Skip to main content

Ricardo Brandão – Senior Software Engineer and Team Lead

One of the biggest problems and frankly the most annoying part of software engineering is collecting data required to investigate a defect. Whether it's digging through DataDog traces, reading logs, or searching through what's more often than not an incomplete knowledge database for possible similar cases in the past (which usually involves asking the engineer who's been at the company the longest). A very rough napkin math suggests at least 50% of the time needed to fix a defect is spent performing data collection and debugging it. What if we could automate that?

Google SRE reports suggest that in complex distributed systems, upwards of 70% of the incident resolution time is spent on diagnosis and triage, not on the actual code change. The average hourly rate for a software engineer in the US is around $45. That means that incident/defect that took about 4 hours to be solved cost the company $108 just on data collection and debugging costs alone. Then multiply that by the number of incidents and defects a company receives monthly (a medium-sized SaaS can face around 20-100 customer-reported defects a month) and the cost ends up being $2,000 just for the data collection itself. I think you can see where this is going.

So here's the deal: We're talking about an established SaaS company with around 45 software engineers (pretty decent size, right?). The team we worked with had 6 full-stack engineers - 2 seniors, 2 mid-level folks, and 2 associates. Standard Node.js shop running everything on AWS, using DataDog for monitoring, Jira for the usual project chaos, and GitHub for code management. You know, the typical stack most of us are familiar with.

Now, like any company that's been around for a while, they had accumulated some... let's call it "technical debt" (we've all been there). Their system had grown organically into this complex beast spanning 10+ repositories and services. Not uncommon, but definitely makes debugging a nightmare.

But here's where it gets ugly: In just the first 4 months of 2025, they got hit with over 300 customer-reported defects. That's roughly 75 defects per month - way above the typical 20-100 range we usually see. The engineering team was basically drowning, spending more time playing whack-a-mole with bugs than actually building cool new features.

When we at Fornax heard about their situation, we thought "this is exactly the kind of problem we can solve with some clever automation." So we decided to tackle it hackathon-style over a 3-day sprint (working outside our normal hours, because why not make it interesting?).

Like any engineering team under pressure, they tried the classic moves first. Step one: throw more people at it. They grew the team from 4 to 6 engineers over two months, pulling developers from feature work to handle the defect backlog. But surprise, surprise - this just created more context switching and didn't actually solve the core problem (which was the time-consuming investigation process).

Next up: process changes. They implemented this whole triage system where senior engineers would pre-investigate defects before assigning them, plus created these detailed defect templates to capture more upfront info. It helped a bit, but now the seniors were spending 60% of their time on triage instead of, you know, doing senior engineer stuff like architecture and mentoring.

They also doubled down on prevention (the responsible thing to do). Test coverage went from 65% to 85% over three months, they added stricter code review requirements, automated linting, security scans - the whole nine yards. Good long-term investments, but defects kept flowing in from existing code and those lovely edge cases that tests just can't catch.

The breaking point? They realized they were closing defects at roughly the same rate new ones were being reported. Despite all their efforts, investigation time was still the bottleneck. Each defect still required digging through DataDog traces, correlating logs across services, and that tribal knowledge from whoever had seen something similar before (usually the person who's been there the longest and is probably already overwhelmed). The team was burning out, and feature development had essentially stalled.

After one of our software engineers from Fornax championed the use of custom MCP tools within Cursor (our client's IDE of choice), we decided to give it a go at automating the research of the incoming defects, particularly those that required a data fix in the database, especially after the defect was fixed. We could do a blanket SQL statement, but with so many different scenarios the data could be in (Ledger logic is not simple, guys), the idea of doing it was really scary. But if we could identify a pattern that let us categorize that issue and perform the investigation, the software engineer would be free to quickly check if everything was okay and perform the fix and close the defect!

Here is where we proposed a proof of concept. We wanted to generate embeddings of ALL the defects and their resolutions and create a vectorized knowledge database where we could perform semantic searches based on the symptoms the user was experiencing.

Now to get to the nitty-gritty, here was the plan:

1 - Implement a workflow where for each of our defects in Jira, we would collect all DataDog information around it, Jira comments, GitHub PRs and change diffs, pertinent Teams chat, and DevOps data fixes that were executed. All this information would go into a vector database where the embedding was an LLM-generated summary of the issue, the symptoms, and the solution.

The integrations needed for this were made following the MCP pattern so we could re-use them within Cursor as well, which was a fortunate double-dipping of the work!

Because we were doing this work outside our normal hours as a POC, we decided to use OpenSearch hosted in AWS (almost all our infrastructure was hosted in AWS). We went with the serverless option because we didn't need high compute on it. The cataloging part was pretty much a one-and-done deal. For 2025 alone we were able to catalog around 350 customer-reported defects and their resolutions.

2 - Implement a second workflow where when creating a new defect in Jira, we would generate the embedding of the symptoms and any other information we had about it and perform a search of k-nearest (the k nearest points in the vector database) defects from our catalog and only keep the highest matches. That gave us a very good foundation to feed into another LLM to compare the symptoms and create a summary of the possible causes and resolutions, including code changes, data fixes, and DataDog traces to back that up.

One particular clever trick we used to find all the related DataDog traces was to request support to always generate a HAR file (or HTTP Archive file) that contained all the requests, responses, headers, cookies, and timing of the session of the user, and just by parsing that information, we were able to easily find all the related traces.

Lastly, with the summary in hand, we added it to the incoming Jira ticket as a comment ready for the next software engineer to vet and perform the code changes/data fix change.

The second workflow we decided to go with was a Lambda being fired by a REST endpoint registered in the AWS API Gateway service. The Jira webhook would call it with the ticket ID and the Lambda would go from there. Roughly 5 seconds later, all the information was fetched and attached to the Jira ticket.

3 - We needed a way to keep our defect knowledge base updated with the latest defects found and solved. To do that, we called another webhook from Jira when moving a defect to completed. That webhook called another API Gateway endpoint which in turn fired another Lambda, but this time it added an entry in an SQS that would eventually start processing that Jira ticket in the same manner the first workflow did, except for that single defect.

We went with an SQS instead of just firing a Lambda to do it, mainly because we wanted a dead-letter queue in case something went wrong with cataloging the defect. We didn't want to miss precious defect information just because some integration didn't work well.

All of this work resulted in a VERY cost-efficient (well, it is cloud, so it was as cost-efficient as you can be) solution with no unnecessary compute. Only executing when needed.

That $108 investigation cost on manpower? Now it costs us $0.02 including all the embedding, LLM costs, and compute. And more importantly? Our software engineers could spend their time fixing the issue instead of just digging through logs and traces. That led not only to faster resolutions of defects but increased personnel satisfaction and more investments in new feature work.

Obviously, that faster turnaround reflected on our customers as well. Faster defect fixes means happy customers!

Now obviously it's not completely perfect yet. We have optimizations we want to perform in terms of a better chunking strategy for the embeddings so we get even more accurate results. We also want to extend this knowledge database into a chatbot for our support team, so THEY can search for a resolution before it even gets to us.

For teams considering doing the same, my recommendation would be to plan ahead of time what information you will add to the knowledge base. In our case, because we went with OpenSearch, every time we wanted to add new metadata to the index we had to recreate it because of the schema validation constraints. It was not a big deal since it was pretty cheap to rebuild it, but on larger datasets we definitely want to avoid that.

I think this case really showcases where the software engineering industry is heading. We're seeing the emergence of what I like to call "AI-augmented DevOps" - where machine learning doesn't replace us engineers, but eliminates the tedious investigative work that prevents us from doing what we actually love: solving complex problems and building cool stuff.

The numbers back this up too. According to the 2024 State of DevOps Report, teams spending over 30% of their time on toil and manual processes are 2.5x more likely to experience burnout (and frankly, who wants that?). Our RAG-based approach directly tackles this by automating the most time-consuming part of defect resolution - all that investigation work.

Looking ahead, I predict knowledge-augmented development workflows will become standard practice by 2027. Companies that embrace these AI-powered investigation tools early will have significant competitive advantages: faster resolution times, happier engineers, and the ability to reinvest saved time into innovation instead of constantly firefighting.

The bottom line? The future isn't about AI replacing developers - it's about AI amplifying what we can do by handling the repetitive, research-heavy tasks that currently eat up 50-70% of our incident response time.

What defect investigation challenges are eating up your team's time? Have you experimented with RAG or similar AI solutions in your engineering workflow? We'd love to hear about your experiences in the comments below. If you're drowning in customer defects like we were, reach out to us at Fornax - we're helping teams implement similar automation solutions. Share this post if you think your network could benefit from reducing their $108 investigation costs to $0.02!

en_USEN