The Debugging Workflow: The More Things Change, The More They Stay The Same
Let me take you back 20 years, a walkthrough of the ancient ways of debugging software. You’re a software developer, and are in charge of a set of critical services in a highly scaled application. Incident management as we know it today, is a bohemian practice at best, but you and your team are on the cutting edge - you use some combination of Nagios for infrastructure and service monitoring; Logwatch for summarizing log events, and some SNMP-based network monitoring tool with fancy visualizations. Both Nagios and Logwatch also send you alerts via email and text message when something is off, which you think is pretty rad! Subversion (SVN) is commonly used among your technical peers at other companies for version controlling and branching, and your team is no different. And while your mind isn’t constantly preoccupied with incidents or system performance issues (DevOps and SRE as bonafide practices would only emerge years later), you feel like your bases are covered - you’ve got alerting, infrastructure, service and network monitoring, and you have a good sense of code release events and versioning. Life is good.
…Until it isn’t. You get an an automated email from Nagios which might look something like this:
Host Unreachable. Not good. You and your team begin to look at system metrics in Nagios to determine the blast radius, while others are looking at logs in Logwatch to help localize the incident. Yet another developer on your team is checking SVN to get a sense of what code or config releases might have just occurred that could potentially be a culprit for the incident at hand.
Sound familiar?
Back to the…Past?
So what has really changed? Even if for a moment we stripped out the mild undertones of snark and sarcasm in this post, what has fundamentally changed in the debugging workflow for developers from 2004 to 2024? Distributed tracing has emerged as a relatively contemporary way to observe microservice-based cloud native applications - flame graphs are nifty ways to see request flows and where requests might be getting held up, but during a production incident, multiple traces, each with dozens to hundreds of spans, could have the answer (or a clue) hidden within them. Has your job as a developer become easier? New tooling obfuscates a fundamental truth of debugging software - observability tools are not the bottleneck in our quest for the root cause - it’s the speed (or lack thereof) at which the human can interpret a ton of underlying observability data.
If we acknowledge that underlying observability tools have improved marginally - less manual onboarding, ability to handle greater scale, slightly better user experience - we must also acknowledge that this benefit is being offset by the exponential increase in observability data being generated by the average enterprise. Software systems today are embedded in every facet of life, and are therefore not only handling an exponentially larger number of use cases, but also significantly more complex use cases: From troop deployments, to inventory management, to payroll processing, to payments, to content streaming and more.
Remember our assertion above - it’s not the tools, it’s the human - and if that assertion is to be believed, the increase in telemetry data requiring human interpretation may actually make the contemporary era of observability more challenging (not less) than 20 years ago. In fact, the expansion of telemetry data has introduced yet another challenge for developers in the modern era - that of signal-to-noise ratio. So let’s take stock of where we are twenty years later - we’ve got marginally improved tooling, offset by a massive increase in telemetry data that we need to interpret when things go sideways, increased complexity and scale of what software is expected to do in our world, and given the growing popularity of microservice based architectures, we’ve instrumented the dickens out of our applications, meaning an increasing number of false positives and benign alerts, not to mention relentless pressure to reduce the costs of storing and querying all that telemetry data.
The Great Stratification
In theory, a unified “single pane of glass” for all telemetry data - metrics, events, logs, traces (aka MELT), alerts, SLOs, change management - exists today. Modern commercial observability companies like Datadog, Dynatrace, Grafana, etc. will often tout the benefits of a one-stop-shop for all your telemetry data. It sounds nice - if all the data is in a single place, and if advanced technologies, like machine learning, could further help a human analyze and interpret all that data, especially within the context of an incident, presto kazam! Isn’t that what all customers really want?
The reality is more nuanced. There are any number of reasons why enterprises don’t actually behave in this way:
Avoiding Vendor lock-in: Enterprises are heavily incentivized to distribute risk. It’s how they protect their businesses. Because of this, no vendor prefers to put all their eggs in a single basket, especially when they can naturally splinter a business process or workflow (such as observability) across multiple vendors, each of whom may have some perceived specialty.
Preference for Specialization: Certain observability and incident management software tools have actually built expertise in a particular area - Splunk for logs, ThousandEyes for network telemetry, Prometheus for metrics, Datadog for traces, Opsgenie for alerts, ServiceNow for change management, and so on. Of course there can be some overlap and mini-consolidation here (i.e. using ServiceNow for change management AND alerting), but most commercial and open source tools have built long-standing institutional knowledge in their area of expertise, and customers know this.
Cost: Observability incumbents make money when customers store their observability data in their systems. The more telemetry data that is stored, the more money the incumbents make. For this reason, we have not yet met the enterprise that stores all of their observability data in a single system - it’s just too expensive. Enterprises typically pursue a data tiering strategy, choosing different storage systems for different tiers of data. Here, open source telemetry platforms, such as Prometheus and Loki, play a meaningful role.
Because of this, The Great Stratification of observability data continues. For years to come (and probably longer), data needed to infer why something is going sideways will be in different modalities, and in many different places, especially for larger, more complex enterprises. The promise of a unified “single pane of glass” platform is appealing, but overly ambitious at best, and ignoring the reality enterprises live in at worst. This is of course the entire impetus for building Flip AI, and why customers immediately understand the value that we bring - an agnostic interpretive layer that can span across any number of observability, change management and alerting tools to rapidly understand and interpret data and help determine the root cause of an incident. We give the human in the debugging workflow the gift of warp speed.
So What’s Next?
We started this post, and in a tongue-in-cheek manner, called the debugging process of 2 decades ago “ancient.” But in a way, what has really changed? Not a whole lot, UNTIL the advances of generative AI. When taking the right approach in building domain specific language models, GenAI could usher in a step-function change in the way developers monitor their systems and restore them to health. Come to think of it, if these language models could isolate incident root causes, begin to predict them in certain cases, and even make highly educated suggestions on how to restore systems to health or how to prevent catastrophes, will the observability practices of the past 20 years begin to fade? We believe so. We are moving away from an era of “observing” and “monitoring” to an era of knowing and understanding. Legacy observability systems were designed with the understanding that the human would be the central cognitive processor of information. Now, agents step in to carry much of that cognitive load, to work in tandem with the human to provide them with actionable information.
So let’s end with another exercise in imagination: you’re a software developer and are in charge of a set of critical services in a highly scaled application. You are not only in charge of improving and evolving these services, but also of maintaining them. Suddenly, an alert fires for one of the services under your purview. You don’t yet know if it’s a false positive or something potentially catastrophic, or somewhere in between. You’re calm and collected. No on-call is paged. No war room is spun up. Rather, autonomous agents, powered by a domain-specific LLM, have already begun their investigation. First, they help you determine the magnitude of the anomaly by looking at a series of metrics, and sure enough, there is a sustained http error spike that is way outside of norms. They continue to investigate, continuing to look at metrics to localize the error and into the logs to find the “why” of the situation. Simultaneously, they’re also looking at your change management system to check for recent deployments or configuration changes that are correlated with the time stamp of the incident at hand. Finally, they reason through all of the data, and concoct a cohesive, data-driven story as to what is happening, where in the system is the impact being felt, what has changed in the system, and finally, why is this happening. Imagine that entire autonomous investigation, from the moment the alert fired, to the moment an agent DM’ed you the completed analysis in Slack, took 2-3 minutes, tops.
If this sounds like the future to you, let me end with William Gibson’s quote: “The future is already here. It’s just not evenly distributed.” If you’d like to see this in action for yourself, don’t hesitate to reach out - don’t take our word for it, seeing is believing!