An alert is just a continuously running test case in production.
That's it. That's the reframe. If you actually believe that sentence, your entire approach to observability changes. You stop treating monitoring as an ops concern and start treating it as a quality concern -- which is what it always was.
A few years back, my team had lost the tool we used to track bug escape rate and change failure rate. Budget cuts. You know how it goes. The tool is gone, but the metric doesn't stop mattering -- if anything, it matters more, because now someone has to ask "is our velocity actually producing quality outcomes or just producing more code?"
I couldn't answer that question. So I worked with an AI to build an answer.
I exported from Jira, called the GitHub API, put both into a spreadsheet with formulas I genuinely do not understand (the AI wrote them; they look like SQL and Excel had a terrible baby). It took about a day of back-and-forth. I now have a spreadsheet that updates monthly with Python scripts the AI wrote for me, and tells me exactly what my bug escape rate and change failure rate are across my teams. Real numbers. Real data.
That spreadsheet is monitoring. Not of my production service -- of my development pipeline. The point I want to make is that monitoring doesn't stop at your customer-facing infrastructure. If you're doing agentic development, the pipeline that produces your software is itself something that needs to be observed.
What Monitoring Actually Is
In a continuous delivery system, you can't test everything. You make bets. You triage. You decide that some things are important enough to have test coverage and some things will be caught after the fact if they break. That's not a failure of discipline -- it's the only honest way to operate when you have finite time and infinite potential test cases.
Monitoring is the backstop for the bets you don't take. It's your answer to the question: "I didn't test for this. How will I know if it breaks?"
In a high-velocity agentic environment where agents are shipping changes faster than a human could, you especially can't test everything. The rate of change is too high. The coverage can't keep up. This isn't new -- it's just more acute. Monitoring has to work harder than it did before.
The question monitoring answers is simple: is this still working the way I think it should be? Not "is it functionally correct" -- that's tests. Not "does it meet user requirements" -- that's exploratory validation. Is it operating correctly, right now, in the real production environment where dependencies fail, caches expire, and AWS has opinions of its own?
The Three Things You Have to Know
Every system you ship to paying customers requires three things to be known at all times.
Metrics tell you what your system is doing in aggregate. Counters, rates, histograms. How many requests per second? How many errors? What is the 99th percentile latency? At Bing, before anyone had coined "observability," we had error pages that included a debug string -- request IDs, timestamps, GUIDs. You could paste that string into an internal tool and get logs from every service in the stack about what happened on that specific request. That was baby's-first distributed tracing, and it worked because someone understood the principle: you should be able to reconstruct what happened. Your metrics are what tell you that something happened. "There's a spike in 500s at 2:47pm" -- you can see that in a graph in thirty seconds. You cannot see it at a glance in a raw log file.
Logs tell you what happened for a specific event. They're detailed, they're per-request, and they're how you understand the "why" once the metrics tell you something is wrong. Logs without metrics means you're reading every line looking for the needle. Metrics without logs means you see the needle but can't examine it. You need both.
Distributed tracing gives you the full path of a request across every service that touched it -- every machine, every hop, every place something could have gone wrong. In a microservices system, a single user action might touch a dozen services. A log from one of them tells you what happened there. A trace tells you everything.
Metrics Are Your Testing Backstop
Here's the practical connection to your test suite: your alerts are the tests for the things you chose not to test.
If you triaged labeling functionality as low enough priority that you didn't add it to your regression suite, that's fine -- as long as you have an alert that fires when label API error rates spike. The metric is the thing that tells you "labeling is broken" before your users file a ticket about it. Without the metric, your deliberate triage decision becomes "we shipped a bug and we'll find out when someone complains."
Your alert thresholds should be set with the same intentionality as your test coverage decisions. Which metrics matter most to your customers? Those need tight SLOs and short alert windows. Which metrics are informational -- things you want to know about but that aren't customer-impacting in the short term? Those can have looser thresholds. This is the same triage calculation you make for your test suite, just applied to your production signals.
Monitoring Your Development Pipeline
Here's the part that's different in an agentic environment, and where most teams aren't paying enough attention yet.
You need SLOs for your development pipeline, not just your production service.
What is your deploy frequency? What is your change failure rate? How long does it take to recover when a change goes wrong? These are the DORA metrics, and they matter as much now as they ever did -- actually more, because the signal-to-noise ratio in an agentic pipeline is harder to maintain. Agents ship PRs fast. Not all of them are good. Your change failure rate will tell you whether the speed is translating to quality outcomes or just to more changes.
If your agent is subtly degrading code quality over time -- maybe because the model version changed, or context drift is accumulating, or a dependency shifted in a way that changed behavior -- you want to know before your customers do. A gradually rising bug escape rate is that signal. So is a creeping increase in how often changes roll back. These trends are invisible unless you're measuring them.
The bug escape rate spreadsheet I described at the top of this post is not a sophisticated monitoring system. It's a Python script and a spreadsheet. But it gives me real data on real trends, and it cost me a day to build. The question it answers -- are we shipping quality outcomes or just shipping volume -- is worth a day.
You Can't Test Everything, and That's Fine
The mistake I see teams make is treating the gaps in their test coverage as something to be ashamed of, as if thorough monitoring is an admission of failure. It isn't. You cannot test everything. Every team with a real codebase and a real deadline has made deliberate tradeoffs about what to cover and what to catch in production. The discipline isn't in having perfect test coverage. The discipline is in knowing what you didn't test and having monitoring that catches it when it breaks.
Monitoring isn't a fallback. It's half of your quality strategy -- the half that operates in production, where the real world has opinions that your test environment didn't model. Treat it like the testing infrastructure it is: define what it should catch, measure whether it's catching it, and improve it when it isn't.
The teams moving fast and staying safe in agentic development aren't the ones who write perfect tests. They're the ones who know what they're not testing, and have the monitoring in place to catch it anyway.
This is post 6 of 7 in The Boring Parts Matter: Engineering Fundamentals for the Agentic Era.

