INSIGHTS Blog

Testing in production: How observability and feature flags make it safe

observability

[By Andy Bold]

For years, “testing in production” was the punchline to jokes about reckless engineering practices. It conjured images of cowboy developers pushing untested code to live systems and hoping for the best. But here’s the reality that’s reshaping how we think about software delivery: production is the only environment that truly matters, and with the right tools and practices, testing there isn’t just safe—it’s essential.

The production paradox

No matter how sophisticated your staging environment, it will never perfectly mirror production. The data is different, the scale is different, the usage patterns are different, and most importantly, the users are different. You can run thousands of tests in lower environments and still miss issues that only manifest when real users interact with your system at scale.

This isn’t a failure of testing—it’s a recognition of reality. Modern distributed systems are too complex to fully simulate. The interactions between services, the impact of real-world network conditions, the behaviour under actual load, and the edge cases that only emerge from millions of users doing unexpected things—these can only be truly understood in production.

But acknowledging this reality doesn’t mean we should blindly push code and cross our fingers. Instead, it means we need to fundamentally rethink how we approach production deployments. This is where observability and feature flags transform “testing in production” from a reckless gamble into a controlled, data-driven practice.

Observability: Your production microscope

Traditional monitoring tells you when something is broken. Observability shows you how your system is actually behaving. This distinction is crucial when you’re validating new features in production.

When you have proper observability in place, production becomes a rich source of immediate feedback. Every request tells a story through distributed traces. Every metric reveals patterns of behaviour. Every log entry provides context. Together, they create a real-time narrative of how your changes are performing in the wild.

Consider what this means for a typical feature release:
Without observability, you deploy and wait nervously. Is the new feature working? Are users experiencing issues? You won’t know until either your basic health checks fail or users start complaining. By then, damage is already done.
With observability, you watch the story unfold in real-time. You see exactly how the new code paths are executing. You observe latency distributions, error rates, and resource consumption at a granular level. You can trace individual user journeys through the new feature. Most importantly, you can detect subtle degradations before they become major incidents.

This visibility transforms production from a black box into a transparent environment where you can safely observe and validate your changes. You’re not flying blind—you’re conducting controlled experiments with full visibility into the results.

Feature flags: Your safety net and control panel

If observability is your microscope, feature flags are your control panel. They provide the ability to separate deployment from release, giving you unprecedented control over who sees what and when.

At its simplest, a feature flag is a conditional statement that determines whether a piece of code executes. But this simple concept enables powerful patterns:

Progressive rollouts

Instead of releasing to all users at once, you can gradually increase exposure. Start with 1% of traffic, watch your observability dashboards, then increase to 5%, 10%, and eventually 100%. If issues arise at any stage, you have contained the blast radius to a small percentage of users.

Targeted releases

Feature flags can be much more sophisticated than simple on/off switches. You can target specific user segments, geographical regions, or device types. This allows you to test with users who are more tolerant of issues (like internal employees or beta testers) before exposing features to your most critical customers.

Instant rollback

Perhaps most importantly, feature flags act as a kill switch. When observability reveals problems—increased error rates, degraded performance, or unexpected behaviour—you can instantly disable the feature without deploying new code. What might have required an emergency rollback requiring 30 minutes or more becomes a configuration change taking effect in seconds.

A/B testing in reality

Feature flags also enable true A/B testing in production. You can run experiments comparing different implementations, measure real user behaviour, and make data-driven decisions about which approach works best. Your observability data provides the quantitative feedback needed to evaluate success.

The synergy of observability and feature flags

While powerful individually, observability and feature flags together create a framework for safe production testing that’s greater than the sum of its parts.

Imagine releasing a new recommendation algorithm for an e-commerce site. With feature flags, you enable it for 5% of users. With observability, you immediately see:

Response times for the recommendation service
Cache hit rates and database query patterns
User engagement metrics for those seeing recommendations
Error rates and specific failure modes
Infrastructure impact and cost implications

If you notice response times creeping up beyond acceptable thresholds, you can immediately disable the feature flag. You haven’t just prevented an outage—you’ve gathered invaluable data about the performance characteristics of your new algorithm under real-world conditions.

This tight feedback loop enables a new development paradigm. Instead of trying to predict every possible scenario in staging, you can:

Build features with comprehensive instrumentation
Deploy to production behind feature flags
Gradually increase exposure while monitoring key metrics
Iterate based on real-world feedback
Roll back instantly if needed

Building confidence through control

The key to making “testing in production” work is control. Observability gives you control through visibility—you know exactly what’s happening. Feature flags give you control through containment—you decide who’s affected and for how long.

This control transforms the risk equation. Instead of betting everything on a big-bang release, you’re making a series of small, controlled, reversible decisions. Each decision is informed by real data from real users in the real environment that matters.

Best practices for production testing

To make this approach work, consider these practices:
Instrument first, deploy second: Before enabling any feature flag, ensure you have the observability in place to understand its impact. Define your success metrics and failure conditions upfront.
Start small, fail fast: Begin with the smallest possible user segment. If you’re going to have issues, better to affect 100 users for 1 minute than 100,000 users for an hour.
Monitor both technical and business metrics: It’s not enough to know your systems are healthy. You need to understand whether the feature is achieving its business goals.
Automate where possible: Use observability data to automatically trigger feature flag changes. If error rates exceed thresholds, automatically reduce traffic or disable the feature entirely.
Learn from every release: Each production test generates valuable data. Use it to refine your understanding of system behaviour and improve future releases.

The new normal

Testing in production isn’t about being reckless—it’s about acknowledging that production is the ultimate testing environment and building practices that make it safe to learn there. With observability providing visibility and feature flags providing control, we can confidently deliver features faster while actually reducing risk.

This approach represents a fundamental shift in how we think about software delivery. Instead of trying to eliminate all uncertainty before production, we’re building systems and practices that help us navigate uncertainty safely. We’re not preventing failures—we’re making them survivable and educational.

The organisations that master this approach gain a significant competitive advantage. They can iterate faster, learn faster, and deliver value faster, all while maintaining or even improving reliability. They’ve turned production from a scary place where things might break into a laboratory where they can safely experiment and improve.

Conclusion

“Testing in production” has evolved from a joke about bad practices to a sophisticated approach enabled by modern tooling. Observability shows us what’s really happening, while feature flags give us precise control over the experience we’re delivering. Together, they transform production from a place of risk to a place of learning.

The future belongs to organisations that can safely and confidently iterate in production. By embracing observability and feature flags, we’re not just making testing in production safe—we’re making it the most effective way to deliver software that truly serves our users’ needs. After all, production is where our users are, and ultimately, that’s the only environment that really matters.