
Businesses around the world are starting to think about the importance of security in the era of AI-driven development. The reliability of our online systems also needs to be part of the conversation.
AI agents are rapidly taking ownership of more and more of the development pipeline as they write code, perform code reviews, and orchestrate deployments. This has increased velocity to such an extent that the implications for reliability and security are hard to predict. Kolton Andrus, who worked as a reliability engineer on complex systems across both Netflix and Amazon, founded Gremlin in 2017 to provide a platform where engineers that cared about improving the reliability of their applications could proactively run stress tests and improve their health.
Andrus writes about the current AI boom with optimism, but states that “the velocity of AI-driven development is very powerful, but it also amplifies the risk: the faster you drive, the worse the crash. Teams should be proactively and automatically running health and safety checks by embedding them into the AI-driven development process.”
Andrus notes that when agents are composing and iterating on systems, they don’t always have the operational context that humans use to make safe design choices. Prompts rarely capture redundancy requirements, scalability thresholds, or dependency failure modes. An agent might produce code that “works” in a happy-path test, but in reality couldn’t handle an outage, a sudden traffic spike, or a failing downstream dependency.
The concept of AI guardrails is straightforward: before you ship, run a suite of tests that either pass or fail. If a test fails, figure out why, fix it, and then continuously run automated tests that validate the fix sticks. Examples of tests you can run on Gremlin to improve resilience include:
- Host redundancy: validate behavior when instances fail or are terminated.
- Zone redundancy: ensure services remain available across AZ/region failures.
- Memory and CPU scalability: simulate resource pressure and ensure graceful
Degradation. - Dependency failure: confirm the system handles downstream outages without cascading failure.
- Dependency latency: measure and react to increased response times from services you depend on.
The reality is that agents will increasingly own tasks across CI/CD, e.g. drafting features, iterating on code, and generating infrastructure changes. But near the end of that pipeline, security and reliability gates will still be necessary for the foreseeable future.
According to Andrus, too many teams will ship code produced by AI and assume the AI did enough. And then they’ll rely on human firefighting or on “AI SRE” after something inevitably breaks. This approach, says Andrus, is “expensive, slow, and stressful.”
“Teams can move fast with AI only if they’re confident they won’t introduce significant risks when they do,” said Andrus. “Automated reliability testing gives developers that confidence. It’s the difference between optimistic trust in an agent and verifiable assurance. As AI accelerates development, the organizations that adopt guardrails will be the ones that scale safely, ship boldly, and recover quickly when the unexpected happens.”