In 2025, the approach towards using AI in Site Reliability Engineering (SRE) has changed significantly. Initially experimenting with it in isolated workflows, now companies are embedding autonomous AI agents directly into the production loop. These agents serve multiple important functions such as managing cloud infrastructure, optimising resource allocation, and responding to incidents.

“A new type of team member is emerging in SRE right now – a multifunctional AI system”, says Ashish Upadhyay, a Cloud Solutions Engineer. Hence, AI Journal invited Ashish to discuss this matter in depth.

According to Ashish, despite agentic AI’s advanced capabilities, there are serious risks associated with it as well. They, for instance, include decision opacity and AI hallucinations, which cause production incidents. Hence, there is now a need for AI-specific observability. Ashish believes that new agentic-related risks require new governance and a different product mindset.

Your new team member is AI

Companies are now directing resources toward agentic AI, as they recognise its potential to perform operations independently and, in the future, even without human intervention.

“This change, although a logical step, has brought a cultural and operational transformation inside companies”, expert shares. If AI was previously used as a rather auxiliary tool, now AI agents need a different approach. The reason is that agentic AI’s actions might deviate from established playbooks, and understanding autonomous judgment will help engineers minimise potential risks. Besides the aforementioned opaque decision-making and hallucinations, there could also be silent degradation or drift, loss of traceability and postmortem blindspots, and overstepping permissions.

Elaborating on the last risk, the transition to introducing agentic AI also requires rethinking approaches to control and trust. “When AI was an assistant, people could manually review, reject, or adjust its output. Agentic systems, however, act autonomously and therefore demand more advanced guardrails”, Ashish adds. These include transparent reasoning and decision logging, as well as permission boundaries, to constrain what the agent is allowed to modify.

The AI onboarding problem: why agents must be treated as new hires

“If a human SRE joined the team tomorrow, they would not be given root access or full IAM roles right away”, Ashish comments about providing permissions. Granting full access is a gradual process based on training, trial results, and a proven understanding of the system.

As the expert says, the same attitude should apply to agentic systems, ideally in the form of an AI onboarding checklist. Expanding the notion of the advanced guardrails, this checklist might consist of the following criteria:

Access management: dedicated service accounts, Just-In-Time/Just-Enough-Access (JIT/JEA), and least-privileged permissions. The goal of this criterion is to prevent AI from touching resources it is not supposed to.
Audibility and logging: AI’s decision traces, the inputs it received, the output it produced, and system changes applied need to be logged immutably. Without audit logs, debugging becomes especially difficult.
Agent performance metrics: to ensure that the agent performs effectively, new metrics are required. They might include successful and unsuccessful actions, hallucination rate, and unsafe-action attempts blocked by guardrails.
Circuit breakers: this criterion defines the boundaries of agentic AI’s actions, including mandatory human approval thresholds, blast-radius controls (for example, the agent can only modify one cluster), and circuit breakers that automatically disable the system.

New failure models: when agents hallucinate at 3 AM

To understand why onboarding is necessary, people must first recognise how these agents uniquely fail. Some of them are listed below:

Decision opacity occurs when engineers cannot explain the agent’s logic. For instance, suppose people do not understand why the system chooses a particular remediation when fixing a production. In that case, the issue limits their ability to reconstruct the incident, diagnose the root cause, or improve future reliability approaches. This example of failure aligns with the audibility and logging criteria of the onboarding checklist.
Performance drift occurs when the agent’s model drifts from optimal behaviour or the underlying data changes. Although these failures do not immediately cause outages, in the long run, they might accumulate technical and operational debts. This debt, in turn, can lead to more serious incidents. This scenario corresponds to the need for agent performance metrics.
Conflicting objective optimisation can also take place. The reason is that agentic AI needs to balance multiple goals – latency, cost, stability, and efficiency. Under challenging scenarios, it can prioritise one objective over another. For example, it might aggressively downscale infrastructure to minimise cost, violating availability SLOs. Therefore, this failure proves the importance of access management.

Ashish also provides a real-life example of why understanding agents’ failures is necessary: an industry survey shows that 72% of companies have experienced at least one outage caused by AI-generated code or agentic automation. In fact, these failures can be predicted and prevented if engineers understand them and design guardrails accordingly, as this article also highlights.

Rethinking observability: why your APM is blind to AI

Traditional observability tools, such as logs, metrics, and traces, are practical for deterministic systems. As such, they can help engineers understand whether a pod crashed or latency spiked. However, the traditional tools cannot yet tell if the agent is hallucinating or if its behaviour is becoming unstable. “This is precisely why AI observability is gaining popularity in 2025”, the expert explains.

AI observability must monitor model drift, anomaly detection on token patterns, data-quality scoring, API success/failure rate, and hallucination risk. According to Ashish, since agentic systems are built differently from deterministic ones, engineers need to observe them differently, as well.

Conclusion: the “Agentic SRE” runbook

Ashish concludes that the centre of gravity in SRE is shifting: instead of operating systems directly, engineers are now responsible for supervising and governing the autonomous agents that run them. As a result, the change requires obtaining new skills, including AI governance literacy, prompt engineering, a deep understanding of LLMs, the ability to design agent guardrails, and competence in AI observability. In the expert’s opinion, these capabilities will help companies safely integrate autonomous agents into their operational workflows.

Author

Balla

I am Erika Balla, a technology journalist and content specialist with over 5 years of experience covering advancements in AI, software development, and digital innovation. With a foundation in graphic design and a strong focus on research-driven writing, I create accurate, accessible, and engaging articles that break down complex technical concepts and highlight their real-world impact.
View all posts

Balla 4 weeks ago

4 minutes read

Your new team member is AI

The AI onboarding problem: why agents must be treated as new hires

New failure models: when agents hallucinate at 3 AM

Rethinking observability: why your APM is blind to AI

Conclusion: the “Agentic SRE” runbook

Author

Related Articles

The Hidden Vulnerabilities in AI-Generated Code That Every Developer Should Know

Banish the ghost of Christmas past: embracing AI-powered pricing this holiday season

AI in Runtime Across the Cloud, Where the Battle for Trust is Won Inside the System

The New AI Literacy: Beyond Syntax and Semantics