AI

Beyond the Buzzword: The Only Question that Matters for AI in Network Operations

By Stephen Ochs, Head of Marketing at Selector

For the last couple of years, the technology industry has been having one conversation about AI, while network and infrastructure teams have been having a totally different, much more practical one. 

We’ve all seen the headlines. Every observability vendor now has “AI” plastered across its website. It’s unavoidable. But out on the floor, in the war rooms, and during daily standups, my peers and I aren’t asking whether a platform uses artificial intelligence. We’re asking a question that cuts right through the marketing hype, one that determines whether we get five minutes of sleep tonight or five hours. 

Where is the AI actually working, and what does it help me do that I couldn’t do before? 

That’s the question that needs a real answer, especially when it comes to the leaders of large complex systems. Whether you are in charge of a global financial network, the internet backbone itself at the ISP level, or hyperscale cloud computing infrastructure, you understand the challenges. A failed AI implementation isn’t just a disappointment; it’s a waste of precious time when done through the same old chaotic interface. 

The Unbearable Weight of Complexity 

If you’ve spent any time in the trenches, you know that the world of today’s complex infrastructure doesn’t map readily onto simple textbook models. Our world involves hybrid cloud computing environments, diverse equipment suppliers, containerized applications, and tech debt distributed across multiple sites. 

The problem isn’t the lack of information; it’s the volume and pace at which information arrives from a dozen different monitoring tools that can’t communicate with each other. You know the pattern: the tool sprawl problem definitely exists. A problem occurs, and it’s no longer just an alarm—it’s a full-blown storm of noise out of which you can’t differentiate the source of the problem. 

Our ops teams are the real heroes who keep the lights on and spend way too much of their time correlating information across various screens. They are essentially trying to jump from network to log file to application trace as the clock ticks. The end result of this trend will be reflected in MTTR, which immediately affects the bottom line and customer trust. We are already losing the battle at this juncture because the information we do possess needs to be properly aligned. In short, we deserve better. 

 The Myth of the Magic Model 

The approach taken in many AI models in production has been to offset the presence of data chaos with the size of the model. The approach states: Throw enough unfiltered data at the behemoth of a model because the model will be able to deduce the information. 

However, this has also failed too many times. 

In operations, reliability is everything. You cannot build a house on shifting sand, and you certainly can’t build a reliable operational strategy on noisy, inconsistent data. If critical context is missing, even the most sophisticated model will start to drift toward educated guesswork. It’s like commissioning an architect to design a skyscraper without telling them where the foundation will be or what soil they’ll be working with. The result will be beautiful, perhaps, but ultimately unstable. We can’tafford to deal with unquestioned inference in our critical infrastructure. We have to stop thinking that more data, without structure, automatically leads to more insight. It just leads to bigger messes. 

Context is the Real Fuel for Intelligence 

The truth is, it’s time to turn the script around and focus on a data-centric approach rather than a model-centric one. The hard part of the work isn’t in the inference engine but in building the data itself into something meaningful before it reaches the model. 

Reliable AI starts with context. 

Picture the data from your network telemetry, such as the metrics, logs, and events, as a series of microscopic pieces of information. Before you can rely upon the AI’s ability to interpret this information, you must enrich that information through the addition of metadata: Where does this data point exist? What does it connect to? What business service does this data support? We need to convert a raw IP address into an endpoint tied to a specific tenant, function, and logical topology. 

This allows you to provide the models with the structured data often drawn from a real-time environment model. The system can identify a typical traffic pattern from anomaly data because it has knowledge of each device’s usage context relative to the others. When an anomaly becomes apparent, its roots can be traced back to a context-specific source. This makes the difference between hearing a thousand isolated words and understanding one carefully composed sentence. This step of enriching the information has immense value. If the AI system doesn’t understand the geographical and usage context of the information it’s analyzing, no actionable information can be derived from it. This has already transitioned from being information; it’s information waiting to happen. 

Moving From Symptoms to Narratives 

The biggest shortcoming of traditional tools is the isolated visibility they provide: they perceive incidents as a series of isolated points. The operator receives three notifications: one regarding the routing problem (NetOps), one regarding high CPU on the server (ITOps), and one regarding application latency (CloudOps). The platform receives three symptoms. 

However, the real world doesn’t operate this way. Trust me when I say this: in large environments that I’ve run before, a single routing flap can cause a cascade of symptoms that appear as a CPU spike before finally emerging as the latency. This chain reaction crosses domains. 

The intelligence we require must be able to grasp the complex interplays of relationships across domains. It must monitorthe emergence of co-occurrences between events and metrics across the layers from Layer 1 through to Layer 7 and track where the patterns of behavior begin and spread. The AI’s context awareness lets it see the connection from the routing flap event to the CPU spike. 

The result doesn’t have to be a list of warning signals—it has to be a story of cause and effect. It has to say: “These are the events that transpired, this was the initial cause, and this has been the effect that has cascaded.” In the event of a crisis, my team isn’t seeking information; it’s seeking a story. What AI needs to do is provide us with that story quickly so that we can see the fix rather than the forensic work. This will help us differentiate from other businesses when it comes to operational intelligence that solves problems rather than provides information about them. 

The Human Operator is Not Replaceable 

One misconception about AIOps is that it aims to eliminate the need for the human part of the loop. This is neither feasiblenor desirable. 

Networking involves judgment. A network professional must be able to grasp the business side of things, the regulations involved, and the human factor of risk. The AI has the processing power to address the volume of information – billions of data points – but does not understand the business side. 

This makes it imperative that explainability be not only desirable but the only non-negotiable requirement that enables the interaction of human knowledge and machine intelligence. The moment the AI system becomes incapable of explaining the reason for its identification of the event as the cause of the problem or for its connection of two non-interrelated behaviors, the system’s use becomes worthless, as trust will be diminished from the start. 

To be of real-world value, this system must be transparent about its workings, as a trusted co-pilot would be. This will enable us to review and question its results, and even direct it on where it might be failing. The role of the human operator will not be diminished but rather improved to deal with the judgment part of the task, rather than the processing. This will be the real road to automation. 

Raising the Bar for True Intelligence 

Real AI in networking is not about flashy interfaces or abstract goals like “prediction.” It’s about building a shared, reliable understanding of how our complex systems actually behave. 

It means: 

  1. Data that is reliably grounded in context and enriched with topology awareness. 
  2. Models that learn patterns of behavior and causality across domains, not just isolated, noisy signals. 
  3. Insights that are fully explainable, allowing humans to trust and improve them. 

This type of model takes longer to develop than the general model. However, this approach and only this approach honors the truth of large-scale network realities. It’s the only way to truly provide value. As leaders, it’s time to stop being sold models that just happen to include the word “AI” in the description. Instead, leaders must insist upon context, causation, and the ability to understand the reasoning behind the model’s results. It’s the only way to truly unlock AI’s power, allowing us to move from complexity to control. The future of efficient, effective, reliable networks will depend upon this. 

Author

Related Articles

Back to top button