
As enterprises and cloud providers race to unleash the power of AI, infrastructure investments continue to soar. Gartner and Goldman Sachs both project more than $1 trillion in AI spending in the coming years, and that figure continues to escalate as organizations vie for competitive advantage.
Yet while AI growth has been relentless, its next era won’t be won on power contracts and GPU FLOPS alone. A critical blind spot threatens to undermine these massive investments: the network infrastructure connecting AI systems.
The blind spot in AI spending
Underutilized AI infrastructure represents one of the industry’s most significant problems, draining both capital and operational expenditures. The numbers are sobering. Organizations typically achieve GPU utilization rates between 30-40% in AI training models. Even sophisticated hyperscalers reach just 50% GPU utilization on their best days, meaning half of the world’s most advanced AI hardware sits idle, consuming power while generating minimal value.
When researchers dug into the cause of these inefficiencies, the network connecting GPUs emerged as a critical performance bottleneck. Research from Meta and Harvard shows up to 32% of GPU hours are consumed by inter-node communication rather than actual computation, all while burning significant power. Compounding this challenge, approximately 30% of each training epoch is spent recovering from errors, many of which stem from network or link errors, per research from Alibaba.
These findings underscore a fundamental reality: Designing for AI at scale requires a radically different approach to network architecture. Until organizations address this bottleneck, AI’s potential will continue to be squandered.
Why traditional networks fall short
Modern networks appear robust on paper, offering 400Gb and 800Gb bandwidth speeds, low latency through RoCEv2, and tightly integrated hardware and software stacks. Yet these specifications function as mere “glamour metrics” that mask inefficiencies in the underlying infrastructure.
Large-scale AI training clusters continue to suffer from disruptive congestion, packet loss, high tail latency, scale limitations, and elevated failure rates. The problem isn’t merely technical; it’s architectural. Traditional network designs weren’t conceived with AI’s unique demands in mind, particularly the massive, synchronized data movements required for distributed training at scale.
The industry’s reliance on incremental improvements has reached a dead end. To satisfy AI’s escalating demands, organizations need to make ambitious, monumental leaps in network design – not tomorrow, but today.
Building the lossless, congestion-free network
Breaking through the networking bottleneck requires organizations to build infrastructure that delivers both lossless transmission and congestion-free operation, which are two goals that might seem contradictory. After all, when a network guarantees that every packet arrives successfully, doesn’t that inevitably create congestion? Consider a busy airport where airplanes (data packets) need to land (deliver their payload). If all the gates or runways are occupied, it will delay their arrival (packet-loss) and cause air traffic congestion, with planes circling the air or tarmac waiting for an opening.
Solving this paradox demands four critical architectural innovations:
- Credit-based flow control (CFBC) fundamentally reimagines how data moves through the network stack. By ensuring that data transmits only when the receiver can accept it, this approach eliminates the pause frame delays and buffer overflow issues that plague Ethernet networks. The result is smooth, predictable data flow that matches sender and receiver capabilities.
- Enhanced congestion avoidance prevents network overload through proactive sender pacing that dynamically adjusts to both network capacity and GPU acceptance rates. Rather than reacting to congestion after it occurs, this approach anticipates potential bottlenecks, prioritizes critical traffic, optimizes queue management, and prevents the cascading failures that can bring training runs to a halt.
- Fine-grained adaptive routing (FGAR) enables intelligent, real-time path selection based on current network conditions. Data packets dynamically adjust their routes to circumvent congested links or failed nodes, maintaining optimal performance even as conditions change. This agility is essential in large-scale clusters where link failures and traffic patterns shift constantly.
- Dynamic lane scaling addresses one of AI training’s most disruptive challenges: link failures in large clusters. Traditional network links bundle multiple lanes together without architecting for high availability. When failures occur, they can disrupt lengthy training runs that represent days or weeks of compute investment. Networks with independently scalable lanes deliver faster, more resilient training outcomes by maintaining connectivity even when individual lanes fail.
Together, these techniques create an intelligent network that is decisive, active, and aware – one that maximizes GPU compute efficiency rather than constraining it.
Quantifying the return on investment
When networks are properly optimized for AI workloads, the financial impact is substantial. Consider a network that increases GPU utilization from 40-50% to 70% while simultaneously reducing error recovery time from 30% to 10%. Such improvements can double effective training throughput, yielding approximately 1.8-2.25x acceleration and translating to 45-56% shorter training runs.
These gains directly impact the bottom line. The network processes significantly more tokens or gradient updates per second, improving total cluster performance and accelerating time to results. For organizations running continuous training workloads, this translates to faster model iterations, quicker deployment cycles, and improved cost-effectiveness measured in tokens per dollar per watt.
Creating a true business accelerator
Underutilized AI infrastructure represents a massive hidden cost, largely because initial investments prioritized rapid deployment over long-term efficiency. However, optimizing the network layer can transform this liability into a competitive advantage – if organizations adopt a holistic view of total cost of ownership and continuously pursue efficiency gains.
In AI infrastructure, the network is not merely a connector; it’s a significant source of value creation and cost reduction. Organizations can’t afford to waste their AI investments on network inefficiencies. They must introduce an AI-optimized architecture that can comprehensively solve workload demands. This will position them to extract maximum value from their investments and transform their infrastructure from a cost center into a true business accelerator.
About the author
Nishant Lodha is senior director of AI networking at Cornelis. Prior to joining Cornelis, Nishant held director-level roles at Intel Corporation and Marvell. He has more than 25 years of experience in datacenter networking, storage, and compute technologies in roles spanning product marketing, solutions and technical marketing, and network engineer. He is based in Silicon Valley.



