AI Business Strategy

The Engineer Who Built AI Infrastructure That Fixes Itself

Inside ByteDance’s San Jose office, the agent-based automation framework now responsible for keeping the company’s largest Kubernetes clusters healthy went into general production this month. The system is called OpenSkill, and its sole author and maintainer is a software engineer named Shashidhar Bhat. 

OpenSkill watches the GPU fleet that powers ByteDance’s big-data pipelines, diagnoses failures, and applies remediation without escalating to a human operator. Bhat built it over a six-month design and engineering cycle that began shortly after he joined the company in June 2024. It was deployed into production this December. In the weeks since, manual operational work on the affected clusters has fallen by forty percent. Idle time on the GPU pool has fallen by thirty-five percent. 

The scale of the underlying environment is uncommon. ByteDance, parent of TikTok, operates one of the largest Kubernetes deployments anywhere on earth. Hundreds of GPU nodes process roughly one petabyte of data each month across the big-data organization Bhat works inside. The operational overhead of cluster maintenance at that volume has, in industry practice, been absorbed by site reliability headcount. ByteDance has chosen a different answer. It writes the cluster as a system that takes the routine decisions itself. 

Bhat joined the company at L5, the engineering grade ByteDance reserves for senior contributors with deep specialization. He spent his first quarter writing design notes, then most of the next two stabilizing the framework in production. The architecture is layered. A set of cooperating agents shares state on cluster health, each scoped to a class of operational decision: GPU node degradation, pod placement, accelerator fault triage, scheduling pressure. The decisions are auditable. The actions are reversible. A higher-order policy layer arbitrates contention when agents conflict. 

What OpenSkill removes from the team’s workload is the long tail of routine decisions that historically consumed senior infrastructure engineers’ time. Hardware degradation events that once triggered an SRE page now resolve before the page fires. Resource imbalances that previously required manual workload migration are corrected by the agents. The on-call rotation, the team has said internally, has shrunk in volume even as the cluster has grown. 

A representative case is GPU node degradation. When a card’s error rate begins to climb, OpenSkill’s hardware-triage agent compares the signal against historical baselines, evaluates whether the workload running on the node can be migrated without violating its SLA, schedules the migration if it can, marks the node for repair, and pages a human only when the decision boundary cannot be resolved automatically. Ten years ago, the same chain of reasoning would have run through three separate engineers and a runbook over the course of an hour. OpenSkill resolves it in seconds. It writes a structured trace of the decision into the cluster’s audit log. 

Bhat spent twelve years before ByteDance at Cornerstone OnDemand, the talent-management software firm headquartered in Santa Monica. His tenure overlapped the platform’s full migration from a monolith to a Kubernetes-native infrastructure. The discipline that produced that migration is the same discipline that produced OpenSkill three years later inside a different company’s data centers. Decisions captured in design documents. Operational processes written into runbooks before they were written into code. Each replacement system run alongside its predecessor long enough to settle the reliability case before the cutover. 

Outside ByteDance, Bhat contributes to Kubewharf Katalyst, an open-source resource management framework jointly maintained by ByteDance and the broader Kubernetes community. Katalyst is one of the few projects in the cloud-native ecosystem to address joint CPU and GPU scheduling under load. Bhat’s design proposals to Katalyst track closely with the operational thesis behind OpenSkill, an unusual pattern in proprietary infrastructure work that has earned his name a degree of recognition inside the Kubernetes operator community. 

The argument Bhat has built OpenSkill against, articulated in his internal design notes and talks over several years, is stark. Infrastructure operators do not scale linearly with the systems they manage. Cognitive load on each engineer rises with cluster complexity. Error rate rises with cognitive load. The operator economics turn unforgiving at hyperscaler volumes. The way out is to take routine decisions off the human and reserve attention for the small set of problems that genuinely require it. 

“The work an SRE does at three in the morning is mostly the work no one should be doing at three in the morning,” Bhat told a recent internal forum. “If a decision is routine, it is a decision a system should be making.” 

Hyperscaler-class operators have built bespoke automation for years, and the major cloud providers have internal answers of their own. The public market for production-grade autonomous infrastructure operators remains thin. OpenSkill’s deployment at ByteDance scale, written and maintained by a single engineer rather than a team, closes a gap inside an existing category rather than opening a new one. The absence of comparable open prior art is a fact ByteDance’s competitors have begun to notice. 

The framework’s documentation is internal. The deployment has not been announced through ByteDance’s external communications channels. What the production rollout will demonstrate over the coming year is whether the operational improvements hold across the load patterns of a full annual cycle. The early signal favors them. The forty percent reduction in manual work was measured against a baseline that included the team’s most experienced operators. The thirty-five percent improvement in GPU idle time has held through the recommender-training spikes that historically erode efficiency on hyperscaler clusters. 

For now, the framework runs quietly inside the company. The team that built it is smaller, by one engineer, than the team that runs it. 

Author

  • Tom Allen

    Founder and Director at The AI Journal. Created this platform with the vision to lead conversations about AI. I am an AI enthusiast.

    View all posts

Related Articles

Back to top button