10.02.2026

NextGen Hackathon judge and Senior Software Engineer explains why the future of AI lies in self-healing control planes and how Kubernetes-native automation can replace brittle, script-heavy operations with resilient control-plane logic

Today, scaling AI is less about another model tweak and more about operations: higher infra bills, tougher SLAs, and multi-cloud sprawl punish brittle, script-heavy runbooks. While the industry races to refine neural architectures and scale LLMs, the real scaling constraint often sits elsewhere: data management and infrastructure resilience, not the model itself. When demand spikes, a “fragile cloud” held together by disconnected scripts starts to fail. Flexential’s 2025 State of AI Infrastructure report underscores the gap: 44% of technology leaders cite infrastructure constraints as the main barrier to expanding AI initiatives, and 59% report critical bandwidth and latency issues (up sharply from the prior year). That’s why the Control Plane matters: the system layer that orchestrates workloads, maintains availability, and enables self-healing. Without it, even the best AI will spend too much time offline.

Navigating deep infrastructure requires hands-on experience. Our guide into the world of high-load systems is Matvii Horskyi, a Senior Software Engineer with over 7 years of experience in architecting distributed services. Matvii specializes in building solutions that remain resilient under extreme operational pressure. His expertise is backed by a track record at industry giants like NetApp, as well as his work on the multi-cloud platforms Qbox and Instaclustr. His standing in the tech community was further solidified by his role as a member of the Judging Committee at the NextGen Hackathon 2025. There, he evaluated international projects in Cloud Technologies and Automation, distinguishing between flashy prototypes and technically viable, production-ready systems.

The Kubernetes-Native Time

A major trend in recent years is the shift from imperative infrastructure management to a declarative approach, which is codified in Kubernetes’ own architecture and increasingly operationalized via GitOps and IaC practices.

During his tenure at NetApp, a Fortune 500 data-infrastructure company focused on hybrid-cloud data management, Matvii tackled a familiar bottleneck: Terraform-based legacy setups had become too cumbersome for dynamic, high-scale environments. He architected and implemented fault-tolerant Kubernetes Operators for major enterprise clients, including big enterprise companies from fortune 500, where reliability requirements are closer to managed services than to one-off internal platforms. The core focus was on “reconciliation logic”, which is a mechanism that continuously compares the system’s current state with the desired state and automatically corrects any deviations. That operator pattern was built for managed-services realities: standardized workflows, repeatability across many customer deployments, and failure modes you can’t ‘fix by hand’ at scale. In that environment, reconciliation logic becomes the product, because it’s what keeps fleets consistent under constant change.

As Matvii explains, “In a massive environment, even a minor manual change outside of Terraform can trigger a butterfly effect, leading to inconsistent environments that are a nightmare to debug during an outage.”

By shifting the infrastructure paradigm, Matvii made a substantial contribution to moving a legacy Terraform-based setup toward a Kubernetes-native ecosystem built around Helm, Custom Resource Definitions (CRDs), and controller-runtime. The result was a more predictable operational model, lower day-to-day toil, and a control plane that could continuously reconcile drift instead of relying on manual interventions during incidents.

This work went far beyond simple automation; by spearheading deep architectural optimizations that refined resource allocation, Matvii drove annual savings. These results offer a vital lesson for the industry: the stability required by high-stakes media environments is not found in reactive manual troubleshooting. Instead, he proved that true resilience is achieved by codifying high-level engineering intuition directly into the system’s controller, effectively building “intelligence” into the infrastructure itself.

The Challenge of Distributed Systems

Running one cluster is relatively simple. Orchestrating the full lifecycle of hundreds of Elasticsearch, Kafka, or ClickHouse deployments across multiple clouds (AWS, Azure, and GCP) is an entirely different scale of problem. Matvii automated CRUD operations, resizing, and backups to ensure they worked flawlessly across providers at the managed data infrastructure providers, such as Qbox and Instaclustr.

“In the world of high-load systems, we often overlook the complexity of ‘Day 2 operations’ like cross-regional data migration,” Matvii explains. “When dealing with multi-cloud environments, a common challenge is coordinating state between providers without interruption. I’ve found that implementing a custom NATS-based signaling layer is a highly effective way to manage this. This pattern scales well to multi-terabyte workloads in production, where long-running operations must be coordinated across services without manual intervention, ensuring the infrastructure remains as fluid as the data it carries.”

He developed Go-based APIs and modules that improved cluster management efficiency by 30%. By implementing strict contracts (OpenAPI) and idempotent workflows, he helped build a system where the infrastructure was designed to tolerate regional failures with predictable recovery and strong data-safety guarantees. This kind of operational maturity is also what makes managed data platforms attractive for investors and acquirers. Instaclustr, for example, was acquired by NetApp (a publicly traded company with multi-billion-dollar annual revenue and significant profitability) in 2022.

AI in Infrastructure and Infrastructure for AI

Serving as a judge at the NextGen Hackathon 2025 allowed Matvii to glimpse the next evolutionary step. Modern developers are increasingly demanding autonomous systems.

“Today, we see AI not just as a product feature, but as a tool to manage the infrastructure itself,” Matvii observes. “Future engineers must focus on self-healing systems where the Control Plane doesn’t just follow rigid algorithms but predicts anomalies in data distribution.” For AI projects, this is vital: model training requires massive throughput, and any misconfiguration in the network or storage can cost thousands of dollars per hour.

“If you want to build a truly resilient system today, my advice is to stop treating infrastructure as a static target,” Matvii suggests. “You should design your controllers to be ‘pessimistic’. Always assume the network will fail and the API will timeout. Build your logic around idempotency from day one; if an operation can’t be safely repeated a thousand times, it shouldn’t be in your production code.”

Scalability is not just about buying more servers; it is a matter of architectural discipline. Matvii Horskyi’s track record demonstrates that the path to efficient AI and complex systems lies in moving away from manual management toward intelligent, Kubernetes-native automation.

Today, the role of a Senior Engineer is that of an architect building the bridge between cloud hardware and the high-level demands of the AI era. As the experience of global leaders shows, that bridge must be declarative, fault-tolerant, and deeply automated. Scale increasingly fails on operations, not compute, and the teams that hold it together treat reliability, idempotency, and automation as product features of the control plane.

Author

Dan Agbo

View all posts

Dan Agbo 4 weeks ago

4 minutes read

The Kubernetes-Native Time

The Challenge of Distributed Systems

AI in Infrastructure and Infrastructure for AI

Author

Related Articles

Will College Exist in 2050?

Stop Designing Interfaces. Start Designing Relationships.

Why 70% of AI Transformations Fail (And It’s Not the Technology)

How the UK can Benefit from AI-Powered Smart Cities