DataAI & Technology

The Query You Did Not Optimize Is Costing You More Than You Think

By Hitarth Trivedi

Every query your data platform runs has a cost attached to it. Not just the compute cost, though that is real. The actual cost includes the engineering time spent debugging slow jobs, the analyst who waits for three hours for a report that should take three minutes, the business decision that gets delayed because the data was not ready. Most organizations do not see this full cost because they are not looking at it.

I have spent seven years building query infrastructure at Uber’s scale. At that level, the performance of the query engine is not an infrastructure concern. It is a business concern. When your interactive analytics platform handles millions of queries per day across thousands of nodes, small inefficiencies compound into significant costs and, more importantly, into decisions that get made slowly or not at all.

The work I am most proud of is the migration of Uber’s Presto workers from Java to C++ using Velox. That migration produced measurable performance gains and cost reductions. But the more interesting story is what it taught me about how organizations think about query infrastructure, and why most of them are leaving money and capability on the table.

Why Query Performance Is a Business Problem

Data teams tend to frame performance as an engineering metric. Query latency, throughput, resource utilization. These are useful measurements. They are not the thing that matters.

The thing that matters is whether your organization can make decisions at the speed the business requires. When a data analyst submits a query and waits two hours for results, the business does not wait. Decisions get made on instinct, on outdated reports, on the data that was already available rather than the data that would have been most useful. This is not a technology problem. It is a decision quality problem.

At Uber’s scale, this dynamic is amplified. An analyst running an experiment analysis should get results in minutes, not hours. A fraud detection query needs to complete before the transaction times out. A pricing algorithm needs current market data to produce useful outputs. In each case, the business outcome depends on the query completing fast enough to be actionable.

When I talk about query infrastructure performance, I am talking about decision velocity. That is the number that matters.

What the Java to C++ Migration Actually Taught Us

The decision to migrate Uber’s Presto workers from Java-based execution to Velox, a C++ execution engine, was not primarily a technology decision. It was a consequence of understanding where the actual bottlenecks were.

Java has served data infrastructure well for decades. The JVM ecosystem is mature, the tooling is excellent, and the talent pool is deep. But at the level of per-query execution overhead, Java introduces costs that become significant at Uber’s query volume. Memory management overhead, garbage collection pauses, and the interpretative nature of JVM execution all add latency that cannot be optimized away without changing the execution model.

Velox addresses these bottlenecks by executing queries in native C++ with memory management that is explicit and predictable. There is no garbage collector. There is no JVM overhead. There is also no safety net, which is a trade-off that requires careful engineering but produces meaningful gains.

The migration was not simple. We had to maintain compatibility with existing Presto connectors, ensure that the new engine handled edge cases the same way as the old one, and build the monitoring infrastructure to validate that performance was actually improving. None of this is glamorous work. It is the kind of work that does not show up in conference talks but that determines whether a migration actually delivers its promised benefits.

What we found was that the gains were real and measurable. Query latency improved significantly for compute-heavy workloads. Memory usage became more predictable, which made capacity planning more reliable. And the engineering team spent less time debugging performance anomalies that traced back to JVM behavior rather than query logic.

The Cloud Migration Nobody Talks About

Moving Presto workloads from on-premise clusters to Google Cloud Platform is the kind of project that sounds straightforward in a project plan and is messy in execution. The narrative is usually about infrastructure cost, scalability, and operational overhead. Those things are real. But the part that deserves more attention is what cloud migration reveals about how workload routing and resource isolation were being managed on-premise.

On-premise clusters develop quirks. Query routing becomes optimized for the specific characteristics of the hardware you have, the specific load patterns you have experienced, and the specific incidents you have survived. These optimizations are often undocumented, implicit in the configuration of senior engineers rather than in any formal system. When you migrate to the cloud, you have to make all of that explicit, usually under time pressure and with limited ability to test against production load patterns.

We had to build scalable frameworks for query routing, workload isolation, and utilization optimization across the new cloud infrastructure. Query routing meant understanding which queries should go to which resource pools based on their characteristics, their priority, and their expected runtime. Workload isolation meant ensuring that a runaway query from one team could not degrade the experience for other teams running queries at the same time. Utilization optimization meant sizing compute resources dynamically based on actual demand rather than provisioned capacity.

These are not unique challenges. Every organization that migrates query infrastructure to the cloud encounters them. But the approaches to solving them vary widely in sophistication, and the difference between a naive migration and a well-designed one shows up in operational costs within months.

What Scale Teaches You That Smaller Environments Do Not

Working at Uber’s scale changes how you think about performance problems.

In a smaller environment, you can get away with reactive performance engineering. A query is slow, you look at the execution plan, you add an index or adjust a configuration, you move on. This works until the query volume grows, the data volume grows, or the diversity of query patterns grows. Then the reactive approach breaks down because there are too many queries to investigate individually and too many interactions between them to predict.

At scale, you need proactive performance engineering. You need capacity planning that anticipates load rather than reacting to it. You need observability that surfaces degradation before it cascades. You need query routing that directs workloads to the right resources based on characteristics you can predict rather than only observe.

The shift from reactive to proactive is not just a tooling shift. It is a mindset shift. It requires engineers who think in terms of systems rather than incidents, who design for failure modes they have not yet seen rather than only the ones they have experienced, and who can operate with incomplete information in an environment where the cost of a mistake is high.

This is why the talent question matters so much in infrastructure engineering. The difference between an engineer who can optimize a slow query and an engineer who can design a system that prevents a class of slow queries is significant. The second type is rarer, and finding and developing that capability is one of the most valuable investments an organization can make in its data infrastructure.

What Organizations Are Getting Wrong About Query Optimization

Most organizations approach query optimization as a tool selection problem. They evaluate different query engines, different data warehouses, different optimization frameworks. They run benchmarks and compare results. Then they implement the winner and declare the problem solved.

The tool is not the problem. The problem is almost always in one of three places.

The first is observability. If you cannot measure query performance with granularity, you cannot optimize it. Most organizations have high-level metrics like average query duration but lack the visibility into per-query execution characteristics, resource consumption patterns, and failure modes that would allow targeted improvements. Without this visibility, optimization is guesswork.

The second is workload classification. Not all queries have the same performance requirements, but most organizations treat them as if they do. An ad-hoc exploration query from a data scientist has different latency tolerance than a production reporting query that feeds a dashboard. A fraud detection query has different requirements than a billing reconciliation query. Routing queries based on their characteristics rather than their submitter is one of the highest-leverage improvements available.

The third is capacity planning. Most organizations size their query infrastructure for peak load, which means they are paying for capacity they use a fraction of the time. Dynamic resource allocation based on actual demand patterns is well understood in theory but rarely implemented well in practice. The engineering investment required to build this capability is real, but the operational cost savings compound over time.

The Direction the Field Is Moving

Query infrastructure is converging on a few consistent themes.

Execution engines are moving toward native, hardware-proximate execution. The Java virtual machine served well, but the performance ceiling is real, and organizations with the most demanding workloads are investing in native execution alternatives. Velox is one example of this trend. There are others.

Cloud-native data infrastructure is becoming the default rather than the exception. The operational advantages of managed infrastructure are compelling, and the remaining arguments for on-premise deployment are shrinking. The organizations that have not yet migrated are either dealing with constraints that are genuinely difficult or delaying an inevitable decision.

Observability and performance intelligence are becoming first-class requirements rather than afterthoughts. The days of tuning a query engine with basic metrics are ending. The teams building the next generation of query infrastructure are designing observability as a core capability, not a monitoring layer bolted on after the fact.

The query engine is becoming more aware of the workloads it serves. This means routing intelligence that improves over time, resource allocation that adapts to demand patterns, and cost optimization that happens automatically rather than through manual intervention.

What This Means for Data Teams

For engineering leaders and data platform teams, the implications are practical.

If you are running Java-based query infrastructure at scale, the performance ceiling is real. Understanding whether a native execution engine makes sense for your workload is worth the investigation. The migration cost is significant, but so is the opportunity cost of operating below your performance potential.

If you are planning a cloud migration, invest as much in workload routing and isolation as in the migration itself. The infrastructure move is temporary. The routing and isolation architecture you build during the migration will shape your operational costs for years.

If you are building a data platform from scratch or modernizing an existing one, make observability a first-class concern. The ability to understand how your query infrastructure is performing, where bottlenecks are developing, and which workloads are consuming resources disproportionately is the foundation for every other optimization.

The query is never free. The question is whether you are paying the price knowingly or unknowingly.

Author

Related Articles

Back to top button