Machine LearningInterview

Scaling ML-Based Recommendation Systems for High-Traffic Platforms

Interview with Sergey Polyashov, Principal Software Engineering Manager at Microsoft

Recommendation engines sit at the core of nearly every high-traffic digital platform. From e-commerce to streaming, they determine what people see, click, and buy. But scaling them to billions of users is one of the hardest challenges in AI today.

Few know this better than Sergey Polyashov, Principal Software Engineering Manager at Microsoft. He has led engineering efforts behind Bing, Microsoft Shopping, Ads, and Copilot, creating AI-driven systems used by hundreds of millions of people daily. At Microsoft, he spearheaded the transformation of Bing Shopping Search into an AI-powered ecosystem, unifying organic and Ads pipelines, building real-time ingestion at internet scale, and integrating product search into Microsoft’s AI agents (Copilot) — a milestone that turned search into conversational commerce.

We spoke with Sergey about the realities of building recommendation systems at scale, the misconceptions that persist, and the future of AI-powered personalization.

What makes scaling recommendation systems for high-traffic platforms uniquely challenging compared to other ML projects?

Unlike many ML projects that run offline, recommendation systems must operate in real time under massive load. When we scaled Bing Shopping globally, one of the hardest problems was freshness: ensuring that billions of price and availability updates flowed into the system within seconds. Otherwise, users would see products that were already out of stock.

To solve this, I led the design of real-time ingestion pipelines capable of processing merchant feeds at web scale while keeping latency under strict millisecond budgets. At the same time, we couldn’t afford inefficiency. At Microsoft’s traffic levels, even a tiny waste in compute translates into enormous costs. That’s why I unified organic and Ads pipelines — not just improving engagement, but also eliminating duplicated infrastructure.

Compared to NLP or vision, recommendation has its own constraints. Data is sparse and noisy — clicks, skips, partial sessions — and latency budgets are unforgiving. That’s why we built multi-stage architectures: fast retrieval to narrow candidates, deeper ranking for precision, and selective re-ranking to optimize fairness and novelty.

Another challenge is bias. Without intervention, large merchants dominate. I introduced re-ranking strategies in Bing Shopping to ensure smaller sellers and new products surfaced alongside giants. It was critical for ecosystem health, not just user experience.

Finally, scaling recommendation today means moving beyond lists. The most transformative step we took was integrating product search into Copilot. Suddenly, a user could ask: ā€œWhat’s the best laptop under my budget for photo editing?ā€ and receive structured, real-time recommendations — not static results. That’s when recommendations became interactive, conversational, and AI-native.

You describe the Copilot integration as transformative for recommendations. What did it take behind the scenes to make natural-language product search work at scale?

Behind the scenes, the hardest part wasn’t the language model itself — it was everything around it. To make Copilot handle natural-language product search at scale, we had to unify multiple systems that historically lived apart: search, ads, and recommendations. That meant consolidating infrastructure so results could be generated within strict latency budgets, and ensuring that the data feeding into the models was fresh and reliable.

We also had to rethink evaluation. Traditional offline metrics weren’t enough, because the real test was whether users felt they were getting trustworthy, useful answers in real time. That pushed us to build new pipelines for monitoring quality continuously in production. So the transformation wasn’t just about building a smarter model — it was about creating an ecosystem where models, data, and infrastructure worked seamlessly together to support conversational commerce.

What are the most common misconceptions about scaling recommendation systems?

The most common misconception is that a better model will fix everything. In practice, the real bottlenecks often lie in data engineering, infrastructure, or monitoring. Another myth is that academic research translates directly to production. A model that looks great in a paper may never meet the 50 millisecond response time demanded in practice. There is also a tendency to optimize only for short-term engagement, like clicks, while underestimating the long-term damage this can do to user trust.

Many people also assume that scaling laws apply in the same way as in NLP. Historically that has not been true, because ranking models are constrained by latency. Offline metrics are another trap: they often fail to predict how a model will perform in the real world. Relying on simple next-item prediction tasks also misses the complexity of long-term behavior. And importing NLP or CV architectures wholesale rarely works, since recommendation systems have their own constraints around data structure, sparsity, and serving cost.

Many assume ā€œa better model will fix everything.ā€ Why is this a misconception?

Because the biggest bottlenecks aren’t models — they’re infrastructure and data pipelines.

For example, when we unified Bing’s organic and Ads pipelines, the gains came not from inventing a new neural net, but from consolidating infrastructure so models could run consistently within latency constraints. Earlier, during my time at Yandex — one of the largest search and tech platforms in Europe — I saw the same pattern in video recommendations: most quality gains came from improving data ingestion and feature freshness, not exotic architectures.

Models matter, but in production, if features are stale or queries miss latency SLAs, even the most accurate model fails. Scaling recommendation is as much about systems engineering as it is about ML.

How do business priorities shape the technical approach?

Business goals drive every architectural choice.

  • Growth: When the priority was user engagement, we tuned Bing Shopping to emphasize exploration, surfacing fresh and diverse products.
  • Monetization: When priorities shifted, I led the integration of Ads into the unified recommendation pipeline — balancing pure relevance with advertiser value, which lifted both CTR and revenue while lowering infra cost.
  • Fairness: In e-commerce, smaller merchants need visibility. I implemented re-ranking strategies to guarantee fair exposure, which helped sustain a diverse marketplace.

The most strategic case was Copilot integration. Microsoft wanted to lead in conversational commerce. To achieve that, I designed systems that grounded LLM outputs in real-time product catalogs — ensuring recommendations were not only relevant but always up to date. That alignment of technical design with business vision turned Copilot into a true shopping assistant.

What role does automation play at this scale?

Automation is survival. At Microsoft, our ingestion pipelines automatically refreshed billions of offers, so users never saw outdated inventory. Models retrained on fresh behavior, adapting continuously to new shopping trends.

We also leaned on A/B testing frameworks and anomaly detection. Every new ranking change or Copilot integration was rolled out to a fraction of users first, with automated monitoring catching regressions before full release. Without automation, scaling safely would have been impossible.Ā 

What tools and processes are critical for monitoring?

Monitoring had three layers:

  • System health — latency, throughput, error rates.
  • Business metrics — CTR, conversions, revenue contribution.
  • Fairness and ecosystem health — making sure small merchants and new products weren’t drowned out.

Real-time dashboards and canary releases were essential. But the final word always came from large-scale online experiments. Before we launched Copilot shopping, we validated not just relevance and latency, but also user trust — ensuring conversational results were grounded in real product data.

What bottlenecks appear when traffic grows quickly?

The first is candidate retrieval. Indexes that work at millions of items collapse when scaled to billions. At Microsoft, I led the redesign of our retrieval layer into multi-stage pipelines to handle global-scale catalogs.

The second is feature pipelines. As traffic grew, so did demand for streaming and joining features in real time. I implemented high-throughput ingestion systems that could process merchant updates continuously — making sure freshness and trust weren’t lost.

The third is serving cost. A GPU-heavy model that takes 100 ms per query may work in a lab but is unsustainable at scale. We broke ranking into retrieval, deeper models, and selective re-ranking to balance accuracy with efficiency.

Copilot made this even harder. Every query had to combine LLM reasoning with real-time product grounding. To make it feasible, we re-optimized the entire flow so conversational answers stayed fast, fresh, and accurate.

How is AI changing recommendation systems?

We’re moving from static lists to conversational, multimodal assistants.

At Microsoft, I led the integration of product search into Copilot, allowing users to ask natural questions and receive real-time, structured recommendations with explanations. This was one of the first large-scale examples of AI-powered commerce.

We also deployed multimodal enrichment pipelines — combining text and image signals — which made product understanding more accurate. This is laying the foundation for the next generation of personalization.

On the engineering side, AI is also improving operations: automating feature generation, anomaly detection, and log analysis. That freed our teams to focus on innovation, not firefighting.

Where will recommendation systems be in 3–5 years?

I see three big shifts:

  • Platform consolidation — companies will unify recommendation stacks instead of building fragmented ones. At Microsoft, we did this for Bing, Shopping, Ads, and Copilot.
  • Conversational recommenders — assistants will guide users through dialog, not lists. Copilot shopping is an early example.
  • Transparency — users will expect to know why something is recommended. At Microsoft, we began surfacing explainable recommendations based on price, reviews, and availability.

Real-time personalization will also be baseline: systems adapting within a session, not overnight. That’s already where our real-time Bing pipelines were heading.

What skills are most valuable for engineers working in this field?

The best engineers combine deep ML knowledge with systems engineering skills. At Microsoft, my teams needed to understand embeddings and ranking models, but also design distributed pipelines that processed trillions of features daily under strict latency limits.

Experimentation skills are equally critical. Every improvement — from unified ranking stacks to Copilot shopping flows — went through rigorous A/B testing.

Finally, communication. When we launched Copilot product search, success depended on ML researchers, infra engineers, and product managers aligning daily. Engineers who could explain trade-offs and collaborate across teams made the difference.

If you could give one piece of advice to companies preparing to scale their recommendation systems, what would it be?

Invest in infrastructure early. Fragile pipelines and stale features will kill you faster than a weak model.

At Microsoft, before we could launch Copilot shopping, we had to build real-time ingestion pipelines and unified ranking stacks. Those foundations mattered far more than exotic models that couldn’t meet latency requirements.

The companies that succeed are not the ones with the fanciest models, but the ones with the most resilient systems. That resilience is what allowed us to bring Bing’s product search into Copilot as a conversational, AI-powered experience.

Author

Related Articles

Back to top button