While the AI world obsesses over massive models and cloud-scale infrastructure, a quieter revolution is unfolding in your pocket. Valerii Popov, a senior software engineer at WhatsApp, is part of a small but impactful group making AI smarter, faster, and entirely local.

With over a decade of experience across mobile platforms, from early Android work to leading machine learning deployment at scale, Valerii has helped shape everything from encrypted messaging to intelligent voice features. At WhatsApp, his focus is bringing cutting-edge intelligence directly to users’ devices, without sacrificing privacy, speed, or battery life.

We sat down with Valerii to talk about what’s working in on-device AI today, what’s still hard, and where the next wave of innovation will hit.

What’s been the biggest on-device AI win for messaging recently?

“Early wins that really landed with users were on-device translation—both text and voice—and voice-note summarization. Especially in global markets, this is a game changer. Imagine receiving a voice message in a language you don’t speak and getting a concise summary, instantly and offline.”

That’s not all. Valerii points to under-the-hood AI features like local ranking of messages, filtering low-value business content, and even context-aware smart replies. These aren’t flashy, but they work—fast, private, and without pinging a server.

“These features keep latency low, work offline, and avoid sending sensitive content to servers. That’s a trifecta of user value in messaging.”

How do you decide what runs locally versus in the cloud?

It’s a balancing act between privacy, latency, and resource cost.

“If it touches user content and we don’t have clear consent, it stays on-device. Full stop.”

Valerii often starts with a server-based model to test feasibility, then moves components on-device when privacy or real-time performance demands it. For example:

Offline needs? Device wins.
Battery-intensive task? Push it server-side.
Simple classification task? Use a small on-device model—don’t waste a 30B parameter LLM.

Most production systems at scale end up hybrid: “We often try the model on-device first. If confidence is low, we fall back to the server. A remote config controls this gating logic.”

What’s actually working in reducing model size and power use on mobile in 2025?

“Quantization to INT8 is the default—it gives us 3–4x smaller models and up to 3x speedups with minimal accuracy loss.”

Beyond that, Valerii highlights distillation (“teacher-student training is gold for on-device”), operator fusion to reduce memory traffic, and NPU offloading as game changers—when supported.

Pruning is hit-or-miss: “It saves size, but real-world speedups depend on kernel support. And over-pruning hurts performance, especially on attention-heavy models.”

His rule of thumb? Classifiers under 1 MB, speech or ASR models between 5–20 MB, and keep cold-start times under 500 ms.

Can you really personalize without sending user data to the cloud?

Yes, but it’s tricky. The go-to solution: federated learning.

“A global model gets fine-tuned on your device using local data. Only anonymized weight updates are sent back—never the raw messages.”

In practice, personalization often involves fine-tuning only the final layers of a model or running a few local SGD epochs. For features like speech recognition or smart replies, this adds meaningful value without compromising privacy.

How do you enforce safety and moderation in E2E encrypted environments?

This is where on-device shines. Valerii separates concerns clearly:

On-device: Lightweight moderation (e.g., blur NSFW content, flag suspicious links), user-side nudges, local heuristics.
Server-side: Network-wide abuse detection, rate limiting, and coordinated spam analysis—done via metadata, not content.

“On-device checks are conservative and reversible—think ‘Are you sure you want to send this?’ prompts. No permanent actions without server-side confirmation.”

Importantly, any local model runners are sandboxed and cryptographically verified before execution to ensure security.

What’s your blueprint for scalable hybrid architecture?

“Local-first inference with server fallback is standard. We use remote config to enable or disable features based on device capabilities.”

iOS and Android bring their own pain points:

iOS: More predictable performance; tight Core ML + ANE integration.
Android: Device fragmentation, inconsistent NNAPI support, stricter APK size budgets.

Maintaining parity means building from a common intermediate representation like ONNX or Core ML, using device allowlists, and testing latency/battery regressions across the board.

“Rollout is blocked unless we hit latency and memory targets on both platforms.”

How do you make on-device AI privacy-defensible?

Beyond slogans, Valerii details a rigorous approach:

Key management + secure enclaves for sensitive caches.
Rollback protection for model integrity.
Schema validation + code signing for model runners.
Clear user consent flows and opt-outs.

“No shortcuts. Every component is reviewed with a privacy lens.”

So, what’s next in on-device AI for messaging in the next 12–18 months?

Valerii is bullish on small multimodal models, better on-device ASR, and edge summarization.

“We’ll see more intent classification, lightweight agents, and context-aware workflows—all running locally. The user expects intelligence without compromise.”

And maybe, just maybe, we’ll stop calling it “on-device” and just call it “smart.”

Author

Balla

I am Erika Balla, a technology journalist and content specialist with over 5 years of experience covering advancements in AI, software development, and digital innovation. With a foundation in graphic design and a strong focus on research-driven writing, I create accurate, accessible, and engaging articles that break down complex technical concepts and highlight their real-world impact.

View all posts

Balla 7 October 2025

3 minutes read

What’s been the biggest on-device AI win for messaging recently?

How do you decide what runs locally versus in the cloud?

What’s actually working in reducing model size and power use on mobile in 2025?

Can you really personalize without sending user data to the cloud?

How do you enforce safety and moderation in E2E encrypted environments?

What’s your blueprint for scalable hybrid architecture?

How do you make on-device AI privacy-defensible?

So, what’s next in on-device AI for messaging in the next 12–18 months?

Author

Related Articles

AI & Predictive Analytics in Accident Claims: A New Era for Injury Attorneys

AI-Driven Property Valuation Models Reshaping Real Estate Investment Decisions

How AI-Powered BOM Management with OpenBOM Enhances SOLIDWORKS Collaboration Beyond CAD

AI’s Growth Curve Is Colliding with Infrastructure Reality