
While the AI world obsesses over massive models and cloud-scale infrastructure, a quieter revolution is unfolding in your pocket. Valerii Popov, a senior software engineer at WhatsApp, is part of a small but impactful group making AI smarter, faster, and entirely local.
With over a decade of experience across mobile platforms, from early Android work to leading machine learning deployment at scale, Valerii has helped shape everything from encrypted messaging to intelligent voice features. At WhatsApp, his focus is bringing cutting-edge intelligence directly to usersโ devices, without sacrificing privacy, speed, or battery life.
We sat down with Valerii to talk about whatโs working in on-device AI today, whatโs still hard, and where the next wave of innovation will hit.
Whatโs been the biggest on-device AI win for messaging recently?
“Early wins that really landed with users were on-device translationโboth text and voiceโand voice-note summarization. Especially in global markets, this is a game changer. Imagine receiving a voice message in a language you donโt speak and getting a concise summary, instantly and offline.”
Thatโs not all. Valerii points to under-the-hood AI features like local ranking of messages, filtering low-value business content, and even context-aware smart replies. These arenโt flashy, but they workโfast, private, and without pinging a server.
โThese features keep latency low, work offline, and avoid sending sensitive content to servers. Thatโs a trifecta of user value in messaging.โ
How do you decide what runs locally versus in the cloud?
Itโs a balancing act between privacy, latency, and resource cost.
โIf it touches user content and we donโt have clear consent, it stays on-device. Full stop.โ
Valerii often starts with a server-based model to test feasibility, then moves components on-device when privacy or real-time performance demands it. For example:
- Offline needs? Device wins.
- Battery-intensive task? Push it server-side.
- Simple classification task? Use a small on-device modelโdonโt waste a 30B parameter LLM.
Most production systems at scale end up hybrid: โWe often try the model on-device first. If confidence is low, we fall back to the server. A remote config controls this gating logic.โ
Whatโs actually working in reducing model size and power use on mobile in 2025?
โQuantization to INT8 is the defaultโit gives us 3โ4x smaller models and up to 3x speedups with minimal accuracy loss.โ
Beyond that, Valerii highlights distillation (โteacher-student training is gold for on-deviceโ), operator fusion to reduce memory traffic, and NPU offloading as game changersโwhen supported.
Pruning is hit-or-miss: โIt saves size, but real-world speedups depend on kernel support. And over-pruning hurts performance, especially on attention-heavy models.โ
His rule of thumb? Classifiers under 1 MB, speech or ASR models between 5โ20 MB, and keep cold-start times under 500 ms.
Can you really personalize without sending user data to the cloud?
Yes, but itโs tricky. The go-to solution: federated learning.
โA global model gets fine-tuned on your device using local data. Only anonymized weight updates are sent backโnever the raw messages.โ
In practice, personalization often involves fine-tuning only the final layers of a model or running a few local SGD epochs. For features like speech recognition or smart replies, this adds meaningful value without compromising privacy.
How do you enforce safety and moderation in E2E encrypted environments?
This is where on-device shines. Valerii separates concerns clearly:
- On-device: Lightweight moderation (e.g., blur NSFW content, flag suspicious links), user-side nudges, local heuristics.
- Server-side: Network-wide abuse detection, rate limiting, and coordinated spam analysisโdone via metadata, not content.
โOn-device checks are conservative and reversibleโthink โAre you sure you want to send this?โ prompts. No permanent actions without server-side confirmation.โ
Importantly, any local model runners are sandboxed and cryptographically verified before execution to ensure security.
Whatโs your blueprint for scalable hybrid architecture?
โLocal-first inference with server fallback is standard. We use remote config to enable or disable features based on device capabilities.โ
iOS and Android bring their own pain points:
- iOS: More predictable performance; tight Core ML + ANE integration.
- Android: Device fragmentation, inconsistent NNAPI support, stricter APK size budgets.
Maintaining parity means building from a common intermediate representation like ONNX or Core ML, using device allowlists, and testing latency/battery regressions across the board.
โRollout is blocked unless we hit latency and memory targets on both platforms.โ
How do you make on-device AI privacy-defensible?
Beyond slogans, Valerii details a rigorous approach:
- Key management + secure enclaves for sensitive caches.
- Rollback protection for model integrity.
- Schema validation + code signing for model runners.
- Clear user consent flows and opt-outs.
โNo shortcuts. Every component is reviewed with a privacy lens.โ
So, whatโs next in on-device AI for messaging in the next 12โ18 months?
Valerii is bullish on small multimodal models, better on-device ASR, and edge summarization.
โWeโll see more intent classification, lightweight agents, and context-aware workflowsโall running locally. The user expects intelligence without compromise.โ
And maybe, just maybe, weโll stop calling it โon-deviceโ and just call it โsmart.โ




