When Pavel Andreev set out to build a pronunciation feedback feature for his iOS children’s app, Tap & Learn Animals, he faced a defining choice. He could route toddlers’ voice data through cloud-based speech recognition APIs from giants like Google, Amazon, or Meta. It would guarantee high baseline accuracy and allow him to ship in a matter of weeks.
But the moment an audio recording leaves a child’s device, developers are bound by stringent regulations like the US Children’s Online Privacy Protection Act (COPPA) and the UK’s Age Appropriate Design Code. Navigating that labyrinth means implementing complex parental consent procedures, strict data retention rules, robust breach notification protocols, and significantly expanded security infrastructure.
For Andreev, whose technical journey began as iOS engineer role at companies Actriv Healthcare, and Alaska Airlines. He chose the frictionless path for user privacy: training an entirely on-device model to eliminate the risks of processing children’s personal data.
The 4.5 Megabyte Compromise And Why It Wasn’t Enough
Building a production-ready model for children’s speech from scratch is no small feat. The development of the bespoke model, PronunciationNet, took four distinct iterations.
His initial attempt utilized Apple’s CreateML as a basic sound classifier. Trained on a small dataset of roughly 300 recordings, the model immediately overfitted. Worse, it was functionally useless for education, returning a binary “pass” or “fail” for entire words without any granular nuance.
Next, he pivoted to PyTorch, employing Facebook’s pre-trained wav2vec2 model. While performance surged, the resulting CoreML export weighed 180MB. For a lightweight educational app, ballooning the installation size past 200MB was unacceptable.
The third attempt approached the problem laterally: treating audio analysis as an image classification task. By using MobileNetV2 with Mel-spectrograms, he aggressively shrank the model size to just 4.5MB. With an expanded dataset of over 1,200 recordings, accuracy hit 88%. Yet, this architecture still suffered from the same functional gap as CreateML—it couldn’t localize pronunciation errors down to individual syllables.
The Breakthrough: A 1.3MB Custom Architecture
Realizing that off-the-shelf models and transfer learning wouldn’t meet his rigorous standards, Andreev immersed himself in neural network literature. He engineered a bespoke dual-head architecture: 1D-convolutional (Conv1D) layers feeding into a bi-directional GRU network. Crucially, the model features two output branches—one for holistic word classification, and a second for the independent evaluation of individual syllables.
Once converted, the final model weighed an astonishingly light 1.3MB and ran seamlessly on the Apple Neural Engine.
How toddlers broke the model
The real test, however, came during live testing. The initial dataset consisted of near-studio-quality recordings. In the unpredictable reality of a toddler’s environment, the model triggered false positives against TV noise, street sounds, and random room chatter.
Andreev rigorously tackled these edge cases. He introduced an `<unknown>` class, deliberately recording over 150 samples of ambient noise. This allowed the model to gracefully fail and issue a readable error prompt when background noise obscured the speech, rather than forcefully guessing a word from its dictionary.
He then layered a 0.45 confidence threshold over the classifier. Pure white noise, digital silence, and 1kHz sine waves all returned probabilities below this threshold, safely routing to the unknown class. The production system didn’t just learn what it knew; it learned the boundaries of its own competence.
From Education to Therapy
What started as a pronunciation tutor for a children’s app is beginning to look like something speech therapists might use. Andreev is designing exercises adapted from clinical logopedic practice. One uses the iPhone’s microphone to track sustained breath — a boat on screen that the child holds at center by exhaling steadily, the audio waveform mapped to position. Another targets phonemic awareness: matching the number of syllables in a spoken word to taps on the screen.
He is also looking outward. His next move is to approach private paediatric speech-language therapy providers on both sides of the Atlantic — clinicians who already work the same ground the app does: living rooms, kitchens, school classrooms, telehealth calls. In the UK, that means practices like Better Days, Kiki’s Children’s Clinic, and The Children’s Place — multidisciplinary paediatric clinics with RCSLT- and HCPC-registered therapists serving London and beyond. In the US, peers include Los Angeles Intensive Pediatric Therapy, The Voz Institute in Washington, DC, and The Center for Pediatric Therapy in New Jersey. The goal is structured feedback from frontline clinicians who spend their careers helping children speak — and who could, in time, put the app in front of the families they serve. For now, every recording stays on the device. That was the whole point.



