Voice assistants have become integral to our daily lives, seamlessly integrating into everything from smart homes to mobile devices. According to a Statista report, by 2024, the number of digital voice assistants is expected to surpass 8.4 billion units, exceeding the world’s population. This growth reflects the increasing reliance on these technologies in various sectors.

These conversational agents, powered by advanced Artificial Intelligence (AI), result from years of innovation in fields such as Natural Language Processing (NLP), machine learning, and speech recognition.

We spoke with Dr. Irina Barskaya, a distinguished data scientist with over a decade of experience and the mastermind behind Yasmina, the first fully functional AI-based voice assistant localized for Saudi Arabia. In our conversation, we delved into the process of building and refining voice assistants, exploring the AI technologies that power their capabilities.

Could you walk us through how AI is being used to bring voice assistants to life? What’s going on behind the scenes?

Voice assistants are powered by cutting-edge AI technologies that work in harmony to deliver seamless user experiences. At the forefront is Natural Language Processing (NLP), which allows these assistants to understand and interpret human language. The ASR (Automatic Speech Recognition) acts as the ‘ears’ of the assistant, converting spoken words into text. Then, Natural Language Understanding (NLU), the ‘brain’, steps in, analyzing the intent and context behind those words to determine the appropriate response or action. Finally, Text-to-Speech (TTS) technology transforms the response into speech, allowing the assistant to communicate naturally with users. A ‘voice’.

Beyond these core NLP components, other AI technologies play crucial roles. For example, smart speakers use wake-word detection models to recognize when they’re being addressed.

Plus, search engine integrations fetch real-time information, such as TV shows and news, and sentiment analysis models gauge emotions in a user’s voice. However, due to recent advancements in multimodal models and their increasing quality, we’re approaching a future where a single, unified model could handle all these tasks with even greater efficiency.

So, Natural Language Processing is at the heart of voice assistants. Can you tell me more about how NLP shapes the way these assistants understand and respond to us?

If you look back, we realize how tightly NLP advancement is linked with voice assistants’ abilities and widespread adoption. In 1962, IBM presented a tool called Shoebox. It was the size of a shoebox and could perform mathematical functions and recognize 16 spoken words as well as digits 0-9. Then in the 1970s, scientists at Carnegie Mellon University in Pittsburgh, Pennsylvania, created Harpy, which could recognize 1,011 words, about the vocabulary of a three-year-old. The very first voice assistant was actually the Julie doll from the Worlds of Wonder toy company, released in 1987. It could recognize a child’s voice and respond to it. All of these, of course, were far from a broad-use product for millions of people.

Today, 42% of Americans use VAs on their phones, and 47 million out of a 252 million population own a “house for a voice assistant” – a smart speaker. This incredible rise was possible due to the rapid development of all three key components of voice assistants: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and Natural Language Understanding (NLU). These advancements have enabled voice assistants to become more accurate, responsive, and integrated into various aspects of daily life, making them indispensable tools for many users.

Voice assistants have come a long way in holding conversations, but how are they getting better at understanding context and maintaining a natural flow during interactions?

Voice assistants have evolved, but models such as GPT-4 and Claude rely on a robust ecosystem of specialized functions/blocks/scenarios to deliver optimal performance. These systems are designed with modularity in mind, where different functions are responsible for specific tasks.

For example, when you ask your assistant to remove all current alarms, a dedicated function will first check the device’s state, retrieve all set alarms, and then proceed to delete them. Similarly, if a user asks to like a song that is currently playing, one function determines the song currently playing, and another records the user’s preference in a database. The NLU (Natural Language Understanding) brain orchestrates these scenarios, determining which scenario to invoke and with which parameters, ensuring precise and efficient task execution.

However, the real game-changer has been the implementation of transformer-based models like BERT and GPT. These models allow the system to understand the context of words within sentences rather than in isolation, leading to more natural and fluent dialogues. For instance, you can ask, “What’s the weather like today?” and then follow up with, “And tomorrow?” without needing to repeat yourself. These advancements also allow voice assistants to consider the device’s state more accurately. If you ask, “Tell me what this song is about,” the assistant can identify the current track and provide a summary of its lyrics. This level of contextual awareness and responsiveness is what makes today’s voice assistants feel more intuitive and user-friendly than ever before.

What kinds of data to train voice assistants, and where does it come from?

Training voice assistants involves using various types of data, and each is tailored to different components of the system.

ASR (Automatic Speech Recognition) models are pre-trained on large, open datasets sourced from various media where speech is prevalent, such as YouTube, films, podcasts, audiobooks, and other online content. Many of these online videos and audio recordings come with captions or transcriptions, either automatically generated or manually created. Public domain and open datasets like the LibriVox audiobooks, TED-LIUM (TED talks), and Common Voice by Mozilla are commonly used. It’s crucial to ensure that the data is publicly available and legally accessible. However, initial training is just the beginning. For specific acoustic adjustments to smart speakers, recordings are made in studio environments simulating houses and apartments to capture realistic acoustic conditions. Additionally, if users consent to share their data, it can be annotated by in-house teams to further enhance training.

Like ASR, TTS (Text-to-Speech) models start with pre-training on large text-voice datasets. For fine-tuning, recordings of a speaker in a studio environment are necessary, where the speaker reads voice assistant domain texts with specific intonations and emotions. Typically, you would need at least 40-50 hours of clean recordings to achieve high-quality results.

NLU (Natural Language Understanding) models are primarily trained on labelled data from users who have consented to share their data for this purpose. This involves labelling different scenarios, deciding which text slots/taggers to use, and determining appropriate responses. For dialogue management, models similar to GPT are used. These models are pre-trained on open datasets, including books and the Common Crawl dataset, and then fine-tuned using Reinforcement Learning from Human Feedback (RLHF) to better handle user interactions.

I can imagine that accents and dialects can be tricky for voice assistants. How do you ensure they can understand and respond accurately to such diverse ways of speaking? Is there enough data for training?

Accents and dialects bring a unique challenge given how diverse they can be, and the availability of dialect data varies significantly between languages. For instance, the Arabic language is rich with dialects like Khaliji, Masri, and Shami, each with its regional nuances. Finding enough open data for these dialects can be difficult, often requiring the creation of custom datasets for each one. In contrast, English dialects are more widely represented online, making it easier to source data for training. However, even with English, when localizing a product for a new market, it’s essential not just to rely on existing data but also to collect specific fine-tuning data to accommodate the unique acoustics of smart speakers.

To address these challenges, ongoing efforts include crowdsourcing initiatives, partnerships with linguistic communities, and using platforms like Mozilla’s Common Voice project. These are designed to gather a diverse range of voice samples, making sure that voice assistants can understand and respond to a global audience. Despite these efforts, ensuring complete coverage of all dialects and accents is still a work in progress. Continuous data collection, annotation, and model refinement are essential to making voice assistants more inclusive and effective for everyone, regardless of how they speak.

It’s interesting how voice assistants are beginning to recognize and respond appropriately to emotional cues in human speech. How exactly are AI voice assistants being trained to do that, and what challenges do you face in making this more natural?

Training voice assistants to recognize and respond to emotions involves a few key approaches. Emotion detection systems analyze how a user speaks, picking up on vocal tones and patterns.

Then sentiment analysis is carried out to detect what emotion would be appropriate and then use synthesis with specific emotion. For instance, Speech Synthesis Markup Language (SSML) can be used to annotate text with specific prosody, rate, pitch, and other vocal characteristics to convey emotions effectively.

Here it would be important to mention the TTS Tortoise architecture. Unlike traditional models, Tortoise uses a “non-autoregressive” approach to generate high-quality, emotionally nuanced speech without needing separate models for each voice. It also supports text prompts, allowing users to include emotional cues directly in their requests, such as [frustrated] “Why isn’t this working?” This enhances the expressiveness and context-appropriateness of the responses.

Of course, many of us have seen OpenAI’s impressive demo of an emotional voice assistant, which responds and reacts to what you say with nuanced expressions. This level of interaction is possible only with a true multimodal model. Achieving this level of interaction requires the model to combine different types of data, such as visual cues, contextual understanding, and advanced NLP.

However, achieving truly natural emotional interaction often requires advanced multimodal models, like those showcased by OpenAI. These models integrate various types of data, including visual cues and contextual understanding, to create a more lifelike and responsive experience. The challenge lies in blending these elements seamlessly to ensure that voice assistants not only recognize emotions but also react in a way that feels genuine and empathetic.

Given that voice data can be so personal, how do these systems ensure that our privacy and security are protected? What measures and technologies are employed?

It is true, voice assistants do handle highly sensitive data. Protecting users’ privacy is, as it should be, a top priority. Here are some ways these systems keep your information safe:

1. Many assistants use on-device models, which means that basic functions, like wake-word detection, are processed locally on your device. Your conversations are not being constantly sent to the cloud. This reduces the risk of unwanted eavesdropping.

2. In case data does have to be sent to the cloud, it’s encrypted both during transmission and while stored. It essentially becomes unreadable in the case of an interception.

3. If those options are not suitable in your case, you can opt-out. Having the choice not to have interactions stored or used for improvements gives users control over how data is handled.

4. Anonymity is another way to handle privacy. Companies often anonymize data, so data cannot be linked back to you.

5. Companies must inform you about how, what kind, and for what purposes they collect data. Plus, it requires your consent, which can also be revoked at any time.

6. In Europe voice assistants adhere to the GDPR data protection regulations, and in California, it is the same but with CCPA. They go through regular audits which ensure compliance and solve potential security issues as soon as possible.

As an experienced engineer, what do you consider to be the most challenging aspect of creating voice assistant systems? What hurdles do you face as an experienced engineer in pushing the technology further?

Even with advancements like ChatGPT-3.5, we still have not seen a popular smart speaker seamlessly integrated with robust natural language understanding and dialogue management.

Besides that, there are also device-specific problems such as wake activation and “long tail” ASR problems, which include specific requests like niche artists, uncommon accents, contextual references, and background noise. There is also the need to maintain a conversation state across multiple turns and take into account the device state in context.

Now there are a few limitations:

1. Contextual Understanding: Assistants often struggle with follow-up questions or maintaining context over longer conversations. While large language models (LLMs) have made some progress, there is still a noticeable gap in fully grasping and maintaining context throughout interactions.

2. Complex Task Completion: Voice assistants excel at simple commands, but they often fail with multi-step tasks or those that require an integration of multiple services. This becomes particularly tricky when dealing with scenarios like smart home controls or music management, which need a deeper level of context and coordination.

3. Emotional Intelligence: The system’s ability to recognize and respond appropriately to user emotions is still limited. There is still a long way to go in this area, and I believe multimodal models are the solution.

4. Proactivity: Most current voice assistants now are reactive rather than proactive, so they only respond when prompted. This can be a challenge because it raises issues around privacy, user acceptance, and ethical concerns about when and how these assistants should offer help.

There are always improvements to be made. What are the best practices you follow or recommend to tackle the current limitations of voice assistant technology?

For ASR, it is always beneficial to collect more data for pretraining and to label more domain-specific data for supervised learning. Focus on data quality, robust data collection processes, and metrics that identify performance gaps.

Integrate “memory” to store user preferences and knowledge about users, enhancing personalized interactions.

Integrate proprietary and custom-developed LLMs for handling chit-chat and informational questions.

Using LLM function calling can bridge the gap between current business logic and LLM capabilities to understand context.

If we focus on these areas, we can make major strides in addressing limitations, enhancing the functionality, and simultaneously improving the user experience of voice assistants.

Author

Haziqa Sajid

Haziqa, a data scientist and technical writer, loves to apply her technical skills and share her knowledge and experience through content.
View all posts

Haziqa Sajid August 21, 2024

9 minutes read

Author

Related Articles

3 Important DeepSeek Lessons for Business Leaders

Study Materials With AI: How to Test Yourself With ChatGPT

How Voice & AI Chatbots Are Changing SEO: The Future of Search is Talking Back

AI is the New Voice at the Drive-Thru – Are You Ready?