Future of AI

How multimodal AI is transforming user experience

We are on the brink of an inflection point when it comes to user experience with AI. Since the release of ChatGPT-3 in November 2022, most of us have had more than enough time to become quite well acquainted with the interactive AI chatbots currently used by a majority of B2C companies (58%).

Now, with the benefits of this technology well attested across a range of sectors, the focus for many companies is to up their chatbot game. So what does this look like?

  • Firstly, companies can increase the sophistication of their chatbots by fine-tuning the GenAI model they use to provide customized responses that reflect their brand’s unique identity.
  • Secondly, companies can work on improving the accuracy of chatbot responses to customer queries by integrating the AI model with more company-specific data, which allows the chatbot to provide less generic responses and help the customer access more helpful information.
  • Thirdly, many organisations are adopting chatbots for internal workflows, which can then be used to increase the productivity of employees, and automate HR actions such as holiday leave requests or onboarding protocols.

These next steps demonstrate the broad number of applications for text-based chatbots, despite the limited mode of interaction that they provide to users. But with current and ongoing developments in Generative AI, this text-based mode of operation is unlikely to be a limitation for much longer.   

Multimodality is rapidly emerging as a powerful new trend in the development of Generative AI, facilitating a more seamless, intuitive, and accessible communicative experience for chatbot users. Already, voice-enabled interfaces for chatbots are used extensively within business settings, with 74% of consumers using mobile voice assistants at home, and 72% of US consumers having interacted with voice assistants in a business context as of data collected in 2024, according to a Master of Code publication.

So far, there seems to be a positive reception for voice-enabled applications, with 71% of prospective customers preferring voice queries over typing.

The potential of multimodality is also recognised by leading experts in the tech industry. OpenAI, for example, has put a strong focus on audio and visual capabilities in its latest Generative AI model, GPT-4o (with the ‘o’ standing for omni). In an official demo video, company representatives showcased the model’s new audio and video capabilities, giving it unprecedented multimodal agility that is expected to make human-computer interaction significantly easier and more pleasant.

According to Christian Jaubert, Senior Associate at the Silicon Foundry, an innovation consultancy firm, the model’s new multimodal capabilities could be a game-changer for business productivity as well as a big advantage for the general public.

“With the release of GPT-4o to private beta users, we’re already seeing various use cases for this technology come up. I think it’s very exciting because the interactive [spoken] chat feature provides a far more seamless and natural mode of interaction. For humans, I think speech is very much something that we all prefer to interact with. So this new speech-to-speech modality opens up a very wide basis of how we’re actually interacting with our phone and interacting with these models. So that’s going to be quite exciting for the general public, and I think from a business context, it’s also going to really increase productivity.” ~ Christian Jaubert, Senior Associate at the Silicon Foundry

The model is currently being rolled out to its first round of private beta users, and is soon expected to be freely available to non-subscribers. The free access that OpenAI will provide to these new capabilities is also likely to make multimodality the new norm for chatbots, and increase user expectations for everyday interactions with AI.

Evolving communication trends

It is widely perceived that speaking is the most natural and intuitive way for humans to interact with not only each other, but also with technology. Indeed, from an evolutionary perspective, there is certainly a natural intuition to spoken language, which was used by our ancestors to communicate long before written language.

Nevertheless, with the dawn of mainstream PCs (personal computers), mobile devices, and social media, text-based communication experienced a spike in popularity as people discovered the flexibility and casual ease of texting and posting. This was further enhanced by the invention of predictive text and auto-correct that most of us have now become reliant on to tweak our written communications.

However, even within the limited domain of texting, the widespread adoption of alternative communicative modalities including emojis, GIFs, and voice notes, are testament to the inherent limitations of written language for everyday human interactions and communication.

Furthermore, within our post-Covid society, there is a particular need for multimodality within digital communication interfaces due to the lasting impact of the lockdown. This has led to several organisations, and individuals more generally, to embrace the advantages of remote, digitally-enabled interactions on a more permanent basis.

Now, the next permutation of this trend could be the widespread adoption of multimodal GenAI applications. This looks particularly likely given our society’s current preference for hybrid communication channels, which has been shaped by heightened appreciation for in-person, human interactions following the restrictions of the pandemic alongside a recognition of the lasting value of digital interfaces such as online meeting rooms, shared digital workspaces, and Generative AI tools like chatbots and virtual assistants.

Below, we consider the numerous benefits of multimodal AI for these digital interfaces, focusing particularly on the development and the applications of conversational AI and interactive video capabilities.

Conversational AI

Conversational AI is more or less exactly what it sounds like, referring to the ability of AI models to maintain a spoken conversation with users. There are many advantages to conversational AI, including increased flexibility and accessibility to information.

For example, through spoken interactions, you can now use AI models to conveniently access information, record your train of thought, or enhance your creative/brainstorming processes on the go, i.e. while driving, at the gym, or engaging in pretty much any manual activity.

Michael Zagorsek, COO of voice AI and speech recognition company, SoundHound AI, draws particular attention to the potential of conversational AI models in the automotive industry, where they can facilitate a far richer driving experience, while also retaining the flexibility of traditional AI models in that they can be customized to embody the character of different brands.

“In the automotive industry, SoundHound Chat AI is setting a new standard by delivering customized voice assistants that offer hands-free access to a wide range of capabilities from car controls, media and communications features, real-time updates, and ChatGPT responses. And we’re able to do this while also aligning with each brand’s unique identity, strengthening brand-to-consumer relationships.” ~ Michael Zagorsek, COO of SoundHound AI 

Conversational AI also has great potential for improving the equity of our society by increasing accessibility to activities, jobs, and opportunities for those with impaired communication, such as illiterate individuals, or people suffering from neurological neuromuscular disorders such as Parkinson’s disease, multiple sclerosis, and muscular dystrophy, which can affect their physical ability to type or text.

Additionally, many people, including Zagorsek, view speaking as a more natural and intuitive mode of communication than written language, which adds to the appeal of conversational AI for anyone, regardless of how proficient they are in written language.

“At SoundHound AI, we believe speaking is the most natural way to interact with products and devices. This new era of conversational AI empowers consumers with a more intuitive and enjoyable user experience that is faster, more accurate, and accessible.” ~ Michael Zagorsek, COO of SoundHound AI 

Zagorsek also highlights the potential of conversational AI to transform people’s everyday use of technology, given the relative simplicity of fitting digital devices with small microphones.

“We foresee voice assistants becoming a standard feature not only in automobiles, but in all products, with the inclusion of a small microphone holding the potential to drive a voice ecosystem of billions of innovative, versatile devices.” ~ Michael Zagorsek, COO of SoundHound AI 

The breakthrough of direct speech-to-speech technology

While conversational AI promises to provide a seamless and simpler way for users to interact with AI models, the technologies required to facilitate this behind the scenes are actually quite complex.

In an interview with the AI Journal, Christian Jaubert, Senior Associate at the Silicon Foundry, highlighted the complexity of creating a seamless conversational AI experience for users, explaining how speech-to-speech technology used to have to be broken down into separate workflows (speech-to-text and text-to-speech), which would increase the latency of outputs.

“So to give a quick breakdown on the history of speech-to-speech technology, it has traditionally involved two key components or workflows. The first is speech-to-text, and that’s behind the automatic speech recognition technology (ASR) that we’ve seen. This ASR technology has seen particularly strong developments in the past ten years. And then the second component is text-to-speech, which is able to take a transcription and put it into actually synthesized speech from that.” ~ Christian Jaubert, Senior Associate at the Silicon Foundry

This is now changing, thanks to breakthroughs in Generative AI technologies which have enabled direct speech-to-speech processing capabilities.

“The explosion of generative AI in the past two year is something that’s really transformed speech-to-speech technology. Now, with the current abilities of Generative AI, we have a middle layer that can be the thinking agents. LLMs can actually take the text, perform complex processing and analysis tasks throughout the text, and then hand that over to the text-to-speech model to actually synthesize its findings into a spoken word. So Gen AI has really allowed for a transformational change where we actually have complex thought that is available to users directly from speech-to-speech AI” ~ Christian Jaubert, Senior Associate at the Silicon Foundry

This technological breakthrough is instrumental for various applications for conversational AI, such as translation where real-time speech-to-speech processing is required to provide a smooth communicative interface between the speakers.

Limitations of conversational AI

Despite the vast benefits of conversational to improve user experience, some experts are questioning whether speech is truly the future of AI interaction in all contexts.

Sam Zheng, CEO of DeepHow, a leading provider for employee training and onboarding solutions, argues that conversational AI alone is not the optimal solution for educational applications. Instead, he promotes the use of video-based SOPs (standard operating procedures), which can provide users with a more holistic and detailed learning experience.

“While AI-powered voice assistants can provide real-time support, they currently struggle with delivering precise and contextually relevant responses. Integration into existing workflows requires substantial customization, and for detailed procedural instructions, video-based SOPs remain superior due to their visual and textual guidance.” ~ Sam Zheng, CEO of DeepHow

An additional consideration for the broader use of voice-based AI applications is the range of environments where they could be effectively used. For example, on public transport and in other noisy public environments, the ease of use for conversational AI would be quite restricted or potentially disruptive, with the use of phone speakers on public transport already recognized as a source of irritation for fellow passengers – and while headphones present as an easy fix to this problem, the overuse of headphones in public places (especially noise-cancelling ones) is a significant driver of the smartphone zombie problem, which is becoming more and more of a threat to public safety.

Interactive video capabilities

Another very exciting development for multimodal AI is the combination of conversational AI with video technology, resulting in a holistically interactive experience for users of AI applications. Essentially, this combines the benefits of YouTube how-to videos with a standard chatbot or customer support service.

For example, digital solutions provider, TechSee, provides multisensory AI agents to improve customer support systems.

“TechSee is redefining AI-driven support with its computer vision technology, offering distinct advantages over how-to YouTube videos and traditional call centers. TechSee’s innovative platform, Sophie AI, seamlessly integrates conversational AI, computer vision, and generative AI, creating a unique multimodal approach. Unlike static YouTube videos or call centers, Sophie AI isthe world’s first autonomous customer service agent capable of seeing, hearing, understanding, and guiding customers in real time just like a human.” ~ Eitan Cohen, CEO and Co-founder of TechSee

In an interview with the AI Journal, Eitan Cohen, CEO of the company, explained the customer-focused incentives for interactive video-based AI for troubleshooting applications, highlighting its key benefits.

“The multimodal platform (text, voice and crucially visual) is unique in the industry and is essential to rapid resolution times and truly understanding customers’ needs. By being able to see, hear, understand, and guide customers through installations and troubleshooting, TechSee is providing human-like service without the typical drawbacks. Recognizing that humans process images 60,000 times faster than text, TechSee employs visual storytelling by enhancing customer-captured visuals with AR guidance to bridge the gap between what customers see and understand. Moreover, typically any physical product requires visual perception to interact with it, understand its status ( e.g. lid is on/off, cable is connected or not connected, button is located here, etc,.) To try and explain this in text or with voice only is very difficult for problem analysis or explaining operating instructions. Think about it like trying to change cartilage on your printer with someone trying to guide you with their eyes closed.” ~ Eitan Cohen, CEO and Co-founder of TechSee

As testament to the efficacy of interactive video-based AI in real-life applications, TechSee’s multimodal troubleshooting agent has resulted in an 89% increase in remote resolution rates for businesses and customer service agencies.

Another key use case for interactive video-based AI is in training and employee onboarding, which is the focus of DeepHow, employee training and onboarding solutions provider.

“At DeepHow, we have made significant advancements in multimodal AI by integrating video, text, and interactive elements into comprehensive Standard Operating Procedures (SOPs). Our AI technology transforms raw video footage into structured, easily navigable training materials, enhancing user experience by making complex procedures more accessible and understandable. This approach allows users to interact with SOPs dynamically and intuitively, leading to improved knowledge retention and application.” ~ Sam Zheng, CEO of DeepHow

Furthermore, Zheng also points to how interactive video-based AI is helping to democratize knowledge and resource accessibility for neurodiverse individuals who might otherwise struggle to process and understand the information in a singular form of just text or speech.

“Multimodal AI has tremendous potential to democratize knowledge by making high-quality training materials widely accessible. It creates diverse and engaging content that caters to different learning styles, helping reach individuals who struggle with traditional methods. This inclusivity bridges skill gaps by providing equal learning opportunities. Moreover, multimodal AI automates knowledge dissemination, ensuring up-to-date information is readily available across organizations. It supports continuous learning with tools for real-time assessment, feedback, and personalized learning paths. By breaking down barriers to knowledge access and enhancing training programs, multimodal AI plays a vital role in building a more skilled and capable workforce.” ~ Sam Zheng, CEO of DeepHow

Overall, when it comes to creating engaging educational/training resources, interactive video-based AI offers several advantages over other communicative modalities. These include:

  • Visual Learning: Videos provide clear visual demonstrations of procedures, making complex tasks easier to understand than text alone.
  • Engagement: Videos are inherently more engaging, helping to maintain viewer attention and improve information retention.
  • Consistency: Video SOPs ensure consistent delivery of instructions, reducing the risk of miscommunication and variation in understanding.
  • Accessibility: Videos can be easily accessed and replayed, allowing users to learn at their own pace and revisit material as needed.

Challenges unique to interactive, video-based AI

The vast benefits of interactive video-based AI applications, such as those offered by TechSee and DeepHow, come with some inherent challenges and a heavier workload. Zheng, for example, lists some of the key challenges involved in the production of video-based SOPs:

  • Content Creation: Capturing accurate video footage of procedures can be challenging.
  • Editing and Structuring: Creating a coherent and structured SOP from raw footage requires time and expertise.
  • Annotation: Adding instructional elements like annotations and captions is essential for clarity.
  • Distribution: Ensuring easy access to SOPs for all users is critical.

However, he also points out how AI can streamline these tasks:

  • Automated Editing: AI can identify key steps and automatically structure content.
  • Annotation and Transcription: AI generates annotations and transcriptions, enhancing video accessibility and informativeness.
  • Search and Retrieval: AI-powered search capabilities allow users to quickly locate relevant SOPs and specific segments, improving usability.

Nevertheless, further challenges could arise as video-based AI becomes more advanced. For example, Zheng points out the prospective complexity of integrating video-based AI with augmented or virtual reality technology.

“These technologies [AR/VR] can provide immersive training experiences but are currently limited by high costs and complex implementation requirements. The expense of hardware, along with the need for specialized skills to develop content, makes them less accessible for most organizations.” ~ Sam Zheng, CEO of DeepHow

The future of multimodal AI

Future developments for multimodal AI will likely follow the trajectory of traditional, text-based models, which primarily includes the personalization of the model’s responses to different users and contexts.

Zheng, for example, highlights some of the benefits that personalized SOPs and learning resources could bring to employees and users of DeepHow’s platform.

“Currently, DeepHow’s platform provides standardized training materials that are accessible to a broad audience. While extensive personalization based on age, physical condition, or mental ability is not yet implemented, our platform’s AI capabilities allow some customization, such as adjusting content based on user feedback and performance. Future developments could include more personalized resources, leveraging AI to tailor training materials to individual needs. This could involve adapting content delivery, pacing, and complexity to suit various learning preferences and capabilities, enhancing inclusivity and effectiveness.” ~ Sam Zheng, CEO of DeepHow

Going forward, and especially following the release of GPT-4o to the wider public, it will also be interesting to monitor the development of user preferences for different communication modalities, especially given the current perception that speech is generally the preferred mode of communication over text. Indeed, in apparent contrast to this widespread belief, our own research found that a majority of our LinkedIn followers (71%) preferred to interact with AI chatbots through text rather than speech (23%).

This finding could be more reflective of the current limitations of voice/video-based AI applications, which still have some way to go before they can provide a truly seamless user experience. Nevertheless, it could also be indicative of an innate human reluctance or conservatism in interacting with what are essentially just algorithms in the same way that we interact with each other.

Either way, the potential for multimodal AI remains vast and exciting for those willing to try it, and could truly transform user experience for those with impaired abilities or neurodiverse conditions.

Related Articles

Back to top button