Conversational AIInterview

Turning Voice into Infrastructure: Building Enterprise-Scale Multimodal AI Products

For years, voice, video, and conversational data existed at the edges of enterprise analytics — recorded and stored, but rarely treated as core business data. What changed was not a single breakthrough, but the emergence of leaders who could turn multimodal AI into reliable, scalable products.

Shankar Krishnan is one of them. At Amazon Web Services, he leads product development for Amazon Transcribe and Bedrock Data Automation (audio), and has previously helped launch Chime SDK call analytics — platforms that combine speech recognition, multimodal analytics, and large language models to support enterprise workflows at scale across industries from healthcare and finance to telecom and public safety.

We spoke with Shankar about how multimodal AI became infrastructure, what it takes to productize generative models responsibly, and why enterprise AI succeeds or fails long before models reach production.

Disclaimer: The views and opinions expressed in this interview are solely those of Shankar Krishnan and do not necessarily reflect the official policy or position of Amazon Web Services, Inc. or its affiliates. This content is provided for informational purposes only and should not be construed as representing AWS’s official stance on any matter discussed.

Shankar, you lead the development of a full portfolio of GenAI products: from Amazon Transcribe to Bedrock Data Automation. Could you describe this product domain as a whole and share how you see the evolution of enterprise voice solutions in the era of generative AI? What exactly has made voice interfaces mature enough for large-scale enterprise adoption?

I think of this domain less as “voice AI” and more as conversational and multimodal intelligence. At the core, you’re dealing with human communication — speech, text, video — and the challenge is turning that into structured, reliable signals businesses can act on. For years, enterprises used voice technologies in isolated pockets, mostly customer support. What changed is that companies now want insights from all conversations — calls, medical consultations, internal meetings, compliance discussions.

That shift required maturity across several dimensions. First, economics. The cost of speech and language models dropped significantly, changing ROI calculations. Second, model quality improved — not just accuracy, but latency and robustness in real-world conditions. Third, features like multilingual support, speaker diarization, and multimodal reasoning reached a level where they could support global enterprises. Finally, and this part is often underestimated, enterprises became ready organizationally.

You helped launch Bedrock Data Automation to extract valuable insights from audio. What strategic role does this product play within the AWS ecosystem and the enterprise market as a whole? Why has the extraction of GenAI-powered insights become critical right now?

The fundamental problem is that most enterprise data is unstructured. Developers want to turn existing data into structured outputs to build GenAI-powered applications. A common example is customer service, where enterprises aim to extract insights from support calls to identify coaching opportunities and improve customer experience. Historically, extracting custom insights from multimodal conversations required orchestrating multiple task-specific AI models alongside large language models. This became even more complex when insights had to be generated across documents, images, audio, and video. Depending on the use case, developers also needed to fine-tune models and implement guardrails to manage hallucination risk. Processing unstructured data at scale adds further complexity due to data variety, strict accuracy and latency requirements, and the lack of standardized processing tools.

Bedrock Data Automation (BDA) addresses this challenge directly. It provides a fully managed service that enables developers to extract insights from multimodal content without building complex workflows. With BDA, GenAI-powered solutions can be built in days rather than months. Developers no longer need to manage model fine-tuning or hallucination risks associated with LLMs, making it easier to process multimodal data quickly and at scale.

On a more practical level — which technological and product innovations within BDA proved to be the most groundbreaking? What made it possible to turn complex AI integrations into a simple, scalable product that enterprise customers could actually adopt at scale?

Organizations increasingly want to extract GenAI-powered insights — such as summaries and sentiment — from voice conversations across sales, operations, and customer support. In practice, developers face recurring challenges: selecting the right models while balancing latency, cost, accuracy, and hallucination risk; optimizing prompts with RAG and existing knowledge bases; building pipelines for real-time and post-call processing; managing ML infrastructure and costs; and ensuring security and architectural compliance.

BDA was designed to remove this complexity. It enables developers to extract custom GenAI-powered insights from multimodal data with just a few clicks, using built-in multimodal analytics, natural-language prompts, and pre-packaged foundation models. Developers no longer need to manually select, fine-tune, or orchestrate models for each use case. At its core, BDA provides a fully managed inferencing API with a single price point and pre-built workflows for common enterprise scenarios. These workflows coordinate foundation models and ML components pre-trained for specific use cases, delivering high accuracy while reducing hallucination risk.

BDA also offers pretrained insights — such as sentiment and call summaries — that customers can ingest directly into their knowledge bases. All APIs follow the AWS Well-Architected Framework, ensuring secure, high-performing, and resilient infrastructure.

Another of your projects, Chime SDK Call Analytics, became an important milestone in the democratization of conversational intelligence — you reduced the path from concept to production analytics from 9–12 months to just a few days. What product principles and architectural decisions made this acceleration possible?

Several product principles drove this acceleration. The first was a deliberate reduction in integration effort. Multiple AWS services — machine learning, databases, and real-time audio processing — were combined into a single workflow, eliminating the need for developers to stitch components together themselves.

The second principle was operational simplicity: scaling and managing production workloads was handled by the platform, removing the need to provision or coordinate infrastructure across services.

Third, we focused on flexibility at the product level. An easy-to-use console allowed developers to build custom workflows — such as recording a call or recording and transcribing it in real time — without complex orchestration code.

Finally, we ensured that insights were easy to consume. Developers could access analytics in real time or post-call without building additional pipelines, with all outputs stored in a structured data lake for analysis using standard BI tools like Tableau.

Do you have personal rules or principles that help you make complex product decisions quickly and confidently?

Sure, more than a decade of my work experience taught me to focus on three most important principles. First, deep customer understanding — not just one segment, but startups, mid-market, and large enterprises across industries. I look at the intensity of the pain and whether customers will actually pay to solve it. Second, I pay close attention to technology trajectories. Generative AI, multimodal reasoning, agentic workflows — these trends matter, but only if they meaningfully improve products. Third, I think carefully about pricing and go-to-market. The best technology fails if pricing doesn’t align with perceived value or if customers can’t adopt it easily.

What competencies should the next generation of AI Product Leaders develop in order to successfully build scalable GenAI platforms at the AWS level? Which skills will become “must-haves” over the next three years?

There are several core competencies AI product leaders will need. First is sound judgment around use cases — understanding which customer problems are genuinely suited for AI and resisting the urge to add AI simply because it is trending. Getting this right is critical for adoption after launch. Leaders can accelerate customer discovery by using AI tools to synthesize feedback and analyze market and competitive trends.

Second, leaders must shorten development cycles through rapid prototyping, not just writing requirements. Building and testing early reduces friction between product and engineering, speeds up launches, and enables faster iteration based on real customer feedback.

Third, product leaders need to define meaningful evaluation frameworks. This means moving beyond generic benchmarks to focus on product-relevant metrics such as latency, accuracy, and cost, supported by evaluation criteria aligned with specific use cases.

Fourth, deployment literacy is essential. Leaders must understand how models scale in production and how infrastructure choices affect performance, reliability, and cost across the AI lifecycle.

Finally, AI product leaders need broad technical fluency across AI techniques, cloud architecture, developer tooling, and data quality. The ability to connect these layers into a coherent, scalable product will define success in the coming years.

Today, we see that the Generative AI market is overheated: hundreds of products are appearing, many companies are experiencing “GenAI fatigue,” and enterprise adoption is proving more challenging than expected. Where do you see the biggest gap between hype and real business value, especially in the voice AI space?

The biggest gap in voice AI is insufficient testing and domain adaptation. Many enterprise deployments underperform not because models are weak, but because they are never properly evaluated or fine-tuned on real, company-specific data. In healthcare, for example, a voice agent that hasn’t been adapted to clinical language will quickly fail in production. The same applies to financial services, where real-time AI assistants must work across different customer profiles, regulatory contexts, and languages.

In practice, enterprises need a disciplined three-step approach: first, evaluate models on real-world data using clear metrics; second, fine-tune them with domain- and company-specific data; and third, choose the right deployment strategy — on-premise, private cloud, edge, or public cloud — to meet performance, latency, and security requirements. Skipping any of these steps is where the gap between hype and real value usually appears.

At the same time, the market is seeing competition between large LLM providers, cloud providers,  and conversational AI startups promising “magical” quality and rapid deployment. Will the future of Conversational AI belong to giants capable of building end-to-end platforms, or to niche players who can deliver targeted industry solutions faster?

I expect large platform providers and niche players to coexist for the foreseeable future. Frontier LLMs offer strong capabilities — reasoning, multilingual support, and multimodal processing — but often fall short on highly domain-specific problems. For example, extracting insights from doctor–patient conversations in healthcare requires models trained on specialized data and workflows. As a result, enterprises may combine general-purpose LLMs with niche solutions tailored to specific domains. These providers compete through domain-trained models, specialized features, hands-on support, smaller and more cost-effective architectures, low-latency inference, and flexible deployment across private cloud, edge, or public environments.

While large LLMs will continue to improve, niche players will need to focus on the right domains, move faster than larger platforms, and continuously collect high-quality domain data to stay relevant.

Author

Related Articles

Back to top button