Article·Announcements·Sep 19, 2024
7 min read

Introducing Deepgram’s Voice Agent API

7 min read
Josh Fox
By Josh Fox
PublishedSep 19, 2024
UpdatedSep 25, 2024

tl;dr:

  • Today we’ve officially launched the newest addition to our Voice AI Platform, the Deepgram Voice Agent API–a unified voice-to-voice API that enables natural-sounding conversations between humans and machines.

  • Powered by the industry’s fastest, most performant speech recognition and voice synthesis models, our voice agent stack listens, thinks, and speaks naturally and in real time.

  • Experience our voice agent API yourself with our interactive demo or be among the first to build it into your product today! Sign up now to get started and receive $200 in credits absolutely free!

Meet the Deepgram Voice Agent API: real-time, conversational AI in one easy to use API

Deepgram is excited to unveil the latest addition to its voice AI platform–the Deepgram Voice Agent API, a unified voice-to-voice API for AI agents that enables natural-sounding conversations between humans and machines. With one powerful API, we enable enterprises and developers to easily create LLM-powered AI agents that listen, think, and speak with the same intelligence and emotive quality that a person can.

Powered by the industry’s fastest, most performant speech recognition and voice synthesis models, our voice agent stack:

  • Listens, thinks, and speaks naturally and in real time (seriously). 

  • Gracefully handles interruptions with first-of-its-kind end-of-thought (EOT) detection modeling.

  • Maximizes developer control, allowing builders to choose between open source, closed-source, and Bring-Your-Own LLMs.

  • Scales to serve your production workloads with costs that ensure it.

  • Meets your security and data privacy requirements with flexible deployment modes (including self-hosted options for VPC and on-premises).

For a glimpse of what can be built with our new voice agent API, check out the videos below. In the first, we demonstrate a customer support use case where the AI voice agent leverages next-gen end-of-speech prediction to handle long pauses as a phone number-based ID is spoken. The agent delivers a responsive, natural conversational flow and provides a high-quality customer interaction involving company and product-specific context:

In this video, we showcase a drive-thru agent’s robust performance in accurately understanding a human speaker in a noisy outdoor environment. The agent demonstrates the ability to perform complex action-taking even when it’s interrupted by the speaker:

We’re also excited to share an early proof-of-concept prototype using our new voice agent API and we encourage you to try it firsthand with this interactive demo.

Enabling the future of Voice-First AI

To the average consumer, terms like “voicebot” and “AI agent” are likely to conjure memories of frustrating interactions with traditional IVRs, text-based chatbots, and personal voice assistants like Siri and Alexa that fail to complete even the simplest of tasks. However, recent advances in generative AI technology have given us the tools we need to finally build engaging, human-like voice agents that have the potential to transform the business world:

  • Speech-to-text (STT) - Low latency and superhuman transcription accuracy, like Deepgram’s Nova-2 STT delivers, for inputting spoken words from a human into the AI agent.

  • Text-to-speech (TTS) - Low latency, natural-sounding voice synthesis, like Deepgram’s Aura TTS provides, that delivers human-like spoken output to a human from the AI agent.

  • Large language models (LLMs) - Powerful, responsive generative AI models, like Llama 3 and GPT-4, that are the brains of the modern conversational AI tech stack and used for chat completion and task execution.



To build the voice-powered agentic AI future, developers must integrate these key components, but there’s much more to consider beyond simply linking these elements in a pipeline and orchestrating the handoffs between them. Crafting engaging, enterprise-grade voice agents requires world-class engineering at the model level to effectively tackle key challenges and infuse a human touch into the artificial. Key focus areas include:

  • Noisy audio: Real-world audio is messy and full of background noise and disparate environmental conditions that need to be dealt with properly by the speech-to-text model.

  • Lightning-fast responses: Current response times often exceed 1.5 seconds or more but must be brought down below a second of latency or less to ensure conversations flow naturally without awkward pauses or delays just as we’re accustomed to in typical human interactions.

  • Conversational cues recognition: Agents must adeptly navigate the subtleties of conversational cues–knowing when to pause and when to continue when interrupted, and understanding when a speaker has finished or intends to proceed–to enable a smooth interaction with the same finesse human speakers exhibit in conversation.

  • Contextual intelligence: Voice agents need advanced understanding capabilities, naturally comprehending the context behind conversations, to respond with the most appropriate information and vocal expressiveness that feels genuine and empathetic, bringing a human touch to digital conversations.

  • Action taking: AI agents must understand intent and take action, from scheduling appointments to sending follow-up information, streamlining tasks and enhancing productivity.

  • Controllability: As the LLM landscape evolves rapidly, there isn't a universal model that fits all needs. Agent builders require flexible options, allowing them to select the optimal LLM or fine-tuned, task-specific language model that best aligns with their use case in terms of performance and cost efficiency.


Taken together, WORDS + CONTEXT + TIMING are what makes the effortless back-and-forth exchange of information and ideas comprising human conversation possible. And for the first time in history, we now have all the ingredients we need to truly replicate it in an artificial intelligence system.

At Deepgram, we've spent nearly a decade building, deploying, and managing thousands of voice AI models to process billions of hours of conversational audio in production. We've applied countless insights gained from these experiences into the development of our new voice agent API, optimizing both the models and system architecture to deliver exceptional performance that sets a new standard in human-machine interaction.

Getting started

Several participants in our Enterprise Voice AI Accelerator Program are already nearing the launch of their first AI voice agents built using our new API, and we’re excited to share their progress in an upcoming article—so stay tuned! If you’re facing challenges building, deploying, or scaling real-time voice agents, we can help. Our new API is now available for early access to select customers. Fill out the request form below to start developing enterprise-grade AI voice agents today!


If you have any feedback about this post, or anything else regarding Deepgram, we'd love to hear from you. Please let us know in our GitHub discussions, Discord community, or contact us to talk to one of our product experts for more information today.

Unlock language AI at scale with an API call.

Get conversational intelligence with transcription and understanding on the world's best speech AI platform.