Voice Agents
Last updated: 2026-04-06
Quick answer: Voice stacks add real-time constraints and higher ambiguity; treat speech as another modality with stricter approval for consequential actions.
Definition
A voice agent is an agentic workflow where the primary interface is spoken: automatic speech recognition (ASR), an LLM or policy layer for reasoning, and text-to-speech (TTS) or streaming audio output. It may be telephony, smart devices, or app-based assistants, often with heartbeat-style turn-taking and barge-in handling.
Why it matters
Latency and error profiles differ from text chat: users interrupt, background noise misleads ASR, and spoken confirmations feel binding. Permission and approval design must assume mishears and imposture.
When to use
Use voice when hands-free or speed of utterance matters—field ops, accessibility, drive-time tasks—and when you can keep high-risk actions behind explicit, durable confirmation (not a casual “yeah”).
When not to use
Avoid voice as the sole channel for high-stakes changes without a secondary verification path; skip when users need dense reference material or precise tokens (URLs, codes) that speech handles poorly.
Failure modes
Over-trusting ASR transcripts, no latency budget end-to-end, and granting tools that cannot be safely invoked from ambiguous short utterances.
Related pages
Swarm vs single-agent systems · Support triage case study · LLMs in agentic systems · Categories