Voice Agents

Last updated: 2026-04-06

Quick answer: Voice stacks add real-time constraints and higher ambiguity; treat speech as another modality with stricter approval for consequential actions.

Definition

A voice agent is an agentic workflow where the primary interface is spoken: automatic speech recognition (ASR), an LLM or policy layer for reasoning, and text-to-speech (TTS) or streaming audio output. It may be telephony, smart devices, or app-based assistants, often with heartbeat-style turn-taking and barge-in handling.

Why it matters

Latency and error profiles differ from text chat: users interrupt, background noise misleads ASR, and spoken confirmations feel binding. Permission and approval design must assume mishears and imposture.

When to use

Use voice when hands-free or speed of utterance matters—field ops, accessibility, drive-time tasks—and when you can keep high-risk actions behind explicit, durable confirmation (not a casual “yeah”).

When not to use

Avoid voice as the sole channel for high-stakes changes without a secondary verification path; skip when users need dense reference material or precise tokens (URLs, codes) that speech handles poorly.

Failure modes

Over-trusting ASR transcripts, no latency budget end-to-end, and granting tools that cannot be safely invoked from ambiguous short utterances.

Swarm vs single-agent systems · Support triage case study · LLMs in agentic systems · Categories