Purpose

The Voice Pipeline step is step 4 of the bot creation wizard (labelled “Voice Pipeline”). It controls the three engines a call runs through, end to end:
1

Speech-to-Text (STT)

Turns the caller’s audio into text.
2

Language model (LLM)

Reads that text plus the bot’s instructions and decides what to say.
3

Voice / Text-to-Speech (TTS)

Speaks the bot’s reply back to the caller.
The step is laid out as three cards — one for STT, one for the LLM, one for Voice/TTS. Each card has an engine selector at the top and a set of tuning fields below it. The exact fields shown change depending on which engine you pick, so the lists below cover every field that can appear.
Every field on this step is optional and ships with a sensible default. You can create a working bot without touching anything here. Tune only when a real call shows a problem (mishearing, slow replies, wrong language, interruptions firing too early or too late).
This step has no buttons of its own — it is all selectors, dropdowns, toggles, and number inputs. Move between steps with Previous / Continue in the wizard header, and save with Create Bot on the final step. See Create a bot for the wizard overview.
Engine names in the console are shown as supplier brands. This page refers to them generically — “STT engine”, “language model”, “voice engine” — and groups the settings by what they do. Match the names you see in the console to the descriptions here.

Speech-to-Text (STT)

The STT card decides how accurately and how quickly the bot hears the caller. Pick an engine, then a model, then optionally tune language handling and turn-end detection.
Most of the advanced STT fields below apply to one specific engine only. If you do not see a field, the engine you selected does not support it — that is expected. The console hides fields that do not apply.

Engine and model

FieldWhat to enterRequiredNotes
STT engineChoose the speech-to-text engine card. Different engines suit different languages and latency needs.NoSelecting an engine reveals its config fields below.
STT modelThe model variant for the chosen engine. Newer/larger models are more accurate; lighter models are faster.NoOptions are engine-specific. Pick the engine’s recommended default unless you have a reason to change it.

Language handling

FieldWhat to enterRequiredNotes
STT languageA language code such as en (English) or hi (Hindi).NoTells the engine which language to transcribe. Leave blank to let the engine auto-detect (some engines default to empty).
STT language hintsA comma-separated list of language codes, e.g. en, hi, for bots that switch languages mid-call.NoAvailable on the multilingual engine only. Improves accuracy when callers mix two languages.
STT strict language hintsToggle on to force transcription to stay within the listed hints and not drift to other languages.NoUse only when you are certain of the languages spoken. Turning it on for an unexpected language causes mistranscription.
STT keytermA word or short phrase the engine should bias toward recognising (e.g. a brand or product name it keeps mishearing).NoHelps with proper nouns and jargon that generic models get wrong.

Formatting

FieldWhat to enterRequiredNotes
STT smart formatToggle on to format numbers, dates, and currency into readable text (e.g. “fifty thousand” → “50,000”).NoImproves how transcripts read in reports. Has little effect on what the LLM understands.
STT punctuateToggle on to add punctuation to the transcript.NoMainly affects transcript readability.

Turn-end detection

These fields control how the engine decides the caller has finished speaking so the bot can respond. Tuning them trades responsiveness against the risk of cutting the caller off.
FieldWhat to enterRequiredNotes
STT endpointingMilliseconds of silence the engine waits before treating speech as ended.NoLower = snappier replies but more chance of interrupting a slow speaker; higher = more patient but slower.
STT VAD force turn endpointToggle that forces the voice-activity detector to close the caller’s turn.NoFor Hindi and other Indic calls with soft speech, fillers, or backchannels, leave this off so the engine keeps its own turn detection. Forcing it on can wedge turns when the caller speaks quietly.
STT EOT thresholdThe end-of-turn confidence threshold the engine uses to decide the caller is done.NoHigher = the bot waits for stronger evidence the caller finished.
STT eager EOT thresholdA lower, “eager” threshold that lets the bot start preparing a reply sooner.NoUsed alongside the main threshold to reduce perceived latency.
STT EOT timeoutMilliseconds after which the turn is closed regardless of confidence.NoA safety net so a turn never hangs open indefinitely.

Language model (LLM)

The LLM is the bot’s “brain”. It reads the transcript and the bot’s instructions and writes the reply. The card lets you choose the engine, the model, and how creative or deterministic its responses are.

Engine and model

FieldWhat to enterRequiredNotes
LLM engineChoose the language-model engine card.NoSelecting an engine reveals its config fields.
LLM modelThe model variant for the chosen engine. Larger models reason better; lighter models are cheaper and faster.NoOptions are engine-specific. For live calls, keep latency-friendly defaults; spend accuracy budget on post-call analysis instead (see Instructions).

Response tuning

FieldWhat to enterRequiredNotes
LLM temperatureA number (roughly 0–1) controlling randomness.NoLower = more consistent, on-script replies; higher = more varied phrasing. Keep low for collections/compliance bots.
LLM max tokensMaximum length of a single bot reply, in tokens.NoCaps how long the bot can talk in one turn. Too low truncates answers mid-sentence.
LLM top PA number (0–1) that limits word choice to the most likely options (“nucleus sampling”).NoAn alternative to temperature for controlling variety. Most bots leave this at default.

Hosted-project settings

These appear only when you select an engine that runs inside your own cloud project.
FieldWhat to enterRequiredNotes
LLM project IDThe cloud project identifier that hosts the model.NoRequired when using the hosted-project engine — the bot fails to start without it.
LLM locationThe cloud region for the project.NoOptional. Defaults to us-east4 if left blank.

Voice / Text-to-Speech (TTS)

The Voice card controls how the bot sounds — which engine speaks, which voice, in which language, and the fine details of pace, pitch, and consistency. It also exposes an audio cache that can reduce cost on repeated phrases.

Engine, voice, and language

FieldWhat to enterRequiredNotes
TTS engineChoose the voice (text-to-speech) engine card.NoSelecting an engine reveals its config fields.
TTS modelThe voice model variant for the chosen engine.NoOptions are engine-specific.
TTS voice IDThe identifier of the specific voice to use (e.g. a named voice).NoAvailable on engines that offer multiple named voices.
TTS languageThe spoken language, chosen from the engine’s supported list (e.g. Hindi, English, Tamil, Telugu, Bengali, Marathi, and more).NoSet this to match the bot’s language. A mismatch produces an accent or wrong pronunciation.

Delivery tuning

How the voice is paced and shaped. Defaults are usually fine; adjust only if the voice sounds rushed, flat, or unnatural.
FieldWhat to enterRequiredNotes
TTS speedA playback-speed multiplier.NoAbove 1 speaks faster, below 1 slower.
TTS paceThe speaking pace.NoA finer control over rhythm than speed, on engines that support it.
TTS pitchA pitch adjustment up or down.NoUse sparingly — large changes sound artificial.
TTS loudnessOutput loudness of the voice.NoAdjust if the bot is too quiet or too loud relative to the line.
TTS stabilityHow consistent the voice stays from sentence to sentence.NoHigher = steadier and more predictable; lower = more expressive but more variable.
TTS similarity boostHow closely the output matches the chosen voice’s character.NoHigher sticks closer to the reference voice.
TTS pronunciation dictCustom pronunciation rules for words the engine says wrong (e.g. brand names, place names).NoAdd entries when a specific word is consistently mispronounced.

Voice cache

The voice cache stores already-spoken audio so identical phrases (like the opening message) do not have to be regenerated each call. This lowers cost and speeds up repeat lines. The cache is per bot and safe to leave off.
FieldWhat to enterRequiredNotes
TTS cache enabledToggle on to cache generated speech and reuse it for matching text.NoTurn off to always generate fresh audio.
TTS cache max entriesThe maximum number of cached audio clips to keep.NoOlder entries are dropped once the limit is reached.
TTS cache TTLHow long (in seconds) a cached clip stays valid before it expires.NoLower values refresh audio more often.
TTS cache max text lengthThe longest text (in characters) that is eligible for caching.NoKeeps the cache focused on short, repeated phrases rather than long unique replies.
TTS cache get timeoutMilliseconds to wait for a cache lookup before generating fresh audio.NoA safety limit so a slow cache never delays the bot.
Tuning STT turn-end detection (endpointing, VAD force turn endpoint, EOT thresholds) and the voice-activity / interruption settings together is what makes a call feel natural. The interruption sensitivity and the voice-activity-detector timings live on the next wizard step (Behavior). If a bot talks over callers or replies too slowly, check both steps — the STT timings here and the VAD settings there work as a pair.

How the three engines work together

Symptom on a test callWhere to look first
Bot mishears words or proper nounsSTT model, STT keyterm, STT language / hints
Bot replies in the wrong languageSTT language and TTS language
Bot cuts the caller offSTT turn-end detection here, plus VAD settings on the Behavior step
Bot is slow to replyLighter LLM model, lower max tokens, STT endpointing
Replies are too random / off-scriptLower LLM temperature
Voice sounds rushed, flat, or mispronounces a wordTTS speed/pace, stability, pronunciation dict
High voice cost on repeated phrasesEnable the voice cache

Validation checklist

After configuring the voice pipeline, run a test call and confirm:
  1. The transcript matches what was said (STT accuracy and language).
  2. The bot speaks in the expected language and voice (TTS language and voice ID).
  3. The bot does not talk over you and does not lag noticeably (turn-end detection and Behavior-step VAD).
  4. Replies stay on-script and complete, not truncated (LLM temperature and max tokens).
  5. For hosted-project LLM engines, the bot actually starts — a missing project ID prevents startup.
Continue to the Behavior step to set interruption sensitivity, call duration, and active hours.