Voice pipeline

Purpose

The Voice Pipeline step is step 4 of the bot creation wizard (labelled “Voice Pipeline”). It controls the three engines a call runs through, end to end:

Speech-to-Text (STT)

Turns the caller’s audio into text.

Language model (LLM)

Reads that text plus the bot’s instructions and decides what to say.

Voice / Text-to-Speech (TTS)

Speaks the bot’s reply back to the caller.

The step is laid out as three cards — one for STT, one for the LLM, one for Voice/TTS. Each card has an engine selector at the top and a set of tuning fields below it. The exact fields shown change depending on which engine you pick, so the lists below cover every field that can appear.

Every field on this step is optional and ships with a sensible default. You can create a working bot without touching anything here. Tune only when a real call shows a problem (mishearing, slow replies, wrong language, interruptions firing too early or too late).

This step has no buttons of its own — it is all selectors, dropdowns, toggles, and number inputs. Move between steps with Previous / Continue in the wizard header, and save with Create Bot on the final step. See Create a bot for the wizard overview.

Engine names in the console are shown as supplier brands. This page refers to them generically — “STT engine”, “language model”, “voice engine” — and groups the settings by what they do. Match the names you see in the console to the descriptions here.

Speech-to-Text (STT)

The STT card decides how accurately and how quickly the bot hears the caller. Pick an engine, then a model, then optionally tune language handling and turn-end detection.

Most of the advanced STT fields below apply to one specific engine only. If you do not see a field, the engine you selected does not support it — that is expected. The console hides fields that do not apply.

Engine and model

Field	What to enter	Required	Notes
STT engine	Choose the speech-to-text engine card. Different engines suit different languages and latency needs.	No	Selecting an engine reveals its config fields below.
STT model	The model variant for the chosen engine. Newer/larger models are more accurate; lighter models are faster.	No	Options are engine-specific. Pick the engine’s recommended default unless you have a reason to change it.

Language handling

Field	What to enter	Required	Notes
STT language	A language code such as `en` (English) or `hi` (Hindi).	No	Tells the engine which language to transcribe. Leave blank to let the engine auto-detect (some engines default to empty).
STT language hints	A comma-separated list of language codes, e.g. `en, hi`, for bots that switch languages mid-call.	No	Available on the multilingual engine only. Improves accuracy when callers mix two languages.
STT strict language hints	Toggle on to force transcription to stay within the listed hints and not drift to other languages.	No	Use only when you are certain of the languages spoken. Turning it on for an unexpected language causes mistranscription.
STT keyterm	A word or short phrase the engine should bias toward recognising (e.g. a brand or product name it keeps mishearing).	No	Helps with proper nouns and jargon that generic models get wrong.

Formatting

Field	What to enter	Required	Notes
STT smart format	Toggle on to format numbers, dates, and currency into readable text (e.g. “fifty thousand” → “50,000”).	No	Improves how transcripts read in reports. Has little effect on what the LLM understands.
STT punctuate	Toggle on to add punctuation to the transcript.	No	Mainly affects transcript readability.

Turn-end detection

These fields control how the engine decides the caller has finished speaking so the bot can respond. Tuning them trades responsiveness against the risk of cutting the caller off.

Field	What to enter	Required	Notes
STT endpointing	Milliseconds of silence the engine waits before treating speech as ended.	No	Lower = snappier replies but more chance of interrupting a slow speaker; higher = more patient but slower.
STT VAD force turn endpoint	Toggle that forces the voice-activity detector to close the caller’s turn.	No	For Hindi and other Indic calls with soft speech, fillers, or backchannels, leave this off so the engine keeps its own turn detection. Forcing it on can wedge turns when the caller speaks quietly.
STT EOT threshold	The end-of-turn confidence threshold the engine uses to decide the caller is done.	No	Higher = the bot waits for stronger evidence the caller finished.
STT eager EOT threshold	A lower, “eager” threshold that lets the bot start preparing a reply sooner.	No	Used alongside the main threshold to reduce perceived latency.
STT EOT timeout	Milliseconds after which the turn is closed regardless of confidence.	No	A safety net so a turn never hangs open indefinitely.

Language model (LLM)

The LLM is the bot’s “brain”. It reads the transcript and the bot’s instructions and writes the reply. The card lets you choose the engine, the model, and how creative or deterministic its responses are.

Engine and model

Field	What to enter	Required	Notes
LLM engine	Choose the language-model engine card.	No	Selecting an engine reveals its config fields.
LLM model	The model variant for the chosen engine. Larger models reason better; lighter models are cheaper and faster.	No	Options are engine-specific. For live calls, keep latency-friendly defaults; spend accuracy budget on post-call analysis instead (see Instructions).

Response tuning

Field	What to enter	Required	Notes
LLM temperature	A number (roughly 0–1) controlling randomness.	No	Lower = more consistent, on-script replies; higher = more varied phrasing. Keep low for collections/compliance bots.
LLM max tokens	Maximum length of a single bot reply, in tokens.	No	Caps how long the bot can talk in one turn. Too low truncates answers mid-sentence.
LLM top P	A number (0–1) that limits word choice to the most likely options (“nucleus sampling”).	No	An alternative to temperature for controlling variety. Most bots leave this at default.

Hosted-project settings

These appear only when you select an engine that runs inside your own cloud project.

Field	What to enter	Required	Notes
LLM project ID	The cloud project identifier that hosts the model.	No	Required when using the hosted-project engine — the bot fails to start without it.
LLM location	The cloud region for the project.	No	Optional. Defaults to `us-east4` if left blank.

Voice / Text-to-Speech (TTS)

The Voice card controls how the bot sounds — which engine speaks, which voice, in which language, and the fine details of pace, pitch, and consistency. It also exposes an audio cache that can reduce cost on repeated phrases.

Engine, voice, and language

Field	What to enter	Required	Notes
TTS engine	Choose the voice (text-to-speech) engine card.	No	Selecting an engine reveals its config fields.
TTS model	The voice model variant for the chosen engine.	No	Options are engine-specific.
TTS voice ID	The identifier of the specific voice to use (e.g. a named voice).	No	Available on engines that offer multiple named voices.
TTS language	The spoken language, chosen from the engine’s supported list (e.g. Hindi, English, Tamil, Telugu, Bengali, Marathi, and more).	No	Set this to match the bot’s language. A mismatch produces an accent or wrong pronunciation.

Delivery tuning

How the voice is paced and shaped. Defaults are usually fine; adjust only if the voice sounds rushed, flat, or unnatural.

Field	What to enter	Required	Notes
TTS speed	A playback-speed multiplier.	No	Above 1 speaks faster, below 1 slower.
TTS pace	The speaking pace.	No	A finer control over rhythm than speed, on engines that support it.
TTS pitch	A pitch adjustment up or down.	No	Use sparingly — large changes sound artificial.
TTS loudness	Output loudness of the voice.	No	Adjust if the bot is too quiet or too loud relative to the line.
TTS stability	How consistent the voice stays from sentence to sentence.	No	Higher = steadier and more predictable; lower = more expressive but more variable.
TTS similarity boost	How closely the output matches the chosen voice’s character.	No	Higher sticks closer to the reference voice.
TTS pronunciation dict	Custom pronunciation rules for words the engine says wrong (e.g. brand names, place names).	No	Add entries when a specific word is consistently mispronounced.

Voice cache

The voice cache stores already-spoken audio so identical phrases (like the opening message) do not have to be regenerated each call. This lowers cost and speeds up repeat lines. The cache is per bot and safe to leave off.

Field	What to enter	Required	Notes
TTS cache enabled	Toggle on to cache generated speech and reuse it for matching text.	No	Turn off to always generate fresh audio.
TTS cache max entries	The maximum number of cached audio clips to keep.	No	Older entries are dropped once the limit is reached.
TTS cache TTL	How long (in seconds) a cached clip stays valid before it expires.	No	Lower values refresh audio more often.
TTS cache max text length	The longest text (in characters) that is eligible for caching.	No	Keeps the cache focused on short, repeated phrases rather than long unique replies.
TTS cache get timeout	Milliseconds to wait for a cache lookup before generating fresh audio.	No	A safety limit so a slow cache never delays the bot.

Tuning STT turn-end detection (endpointing, VAD force turn endpoint, EOT thresholds) and the voice-activity / interruption settings together is what makes a call feel natural. The interruption sensitivity and the voice-activity-detector timings live on the next wizard step (Behavior). If a bot talks over callers or replies too slowly, check both steps — the STT timings here and the VAD settings there work as a pair.

How the three engines work together

Symptom on a test call	Where to look first
Bot mishears words or proper nouns	STT model, STT keyterm, STT language / hints
Bot replies in the wrong language	STT language and TTS language
Bot cuts the caller off	STT turn-end detection here, plus VAD settings on the Behavior step
Bot is slow to reply	Lighter LLM model, lower max tokens, STT endpointing
Replies are too random / off-script	Lower LLM temperature
Voice sounds rushed, flat, or mispronounces a word	TTS speed/pace, stability, pronunciation dict
High voice cost on repeated phrases	Enable the voice cache

Validation checklist

After configuring the voice pipeline, run a test call and confirm:

The transcript matches what was said (STT accuracy and language).
The bot speaks in the expected language and voice (TTS language and voice ID).
The bot does not talk over you and does not lag noticeably (turn-end detection and Behavior-step VAD).
Replies stay on-script and complete, not truncated (LLM temperature and max tokens).
For hosted-project LLM engines, the bot actually starts — a missing project ID prevents startup.

Continue to the Behavior step to set interruption sensitivity, call duration, and active hours.

​Purpose

​Speech-to-Text (STT)

​Engine and model

​Language handling

​Formatting

​Turn-end detection

​Language model (LLM)

​Engine and model

​Response tuning

​Hosted-project settings

​Voice / Text-to-Speech (TTS)

​Engine, voice, and language

​Delivery tuning

​Voice cache

​How the three engines work together

​Validation checklist

Purpose

Speech-to-Text (STT)

Engine and model

Language handling

Formatting

Turn-end detection

Language model (LLM)

Engine and model

Response tuning

Hosted-project settings

Voice / Text-to-Speech (TTS)

Engine, voice, and language

Delivery tuning

Voice cache

How the three engines work together

Validation checklist