Purpose
The Voice Pipeline step is step 4 of the bot creation wizard (labelled “Voice Pipeline”). It controls the three engines a call runs through, end to end:
Speech-to-Text (STT)
Turns the caller’s audio into text.
Language model (LLM)
Reads that text plus the bot’s instructions and decides what to say.
Voice / Text-to-Speech (TTS)
Speaks the bot’s reply back to the caller.
The step is laid out as three cards — one for STT, one for the LLM, one for Voice/TTS. Each card has an engine selector at the top and a set of tuning fields below it. The exact fields shown change depending on which engine you pick, so the lists below cover every field that can appear.
Every field on this step is optional and ships with a sensible default. You can create a working bot without touching anything here. Tune only when a real call shows a problem (mishearing, slow replies, wrong language, interruptions firing too early or too late).
This step has no buttons of its own — it is all selectors, dropdowns, toggles, and number inputs. Move between steps with Previous / Continue in the wizard header, and save with Create Bot on the final step. See Create a bot for the wizard overview.
Engine names in the console are shown as supplier brands. This page refers to them generically — “STT engine”, “language model”, “voice engine” — and groups the settings by what they do. Match the names you see in the console to the descriptions here.
Speech-to-Text (STT)
The STT card decides how accurately and how quickly the bot hears the caller. Pick an engine, then a model, then optionally tune language handling and turn-end detection.
Most of the advanced STT fields below apply to one specific engine only. If you do not see a field, the engine you selected does not support it — that is expected. The console hides fields that do not apply.
Engine and model
| Field | What to enter | Required | Notes |
|---|
| STT engine | Choose the speech-to-text engine card. Different engines suit different languages and latency needs. | No | Selecting an engine reveals its config fields below. |
| STT model | The model variant for the chosen engine. Newer/larger models are more accurate; lighter models are faster. | No | Options are engine-specific. Pick the engine’s recommended default unless you have a reason to change it. |
Language handling
| Field | What to enter | Required | Notes |
|---|
| STT language | A language code such as en (English) or hi (Hindi). | No | Tells the engine which language to transcribe. Leave blank to let the engine auto-detect (some engines default to empty). |
| STT language hints | A comma-separated list of language codes, e.g. en, hi, for bots that switch languages mid-call. | No | Available on the multilingual engine only. Improves accuracy when callers mix two languages. |
| STT strict language hints | Toggle on to force transcription to stay within the listed hints and not drift to other languages. | No | Use only when you are certain of the languages spoken. Turning it on for an unexpected language causes mistranscription. |
| STT keyterm | A word or short phrase the engine should bias toward recognising (e.g. a brand or product name it keeps mishearing). | No | Helps with proper nouns and jargon that generic models get wrong. |
| Field | What to enter | Required | Notes |
|---|
| STT smart format | Toggle on to format numbers, dates, and currency into readable text (e.g. “fifty thousand” → “50,000”). | No | Improves how transcripts read in reports. Has little effect on what the LLM understands. |
| STT punctuate | Toggle on to add punctuation to the transcript. | No | Mainly affects transcript readability. |
Turn-end detection
These fields control how the engine decides the caller has finished speaking so the bot can respond. Tuning them trades responsiveness against the risk of cutting the caller off.
| Field | What to enter | Required | Notes |
|---|
| STT endpointing | Milliseconds of silence the engine waits before treating speech as ended. | No | Lower = snappier replies but more chance of interrupting a slow speaker; higher = more patient but slower. |
| STT VAD force turn endpoint | Toggle that forces the voice-activity detector to close the caller’s turn. | No | For Hindi and other Indic calls with soft speech, fillers, or backchannels, leave this off so the engine keeps its own turn detection. Forcing it on can wedge turns when the caller speaks quietly. |
| STT EOT threshold | The end-of-turn confidence threshold the engine uses to decide the caller is done. | No | Higher = the bot waits for stronger evidence the caller finished. |
| STT eager EOT threshold | A lower, “eager” threshold that lets the bot start preparing a reply sooner. | No | Used alongside the main threshold to reduce perceived latency. |
| STT EOT timeout | Milliseconds after which the turn is closed regardless of confidence. | No | A safety net so a turn never hangs open indefinitely. |
Language model (LLM)
The LLM is the bot’s “brain”. It reads the transcript and the bot’s instructions and writes the reply. The card lets you choose the engine, the model, and how creative or deterministic its responses are.
Engine and model
| Field | What to enter | Required | Notes |
|---|
| LLM engine | Choose the language-model engine card. | No | Selecting an engine reveals its config fields. |
| LLM model | The model variant for the chosen engine. Larger models reason better; lighter models are cheaper and faster. | No | Options are engine-specific. For live calls, keep latency-friendly defaults; spend accuracy budget on post-call analysis instead (see Instructions). |
Response tuning
| Field | What to enter | Required | Notes |
|---|
| LLM temperature | A number (roughly 0–1) controlling randomness. | No | Lower = more consistent, on-script replies; higher = more varied phrasing. Keep low for collections/compliance bots. |
| LLM max tokens | Maximum length of a single bot reply, in tokens. | No | Caps how long the bot can talk in one turn. Too low truncates answers mid-sentence. |
| LLM top P | A number (0–1) that limits word choice to the most likely options (“nucleus sampling”). | No | An alternative to temperature for controlling variety. Most bots leave this at default. |
Hosted-project settings
These appear only when you select an engine that runs inside your own cloud project.
| Field | What to enter | Required | Notes |
|---|
| LLM project ID | The cloud project identifier that hosts the model. | No | Required when using the hosted-project engine — the bot fails to start without it. |
| LLM location | The cloud region for the project. | No | Optional. Defaults to us-east4 if left blank. |
Voice / Text-to-Speech (TTS)
The Voice card controls how the bot sounds — which engine speaks, which voice, in which language, and the fine details of pace, pitch, and consistency. It also exposes an audio cache that can reduce cost on repeated phrases.
Engine, voice, and language
| Field | What to enter | Required | Notes |
|---|
| TTS engine | Choose the voice (text-to-speech) engine card. | No | Selecting an engine reveals its config fields. |
| TTS model | The voice model variant for the chosen engine. | No | Options are engine-specific. |
| TTS voice ID | The identifier of the specific voice to use (e.g. a named voice). | No | Available on engines that offer multiple named voices. |
| TTS language | The spoken language, chosen from the engine’s supported list (e.g. Hindi, English, Tamil, Telugu, Bengali, Marathi, and more). | No | Set this to match the bot’s language. A mismatch produces an accent or wrong pronunciation. |
Delivery tuning
How the voice is paced and shaped. Defaults are usually fine; adjust only if the voice sounds rushed, flat, or unnatural.
| Field | What to enter | Required | Notes |
|---|
| TTS speed | A playback-speed multiplier. | No | Above 1 speaks faster, below 1 slower. |
| TTS pace | The speaking pace. | No | A finer control over rhythm than speed, on engines that support it. |
| TTS pitch | A pitch adjustment up or down. | No | Use sparingly — large changes sound artificial. |
| TTS loudness | Output loudness of the voice. | No | Adjust if the bot is too quiet or too loud relative to the line. |
| TTS stability | How consistent the voice stays from sentence to sentence. | No | Higher = steadier and more predictable; lower = more expressive but more variable. |
| TTS similarity boost | How closely the output matches the chosen voice’s character. | No | Higher sticks closer to the reference voice. |
| TTS pronunciation dict | Custom pronunciation rules for words the engine says wrong (e.g. brand names, place names). | No | Add entries when a specific word is consistently mispronounced. |
Voice cache
The voice cache stores already-spoken audio so identical phrases (like the opening message) do not have to be regenerated each call. This lowers cost and speeds up repeat lines. The cache is per bot and safe to leave off.
| Field | What to enter | Required | Notes |
|---|
| TTS cache enabled | Toggle on to cache generated speech and reuse it for matching text. | No | Turn off to always generate fresh audio. |
| TTS cache max entries | The maximum number of cached audio clips to keep. | No | Older entries are dropped once the limit is reached. |
| TTS cache TTL | How long (in seconds) a cached clip stays valid before it expires. | No | Lower values refresh audio more often. |
| TTS cache max text length | The longest text (in characters) that is eligible for caching. | No | Keeps the cache focused on short, repeated phrases rather than long unique replies. |
| TTS cache get timeout | Milliseconds to wait for a cache lookup before generating fresh audio. | No | A safety limit so a slow cache never delays the bot. |
Tuning STT turn-end detection (endpointing, VAD force turn endpoint, EOT thresholds) and the voice-activity / interruption settings together is what makes a call feel natural. The interruption sensitivity and the voice-activity-detector timings live on the next wizard step (Behavior). If a bot talks over callers or replies too slowly, check both steps — the STT timings here and the VAD settings there work as a pair.
How the three engines work together
| Symptom on a test call | Where to look first |
|---|
| Bot mishears words or proper nouns | STT model, STT keyterm, STT language / hints |
| Bot replies in the wrong language | STT language and TTS language |
| Bot cuts the caller off | STT turn-end detection here, plus VAD settings on the Behavior step |
| Bot is slow to reply | Lighter LLM model, lower max tokens, STT endpointing |
| Replies are too random / off-script | Lower LLM temperature |
| Voice sounds rushed, flat, or mispronounces a word | TTS speed/pace, stability, pronunciation dict |
| High voice cost on repeated phrases | Enable the voice cache |
Validation checklist
After configuring the voice pipeline, run a test call and confirm:
- The transcript matches what was said (STT accuracy and language).
- The bot speaks in the expected language and voice (TTS language and voice ID).
- The bot does not talk over you and does not lag noticeably (turn-end detection and Behavior-step VAD).
- Replies stay on-script and complete, not truncated (LLM temperature and max tokens).
- For hosted-project LLM engines, the bot actually starts — a missing project ID prevents startup.
Continue to the Behavior step to set interruption sensitivity, call duration, and active hours.