OtoniStark

AI Assistants Overview

Assistant Modes

Understand the two voice generation modes available for your AI assistants and when to use each one.

AI assistants on OtoniStark can speak in two distinct modes. Each mode determines how a caller’s speech is understood and how the assistant’s reply is generated:

1. Pipeline

Label in UI	`Pipeline`
How it works	Speech-to-Text → LLM → Text-to-Speech
Latency	~800 – 1500 ms (depends on language & model)
Best for	Complex reasoning, dynamic prompts, multi-sentence replies

Pipeline mode first transcribes the caller’s words into text, runs that text through the language model, then converts the response back to audio. It’s a tried-and-true approach that offers maximum flexibility:

Supports all voices in the library (including custom-cloned voices).
Handles long-form answers or paragraph-style responses well.
Allows the LLM to inject variables and reference earlier context cleanly.

When to choose Pipeline

You need rich, multi-sentence answers (e.g. support queries, detailed explanations).
The assistant must reason over structured data or complex prompts.
You prefer absolute control of the spoken voice (clone or brand voice).

2. Speech-to-Speech (Multimodal)

Label in UI	`Speech-to-speech`
How it works	Direct speech-to-speech generation (no intermediate text)
Latency	~300 – 600 ms (ultra low)
Best for	Natural back-and-forth, short & reactive replies

Speech-to-speech mode skips separate transcription and TTS. Instead, it uses a multimodal model that listens and speaks directly, producing more conversational flow:

Fast turn-taking – callers experience near-instant responses.
Generates more expressive prosody natively (intonation, fillers).
Currently supports a limited voice set, but more are added regularly.

When to choose Speech-to-Speech

The conversation needs to feel snappy (sales, booking confirmations).
Your replies are generally short sentences or quick acknowledgements.
You’re okay with the system-provided voice options for faster interaction.

Speech-to-speech is evolving rapidly. If you need a custom cloned voice or advanced prompt logic, stick with Pipeline for now.

Switching modes

You can pick the mode for each assistant in Assistant → Settings → Voice Engine. Test both modes to see which delivers the best balance of speed and quality for your use-case.

Pro Tip: Record two calls – one in each mode – and compare the caller’s perceived latency and engagement level to decide which fits your flow.

Introduction

Getting Started

AI Assistants Overview

Phone Numbers

Inbound Calls

Outbound Calls

AI Prompting & Conversation Design

Automation & Integrations

Costs & Pricing

Number Provisioning

Troubleshooting & FAQs

API Reference

Assistants

Calls

Leads

Campaigns

SMS

Webhooks

Platform updates

AI Assistants Overview

Assistant Modes

1. Pipeline

When to choose Pipeline

2. Speech-to-Speech (Multimodal)

When to choose Speech-to-Speech

Switching modes