The Speech-to-Text (STT) feature applies specifically to voice-based agents and controls how the agent listens and transcribes audio input during a call. This is a crucial component for any voice-based agent, enabling it to convert spoken language into text, which can then be processed for responses.

Customizing Speech-to-Text Settings

In the AI Voice Settings, you can choose the appropriate speech-to-text provider and language for your agent. These settings directly affect how the agent interprets and processes voice input during calls.

Available Speech-to-Text Providers

  1. Deepgram: Nova 2, Nova 3
  2. Groq: Whisper Large V3 Turbo
  3. Cartesia: Ink Whisper
  4. Sarvam: Saaras 2.5
Each provider offers unique performance characteristics, including accuracy, speed, and language support. Depending on your use case, you can select the provider that best meets your needs.

Key Considerations

  1. Use Case
    Choose your speech-to-text provider based on the complexity of the conversation. If your agent handles basic conversations, a simpler STT model might be sufficient. For more technical or nuanced conversations, consider a model that offers higher accuracy and language support.
  2. Language
    Ensure the language selected matches the language of the customer interactions. This will help improve transcription accuracy. Different providers support different sets of languages, so confirm the availability of your required language.
  3. Pricing Consideration
    The choice of STT model might impact your pricing. More advanced providers or those supporting specialized languages may incur higher costs. Always evaluate your use case to choose the most cost-effective option without compromising on quality.

How Speech-to-Text Works

Once configured, the STT provider is automatically applied to voice-based agent interactions. The agent listens during the conversation and converts spoken responses into text for further processing. This text can be used for generating responses or pulling information from a knowledge base, allowing the agent to engage effectively.