> ## Documentation Index > Fetch the complete documentation index at: https://plivo.com/docs/llms.txt > Use this file to discover all available pages before exploring further. # Audio Streaming Guide > Complete guide to building Voice AI applications with real-time bidirectional audio streaming Real-time bidirectional audio streaming enables Voice AI applications, live transcription, voice assistants, and custom audio processing on Plivo calls. *** ## Architecture Plivo Audio Streaming enables real-time, bidirectional audio communication between your application and an ongoing phone call via WebSocket. ### High-Level Flow ```mermaid theme={null} sequenceDiagram participant Caller as Caller (Phone) participant Plivo as Plivo Platform participant WS as Your WebSocket Server participant AI as AI Services (STT/LLM/TTS) Caller->>Plivo: Initiates Call Plivo->>WS: HTTP Request to Answer URL WS->>Plivo: Returns Stream XML Response Plivo->>WS: Opens WebSocket Connection WS->>Plivo: Connection Established Note over Plivo,WS: Stream Session Active Plivo->>WS: start event (call metadata) loop Real-time Audio Stream Plivo->>WS: media event (caller audio) WS->>AI: Forward audio to STT AI->>WS: Transcription WS->>AI: Send to LLM AI->>WS: LLM Response WS->>AI: Send to TTS AI->>WS: Generated Audio WS->>Plivo: playAudio event (response audio) Plivo->>Caller: Plays Audio end opt DTMF Caller->>Plivo: Presses Key Plivo->>WS: dtmf event WS->>WS: Handle DTMF end Plivo->>WS: Connection Closes Plivo->>Caller: Call Ends ``` ### Step-by-Step Flow 1. **Call Initiation**: A caller dials your Plivo number, or your application initiates an outbound call. 2. **Answer URL Request**: Plivo makes an HTTP request to your configured Answer URL. 3. **Stream XML Response**: Your server responds with XML containing the `` element, specifying the WebSocket URL and streaming parameters. 4. **WebSocket Connection**: Plivo establishes a WebSocket connection to your specified URL, validating signatures if configured. 5. **Start Event**: Plivo sends a `start` event containing call metadata (call ID, stream ID, media format, etc.). 6. **Media Streaming**: * **Inbound**: Plivo continuously sends `media` events containing base64-encoded audio chunks from the caller. * **Outbound**: Your server sends `playAudio` events with base64-encoded audio to be played to the caller. 7. **DTMF Events**: When the caller presses keys, Plivo sends `dtmf` events with the digit information. 8. **Control Events**: Your server can send `clearAudio` to interrupt playback or `checkpoint` to track playback progress. 9. **Confirmation Events**: Plivo sends `playedStream` when audio finishes playing and `clearedAudio` when the queue is cleared. 10. **Connection Close**: When the call ends or streaming stops, the WebSocket connection closes. *** ## Stream XML The `` XML element initiates audio streaming for a call. Include it in your Answer URL response. ### Basic Syntax ```xml theme={null} wss://your-server.com/stream ``` ### Parameters | Parameter | Type | Default | Description | | ---------------------- | ------- | ------------------------- | --------------------------------------------------------------------------------------------- | | `bidirectional` | boolean | `false` | Enable two-way audio streaming. When `true`, you can send audio back to the caller. | | `keepCallAlive` | boolean | `false` | Keep the call active after the stream ends. When `false`, the call ends when streaming stops. | | `contentType` | string | `audio/x-mulaw;rate=8000` | Audio codec and sample rate. See [Supported Content Types](#supported-content-types). | | `statusCallbackUrl` | string | — | URL for stream status callbacks (started, stopped, failed). | | `statusCallbackMethod` | string | `POST` | HTTP method for status callbacks (`GET` or `POST`). | | `extraHeaders` | string | — | Custom headers to include in the start event. Format: `key1=value1;key2=value2` | ### Supported Content Types | Content Type | Description | Use Case | | ------------------------- | -------------------------- | ------------------------------------------------------------------------ | | `audio/x-mulaw;rate=8000` | μ-law codec at 8kHz | **Recommended**. Standard telephony, lowest latency, best compatibility. | | `audio/x-l16;rate=8000` | Linear PCM 16-bit at 8kHz | Higher quality for speech processing. | | `audio/x-l16;rate=16000` | Linear PCM 16-bit at 16kHz | High-quality speech recognition. | ### Examples #### Basic Unidirectional Stream (Listen Only) ```xml theme={null} wss://your-server.com/stream ``` #### Bidirectional Stream with μ-law Codec ```xml theme={null} Hello! I'm connecting you to our AI assistant. wss://your-server.com/stream ``` #### Stream with Status Callbacks and Extra Headers ```xml theme={null} wss://your-server.com/stream ``` #### Higher Quality Stream (16kHz) ```xml theme={null} wss://your-server.com/stream ``` #### Record After Stream ```xml theme={null} wss://your-server.com/stream ``` *** ## Stream APIs The Plivo Stream API allows you to control active streams programmatically via REST API calls. ### Base URL ``` https://api.plivo.com/v1/Account/{auth_id}/Call/{call_uuid}/Stream/ ``` ### Authentication Use HTTP Basic Authentication with your Plivo Auth ID and Auth Token. ### Stop a Stream Stop an active stream on a call. **Endpoint**: `DELETE /v1/Account/{auth_id}/Call/{call_uuid}/Stream/` **Parameters**: | Parameter | Type | Required | Description | | ----------- | ------ | -------- | -------------------- | | `auth_id` | string | Yes | Your Plivo Auth ID | | `call_uuid` | string | Yes | The UUID of the call | **Example Request**: ```bash theme={null} curl -X DELETE \ https://api.plivo.com/v1/Account/YOUR_AUTH_ID/Call/CALL_UUID/Stream/ \ -u YOUR_AUTH_ID:YOUR_AUTH_TOKEN ``` **Example Response**: ```json theme={null} { "message": "stream stopped", "api_id": "b8e78dd0-1234-11ec-8a9e-0242ac110002" } ``` ### Get Stream Details Retrieve information about active streams on a call. **Endpoint**: `GET /v1/Account/{auth_id}/Call/{call_uuid}/Stream/` **Example Request**: ```bash theme={null} curl -X GET \ https://api.plivo.com/v1/Account/YOUR_AUTH_ID/Call/CALL_UUID/Stream/ \ -u YOUR_AUTH_ID:YOUR_AUTH_TOKEN ``` **Example Response**: ```json theme={null} { "api_id": "c9f89ee1-1234-11ec-8a9e-0242ac110002", "objects": [ { "stream_id": "12345678-1234-1234-1234-123456789abc", "call_uuid": "CALL_UUID", "status": "streaming", "service_url": "wss://your-server.com/stream", "bidirectional": true, "content_type": "audio/x-mulaw;rate=8000" } ] } ``` ### Using the Plivo SDK #### Node.js ```javascript theme={null} const plivo = require('plivo'); const client = new plivo.Client('YOUR_AUTH_ID', 'YOUR_AUTH_TOKEN'); // Stop a stream await client.calls.stopStream('CALL_UUID'); ``` #### Python ```python theme={null} import plivo client = plivo.RestClient('YOUR_AUTH_ID', 'YOUR_AUTH_TOKEN') # Stop a stream client.calls.stop_stream(call_uuid='CALL_UUID') ``` *** ## Stream Status Callback URL Configure a callback URL to receive notifications about stream lifecycle events. ### Configuration Set the `statusCallbackUrl` attribute in your Stream XML: ```xml theme={null} wss://your-server.com/stream ``` ### Callback Events Your callback URL receives POST (or GET) requests with the following parameters: | Parameter | Type | Description | | -------------- | ------ | ---------------------------------------------- | | `CallUUID` | string | The unique identifier for the call | | `StreamID` | string | The unique identifier for the stream | | `Event` | string | The event type: `started`, `stopped`, `failed` | | `Timestamp` | string | ISO 8601 timestamp of the event | | `From` | string | The caller's phone number | | `To` | string | The called phone number | | `Direction` | string | Call direction: `inbound` or `outbound` | | `StatusReason` | string | Reason for status (on `stopped` or `failed`) | | `Duration` | number | Stream duration in seconds (on `stopped`) | ### Event Types #### `started` Sent when the WebSocket connection is successfully established. ```json theme={null} { "CallUUID": "12345678-1234-1234-1234-123456789abc", "StreamID": "87654321-4321-4321-4321-cba987654321", "Event": "started", "Timestamp": "2024-01-15T10:30:00Z", "From": "+14155551234", "To": "+14155555678", "Direction": "inbound" } ``` #### `stopped` Sent when the stream ends normally. ```json theme={null} { "CallUUID": "12345678-1234-1234-1234-123456789abc", "StreamID": "87654321-4321-4321-4321-cba987654321", "Event": "stopped", "Timestamp": "2024-01-15T10:35:00Z", "StatusReason": "completed", "Duration": 300, "From": "+14155551234", "To": "+14155555678", "Direction": "inbound" } ``` #### `failed` Sent when the stream fails to start or encounters an error. ```json theme={null} { "CallUUID": "12345678-1234-1234-1234-123456789abc", "StreamID": "87654321-4321-4321-4321-cba987654321", "Event": "failed", "Timestamp": "2024-01-15T10:30:05Z", "StatusReason": "connection_failed", "From": "+14155551234", "To": "+14155555678", "Direction": "inbound" } ``` ### Example Callback Handler ```javascript theme={null} app.post('/stream-status', (req, res) => { const { CallUUID, StreamID, Event, StatusReason, Duration } = req.body; switch (Event) { case 'started': console.log(`Stream ${StreamID} started for call ${CallUUID}`); break; case 'stopped': console.log(`Stream ${StreamID} stopped after ${Duration}s: ${StatusReason}`); break; case 'failed': console.error(`Stream ${StreamID} failed: ${StatusReason}`); break; } res.sendStatus(200); }); ``` *** ## Plivo Signature Validation Plivo signs WebSocket connection requests to verify authenticity. Validate these signatures to ensure requests originate from Plivo. ### V3 Signature Headers Plivo includes two headers with each WebSocket connection request: | Header | Description | | ---------------------------- | ------------------------------- | | `X-Plivo-Signature-V3` | The HMAC-SHA256 signature | | `X-Plivo-Signature-V3-Nonce` | A unique nonce for this request | ### Validation Process 1. Construct the signature base string: `{METHOD}{URI}{NONCE}` 2. Compute HMAC-SHA256 using your Auth Token as the key 3. Base64 encode the result 4. Compare with the `X-Plivo-Signature-V3` header ### Using the Plivo SDK The Plivo SDK provides a built-in validation function: ```javascript theme={null} import { validateV3Signature } from 'plivo'; const isValid = validateV3Signature( method, // 'GET' for WebSocket upgrade requests uri, // Full URI including protocol and path nonce, // X-Plivo-Signature-V3-Nonce header value authToken, // Your Plivo Auth Token signature, // X-Plivo-Signature-V3 header value ); ``` ### Using the Node.js Stream SDK The `plivo-stream-sdk-node` handles signature validation automatically: ```javascript theme={null} const plivoServer = new PlivoWebSocketServer({ server, path: '/stream', validateSignature: true, authToken: process.env.PLIVO_AUTH_TOKEN, }); ``` When `validateSignature` is enabled, connections with invalid signatures are automatically rejected with a 1008 WebSocket close code. ### Manual Validation Example ```javascript theme={null} import crypto from 'crypto'; function validatePlivoSignature(request, authToken) { const signature = request.headers['x-plivo-signature-v3']; const nonce = request.headers['x-plivo-signature-v3-nonce']; if (!signature || !nonce) { return false; } const host = request.headers.host; const protocol = request.socket.encrypted ? 'https' : 'http'; const uri = `${protocol}://${host}${request.url}`; const baseString = `GET${uri}${nonce}`; const expectedSignature = crypto.createHmac('sha256', authToken).update(baseString).digest('base64'); return crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expectedSignature)); } ``` *** ## The Plivo Stream Event Protocol All communication over the WebSocket uses JSON messages. Events are categorized as **Input Events** (from Plivo to your server) and **Output Events** (from your server to Plivo). ### Input Events (Plivo → Your Server) #### `start` Sent once when the stream begins. Contains call and stream metadata. ```json theme={null} { "event": "start", "sequenceNumber": 1, "start": { "callId": "12345678-1234-1234-1234-123456789abc", "streamId": "87654321-4321-4321-4321-cba987654321", "accountId": "MAXXXXXXXXXXXXXXXXXX", "tracks": ["inbound"], "mediaFormat": { "encoding": "audio/x-mulaw", "sampleRate": 8000 } }, "extra_headers": "userId=12345;sessionId=abc-xyz" } ``` | Field | Type | Description | | ------------------------------ | ------------- | ---------------------------------------------------------------------------- | | `event` | string | Always `"start"` | | `sequenceNumber` | number | Event sequence number (starts at 1) | | `start.callId` | string (UUID) | Unique identifier for the call | | `start.streamId` | string (UUID) | Unique identifier for the stream | | `start.accountId` | string | Your Plivo account ID | | `start.tracks` | string\[] | Audio tracks being streamed (e.g., `["inbound"]`, `["inbound", "outbound"]`) | | `start.mediaFormat.encoding` | string | Audio codec (e.g., `"audio/x-mulaw"`) | | `start.mediaFormat.sampleRate` | number | Sample rate in Hz | | `extra_headers` | string | Custom headers from the Stream XML `extraHeaders` attribute | #### `media` Sent continuously during the call. Contains audio data from the caller. ```json theme={null} { "event": "media", "sequenceNumber": 42, "streamId": "87654321-4321-4321-4321-cba987654321", "media": { "track": "inbound", "timestamp": "1705312200000", "chunk": 41, "payload": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV..." }, "extra_headers": "userId=12345;sessionId=abc-xyz" } ``` | Field | Type | Description | | ----------------- | ------------- | ---------------------------------------- | | `event` | string | Always `"media"` | | `sequenceNumber` | number | Event sequence number | | `streamId` | string (UUID) | Stream identifier | | `media.track` | string | Audio track (`"inbound"` = caller audio) | | `media.timestamp` | string | Unix timestamp in milliseconds | | `media.chunk` | number | Chunk sequence number for this track | | `media.payload` | string | Base64-encoded audio data | | `extra_headers` | string | Custom headers from the Stream XML | **Audio Chunk Details**: * Each chunk contains approximately 20ms of audio * At 8kHz with μ-law encoding: \~160 bytes per chunk * Decode using: `Buffer.from(payload, 'base64')` #### `dtmf` Sent when the caller presses a key on their phone. ```json theme={null} { "event": "dtmf", "sequenceNumber": 50, "streamId": "87654321-4321-4321-4321-cba987654321", "dtmf": { "track": "inbound", "digit": "5", "timestamp": "1705312250000" }, "extra_headers": "userId=12345;sessionId=abc-xyz" } ``` | Field | Type | Description | | ---------------- | ------------- | ----------------------------------------------- | | `event` | string | Always `"dtmf"` | | `sequenceNumber` | number | Event sequence number | | `streamId` | string (UUID) | Stream identifier | | `dtmf.track` | string | Audio track (`"inbound"`) | | `dtmf.digit` | string | The DTMF digit pressed (`0-9`, `*`, `#`, `A-D`) | | `dtmf.timestamp` | string | Unix timestamp in milliseconds | | `extra_headers` | string | Custom headers from the Stream XML | #### `playedStream` Confirmation that audio with a checkpoint has finished playing. ```json theme={null} { "event": "playedStream", "sequenceNumber": 75, "streamId": "87654321-4321-4321-4321-cba987654321", "name": "greeting-complete" } ``` | Field | Type | Description | | ---------------- | ------------- | --------------------------------- | | `event` | string | Always `"playedStream"` | | `sequenceNumber` | number | Event sequence number | | `streamId` | string (UUID) | Stream identifier | | `name` | string | The checkpoint name you specified | #### `clearedAudio` Confirmation that the audio queue has been cleared. ```json theme={null} { "event": "clearedAudio", "sequenceNumber": 80, "streamId": "87654321-4321-4321-4321-cba987654321" } ``` | Field | Type | Description | | ---------------- | ------------- | ----------------------- | | `event` | string | Always `"clearedAudio"` | | `sequenceNumber` | number | Event sequence number | | `streamId` | string (UUID) | Stream identifier | ### Output Events (Your Server → Plivo) #### `playAudio` Send audio to be played to the caller. For bidirectional streams only. ```json theme={null} { "event": "playAudio", "media": { "contentType": "audio/x-mulaw", "sampleRate": 8000, "payload": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV..." } } ``` | Field | Type | Description | | ------------------- | ------ | --------------------------------------------------- | | `event` | string | Always `"playAudio"` | | `media.contentType` | string | Audio MIME type (must match stream's `contentType`) | | `media.sampleRate` | number | Sample rate in Hz (must match stream's sample rate) | | `media.payload` | string | Base64-encoded audio data | **Important**: The content type and sample rate must match what was specified in your Stream XML. #### `checkpoint` Mark a point in the audio queue. Receive a `playedStream` event when playback reaches this point. ```json theme={null} { "event": "checkpoint", "streamId": "87654321-4321-4321-4321-cba987654321", "name": "greeting-complete" } ``` | Field | Type | Description | | ---------- | ------------- | ------------------------------------- | | `event` | string | Always `"checkpoint"` | | `streamId` | string (UUID) | Stream identifier | | `name` | string | Unique identifier for this checkpoint | **Use Cases**: * Track when a specific response finishes playing * Coordinate actions after audio playback * Measure time from sending audio to playback completion #### `clearAudio` Clear all queued audio. Use this to implement interruption. ```json theme={null} { "event": "clearAudio", "streamId": "87654321-4321-4321-4321-cba987654321" } ``` | Field | Type | Description | | ---------- | ------------- | --------------------- | | `event` | string | Always `"clearAudio"` | | `streamId` | string (UUID) | Stream identifier | *** ## X-Headers X-Headers (Extra Headers) allow you to pass custom metadata from your Stream XML to your WebSocket server. ### Configuration Set the `extraHeaders` attribute in your Stream XML: ```xml theme={null} wss://your-server.com/stream ``` ### Format * Key-value pairs separated by semicolons: `key1=value1;key2=value2` * Keys and values are strings * URL-encode values if they contain special characters ### Accessing X-Headers X-Headers appear in the `extra_headers` field of every event: ```json theme={null} { "event": "start", "extra_headers": "userId=12345;sessionId=abc-xyz;tier=premium", ... } ``` ### Parsing X-Headers ```javascript theme={null} function parseExtraHeaders(extraHeaders) { const headers = {}; if (!extraHeaders) return headers; for (const pair of extraHeaders.split(';')) { const [key, value] = pair.split('='); if (key && value) { headers[key.trim()] = decodeURIComponent(value.trim()); } } return headers; } // Usage plivoServer.onStart((event, ws) => { const headers = parseExtraHeaders(event.extra_headers); console.log(headers.userId); // "12345" console.log(headers.sessionId); // "abc-xyz" console.log(headers.tier); // "premium" }); ``` ### Why Use X-Headers? 1. **Session Correlation**: Pass session IDs to correlate WebSocket connections with HTTP sessions 2. **User Context**: Include user IDs, account tiers, or language preferences 3. **Routing**: Pass information to route audio to different processing pipelines 4. **Analytics**: Include tracking IDs for analytics and debugging ### Example: Dynamic Agent Selection ```xml theme={null} wss://your-server.com/stream ``` ```javascript theme={null} plivoServer.onStart((event, ws) => { const headers = parseExtraHeaders(event.extra_headers); // Route to appropriate AI agent based on headers const agent = initializeAgent({ type: headers.agentType, // "sales" language: headers.language, // "es" customerId: headers.customerId, // "cust_123" }); // Store agent in connection state connectionState.set(ws, { agent }); }); ``` *** ## Limits ### WebSocket URL Length | Limit | Value | | ---------------------------- | ------------------- | | Maximum WebSocket URL length | **2048 characters** | This includes the full URL with protocol, host, path, and any query parameters. ### Stream Limits | Limit | Value | | ----------------------------------- | --------------------- | | Maximum concurrent streams per call | 1 | | Maximum stream duration | Same as call duration | | Audio buffer size (playback queue) | \~60 seconds of audio | ### Rate Limits | Limit | Value | | ----------------------------------- | --------------------------------------------- | | Media events per second | \~50 (approximately 20ms chunks) | | Maximum playAudio events per second | No hard limit, but limited by playback buffer | ### Message Size | Limit | Value | | ------------------------------ | --------------------- | | Maximum WebSocket message size | 64 KB | | Recommended audio chunk size | ≤16 KB base64-encoded | *** ## Protocol Schema Reference ### JSON Schema ```json theme={null} { "$schema": "http://json-schema.org/draft-07/schema#", "definitions": { "StartEvent": { "type": "object", "required": ["event", "sequenceNumber", "start", "extra_headers"], "properties": { "event": { "const": "start" }, "sequenceNumber": { "type": "integer", "minimum": 1 }, "start": { "type": "object", "required": ["callId", "streamId", "accountId", "tracks", "mediaFormat"], "properties": { "callId": { "type": "string", "format": "uuid" }, "streamId": { "type": "string", "format": "uuid" }, "accountId": { "type": "string" }, "tracks": { "type": "array", "items": { "type": "string" } }, "mediaFormat": { "type": "object", "required": ["encoding", "sampleRate"], "properties": { "encoding": { "type": "string" }, "sampleRate": { "type": "integer" } } } } }, "extra_headers": { "type": "string" } } }, "MediaEvent": { "type": "object", "required": ["event", "sequenceNumber", "streamId", "media", "extra_headers"], "properties": { "event": { "const": "media" }, "sequenceNumber": { "type": "integer" }, "streamId": { "type": "string", "format": "uuid" }, "media": { "type": "object", "required": ["track", "timestamp", "chunk", "payload"], "properties": { "track": { "type": "string" }, "timestamp": { "type": "string" }, "chunk": { "type": "integer" }, "payload": { "type": "string", "contentEncoding": "base64" } } }, "extra_headers": { "type": "string" } } }, "DTMFEvent": { "type": "object", "required": ["event", "sequenceNumber", "streamId", "dtmf", "extra_headers"], "properties": { "event": { "const": "dtmf" }, "sequenceNumber": { "type": "integer" }, "streamId": { "type": "string", "format": "uuid" }, "dtmf": { "type": "object", "required": ["track", "digit", "timestamp"], "properties": { "track": { "type": "string" }, "digit": { "type": "string", "pattern": "^[0-9*#A-D]$" }, "timestamp": { "type": "string" } } }, "extra_headers": { "type": "string" } } }, "PlayedStreamEvent": { "type": "object", "required": ["event", "sequenceNumber", "streamId", "name"], "properties": { "event": { "const": "playedStream" }, "sequenceNumber": { "type": "integer" }, "streamId": { "type": "string", "format": "uuid" }, "name": { "type": "string" } } }, "ClearedAudioEvent": { "type": "object", "required": ["event", "sequenceNumber", "streamId"], "properties": { "event": { "const": "clearedAudio" }, "sequenceNumber": { "type": "integer" }, "streamId": { "type": "string", "format": "uuid" } } }, "PlayAudioEvent": { "type": "object", "required": ["event", "media"], "properties": { "event": { "const": "playAudio" }, "media": { "type": "object", "required": ["contentType", "sampleRate", "payload"], "properties": { "contentType": { "type": "string" }, "sampleRate": { "type": "integer" }, "payload": { "type": "string", "contentEncoding": "base64" } } } } }, "CheckpointEvent": { "type": "object", "required": ["event", "streamId", "name"], "properties": { "event": { "const": "checkpoint" }, "streamId": { "type": "string", "format": "uuid" }, "name": { "type": "string" } } }, "ClearAudioEvent": { "type": "object", "required": ["event", "streamId"], "properties": { "event": { "const": "clearAudio" }, "streamId": { "type": "string", "format": "uuid" } } } } } ``` ### TypeScript Types ```typescript theme={null} // Input Events (Plivo → Your Server) interface StartEvent { event: 'start'; sequenceNumber: number; start: { callId: string; // UUID streamId: string; // UUID accountId: string; tracks: string[]; mediaFormat: { encoding: string; sampleRate: number; }; }; extra_headers: string; } interface MediaEvent { event: 'media'; sequenceNumber: number; streamId: string; media: { track: string; timestamp: string; chunk: number; payload: string; // Base64 }; extra_headers: string; getRawMedia(): Buffer; // SDK helper method } interface DTMFEvent { event: 'dtmf'; sequenceNumber: number; streamId: string; dtmf: { track: string; digit: string; timestamp: string; }; extra_headers: string; } interface PlayedStreamEvent { event: 'playedStream'; sequenceNumber: number; streamId: string; name: string; } interface ClearedAudioEvent { event: 'clearedAudio'; sequenceNumber: number; streamId: string; } // Output Events (Your Server → Plivo) interface PlayAudioEvent { event: 'playAudio'; media: { contentType: string; sampleRate: number; payload: string; // Base64 }; } interface CheckpointEvent { event: 'checkpoint'; streamId: string; name: string; } interface ClearAudioEvent { event: 'clearAudio'; streamId: string; } ``` *** ## Recommendations for an Effective Plivo Stream Experience ### Audio Codec and Sample Rate Considerations #### Recommended: μ-law 8000Hz **Why μ-law at 8kHz is the best choice for most applications:** 1. **Native Telephony Format**: μ-law (PCMU) is the standard codec for telephony networks. Using this format means no transcoding is required, reducing latency. 2. **Lowest Latency**: Because it's the native format, audio passes through Plivo with minimal processing overhead. 3. **Bandwidth Efficient**: μ-law compresses 16-bit audio to 8-bit, reducing data transfer by 50% while maintaining voice quality. 4. **Universal Compatibility**: Every speech-to-text and text-to-speech service supports μ-law. No conversion needed. 5. **Sufficient for Voice**: Human speech is well-represented at 8kHz. Higher sample rates don't significantly improve voice AI applications. ```xml theme={null} wss://your-server.com/stream ``` #### When to Use Higher Sample Rates Consider 16kHz (`audio/x-l16;rate=16000`) only if: * Your STT model specifically benefits from higher sample rates (verify with benchmarks) * You're doing audio analysis beyond speech recognition * You have abundant bandwidth and can accept slightly higher latency ### Minimize Latency for a Better Experience #### 1. Choose the Right Region for Your WebSocket Server ```mermaid theme={null} graph LR A[Caller Location] -->|PSTN| B[Plivo Edge] B -->|WebSocket| C[Your Server] C -->|API| D[AI Services] style B fill:#f9f,stroke:#333 style C fill:#9f9,stroke:#333 ``` **Key Latency Sources**: * PSTN to Plivo: Fixed, based on caller location * Plivo to your server: Depends on server location * Your server to AI services: Depends on AI provider regions #### 2. Server Location Strategy | Your Use Case | Recommended Server Location | | ---------------------- | -------------------------------------------------- | | US-focused traffic | US East (Virginia) or US West (Oregon) | | Europe-focused traffic | Frankfurt or London | | Asia-Pacific traffic | Singapore or Mumbai | | Global traffic | Deploy in multiple regions with geographic routing | #### 3. Latency Budget For a responsive Voice AI experience, aim for: | Component | Target Latency | | -------------------- | --------------- | | Speech-to-Text | \< 200ms | | LLM Processing | \< 500ms | | Text-to-Speech | \< 200ms | | Network (round trip) | \< 100ms | | **Total** | **\< 1 second** | ### Where Is My Call Located? How Does Plivo Select the Location? Plivo routes calls through the edge location closest to the **caller's location** on the PSTN, not your server location. **Edge Locations**: * United States (multiple) * Europe (London, Frankfurt) * Asia-Pacific (Singapore, Mumbai, Sydney) * And more **Implications**: 1. A caller in London connects to Plivo's London edge 2. The WebSocket connects from London to your server 3. Position your server close to your expected caller locations ### India: Phone Numbers and Regulations Indian telecommunications regulations require: 1. **Local Presence**: Indian phone numbers require local business registration 2. **DND Registry**: Respect the Do Not Disturb registry for outbound calls 3. **Content Restrictions**: Certain types of automated content may be restricted Contact Plivo support for guidance on Indian number provisioning and compliance. ### Where to Host Your WebSocket Server **Cloud Providers with Low-Latency Options**: | Provider | Best Regions for Voice | | ------------------ | ------------------------------------------ | | AWS | us-east-1, eu-west-1, ap-southeast-1 | | Google Cloud | us-central1, europe-west1, asia-southeast1 | | Azure | East US, West Europe, Southeast Asia | | Fly.io | Automatic edge deployment | | Cloudflare Workers | Global edge (for lightweight processing) | **Optimization Tips**: 1. Use the same region as your AI services when possible 2. Deploy WebSocket servers in multiple regions for global traffic 3. Use connection pooling for AI service clients 4. Keep WebSocket handlers lightweight—offload heavy processing ### What Is Noise Cancellation and Why Do You Need It? **The Problem**: Phone calls often include background noise—traffic, coffee shops, offices, wind. This noise degrades: * Speech recognition accuracy * Voice AI response quality * Overall user experience **Plivo Noise Cancellation** removes background noise in real-time before audio reaches your WebSocket server. #### How It Works ```mermaid theme={null} graph LR A[Caller with Noise] -->|Raw Audio| B[Plivo Edge] B -->|AI Processing| C[Noise Removal] C -->|Clean Audio| D[Your Server] D -->|Better Recognition| E[AI Services] ``` 1. **Real-time Processing**: Audio is processed in milliseconds at the edge 2. **AI-Powered**: Uses machine learning models trained on telephony noise patterns 3. **Voice Preservation**: Enhances speech while removing noise 4. **No Code Changes**: Works transparently with existing streams #### Benefits * **Higher STT Accuracy**: 15-30% improvement in word error rate * **Fewer Misunderstandings**: Reduces need for "I didn't understand that" responses * **Better User Experience**: Callers can use your voice AI from anywhere #### Enable Noise Cancellation Add noise cancellation to your streams via XML or the REST API: **XML attribute:** ```xml theme={null} wss://ai.example.com/voice-agent ``` **REST API parameter:** ```json theme={null} { "service_url": "wss://ai.example.com/voice-agent", "bidirectional": true, "noise_cancellation": "true", "noise_cancellation_level": 85 } ``` #### Choosing a Cancellation Level | Level Range | Environment | Notes | | ----------- | ----------------------------- | --------------------------------------------------- | | `60`–`70` | Quiet (home, office) | Light filtering, preserves voice detail | | `70`–`85` | Moderate noise | Good balance for most use cases (default: `85`) | | `85`–`100` | Heavy noise (traffic, crowds) | Aggressive filtering, may introduce minor artifacts | Start with the default value of `85`. Increase toward `100` for environments with heavy background noise (traffic, crowds). Decrease toward `60` if you notice audio artifacts or voice distortion. #### Access Original (Pre-Denoised) Recording When noise cancellation is enabled, call recordings contain the denoised audio by default. To access the original recording before noise cancellation was applied, append `type=original` to the media fetch URL: ``` https://media.plivo.com/v1/Account/{auth_id}/Recording/{recording_id}.mp3?type=original ``` Pre-denoised recordings are currently available for **India region** accounts only. #### Performance Noise cancellation adds minimal latency — audio is processed at the edge before reaching your WebSocket server. No server-side code changes are required. For latency-sensitive applications, test with your specific use case to ensure it meets requirements. #### Reference * [Stream XML attributes](/voice/xml/audio-streaming) — full `noiseCancellation` and `noiseCancellationLevel` attribute docs * [Audio Streams API](/voice/api/audio-streams) — full `noise_cancellation` and `noise_cancellation_level` parameter docs *** ## How-To and Examples ### Start a Plivo Stream with Stream XML **Basic Answer URL Handler**: ```javascript theme={null} // Express.js example app.get('/answer', (req, res) => { const streamUrl = `wss://${req.get('host')}/stream`; const xml = ` Hello! I'm connecting you now. ${streamUrl} `; res.type('application/xml').send(xml); }); ``` **Using Plivo SDK (Node.js)**: ```javascript theme={null} import * as Plivo from 'plivo'; app.get('/answer', (req, res) => { const response = new Plivo.Response(); response.addSpeak("Hello! I'm connecting you now."); response.addStream(`wss://${req.get('host')}/stream`, { bidirectional: true, keepCallAlive: true, contentType: 'audio/x-mulaw;rate=8000', }); res.type('application/xml').send(response.toXML()); }); ``` ### Record a Plivo Stream with Stream XML Record the call while streaming for compliance or training purposes: ```xml theme={null} wss://your-server.com/stream ``` ### Stop a Plivo Stream with the Stream API ```javascript theme={null} import Plivo from 'plivo'; const client = new Plivo.Client(process.env.PLIVO_AUTH_ID, process.env.PLIVO_AUTH_TOKEN); // Stop stream when you need to end it programmatically async function stopStream(callUuid) { try { await client.calls.stopStream(callUuid); console.log('Stream stopped successfully'); } catch (error) { console.error('Failed to stop stream:', error); } } ``` ### Example: Node.js Stream SDK with Deepgram, OpenAI, and ElevenLabs A complete voice AI implementation: ```typescript theme={null} import express from 'express'; import PlivoWebSocketServer from 'plivo-stream-sdk-node'; import type { StartEvent, MediaEvent, DTMFEvent } from 'plivo-stream-sdk-node'; import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk'; import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js'; import { OpenAI } from 'openai'; const app = express(); const PORT = 8000; // Initialize clients const deepgram = createClient(process.env.DEEPGRAM_API_KEY); const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY }); // Per-connection state const connectionState = new WeakMap(); // Plivo Answer URL app.get('/stream', (req, res) => { const xml = ` Hello! How can I help you today? wss://${req.get('host')}/stream `; res.type('application/xml').send(xml); }); const server = app.listen(PORT); // TTS streaming function async function streamTTS(text: string, ws: any, plivoServer: any) { const audioStream = await elevenlabs.textToSpeech.stream(process.env.ELEVENLABS_VOICE_ID!, { text, modelId: 'eleven_turbo_v2', outputFormat: 'ulaw_8000', }); for await (const chunk of audioStream) { plivoServer.playAudio(ws, 'audio/x-mulaw', 8000, Buffer.from(chunk)); } } // Create Plivo WebSocket Server const plivoServer = new PlivoWebSocketServer({ server, path: '/stream', }); plivoServer .onConnection(async (ws, req) => { console.log('New connection'); // Create Deepgram connection const dgConnection = deepgram.listen.live({ model: 'nova-2', encoding: 'mulaw', sample_rate: 8000, smart_format: true, }); const messages: OpenAI.Chat.ChatCompletionMessageParam[] = []; dgConnection.on(LiveTranscriptionEvents.Transcript, async (data) => { const transcript = data.channel.alternatives[0].transcript; if (transcript.trim()) { console.log('User:', transcript); // Get AI response messages.push({ role: 'user', content: transcript }); const completion = await openai.chat.completions.create({ model: 'gpt-4o-mini', messages: [ { role: 'system', content: 'You are a helpful voice assistant. Keep responses brief and conversational.' }, ...messages, ], }); const response = completion.choices[0].message.content!; console.log('AI:', response); messages.push({ role: 'assistant', content: response }); // Stream TTS await streamTTS(response, ws, plivoServer); } }); connectionState.set(ws, { dgConnection, messages }); }) .onMedia((event: MediaEvent, ws) => { const state = connectionState.get(ws); if (state?.dgConnection) { state.dgConnection.send(event.getRawMedia()); } }) .onDtmf((event: DTMFEvent, ws) => { console.log('DTMF:', event.dtmf.digit); // Clear audio on * press (interrupt) if (event.dtmf.digit === '*') { plivoServer.clearAudio(ws); } }) .onClose((ws) => { const state = connectionState.get(ws); if (state?.dgConnection) { state.dgConnection.requestClose(); } }) .start(); ``` ### Sending and Receiving DTMFs Handle DTMF input for menu navigation or controls: ```typescript theme={null} plivoServer.onDtmf((event: DTMFEvent, ws) => { const { digit, timestamp } = event.dtmf; switch (digit) { case '1': // Transfer to sales streamTTS('Connecting you to sales.', ws, plivoServer); break; case '2': // Transfer to support streamTTS('Connecting you to support.', ws, plivoServer); break; case '*': // Interrupt current response plivoServer.clearAudio(ws); streamTTS('Response cleared. How can I help?', ws, plivoServer); break; case '#': // Repeat last response const state = connectionState.get(ws); if (state?.lastResponse) { streamTTS(state.lastResponse, ws, plivoServer); } break; default: console.log(`Received DTMF: ${digit}`); } }); ``` ### Example with Python Stream SDK ```python theme={null} import asyncio import websockets import json import base64 from deepgram import DeepgramClient, LiveTranscriptionEvents from openai import OpenAI from elevenlabs import ElevenLabs # Initialize clients deepgram = DeepgramClient(os.environ["DEEPGRAM_API_KEY"]) openai_client = OpenAI() elevenlabs = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"]) messages = [] async def handle_stream(websocket): # Set up Deepgram dg_connection = deepgram.listen.live.v("1") async def on_transcript(self, result, **kwargs): transcript = result.channel.alternatives[0].transcript if transcript.strip(): print(f"User: {transcript}") # Get AI response messages.append({"role": "user", "content": transcript}) completion = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "system", "content": "You are a helpful voice assistant."}, *messages ] ) response = completion.choices[0].message.content messages.append({"role": "assistant", "content": response}) # Stream TTS audio_stream = elevenlabs.text_to_speech.stream( voice_id=os.environ["ELEVENLABS_VOICE_ID"], text=response, model_id="eleven_turbo_v2", output_format="ulaw_8000" ) for chunk in audio_stream: await websocket.send(json.dumps({ "event": "playAudio", "media": { "contentType": "audio/x-mulaw", "sampleRate": 8000, "payload": base64.b64encode(chunk).decode() } })) dg_connection.on(LiveTranscriptionEvents.Transcript, on_transcript) await dg_connection.start({"model": "nova-2", "encoding": "mulaw", "sample_rate": 8000}) async for message in websocket: data = json.loads(message) if data["event"] == "media": audio = base64.b64decode(data["media"]["payload"]) dg_connection.send(audio) elif data["event"] == "dtmf": print(f"DTMF: {data['dtmf']['digit']}") if data["dtmf"]["digit"] == "*": await websocket.send(json.dumps({ "event": "clearAudio", "streamId": data["streamId"] })) async def main(): async with websockets.serve(handle_stream, "0.0.0.0", 8000): await asyncio.Future() asyncio.run(main()) ``` ### Example with Pipecat [Pipecat](https://github.com/pipecat-ai/pipecat) is an open-source framework for building voice AI pipelines. ```python theme={null} import asyncio from pipecat.pipeline.pipeline import Pipeline from pipecat.pipeline.runner import PipelineRunner from pipecat.services.deepgram import DeepgramSTTService from pipecat.services.openai import OpenAILLMService from pipecat.services.elevenlabs import ElevenLabsTTSService from pipecat.transports.services.plivo import PlivoTransport async def main(): # Configure transport transport = PlivoTransport( host="0.0.0.0", port=8000, path="/stream" ) # Configure services stt = DeepgramSTTService( api_key=os.environ["DEEPGRAM_API_KEY"], model="nova-2" ) llm = OpenAILLMService( api_key=os.environ["OPENAI_API_KEY"], model="gpt-4o-mini", system_prompt="You are a helpful voice assistant. Be concise." ) tts = ElevenLabsTTSService( api_key=os.environ["ELEVENLABS_API_KEY"], voice_id=os.environ["ELEVENLABS_VOICE_ID"], model_id="eleven_turbo_v2" ) # Build pipeline pipeline = Pipeline([ transport.input(), stt, llm, tts, transport.output() ]) # Run runner = PipelineRunner() await runner.run(pipeline) asyncio.run(main()) ``` *** ## General Considerations for Voice AI Agents ### Noise Cancellation **Why it matters**: Background noise is the #1 cause of speech recognition errors. **Implementation**: 1. Enable Plivo's built-in noise cancellation (contact support) 2. Consider client-side noise suppression for high-quality microphones 3. For mobile callers, noise is especially prevalent ### Voice Activity Detection (VAD) and Turn Detection **The Challenge**: Knowing when the user has finished speaking. **Approaches**: 1. **Silence-based VAD**: Wait for N milliseconds of silence * Pros: Simple * Cons: Slow, doesn't handle pauses well 2. **STT End-of-Speech Detection**: Most STT services provide `speech_final` events * Pros: Understands speech patterns * Cons: Slight delay 3. **Semantic Turn Detection**: Use LLM to determine if response is needed * Pros: Handles complex dialogue * Cons: Added latency **Recommendation**: Combine STT's `speech_final` with a short timeout (300-500ms). ### Interruption Users should be able to interrupt the AI mid-response. **Implementation**: ```typescript theme={null} let isPlaying = false; let interruptionBuffer: string[] = []; plivoServer .onMedia((event, ws) => { const audio = event.getRawMedia(); // Send to STT sttClient.send(audio); // If user speaks while AI is playing, they might be interrupting if (isPlaying) { // Accumulate audio and check for speech interruptionBuffer.push(audio); } }) .onPlayedStream((event, ws) => { isPlaying = false; }); // In STT callback sttClient.on('transcript', (data) => { if (data.isFinal && isPlaying) { // User interrupted plivoServer.clearAudio(ws); isPlaying = false; // Process interruption handleUserInput(data.transcript); } }); ``` ### Context Management Maintain conversation context for coherent multi-turn dialogue: ```typescript theme={null} interface ConversationContext { messages: Array<{ role: string; content: string }>; userProfile?: { name?: string; preferences?: Record; }; sessionData?: Record; } // Per-connection context const contexts = new WeakMap(); function getSystemPrompt(context: ConversationContext): string { let prompt = `You are a helpful voice assistant.`; if (context.userProfile?.name) { prompt += ` The user's name is ${context.userProfile.name}.`; } if (context.sessionData?.topic) { prompt += ` You are currently helping with ${context.sessionData.topic}.`; } return prompt; } // Limit context size to control costs and latency function trimContext(messages: Array<{ role: string; content: string }>, maxMessages = 20) { if (messages.length > maxMessages) { // Keep system message + recent messages return [messages[0], ...messages.slice(-maxMessages + 1)]; } return messages; } ``` ### Best Practices Summary | Aspect | Recommendation | | ------------------ | ------------------------------------------------------ | | **Codec** | μ-law 8000Hz for lowest latency | | **Response Time** | Aim for \< 1 second total | | **Interruption** | Always support—use `clearAudio` | | **DTMF** | Support `*` for interrupt, `#` for repeat | | **Error Handling** | Graceful fallbacks, don't leave user hanging | | **Context** | Maintain conversation history, trim when needed | | **Testing** | Test on actual phone calls, not just WebSocket clients | *** ## Support For questions, issues, or feature requests: * **Documentation**: [https://www.plivo.com/docs/](https://www.plivo.com/docs/) * **Support**: [support@plivo.com](mailto:support@plivo.com) * **GitHub Issues**: For SDK-specific issues *** *Last updated: January 2026*