Audio Streaming Guide

Real-time bidirectional audio streaming enables Voice AI applications, live transcription, voice assistants, and custom audio processing on Plivo calls.

Architecture

Plivo Audio Streaming enables real-time, bidirectional audio communication between your application and an ongoing phone call via WebSocket.

High-Level Flow

Step-by-Step Flow

Call Initiation: A caller dials your Plivo number, or your application initiates an outbound call.
Answer URL Request: Plivo makes an HTTP request to your configured Answer URL.
Stream XML Response: Your server responds with XML containing the <Stream> element, specifying the WebSocket URL and streaming parameters.
WebSocket Connection: Plivo establishes a WebSocket connection to your specified URL, validating signatures if configured.
Start Event: Plivo sends a start event containing call metadata (call ID, stream ID, media format, etc.).
Media Streaming:
- Inbound: Plivo continuously sends media events containing base64-encoded audio chunks from the caller.
- Outbound: Your server sends playAudio events with base64-encoded audio to be played to the caller.
DTMF Events: When the caller presses keys, Plivo sends dtmf events with the digit information.
Control Events: Your server can send clearAudio to interrupt playback or checkpoint to track playback progress.
Confirmation Events: Plivo sends playedStream when audio finishes playing and clearedAudio when the queue is cleared.
Connection Close: When the call ends or streaming stops, the WebSocket connection closes.

Stream XML

The <Stream> XML element initiates audio streaming for a call. Include it in your Answer URL response.

Basic Syntax

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream bidirectional="true" keepCallAlive="true" contentType="audio/x-mulaw;rate=8000">
        wss://your-server.com/stream
    </Stream>
</Response>

Parameters

Parameter	Type	Default	Description
`bidirectional`	boolean	`false`	Enable two-way audio streaming. When `true`, you can send audio back to the caller.
`keepCallAlive`	boolean	`false`	Keep the call active after the stream ends. When `false`, the call ends when streaming stops.
`contentType`	string	`audio/x-mulaw;rate=8000`	Audio codec and sample rate. See Supported Content Types.
`statusCallbackUrl`	string	—	URL for stream status callbacks (started, stopped, failed).
`statusCallbackMethod`	string	`POST`	HTTP method for status callbacks (`GET` or `POST`).
`extraHeaders`	string	—	Custom headers to include in the start event. Format: `key1=value1;key2=value2`

Supported Content Types

Content Type	Description	Use Case
`audio/x-mulaw;rate=8000`	μ-law codec at 8kHz	Recommended. Standard telephony, lowest latency, best compatibility.
`audio/x-l16;rate=8000`	Linear PCM 16-bit at 8kHz	Higher quality for speech processing.
`audio/x-l16;rate=16000`	Linear PCM 16-bit at 16kHz	High-quality speech recognition.

Examples

Basic Unidirectional Stream (Listen Only)

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream>
        wss://your-server.com/stream
    </Stream>
</Response>

Bidirectional Stream with μ-law Codec

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Hello! I'm connecting you to our AI assistant.</Speak>
    <Stream bidirectional="true"
            keepCallAlive="true"
            contentType="audio/x-mulaw;rate=8000">
        wss://your-server.com/stream
    </Stream>
</Response>

Stream with Status Callbacks and Extra Headers

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream bidirectional="true"
            keepCallAlive="true"
            contentType="audio/x-mulaw;rate=8000"
            statusCallbackUrl="https://your-server.com/stream-status"
            statusCallbackMethod="POST"
            extraHeaders="userId=12345;sessionId=abc-xyz">
        wss://your-server.com/stream
    </Stream>
</Response>

Higher Quality Stream (16kHz)

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream bidirectional="true"
            keepCallAlive="true"
            contentType="audio/x-l16;rate=16000">
        wss://your-server.com/stream
    </Stream>
</Response>

Record After Stream

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream bidirectional="true"
            keepCallAlive="true"
            contentType="audio/x-mulaw;rate=8000">
        wss://your-server.com/stream
    </Stream>
    <Record action="https://your-server.com/recording-complete"
            recordingFormat="mp3"
            maxLength="3600"/>
</Response>

Stream APIs

The Plivo Stream API allows you to control active streams programmatically via REST API calls.

Base URL

https://api.plivo.com/v1/Account/{auth_id}/Call/{call_uuid}/Stream/

Authentication

Use HTTP Basic Authentication with your Plivo Auth ID and Auth Token.

Stop a Stream

Stop an active stream on a call. Endpoint: DELETE /v1/Account/{auth_id}/Call/{call_uuid}/Stream/ Parameters:

Parameter	Type	Required	Description
`auth_id`	string	Yes	Your Plivo Auth ID
`call_uuid`	string	Yes	The UUID of the call

Example Request:

curl -X DELETE \
  https://api.plivo.com/v1/Account/YOUR_AUTH_ID/Call/CALL_UUID/Stream/ \
  -u YOUR_AUTH_ID:YOUR_AUTH_TOKEN

Example Response:

{
  "message": "stream stopped",
  "api_id": "b8e78dd0-1234-11ec-8a9e-0242ac110002"
}

Get Stream Details

Retrieve information about active streams on a call. Endpoint: GET /v1/Account/{auth_id}/Call/{call_uuid}/Stream/ Example Request:

curl -X GET \
  https://api.plivo.com/v1/Account/YOUR_AUTH_ID/Call/CALL_UUID/Stream/ \
  -u YOUR_AUTH_ID:YOUR_AUTH_TOKEN

Example Response:

{
  "api_id": "c9f89ee1-1234-11ec-8a9e-0242ac110002",
  "objects": [
    {
      "stream_id": "12345678-1234-1234-1234-123456789abc",
      "call_uuid": "CALL_UUID",
      "status": "streaming",
      "service_url": "wss://your-server.com/stream",
      "bidirectional": true,
      "content_type": "audio/x-mulaw;rate=8000"
    }
  ]
}

Using the Plivo SDK

Node.js

const plivo = require('plivo');

const client = new plivo.Client('YOUR_AUTH_ID', 'YOUR_AUTH_TOKEN');

// Stop a stream
await client.calls.stopStream('CALL_UUID');

Python

import plivo

client = plivo.RestClient('YOUR_AUTH_ID', 'YOUR_AUTH_TOKEN')

# Stop a stream
client.calls.stop_stream(call_uuid='CALL_UUID')

Stream Status Callback URL

Configure a callback URL to receive notifications about stream lifecycle events.

Configuration

Set the statusCallbackUrl attribute in your Stream XML:

<Stream bidirectional="true"
        statusCallbackUrl="https://your-server.com/stream-status"
        statusCallbackMethod="POST">
    wss://your-server.com/stream
</Stream>

Callback Events

Your callback URL receives POST (or GET) requests with the following parameters:

Parameter	Type	Description
`CallUUID`	string	The unique identifier for the call
`StreamID`	string	The unique identifier for the stream
`Event`	string	The event type: `started`, `stopped`, `failed`
`Timestamp`	string	ISO 8601 timestamp of the event
`From`	string	The caller’s phone number
`To`	string	The called phone number
`Direction`	string	Call direction: `inbound` or `outbound`
`StatusReason`	string	Reason for status (on `stopped` or `failed`)
`Duration`	number	Stream duration in seconds (on `stopped`)

Event Types

`started`

Sent when the WebSocket connection is successfully established.

{
  "CallUUID": "12345678-1234-1234-1234-123456789abc",
  "StreamID": "87654321-4321-4321-4321-cba987654321",
  "Event": "started",
  "Timestamp": "2024-01-15T10:30:00Z",
  "From": "+14155551234",
  "To": "+14155555678",
  "Direction": "inbound"
}

`stopped`

Sent when the stream ends normally.

{
  "CallUUID": "12345678-1234-1234-1234-123456789abc",
  "StreamID": "87654321-4321-4321-4321-cba987654321",
  "Event": "stopped",
  "Timestamp": "2024-01-15T10:35:00Z",
  "StatusReason": "completed",
  "Duration": 300,
  "From": "+14155551234",
  "To": "+14155555678",
  "Direction": "inbound"
}

`failed`

Sent when the stream fails to start or encounters an error.

{
  "CallUUID": "12345678-1234-1234-1234-123456789abc",
  "StreamID": "87654321-4321-4321-4321-cba987654321",
  "Event": "failed",
  "Timestamp": "2024-01-15T10:30:05Z",
  "StatusReason": "connection_failed",
  "From": "+14155551234",
  "To": "+14155555678",
  "Direction": "inbound"
}

Example Callback Handler

app.post('/stream-status', (req, res) => {
  const { CallUUID, StreamID, Event, StatusReason, Duration } = req.body;

  switch (Event) {
    case 'started':
      console.log(`Stream ${StreamID} started for call ${CallUUID}`);
      break;
    case 'stopped':
      console.log(`Stream ${StreamID} stopped after ${Duration}s: ${StatusReason}`);
      break;
    case 'failed':
      console.error(`Stream ${StreamID} failed: ${StatusReason}`);
      break;
  }

  res.sendStatus(200);
});

Plivo Signature Validation

Plivo signs WebSocket connection requests to verify authenticity. Validate these signatures to ensure requests originate from Plivo.

V3 Signature Headers

Plivo includes two headers with each WebSocket connection request:

Header	Description
`X-Plivo-Signature-V3`	The HMAC-SHA256 signature
`X-Plivo-Signature-V3-Nonce`	A unique nonce for this request

Validation Process

Construct the signature base string: {METHOD}{URI}{NONCE}
Compute HMAC-SHA256 using your Auth Token as the key
Base64 encode the result
Compare with the X-Plivo-Signature-V3 header

Using the Plivo SDK

The Plivo SDK provides a built-in validation function:

import { validateV3Signature } from 'plivo';

const isValid = validateV3Signature(
  method, // 'GET' for WebSocket upgrade requests
  uri, // Full URI including protocol and path
  nonce, // X-Plivo-Signature-V3-Nonce header value
  authToken, // Your Plivo Auth Token
  signature, // X-Plivo-Signature-V3 header value
);

Using the Node.js Stream SDK

The plivo-stream-sdk-node handles signature validation automatically:

const plivoServer = new PlivoWebSocketServer({
  server,
  path: '/stream',
  validateSignature: true,
  authToken: process.env.PLIVO_AUTH_TOKEN,
});

When validateSignature is enabled, connections with invalid signatures are automatically rejected with a 1008 WebSocket close code.

Manual Validation Example

import crypto from 'crypto';

function validatePlivoSignature(request, authToken) {
  const signature = request.headers['x-plivo-signature-v3'];
  const nonce = request.headers['x-plivo-signature-v3-nonce'];

  if (!signature || !nonce) {
    return false;
  }

  const host = request.headers.host;
  const protocol = request.socket.encrypted ? 'https' : 'http';
  const uri = `${protocol}://${host}${request.url}`;

  const baseString = `GET${uri}${nonce}`;
  const expectedSignature = crypto.createHmac('sha256', authToken).update(baseString).digest('base64');

  return crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expectedSignature));
}

The Plivo Stream Event Protocol

All communication over the WebSocket uses JSON messages. Events are categorized as Input Events (from Plivo to your server) and Output Events (from your server to Plivo).

Input Events (Plivo → Your Server)

`start`

Sent once when the stream begins. Contains call and stream metadata.

{
  "event": "start",
  "sequenceNumber": 1,
  "start": {
    "callId": "12345678-1234-1234-1234-123456789abc",
    "streamId": "87654321-4321-4321-4321-cba987654321",
    "accountId": "MAXXXXXXXXXXXXXXXXXX",
    "tracks": ["inbound"],
    "mediaFormat": {
      "encoding": "audio/x-mulaw",
      "sampleRate": 8000
    }
  },
  "extra_headers": "userId=12345;sessionId=abc-xyz"
}

Field	Type	Description
`event`	string	Always `"start"`
`sequenceNumber`	number	Event sequence number (starts at 1)
`start.callId`	string (UUID)	Unique identifier for the call
`start.streamId`	string (UUID)	Unique identifier for the stream
`start.accountId`	string	Your Plivo account ID
`start.tracks`	string[]	Audio tracks being streamed (e.g., `["inbound"]`, `["inbound", "outbound"]`)
`start.mediaFormat.encoding`	string	Audio codec (e.g., `"audio/x-mulaw"`)
`start.mediaFormat.sampleRate`	number	Sample rate in Hz
`extra_headers`	string	Custom headers from the Stream XML `extraHeaders` attribute

`media`

Sent continuously during the call. Contains audio data from the caller.

{
  "event": "media",
  "sequenceNumber": 42,
  "streamId": "87654321-4321-4321-4321-cba987654321",
  "media": {
    "track": "inbound",
    "timestamp": "1705312200000",
    "chunk": 41,
    "payload": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV..."
  },
  "extra_headers": "userId=12345;sessionId=abc-xyz"
}

Field	Type	Description
`event`	string	Always `"media"`
`sequenceNumber`	number	Event sequence number
`streamId`	string (UUID)	Stream identifier
`media.track`	string	Audio track (`"inbound"` = caller audio)
`media.timestamp`	string	Unix timestamp in milliseconds
`media.chunk`	number	Chunk sequence number for this track
`media.payload`	string	Base64-encoded audio data
`extra_headers`	string	Custom headers from the Stream XML

Audio Chunk Details:

Each chunk contains approximately 20ms of audio
At 8kHz with μ-law encoding: ~160 bytes per chunk
Decode using: Buffer.from(payload, 'base64')

`dtmf`

Sent when the caller presses a key on their phone.

{
  "event": "dtmf",
  "sequenceNumber": 50,
  "streamId": "87654321-4321-4321-4321-cba987654321",
  "dtmf": {
    "track": "inbound",
    "digit": "5",
    "timestamp": "1705312250000"
  },
  "extra_headers": "userId=12345;sessionId=abc-xyz"
}

Field	Type	Description
`event`	string	Always `"dtmf"`
`sequenceNumber`	number	Event sequence number
`streamId`	string (UUID)	Stream identifier
`dtmf.track`	string	Audio track (`"inbound"`)
`dtmf.digit`	string	The DTMF digit pressed (`0-9`, `*`, `#`, `A-D`)
`dtmf.timestamp`	string	Unix timestamp in milliseconds
`extra_headers`	string	Custom headers from the Stream XML

`playedStream`

Confirmation that audio with a checkpoint has finished playing.

{
  "event": "playedStream",
  "sequenceNumber": 75,
  "streamId": "87654321-4321-4321-4321-cba987654321",
  "name": "greeting-complete"
}

Field	Type	Description
`event`	string	Always `"playedStream"`
`sequenceNumber`	number	Event sequence number
`streamId`	string (UUID)	Stream identifier
`name`	string	The checkpoint name you specified

`clearedAudio`

Confirmation that the audio queue has been cleared.

{
  "event": "clearedAudio",
  "sequenceNumber": 80,
  "streamId": "87654321-4321-4321-4321-cba987654321"
}

Field	Type	Description
`event`	string	Always `"clearedAudio"`
`sequenceNumber`	number	Event sequence number
`streamId`	string (UUID)	Stream identifier

Output Events (Your Server → Plivo)

`playAudio`

Send audio to be played to the caller. For bidirectional streams only.

{
  "event": "playAudio",
  "media": {
    "contentType": "audio/x-mulaw",
    "sampleRate": 8000,
    "payload": "//uQxAAAAAANIAAAAAExBTUUzLjEwMFVV..."
  }
}

Field	Type	Description
`event`	string	Always `"playAudio"`
`media.contentType`	string	Audio MIME type (must match stream’s `contentType`)
`media.sampleRate`	number	Sample rate in Hz (must match stream’s sample rate)
`media.payload`	string	Base64-encoded audio data

Important: The content type and sample rate must match what was specified in your Stream XML.

`checkpoint`

Mark a point in the audio queue. Receive a playedStream event when playback reaches this point.

{
  "event": "checkpoint",
  "streamId": "87654321-4321-4321-4321-cba987654321",
  "name": "greeting-complete"
}

Field	Type	Description
`event`	string	Always `"checkpoint"`
`streamId`	string (UUID)	Stream identifier
`name`	string	Unique identifier for this checkpoint

Use Cases:

Track when a specific response finishes playing
Coordinate actions after audio playback
Measure time from sending audio to playback completion

`clearAudio`

Clear all queued audio. Use this to implement interruption.

{
  "event": "clearAudio",
  "streamId": "87654321-4321-4321-4321-cba987654321"
}

Field	Type	Description
`event`	string	Always `"clearAudio"`
`streamId`	string (UUID)	Stream identifier

X-Headers

X-Headers (Extra Headers) allow you to pass custom metadata from your Stream XML to your WebSocket server.

Configuration

Set the extraHeaders attribute in your Stream XML:

<Stream bidirectional="true"
        extraHeaders="userId=12345;sessionId=abc-xyz;tier=premium">
    wss://your-server.com/stream
</Stream>

Format

Key-value pairs separated by semicolons: key1=value1;key2=value2
Keys and values are strings
URL-encode values if they contain special characters

Accessing X-Headers

X-Headers appear in the extra_headers field of every event:

{
  "event": "start",
  "extra_headers": "userId=12345;sessionId=abc-xyz;tier=premium",
  ...
}

Parsing X-Headers

function parseExtraHeaders(extraHeaders) {
  const headers = {};
  if (!extraHeaders) return headers;

  for (const pair of extraHeaders.split(';')) {
    const [key, value] = pair.split('=');
    if (key && value) {
      headers[key.trim()] = decodeURIComponent(value.trim());
    }
  }
  return headers;
}

// Usage
plivoServer.onStart((event, ws) => {
  const headers = parseExtraHeaders(event.extra_headers);
  console.log(headers.userId); // "12345"
  console.log(headers.sessionId); // "abc-xyz"
  console.log(headers.tier); // "premium"
});

Why Use X-Headers?

Session Correlation: Pass session IDs to correlate WebSocket connections with HTTP sessions
User Context: Include user IDs, account tiers, or language preferences
Routing: Pass information to route audio to different processing pipelines
Analytics: Include tracking IDs for analytics and debugging

Example: Dynamic Agent Selection

<!-- In your Answer URL response -->
<Response>
    <Stream bidirectional="true"
            extraHeaders="agentType=sales;language=es;customerId=cust_123">
        wss://your-server.com/stream
    </Stream>
</Response>

plivoServer.onStart((event, ws) => {
  const headers = parseExtraHeaders(event.extra_headers);

  // Route to appropriate AI agent based on headers
  const agent = initializeAgent({
    type: headers.agentType, // "sales"
    language: headers.language, // "es"
    customerId: headers.customerId, // "cust_123"
  });

  // Store agent in connection state
  connectionState.set(ws, { agent });
});

Limits

WebSocket URL Length

Limit	Value
Maximum WebSocket URL length	2048 characters

This includes the full URL with protocol, host, path, and any query parameters.

Stream Limits

Limit	Value
Maximum concurrent streams per call	1
Maximum stream duration	Same as call duration
Audio buffer size (playback queue)	~60 seconds of audio

Rate Limits

Limit	Value
Media events per second	~50 (approximately 20ms chunks)
Maximum playAudio events per second	No hard limit, but limited by playback buffer

Message Size

Limit	Value
Maximum WebSocket message size	64 KB
Recommended audio chunk size	≤16 KB base64-encoded

Protocol Schema Reference

JSON Schema

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "definitions": {
    "StartEvent": {
      "type": "object",
      "required": ["event", "sequenceNumber", "start", "extra_headers"],
      "properties": {
        "event": { "const": "start" },
        "sequenceNumber": { "type": "integer", "minimum": 1 },
        "start": {
          "type": "object",
          "required": ["callId", "streamId", "accountId", "tracks", "mediaFormat"],
          "properties": {
            "callId": { "type": "string", "format": "uuid" },
            "streamId": { "type": "string", "format": "uuid" },
            "accountId": { "type": "string" },
            "tracks": { "type": "array", "items": { "type": "string" } },
            "mediaFormat": {
              "type": "object",
              "required": ["encoding", "sampleRate"],
              "properties": {
                "encoding": { "type": "string" },
                "sampleRate": { "type": "integer" }
              }
            }
          }
        },
        "extra_headers": { "type": "string" }
      }
    },
    "MediaEvent": {
      "type": "object",
      "required": ["event", "sequenceNumber", "streamId", "media", "extra_headers"],
      "properties": {
        "event": { "const": "media" },
        "sequenceNumber": { "type": "integer" },
        "streamId": { "type": "string", "format": "uuid" },
        "media": {
          "type": "object",
          "required": ["track", "timestamp", "chunk", "payload"],
          "properties": {
            "track": { "type": "string" },
            "timestamp": { "type": "string" },
            "chunk": { "type": "integer" },
            "payload": { "type": "string", "contentEncoding": "base64" }
          }
        },
        "extra_headers": { "type": "string" }
      }
    },
    "DTMFEvent": {
      "type": "object",
      "required": ["event", "sequenceNumber", "streamId", "dtmf", "extra_headers"],
      "properties": {
        "event": { "const": "dtmf" },
        "sequenceNumber": { "type": "integer" },
        "streamId": { "type": "string", "format": "uuid" },
        "dtmf": {
          "type": "object",
          "required": ["track", "digit", "timestamp"],
          "properties": {
            "track": { "type": "string" },
            "digit": { "type": "string", "pattern": "^[0-9*#A-D]$" },
            "timestamp": { "type": "string" }
          }
        },
        "extra_headers": { "type": "string" }
      }
    },
    "PlayedStreamEvent": {
      "type": "object",
      "required": ["event", "sequenceNumber", "streamId", "name"],
      "properties": {
        "event": { "const": "playedStream" },
        "sequenceNumber": { "type": "integer" },
        "streamId": { "type": "string", "format": "uuid" },
        "name": { "type": "string" }
      }
    },
    "ClearedAudioEvent": {
      "type": "object",
      "required": ["event", "sequenceNumber", "streamId"],
      "properties": {
        "event": { "const": "clearedAudio" },
        "sequenceNumber": { "type": "integer" },
        "streamId": { "type": "string", "format": "uuid" }
      }
    },
    "PlayAudioEvent": {
      "type": "object",
      "required": ["event", "media"],
      "properties": {
        "event": { "const": "playAudio" },
        "media": {
          "type": "object",
          "required": ["contentType", "sampleRate", "payload"],
          "properties": {
            "contentType": { "type": "string" },
            "sampleRate": { "type": "integer" },
            "payload": { "type": "string", "contentEncoding": "base64" }
          }
        }
      }
    },
    "CheckpointEvent": {
      "type": "object",
      "required": ["event", "streamId", "name"],
      "properties": {
        "event": { "const": "checkpoint" },
        "streamId": { "type": "string", "format": "uuid" },
        "name": { "type": "string" }
      }
    },
    "ClearAudioEvent": {
      "type": "object",
      "required": ["event", "streamId"],
      "properties": {
        "event": { "const": "clearAudio" },
        "streamId": { "type": "string", "format": "uuid" }
      }
    }
  }
}

TypeScript Types

// Input Events (Plivo → Your Server)
interface StartEvent {
  event: 'start';
  sequenceNumber: number;
  start: {
    callId: string; // UUID
    streamId: string; // UUID
    accountId: string;
    tracks: string[];
    mediaFormat: {
      encoding: string;
      sampleRate: number;
    };
  };
  extra_headers: string;
}

interface MediaEvent {
  event: 'media';
  sequenceNumber: number;
  streamId: string;
  media: {
    track: string;
    timestamp: string;
    chunk: number;
    payload: string; // Base64
  };
  extra_headers: string;
  getRawMedia(): Buffer; // SDK helper method
}

interface DTMFEvent {
  event: 'dtmf';
  sequenceNumber: number;
  streamId: string;
  dtmf: {
    track: string;
    digit: string;
    timestamp: string;
  };
  extra_headers: string;
}

interface PlayedStreamEvent {
  event: 'playedStream';
  sequenceNumber: number;
  streamId: string;
  name: string;
}

interface ClearedAudioEvent {
  event: 'clearedAudio';
  sequenceNumber: number;
  streamId: string;
}

// Output Events (Your Server → Plivo)
interface PlayAudioEvent {
  event: 'playAudio';
  media: {
    contentType: string;
    sampleRate: number;
    payload: string; // Base64
  };
}

interface CheckpointEvent {
  event: 'checkpoint';
  streamId: string;
  name: string;
}

interface ClearAudioEvent {
  event: 'clearAudio';
  streamId: string;
}

Recommendations for an Effective Plivo Stream Experience

Audio Codec and Sample Rate Considerations

Recommended: μ-law 8000Hz

Why μ-law at 8kHz is the best choice for most applications:

Native Telephony Format: μ-law (PCMU) is the standard codec for telephony networks. Using this format means no transcoding is required, reducing latency.
Lowest Latency: Because it’s the native format, audio passes through Plivo with minimal processing overhead.
Bandwidth Efficient: μ-law compresses 16-bit audio to 8-bit, reducing data transfer by 50% while maintaining voice quality.
Universal Compatibility: Every speech-to-text and text-to-speech service supports μ-law. No conversion needed.
Sufficient for Voice: Human speech is well-represented at 8kHz. Higher sample rates don’t significantly improve voice AI applications.

<!-- Recommended configuration -->
<Stream bidirectional="true"
        contentType="audio/x-mulaw;rate=8000">
    wss://your-server.com/stream
</Stream>

When to Use Higher Sample Rates

Consider 16kHz (audio/x-l16;rate=16000) only if:

Your STT model specifically benefits from higher sample rates (verify with benchmarks)
You’re doing audio analysis beyond speech recognition
You have abundant bandwidth and can accept slightly higher latency

Minimize Latency for a Better Experience

1. Choose the Right Region for Your WebSocket Server

Key Latency Sources:

PSTN to Plivo: Fixed, based on caller location
Plivo to your server: Depends on server location
Your server to AI services: Depends on AI provider regions

2. Server Location Strategy

Your Use Case	Recommended Server Location
US-focused traffic	US East (Virginia) or US West (Oregon)
Europe-focused traffic	Frankfurt or London
Asia-Pacific traffic	Singapore or Mumbai
Global traffic	Deploy in multiple regions with geographic routing

3. Latency Budget

For a responsive Voice AI experience, aim for:

Component	Target Latency
Speech-to-Text	< 200ms
LLM Processing	< 500ms
Text-to-Speech	< 200ms
Network (round trip)	< 100ms
Total	< 1 second

Where Is My Call Located? How Does Plivo Select the Location?

Plivo routes calls through the edge location closest to the caller’s location on the PSTN, not your server location. Edge Locations:

United States (multiple)
Europe (London, Frankfurt)
Asia-Pacific (Singapore, Mumbai, Sydney)
And more

Implications:

A caller in London connects to Plivo’s London edge
The WebSocket connects from London to your server
Position your server close to your expected caller locations

India: Phone Numbers and Regulations

Indian telecommunications regulations require:

Local Presence: Indian phone numbers require local business registration
DND Registry: Respect the Do Not Disturb registry for outbound calls
Content Restrictions: Certain types of automated content may be restricted

Contact Plivo support for guidance on Indian number provisioning and compliance.

Where to Host Your WebSocket Server

Cloud Providers with Low-Latency Options:

Provider	Best Regions for Voice
AWS	us-east-1, eu-west-1, ap-southeast-1
Google Cloud	us-central1, europe-west1, asia-southeast1
Azure	East US, West Europe, Southeast Asia
Fly.io	Automatic edge deployment
Cloudflare Workers	Global edge (for lightweight processing)

Optimization Tips:

Use the same region as your AI services when possible
Deploy WebSocket servers in multiple regions for global traffic
Use connection pooling for AI service clients
Keep WebSocket handlers lightweight—offload heavy processing

What Is Noise Cancellation and Why Do You Need It?

The Problem: Phone calls often include background noise—traffic, coffee shops, offices, wind. This noise degrades:

Speech recognition accuracy
Voice AI response quality
Overall user experience

Plivo Noise Cancellation removes background noise in real-time before audio reaches your WebSocket server.

How It Works

Real-time Processing: Audio is processed in milliseconds at the edge
AI-Powered: Uses machine learning models trained on telephony noise patterns
Voice Preservation: Enhances speech while removing noise
No Code Changes: Works transparently with existing streams

Benefits

Higher STT Accuracy: 15-30% improvement in word error rate
Fewer Misunderstandings: Reduces need for “I didn’t understand that” responses
Better User Experience: Callers can use your voice AI from anywhere

Enable Noise Cancellation

Add noise cancellation to your streams via XML or the REST API: XML attribute:

<Stream bidirectional="true"
        keepCallAlive="true"
        noiseCancellation="true"
        noiseCancellationLevel="85">
    wss://ai.example.com/voice-agent
</Stream>

REST API parameter:

{
    "service_url": "wss://ai.example.com/voice-agent",
    "bidirectional": true,
    "noise_cancellation": "true",
    "noise_cancellation_level": 85
}

Choosing a Cancellation Level

Level Range	Environment	Notes
`60`–`70`	Quiet (home, office)	Light filtering, preserves voice detail
`70`–`85`	Moderate noise	Good balance for most use cases (default: `85`)
`85`–`100`	Heavy noise (traffic, crowds)	Aggressive filtering, may introduce minor artifacts

Start with the default value of 85. Increase toward 100 for environments with heavy background noise (traffic, crowds). Decrease toward 60 if you notice audio artifacts or voice distortion.

Access Original (Pre-Denoised) Recording

When noise cancellation is enabled, call recordings contain the denoised audio by default. To access the original recording before noise cancellation was applied, append type=original to the media fetch URL:

https://media.plivo.com/v1/Account/{auth_id}/Recording/{recording_id}.mp3?type=original

Pre-denoised recordings are currently available for India region accounts only.

Performance

Noise cancellation adds minimal latency — audio is processed at the edge before reaching your WebSocket server. No server-side code changes are required. For latency-sensitive applications, test with your specific use case to ensure it meets requirements.

Reference

Stream XML attributes — full noiseCancellation and noiseCancellationLevel attribute docs
Audio Streams API — full noise_cancellation and noise_cancellation_level parameter docs

How-To and Examples

Start a Plivo Stream with Stream XML

Basic Answer URL Handler:

// Express.js example
app.get('/answer', (req, res) => {
  const streamUrl = `wss://${req.get('host')}/stream`;

  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Hello! I'm connecting you now.</Speak>
    <Stream bidirectional="true" 
            keepCallAlive="true" 
            contentType="audio/x-mulaw;rate=8000">
        ${streamUrl}
    </Stream>
</Response>`;

  res.type('application/xml').send(xml);
});

Using Plivo SDK (Node.js):

import * as Plivo from 'plivo';

app.get('/answer', (req, res) => {
  const response = new Plivo.Response();

  response.addSpeak("Hello! I'm connecting you now.");

  response.addStream(`wss://${req.get('host')}/stream`, {
    bidirectional: true,
    keepCallAlive: true,
    contentType: 'audio/x-mulaw;rate=8000',
  });

  res.type('application/xml').send(response.toXML());
});

Record a Plivo Stream with Stream XML

Record the call while streaming for compliance or training purposes:

<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Stream bidirectional="true"
            keepCallAlive="true"
            contentType="audio/x-mulaw;rate=8000">
        wss://your-server.com/stream
    </Stream>
    <Record action="https://your-server.com/recording-callback"
            recordingFormat="mp3"
            maxLength="3600"
            callbackMethod="POST"/>
</Response>

Stop a Plivo Stream with the Stream API

import Plivo from 'plivo';

const client = new Plivo.Client(process.env.PLIVO_AUTH_ID, process.env.PLIVO_AUTH_TOKEN);

// Stop stream when you need to end it programmatically
async function stopStream(callUuid) {
  try {
    await client.calls.stopStream(callUuid);
    console.log('Stream stopped successfully');
  } catch (error) {
    console.error('Failed to stop stream:', error);
  }
}

Example: Node.js Stream SDK with Deepgram, OpenAI, and ElevenLabs

A complete voice AI implementation:

import express from 'express';
import PlivoWebSocketServer from 'plivo-stream-sdk-node';
import type { StartEvent, MediaEvent, DTMFEvent } from 'plivo-stream-sdk-node';
import { createClient, LiveTranscriptionEvents } from '@deepgram/sdk';
import { ElevenLabsClient } from '@elevenlabs/elevenlabs-js';
import { OpenAI } from 'openai';

const app = express();
const PORT = 8000;

// Initialize clients
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const elevenlabs = new ElevenLabsClient({ apiKey: process.env.ELEVENLABS_API_KEY });

// Per-connection state
const connectionState = new WeakMap();

// Plivo Answer URL
app.get('/stream', (req, res) => {
  const xml = `<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Speak>Hello! How can I help you today?</Speak>
    <Stream bidirectional="true" 
            keepCallAlive="true" 
            contentType="audio/x-mulaw;rate=8000">
        wss://${req.get('host')}/stream
    </Stream>
</Response>`;
  res.type('application/xml').send(xml);
});

const server = app.listen(PORT);

// TTS streaming function
async function streamTTS(text: string, ws: any, plivoServer: any) {
  const audioStream = await elevenlabs.textToSpeech.stream(process.env.ELEVENLABS_VOICE_ID!, {
    text,
    modelId: 'eleven_turbo_v2',
    outputFormat: 'ulaw_8000',
  });

  for await (const chunk of audioStream) {
    plivoServer.playAudio(ws, 'audio/x-mulaw', 8000, Buffer.from(chunk));
  }
}

// Create Plivo WebSocket Server
const plivoServer = new PlivoWebSocketServer({
  server,
  path: '/stream',
});

plivoServer
  .onConnection(async (ws, req) => {
    console.log('New connection');

    // Create Deepgram connection
    const dgConnection = deepgram.listen.live({
      model: 'nova-2',
      encoding: 'mulaw',
      sample_rate: 8000,
      smart_format: true,
    });

    const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [];

    dgConnection.on(LiveTranscriptionEvents.Transcript, async (data) => {
      const transcript = data.channel.alternatives[0].transcript;
      if (transcript.trim()) {
        console.log('User:', transcript);

        // Get AI response
        messages.push({ role: 'user', content: transcript });
        const completion = await openai.chat.completions.create({
          model: 'gpt-4o-mini',
          messages: [
            { role: 'system', content: 'You are a helpful voice assistant. Keep responses brief and conversational.' },
            ...messages,
          ],
        });

        const response = completion.choices[0].message.content!;
        console.log('AI:', response);
        messages.push({ role: 'assistant', content: response });

        // Stream TTS
        await streamTTS(response, ws, plivoServer);
      }
    });

    connectionState.set(ws, { dgConnection, messages });
  })
  .onMedia((event: MediaEvent, ws) => {
    const state = connectionState.get(ws);
    if (state?.dgConnection) {
      state.dgConnection.send(event.getRawMedia());
    }
  })
  .onDtmf((event: DTMFEvent, ws) => {
    console.log('DTMF:', event.dtmf.digit);

    // Clear audio on * press (interrupt)
    if (event.dtmf.digit === '*') {
      plivoServer.clearAudio(ws);
    }
  })
  .onClose((ws) => {
    const state = connectionState.get(ws);
    if (state?.dgConnection) {
      state.dgConnection.requestClose();
    }
  })
  .start();

Sending and Receiving DTMFs

Handle DTMF input for menu navigation or controls:

plivoServer.onDtmf((event: DTMFEvent, ws) => {
  const { digit, timestamp } = event.dtmf;

  switch (digit) {
    case '1':
      // Transfer to sales
      streamTTS('Connecting you to sales.', ws, plivoServer);
      break;
    case '2':
      // Transfer to support
      streamTTS('Connecting you to support.', ws, plivoServer);
      break;
    case '*':
      // Interrupt current response
      plivoServer.clearAudio(ws);
      streamTTS('Response cleared. How can I help?', ws, plivoServer);
      break;
    case '#':
      // Repeat last response
      const state = connectionState.get(ws);
      if (state?.lastResponse) {
        streamTTS(state.lastResponse, ws, plivoServer);
      }
      break;
    default:
      console.log(`Received DTMF: ${digit}`);
  }
});

Example with Python Stream SDK

import asyncio
import websockets
import json
import base64
from deepgram import DeepgramClient, LiveTranscriptionEvents
from openai import OpenAI
from elevenlabs import ElevenLabs

# Initialize clients
deepgram = DeepgramClient(os.environ["DEEPGRAM_API_KEY"])
openai_client = OpenAI()
elevenlabs = ElevenLabs(api_key=os.environ["ELEVENLABS_API_KEY"])

messages = []

async def handle_stream(websocket):
    # Set up Deepgram
    dg_connection = deepgram.listen.live.v("1")

    async def on_transcript(self, result, **kwargs):
        transcript = result.channel.alternatives[0].transcript
        if transcript.strip():
            print(f"User: {transcript}")

            # Get AI response
            messages.append({"role": "user", "content": transcript})
            completion = openai_client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful voice assistant."},
                    *messages
                ]
            )
            response = completion.choices[0].message.content
            messages.append({"role": "assistant", "content": response})

            # Stream TTS
            audio_stream = elevenlabs.text_to_speech.stream(
                voice_id=os.environ["ELEVENLABS_VOICE_ID"],
                text=response,
                model_id="eleven_turbo_v2",
                output_format="ulaw_8000"
            )

            for chunk in audio_stream:
                await websocket.send(json.dumps({
                    "event": "playAudio",
                    "media": {
                        "contentType": "audio/x-mulaw",
                        "sampleRate": 8000,
                        "payload": base64.b64encode(chunk).decode()
                    }
                }))

    dg_connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
    await dg_connection.start({"model": "nova-2", "encoding": "mulaw", "sample_rate": 8000})

    async for message in websocket:
        data = json.loads(message)

        if data["event"] == "media":
            audio = base64.b64decode(data["media"]["payload"])
            dg_connection.send(audio)

        elif data["event"] == "dtmf":
            print(f"DTMF: {data['dtmf']['digit']}")
            if data["dtmf"]["digit"] == "*":
                await websocket.send(json.dumps({
                    "event": "clearAudio",
                    "streamId": data["streamId"]
                }))

async def main():
    async with websockets.serve(handle_stream, "0.0.0.0", 8000):
        await asyncio.Future()

asyncio.run(main())

Example with Pipecat

Pipecat is an open-source framework for building voice AI pipelines.

import asyncio
from pipecat.pipeline.pipeline import Pipeline
from pipecat.pipeline.runner import PipelineRunner
from pipecat.services.deepgram import DeepgramSTTService
from pipecat.services.openai import OpenAILLMService
from pipecat.services.elevenlabs import ElevenLabsTTSService
from pipecat.transports.services.plivo import PlivoTransport

async def main():
    # Configure transport
    transport = PlivoTransport(
        host="0.0.0.0",
        port=8000,
        path="/stream"
    )

    # Configure services
    stt = DeepgramSTTService(
        api_key=os.environ["DEEPGRAM_API_KEY"],
        model="nova-2"
    )

    llm = OpenAILLMService(
        api_key=os.environ["OPENAI_API_KEY"],
        model="gpt-4o-mini",
        system_prompt="You are a helpful voice assistant. Be concise."
    )

    tts = ElevenLabsTTSService(
        api_key=os.environ["ELEVENLABS_API_KEY"],
        voice_id=os.environ["ELEVENLABS_VOICE_ID"],
        model_id="eleven_turbo_v2"
    )

    # Build pipeline
    pipeline = Pipeline([
        transport.input(),
        stt,
        llm,
        tts,
        transport.output()
    ])

    # Run
    runner = PipelineRunner()
    await runner.run(pipeline)

asyncio.run(main())

General Considerations for Voice AI Agents

Noise Cancellation

Why it matters: Background noise is the #1 cause of speech recognition errors. Implementation:

Enable Plivo’s built-in noise cancellation (contact support)
Consider client-side noise suppression for high-quality microphones
For mobile callers, noise is especially prevalent

Voice Activity Detection (VAD) and Turn Detection

The Challenge: Knowing when the user has finished speaking. Approaches:

Silence-based VAD: Wait for N milliseconds of silence
- Pros: Simple
- Cons: Slow, doesn’t handle pauses well
STT End-of-Speech Detection: Most STT services provide speech_final events
- Pros: Understands speech patterns
- Cons: Slight delay
Semantic Turn Detection: Use LLM to determine if response is needed
- Pros: Handles complex dialogue
- Cons: Added latency

Recommendation: Combine STT’s speech_final with a short timeout (300-500ms).

Interruption

Users should be able to interrupt the AI mid-response. Implementation:

let isPlaying = false;
let interruptionBuffer: string[] = [];

plivoServer
  .onMedia((event, ws) => {
    const audio = event.getRawMedia();

    // Send to STT
    sttClient.send(audio);

    // If user speaks while AI is playing, they might be interrupting
    if (isPlaying) {
      // Accumulate audio and check for speech
      interruptionBuffer.push(audio);
    }
  })
  .onPlayedStream((event, ws) => {
    isPlaying = false;
  });

// In STT callback
sttClient.on('transcript', (data) => {
  if (data.isFinal && isPlaying) {
    // User interrupted
    plivoServer.clearAudio(ws);
    isPlaying = false;

    // Process interruption
    handleUserInput(data.transcript);
  }
});

Context Management

Maintain conversation context for coherent multi-turn dialogue:

interface ConversationContext {
  messages: Array<{ role: string; content: string }>;
  userProfile?: {
    name?: string;
    preferences?: Record<string, any>;
  };
  sessionData?: Record<string, any>;
}

// Per-connection context
const contexts = new WeakMap<WebSocketType, ConversationContext>();

function getSystemPrompt(context: ConversationContext): string {
  let prompt = `You are a helpful voice assistant.`;

  if (context.userProfile?.name) {
    prompt += ` The user's name is ${context.userProfile.name}.`;
  }

  if (context.sessionData?.topic) {
    prompt += ` You are currently helping with ${context.sessionData.topic}.`;
  }

  return prompt;
}

// Limit context size to control costs and latency
function trimContext(messages: Array<{ role: string; content: string }>, maxMessages = 20) {
  if (messages.length > maxMessages) {
    // Keep system message + recent messages
    return [messages[0], ...messages.slice(-maxMessages + 1)];
  }
  return messages;
}

Best Practices Summary

Aspect	Recommendation
Codec	μ-law 8000Hz for lowest latency
Response Time	Aim for < 1 second total
Interruption	Always support—use `clearAudio`
DTMF	Support `*` for interrupt, `#` for repeat
Error Handling	Graceful fallbacks, don’t leave user hanging
Context	Maintain conversation history, trim when needed
Testing	Test on actual phone calls, not just WebSocket clients

Support

For questions, issues, or feature requests:

Documentation: https://www.plivo.com/docs/
Support: support@plivo.com
GitHub Issues: For SDK-specific issues

Last updated: January 2026

Getting Started

Concepts

API Reference

XML Reference

SDKs

Troubleshooting

Voice Tutorials

Migration Guides

​Architecture

​High-Level Flow

​Step-by-Step Flow

​Stream XML

​Basic Syntax

​Parameters

​Supported Content Types

​Examples

​Basic Unidirectional Stream (Listen Only)

​Bidirectional Stream with μ-law Codec

​Stream with Status Callbacks and Extra Headers

​Higher Quality Stream (16kHz)

​Record After Stream

​Stream APIs

​Base URL

​Authentication

​Stop a Stream

​Get Stream Details

​Using the Plivo SDK

​Node.js

​Python

​Stream Status Callback URL

​Configuration

​Callback Events

​Event Types

​started

​stopped

​failed

​Example Callback Handler

​Plivo Signature Validation

​V3 Signature Headers

​Validation Process

​Using the Plivo SDK

​Using the Node.js Stream SDK

​Manual Validation Example

​The Plivo Stream Event Protocol

​Input Events (Plivo → Your Server)

​start

​media

​dtmf

​playedStream

​clearedAudio

​Output Events (Your Server → Plivo)

​playAudio

​checkpoint

​clearAudio

​X-Headers

​Configuration

​Format

​Accessing X-Headers

​Parsing X-Headers

​Why Use X-Headers?

​Example: Dynamic Agent Selection

​Limits

​WebSocket URL Length

​Stream Limits

​Rate Limits

​Message Size

​Protocol Schema Reference

​JSON Schema

​TypeScript Types

​Recommendations for an Effective Plivo Stream Experience

​Audio Codec and Sample Rate Considerations

​Recommended: μ-law 8000Hz

​When to Use Higher Sample Rates

​Minimize Latency for a Better Experience

​1. Choose the Right Region for Your WebSocket Server

​2. Server Location Strategy

​3. Latency Budget

​Where Is My Call Located? How Does Plivo Select the Location?

​India: Phone Numbers and Regulations

​Where to Host Your WebSocket Server

Architecture

High-Level Flow

Step-by-Step Flow

Stream XML

Basic Syntax

Parameters

Supported Content Types

Examples

Basic Unidirectional Stream (Listen Only)

Bidirectional Stream with μ-law Codec

Stream with Status Callbacks and Extra Headers

Higher Quality Stream (16kHz)

Record After Stream

Stream APIs

Base URL

Authentication

Stop a Stream

Get Stream Details

Using the Plivo SDK

Node.js

Python

Stream Status Callback URL

Configuration

Callback Events

Event Types

`started`

`stopped`

`failed`

Example Callback Handler

Plivo Signature Validation

V3 Signature Headers

Validation Process

Using the Plivo SDK

Using the Node.js Stream SDK

Manual Validation Example

The Plivo Stream Event Protocol

Input Events (Plivo → Your Server)

`start`

`media`

`dtmf`

`playedStream`

`clearedAudio`

Output Events (Your Server → Plivo)

`playAudio`

`checkpoint`

`clearAudio`

X-Headers

Configuration

Format

Accessing X-Headers

Parsing X-Headers

Why Use X-Headers?

Example: Dynamic Agent Selection

Limits

WebSocket URL Length

Stream Limits

Rate Limits

Message Size

Protocol Schema Reference

JSON Schema

TypeScript Types

Recommendations for an Effective Plivo Stream Experience

Audio Codec and Sample Rate Considerations

Recommended: μ-law 8000Hz

When to Use Higher Sample Rates

Minimize Latency for a Better Experience

1. Choose the Right Region for Your WebSocket Server

2. Server Location Strategy

3. Latency Budget

Where Is My Call Located? How Does Plivo Select the Location?

India: Phone Numbers and Regulations

Where to Host Your WebSocket Server

What Is Noise Cancellation and Why Do You Need It?

How It Works

Benefits

Enable Noise Cancellation

Choosing a Cancellation Level

Access Original (Pre-Denoised) Recording

Performance

Reference