Getting started with Speech Synthesis Markup Language (SSML)

The World Wide Web Consortium (W3C) created Speech Synthesis Markup Language (SSML) as an XML-based markup language to assist in generating natural-sounding synthesized speech. The Plivo Speak XML element supports the generation of SSML-based speech, powered by Amazon Polly. It supports 27 languages and more than 40 voices, and allows developers to control pronunciation, pitch, and volume. Here‘s how SSML appears within Plivo Speak XML elements:

<Response>
    <Speak voice="MAN">Go Green, Go Plivo</Speak> //Basic Text-to-Speech
    <Speak voice="Polly.Joey">
        <emphasis level="moderate">Go Green, Go Plivo</emphasis> //Text-to-Speech using SSML
    </Speak>
</Response>

To synthesize SSML speech on Plivo, specify one of the Amazon Polly voices in the voice attribute of Plivo’s <Speak> XML tag. Note that Polly voices must be namespaced with a Polly prefix. For example:

<Response>
    <Speak voice="Polly.Joey">
        <emphasis level="moderate">Go Green, Go Plivo</emphasis>
    </Speak>
</Response>

SSML tags

You can use these SSML tags within Plivo XML.

SSML Tag	Action	Description
<break>	Add a pause	Use this tag to include a pause in the speech.
<emphasis>	Emphasize words	Use this tag to change the rate and voice of the speech.
<lang>	Specify another language for specific words	Use this tag to set the natural language of the text.
<p>	Add a pause between paragraphs	Use this tag to represent a paragraph.
<phoneme>	Use phonetic pronunciation	Use this tag to set phonetic pronunciation for specific text.
<prosody>	Control volume, speaking rate, and pitch	Use this tag to modify the volume, speaking rate, and pitch of the tagged text.
<s>	Add a pause between sentences	Use this tag to represent a sentence. This adds a strong break before and after the tag.
<say-as>	Control how special types of words are spoken	Use this tag to describe how to interpret the text.
<sub>	Pronounce acronyms and abbreviations	Use this tag to pronounce the specified words or phrases as different words or phrases.
<w>	Improve pronunciation by specifying parts of speech	Use this tag to customize the pronunciation of words by specifying the part of speech they are.

Note: Plivo doesn’t support these Amazon Polly-specific tags in Plivo XML:

<amazon:auto-breaths>
<amazon:effect name=“drc”>
<amazon:effect phonation=“soft”>
<amazon:effect vocal-tract-length>
<amazon: effect name=“whispered”>

SSML voices

Plivo supports these Amazon Polly voices for use with Plivo XML:

Language	Female	Male
Australian English (en-AU)	Polly.Nicole	Polly.Russell
Brazilian Portuguese (pt-BR)	Polly.Vitória	Polly.Ricardo
Canadian French (fr-CA)	Polly.Chantal	-
Danish (da-DK)	Polly.Naja	Polly.Mads
Dutch (nl-NL)	Polly.Lotte	Polly.Ruben
French (fr-FR)	Polly.Lea	Polly.Celine
	Polly.Mathieu	-
German (de-DE)	Polly.Vicki	Polly.Hans
	Polly.Marlene	-
Hindi (hi-IN)	Polly.Aditi	-
Icelandic (is-IS)	Polly.Dora	Polly.Karl
Indian English (en-IN)	Polly.Raveena	-
	Polly.Aditi	-
Italian (it-IT)	Polly.Carla	Polly.Giorgio
Japanese (ja-JP)	Polly.Mizuki	Polly.Takumi
Korean (ko-KR)	Polly.Seoyeon	-
Mandarin Chinese (cmn-CN)	Polly.Zhiyu	-
Norwegian (nb-NO)	Polly.Liv	-
Polish (pl-PL)	Polly.Ewa	Polly.Jacek
	Polly.Maja	Polly.Jan
Portuguese - Iberic (pt-PT)	Polly.Ines	Polly.Cristiano
Romanian (ro-RO)	Polly.Carmen	-
Russian (ru-RU)	Polly.Tatyana	Polly.Maxim
Spanish - Castilian (es-ES)	Polly.Conchita	Polly.Enrique
Spanish - Mexican (es-MX)	Polly.Mia	-
US - Spanish (es-US)	Polly.Penelope	Polly.Miguel
	Polly.Lupe-Standard	-
Swedish (sv-SE)	Polly.Astrid	-
Turkish (tr-TR)	Polly.Filiz	-
UK English (en-GB)	Polly.Amy	Polly.Brian
	Polly.Emma	-
US English (en-US)	Polly.Joanna	Polly.Matthew
	Polly.Salli	Polly.Justin
	Polly.Kendra	Polly.Joey
	Polly.Kimberly	-
	Polly.Ivy	-
Welsh (cy-GB)	Polly.Gwyneth	-
Welsh English (en-GB-WLS)	-	Polly.Geraint

Character limit

To ensure quick synthesis, Plivo caps the length of text that can be synthesized in one <Speak> tag at 3,000 characters.

Pricing

Support for SSML-based speech synthesis is currently in beta and free for all Plivo users. We expect to eventually charge for text-to-speech on the basis of the number of characters synthesized.

SSML support in Plivo Server SDKs

SSML tags are supported in all of our Server SDKs.

Example

This example use the Joey voice for US English (en-US). Use the <Speak voice> tag to specify the voice for your text.

say-as

The say-as tag describes how to interpret the text.

from flask import Flask, Response, request, url_for
from plivo import plivoxml

app = Flask(__name__)

@app.route("/ssml/", methods=["GET", "POST"])
def ssml():
    element = plivoxml.ResponseElement()
    response = (
        element.add(
            plivoxml.SpeakElement(content="The date is", voice="Polly.Joey", language="en-US")
            .add_say_as("20200626", interpret_as="date")
        )
        .to_string(False)
    )
    print(response)
    return Response(response, mimetype="text/xml")

if __name__ == "__main__":
    app.run(host="0.0.0.0", debug=True)

The rendered XML document would be:

<Response>
    <Speak voice="Polly.Joey">The date is
      <say-as interpret-as="date">20200626</say-as>
    </Speak>
</Response>

w

The w tag lets you customize the pronunciation of a word by specifying its part of speech.

from flask import Flask, Response, request, url_for
from plivo import plivoxml

app = Flask(__name__)

@app.route("/ssml/", methods=["GET", "POST"])
def ssml():
    element = plivoxml.ResponseElement()
    response = (
        element.add(
            plivoxml.SpeakElement(content="The word", voice="Polly.Joey", language="en-US")
            .add_say_as("read", interpret_as="characters")
            .add_s("may be interpreted as either the present simple form")
            .add_w("read", role="amazon:VB")
            .add_s("or the past participle form")
            .add_w("read", role="amazon:VBD")
        )
        .to_string(False)
    )
    print(response)
    return Response(response, mimetype="text/xml")

if __name__ == "__main__":
    app.run(host="0.0.0.0", debug=True)

The rendered XML document would be:

<Response>
    <Speak voice="Polly.Joey">The word
      <say-as interpret-as="characters">read</say-as>
      <s>
          may be interpreted as either the present simple form
      </s>
      <w role="amazon:VB">read</w>
      <s>or the past participle form</s>
      <w role="amazon:VBD">read</w>
    </Speak>
</Response>

More examples

<Response>
    <Speak>I can speak in a 
      <prosody pitch="high">higher pitched voice</prosody>
      , or I can speak 
      <prosody pitch="low">in a lower pitched voice</prosody>
    </Speak>
</Response>

<Response>
    <Speak>I can speak 
      <prosody rate="x-slow">really slowly</prosody>
      , or  I can speak 
      <prosody rate="x-fast">really fast</prosody>
    </Speak>
</Response>

<Response>
    <Speak>I can also speak 
      <prosody volume="x-loud">very loudly</prosody>
      , or I can speak <prosody volume="x-soft">very quietly</prosody>. 
    </Speak>
</Response>

Concepts

​SSML tags

​SSML voices

​Character limit

​Pricing

​SSML support in Plivo Server SDKs

​Example

​say-as

​w

​More examples