Text to Speech
Reading Time: 4 minutes

How To Use Speech Synthesis Markup Language to Master Text-to-Speech Requests

Text-to-Speech technology (TTST) has evolved from an intervention for a select few, to a comprehensive assistive strategy offered to all. Computer-generated, synthesized spoken words have become very popular. TTST helps or augments task performance in many business-related tasks and acts as an ideal virtual assistant in many cases as well.

For end-users, the process is usually quite straightforward. They type the desired messages, they select one of the available languages and types of voices (gender, tonality, etc.) and then they press a button to convert the text to voice.

On the other hand, developers face a rather more complex process to create and configure a text-to-speech engine. They use Speech Synthesis Markup Language (SSML) to customize speech with SSML tags that enable them to build high-quality voice applications and convert written text to a natural sounding voice. With SSML tags you can add pauses, numbers, date and time formatting, and other pronunciation instructions, that will boost the user experience.

What is the Speech Synthesis Markup Language (SSML)?

Speech Synthesis Markup Language (abbreviated SSML) is an XML-based markup language. SSML can be used in a variety of applications, mobile devices, websites, and Internet of Things (IoT) devices to generate speech. Besides, you can use SSML to control the finer aspects of speech, such as pronunciation, inflection, pitch, and more, with all the supported languages you have built in your dev project.

SSML augments plain text passed to a synthesizer with predefined elements, or tags. So, XML text is first processed by a parser, which separates the plain text from the markup specifications. The specifications are then turned into instructions for the synthesizer to understand to produce its desired effects. It is important to note that the markup should be well-formed so that the XML parser can correctly do this job.

What are the elements of SSML?

Let’s see the elements that comprise SSML to have a better grasp of how you can use them to master your text-to-speech requests. An SSML element is made up of an opening tag, a closing tag, and any elements contained within.

In an SSML document, the root element is < speak/>, followed by the elements < meta/>, < metadata/> and < lexicon/> that must appear before all other elements and text. It is not mandated that elements be in any particular order in the specification. However, the root element must have a mandatory xml:lang attribute that specifies the language of the root document.

The speak element < speak >

This is the root element for SSML documents and contains the valid attributes of version, xml:lang, and xml:base.

The audio element < audio >

The element inserts recorded components into the service-generated audio.

The break element < break >

The element inserts a pause into spoken text, with optional attributes of strength and time.

The desc element < desc >

The element occurs only within an < audio > element.

The emphasis < emphasis >

The element indicates that the text enclosed should be spoken with emphasis.

The lexicon element < lexicon >

For each SSML document, this element sets up pronunciation dictionaries.

The mark element < mark >

This is an empty element used to mark where the text will be synthesized. When all of the text that precedes the < mark > element is synthesized the client is notified.

The meta < meta > and metadata < metadata > elements

Each one of these elements serves as an empty container for document information.


The paragraph < p > and sentence < s > elements

These elements are optional and are used as hints about the textual structure, adding pauses to synthesize audio.

The phoneme element < phoneme >

The element defines how a word sounds, how the sound divides into syllables, and which syllables receive stress. The element has two attributes, the alphabet, which specifies the phonology to be used, and the ph which provides the pronunciation in the specified alphabet.

The prosody element < prosody >

The element defines the attributes of pitch and speaking rate of the text. Pitch indicates the baseline pitch for the text within the element, while rate indicates the speaking rate for that text.

The say-as element < say-as >

The element provides information about the type of text contained within the element. It has one required attribute, interpret-as, which indicates how the enclosed text is to be interpreted, and two optional attributes, format and detail, used only with particular values within the interpret-as attribute.

The sub element < sub >

The element indicates that the text specified by the alias attribute will replace the text enclosed within the element when speech is synthesized.

The voice element < voice >

The element requests a change in the voice.

How can you use SSML in text-to-speech requests?

Modern text-to-speech conversational systems put most of the above-mentioned elements into action o help you build your text-to-speech requests. You do not have to create a tool from scratch anymore to get a valid and efficient speech output. You can leverage these ready-to-use well-formed services to validate your projects.

A ready-to-use TTS tool automatically validates all SSML elements, ignores unsupported SSML elements, and reports HTTP 400 error codes for invalid elements, symbols, missing vowels, invalid locations, invalid syllables stress and so on. Υou can easily integrate a text-to-speech tool into your projects via REST-API. Some services also provide you with the option to add and fine-tune custom voices that fit your project’s unique product or brand tone of voice for optimum customer service.

Start your next text to speech TTS project

Routee’s Text-to-Speech service empowers your dev projects, applications, tools, or devices to convert text into human-like synthesized speech. You can use human-like neural voices in over 16 languages to broadcast your messages to your audience worldwide in their native language and elevate the customer experience. Check out how easy it is to integrate this modern TTS solution into your projects.

Keep reading to find out how text-to-speech can help your business.

Grow your business with Routee's Text to Speech Services