In this article I want to share with you how to create a programmable voice solution based on Twilio services. One of the services which could help us is TwiML.
TwiML or Twilio Markup Language is a XML based language which instructs Twilio how to handle various events, particularly incoming and outgoing calls. To start working with TwiML we should to configure a specific Twilio service called TwiML application. This service handles all phone numbers linked with a corresponding TwiML application.
The entry point of your Programmable voice solution starts inside this TwiML application. There you can set a
REQUEST URL endpoint which will be triggered once the app receive an incoming phone call.
For your project you can configure this property using some url refers to your locally running application, for example you could use ngrok.
Now we know that once we receive a new incoming call, TwiML application will trigger this URL for us. Since it happens we can start
communicate with our caller by sending instructions which will be process by TwiML application into interactive voice responses.
Let’s take a look at how these instructions look. This is the first instruction we send to TwiML application after it triggers our entry endpoint:
<?xml version="1.0" encoding="UTF-8"?> <Response> <Gather action="/ivr/gather-digits" timeout="4" numDigits="1"> <Say voice="Polly.Emma" language="en-GB"> <prosody rate="100%"> <s>Hello and welcome to Test Restaurant</s> <s>If you would like to make a booking press <say-as interpret-as="digits">1</say-as> <break />for other enquiries press <say-as interpret-as="digits">2</say-as> </s> </prosody> </Say> </Gather> <Redirect>/ivr/redirect</Redirect> </Response>
Response – the root element of Twilio’s XML markup.
Gather – collect digits the caller types on their keypad.
Say – read text to the caller.
Redirect – transfers call control of a call to the TwiML at a different URL.
Due to this instruction we are setting a 4 second timer to gather user input and we expect to receive only one digit. In the background we are playing a welcome message in which we explain what the caller should do and finally depends on the result of whether we gather the caller’s input or not we send a request to
/ivr/gather-digits in case of the success or to
/ivr/redirect in case of failure.
All these routes, started with
/ivr, are handled by our IVR application which handles each request and after some work is done, like for example a request to an external API service, we create and send back to TwiML application a XML file with next set of instructions. This kind of communication works until the call is ended or dropped.
There are a lot of instruction that TwiML provides but I want to focus on one of the most widely used and tricky from my point of view because we had problems only around this instruction.
Say – converts text to speech that is read back to the caller.
There are several parameters which you can configure: voice, loop and language.
For voice you can choose a default one, Alice or Amazon Polly.
The main tip here is that if you have to support non-English countries or even accents of English just use Amazon Polly voice which is not free as other alternatives but Amazon Polly can guarantee that your speech sounds more or less properly.
The interesting part here that you can’t be on 100% sure in the result even with Amazon Polly voice. For example, fins can not normally understand the final speech after synthesizing text to Finnish language using Amazon Polly. In our case we are using a combination of different text-to-speech providers and particularly for Finnish language we are using Microsoft Azure.
Also providers of text to speech solutions like Amazon Polly or Microsoft Azure comes with support for SSML that allows you almost fully control the synthesized speech. Let’s take a look on how to use SSML in real world. I hope you still remember our welcome TwiML file.
<Say voice="Polly.Emma" language="en-GB"> <s>If you would like to make a booking press <say-as interpret-as="digits">1</say-as> <break />for other enquiries press <say-as interpret-as="digits">2</say-as> </s> </Say>
This is how we are using Say instruction. But what is SSML here?
SSML or Speech Synthesis Markup Language – is a W3C specification that allows developers to use XML-based markup language for assisting the generation of synthesized speech.
s – element which represents a sentence.
break – an empty element that controls pausing.
say-as – element which allows you to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.
In our welcome TwiML we are using
say-as to force provider to pronounce options as cardinal number because for some languages they are interpreted as ordinal numbers which is obviously an issue and we also want to have a pause right after the first option is pronounced so we are using
break to achieve this experience.
Now we have a better understanding of how all pieces work together so it’s a good moment to start building your own solution for programmable voice