Talking Web Pages and the Speech Synthesis API
A few weeks ago, I briefly discussed NLP and its related technologies. When dealing with natural language, there are two different, yet complementary, aspects to consider: Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). In the article Introducing the Web Speech API, I discussed the Web Speech API, an API to provide speech input and text-to-speech output features in a web browser. You may have noticed that I only presented how to implement the speech recognition in a website, not the speech synthesis. In this article, we’ll fill the gap describing the Speech Synthesis API.
Speech recognition gives the users, especially those with disabilities, the chance to provide information to a website. Recalling the use cases I highlighted:
In a website, users could navigate pages or populate form fields using their voice. Users could also interact with a page while driving, without taking their eyes off of the road. These are not trivial use cases.
So, we can see it as the channel from the user to the website. The speech synthesis is the other way around, providing websites the ability to provide information to users by reading text. This is especially useful for blind people and, in general, those with visual impairments.
Speech synthesis has as many use cases as speech recognition. Think of the systems implemented in some new cars that read your texts or emails so that you don’t have to take your eyes off of the road. Visually impaired people who use computers are familiar with softwares like JAWS that read whatever is on the desktop allowing them to perform tasks. These applications are great, but they cost a lot of money. Thanks to the Speech Synthesis API we can help people using our websites regardless of their disabilities.
As an example, imagine you’re writing a post on your blog (as I’m doing right now), and in order to improve its readability you split it in several paragraphs. Isn’t this a good chance to use the Speech Synthesis API? In fact, we could program our website so that, once a user hovers over (or focuses on) text, an icon of a speaker appears on the screen. If the user clicks the icon, we call a function that will synthesize the text of the given paragraph. This is a non-trivial improvement. Even better, it has a very low overhead for us as developers, and no overhead for our users. A basic implementation of this concept is shown in the JS Bin below.
Speech Synthesis API demo
Now that we know more about the use cases of this API, let’s learn about its methods and properties.
Methods and Properties
The Speech Synthesis API defines an interface, called SpeechSynthesis
, whose structure is presented here. Like the previous article, this one won’t cover all the properties and methods described in the specification. The reason is that it’s too complex to be covered in one article. However, we’ll explain enough elements to let you easily understand those not covered.
The SpeechSynthesisUtterance
Object
The first object we need to learn about is the SpeechSynthesisUtterance
object. It represents the utterance (i.e. the text) that will be spoken by the synthesizer. This object is pretty flexible, and can be customized in several ways. Apart from the text, we can set the language used to pronounce the text, the speed rate, and even the pitch. The following is a list of its properties:
text
– A string that specifies the utterance (text) to be synthesized.lang
– A string representing the language of the speech synthesis for the utterance (for example “en-GB” or “it-IT”).voiceURI
– A string that specifies the speech synthesis voice and the location of the speech synthesis service that the web application wishes to use.volume
– A number representing the volume for the text. It ranges from 0 (minimum) to 1 (maximum) inclusive, and the default value is 1.rate
– A number representing the speaking rate for the utterance. It is relative to the default rate for the voice. The default value is 1. A value of 2 means that the utterance will be spoken at twice the default speed. Values below 0.1 or above 10 are disallowed.pitch
– A number representing the speaking pitch for the utterance. It ranges from 0 (minimum) to 2 (maximum) inclusive. The default value is 1.
To instantiate this object we can either pass the text to synthesize as a constructor argument, or omit the text and set it later. The following code is an example of the first scenario.
// Create the utterance object
var utterance = new SpeechSynthesisUtterance('My name is Aurelio De Rosa');
The second case, which constructs a SpeechSynthesisUtterance
and then assigns parameters is shown below.
// Create the utterance object
var utterance = new SpeechSynthesisUtterance();
utterance.text = 'My name is Aurelio De Rosa';
utterance.lang = 'it-IT';
utterance.rate = 1.2;
Some of the methods exposed by this object are:
onstart
– Sets a callback that is fired when the synthesis starts.onpause
– Sets a callback that is fired when the speech synthesis is paused.onresume
– Sets a callback that is fired when the synthesis is resumed.onend
– Sets a callback that is fired when the synthesis is concluded.
The SpeechSynthesisUtterance
object allows us to set the text to be spoken well as to configure how it will be spoken. At the moment, we’ve only created the object representing the utterance though. We still need to tie it to the synthesizer.
The SpeechSynthesis
Object
The SpeechSynthesis
object doesn’t need to be instantiate. It belongs to the window
object, and can be used directly. This object exposes several methods such as:
speak()
– Accepts aSpeechSynthesisUtterance
object as its only parameter. This method is used to synthesize an utterance.stop()
– Immediately terminates the synthesis process.pause()
– Pauses the synthesis process.resume()
– Resumes the synthesis process.
Another interesting method is getVoices()
. It doesn’t accept any arguments, and is used to retrieve the list (an array) of voices available for the specific browser. Each entry in the list provides information such as name
, a mnemonic name to give developers a hint of the voice (for example “Google US English”), lang
, the language of the voice (for example it-IT), and voiceURI
, the location of the speech synthesis service for this voice.
Important note: In Chrome and Safari, the voiceURI
property is named voice
instead. So, the demo we’ll build in this article uses voice
instead of voiceURI
.
Browser Compatibility
Unfortunately, at the time of writing the only browsers that support the Speech Synthesis API are Chrome 33 with full support, and Safari for iOS 7 with partial support.
Demo
This section provides a simple demo of the Speech Synthesis API. This page allows you to input some text and have it synthesized. In addition, it’s possible to set the rate, the pitch, and the language you want to use. You can also stop, pause, or resume the synthesis of the text at any time using the respective buttons provided.
Before attaching the listener to the buttons, because the support for this API is very limited, we perform a test for the implementation. As usual the test is very simple and consist of the following code:
if (window.SpeechSynthesisUtterance === undefined) {
// Not supported
} else {
// Read my text
}
If the test fails, we show the message “API not supported” to the user. Once support is verified, we dynamically load the voices available in the specific select box put in the markup. Please note that the getVoices()
method in Chrome has an issue (#340160). Therefore, I created a workaround for it using setInterval()
. Then, we attach a handler for each button, so that they can call their specific action (play, stop, and so on).
A live demo of the code is available here. In addition, this demo together with all the others I’ve built so far, is available in my HTML5 API demos repository.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Speech Synthesis API Demo</title>
<style>
*
{
-webkit-box-sizing: border-box;
-moz-box-sizing: border-box;
box-sizing: border-box;
}
body
{
max-width: 500px;
margin: 2em auto;
padding: 0 0.5em;
font-size: 20px;
}
h1,
.buttons-wrapper
{
text-align: center;
}
.hidden
{
display: none;
}
#text,
#log
{
display: block;
width: 100%;
height: 5em;
overflow-y: scroll;
border: 1px solid #333333;
line-height: 1.3em;
}
.field-wrapper
{
margin-top: 0.2em;
}
.button-demo
{
padding: 0.5em;
display: inline-block;
margin: 1em auto;
}
</style>
</head>
<body>
<h1>Speech Synthesis API</h1>
<h3>Play area</h3>
<form action="" method="get">
<label for="text">Text:</label>
<textarea id="text"></textarea>
<div class="field-wrapper">
<label for="voice">Voice:</label>
<select id="voice"></select>
</div>
<div class="field-wrapper">
<label for="rate">Rate (0.1 - 10):</label>
<input type="number" id="rate" min="0.1" max="10" value="1" step="any" />
</div>
<div class="field-wrapper">
<label for="pitch">Pitch (0.1 - 2):</label>
<input type="number" id="pitch" min="0.1" max="2" value="1" step="any" />
</div>
<div class="buttons-wrapper">
<button id="button-speak-ss" class="button-demo">Speak</button>
<button id="button-stop-ss" class="button-demo">Stop</button>
<button id="button-pause-ss" class="button-demo">Pause</button>
<button id="button-resume-ss" class="button-demo">Resume</button>
</div>
</form>
<span id="ss-unsupported" class="hidden">API not supported</span>
<h3>Log</h3>
<div id="log"></div>
<button id="clear-all" class="button-demo">Clear all</button>
<script>
// Test browser support
if (window.SpeechSynthesisUtterance === undefined) {
document.getElementById('ss-unsupported').classList.remove('hidden');
['button-speak-ss', 'button-stop-ss', 'button-pause-ss', 'button-resume-ss'].forEach(function(elementId) {
document.getElementById(elementId).setAttribute('disabled', 'disabled');
});
} else {
var text = document.getElementById('text');
var voices = document.getElementById('voice');
var rate = document.getElementById('rate');
var pitch = document.getElementById('pitch');
var log = document.getElementById('log');
// Workaround for a Chrome issue (#340160 - https://code.google.com/p/chromium/issues/detail?id=340160)
var watch = setInterval(function() {
// Load all voices available
var voicesAvailable = speechSynthesis.getVoices();
if (voicesAvailable.length !== 0) {
for(var i = 0; i < voicesAvailable.length; i++) {
voices.innerHTML += '<option value="' + voicesAvailable[i].lang + '"' +
'data-voice-uri="' + voicesAvailable[i].voiceURI + '">' +
voicesAvailable[i].name +
(voicesAvailable[i].default ? ' (default)' : '') + '</option>';
}
clearInterval(watch);
}
}, 1);
document.getElementById('button-speak-ss').addEventListener('click', function(event) {
event.preventDefault();
var selectedVoice = voices.options[voices.selectedIndex];
// Create the utterance object setting the chosen parameters
var utterance = new SpeechSynthesisUtterance();
utterance.text = text.value;
utterance.voice = selectedVoice.getAttribute('data-voice-uri');
utterance.lang = selectedVoice.value;
utterance.rate = rate.value;
utterance.pitch = pitch.value;
utterance.onstart = function() {
log.innerHTML = 'Speaker started' + '<br />' + log.innerHTML;
};
utterance.onend = function() {
log.innerHTML = 'Speaker finished' + '<br />' + log.innerHTML;
};
window.speechSynthesis.speak(utterance);
});
document.getElementById('button-stop-ss').addEventListener('click', function(event) {
event.preventDefault();
window.speechSynthesis.cancel();
log.innerHTML = 'Speaker stopped' + '<br />' + log.innerHTML;
});
document.getElementById('button-pause-ss').addEventListener('click', function(event) {
event.preventDefault();
window.speechSynthesis.pause();
log.innerHTML = 'Speaker paused' + '<br />' + log.innerHTML;
});
document.getElementById('button-resume-ss').addEventListener('click', function(event) {
event.preventDefault();
if (window.speechSynthesis.paused === true) {
window.speechSynthesis.resume();
log.innerHTML = 'Speaker resumed' + '<br />' + log.innerHTML;
} else {
log.innerHTML = 'Unable to resume. Speaker is not paused.' + '<br />' + log.innerHTML;
}
});
document.getElementById('clear-all').addEventListener('click', function() {
log.textContent = '';
});
}
</script>
</body>
</html>
Conclusion
In this article we’ve covered the Speech Synthesis API. It’s an API to synthesize text and improve the overall experience for the users of our websites, especially those with visual impairments. As we’ve seen, this API exposes several objects, methods, and properties, but it isn’t very difficult to use. Unfortunately, at the moment its browser support is very poor, with Chrome and Safari being the only browsers to support it.
Hopefully, more browsers will follow the lead, allowing you to realistically consider using it on your website. I’ve decided to. Don’t forget to play with the demo and to post a comment if you liked the article. I’d really love to hear your opinion.