Talking Web Pages and the Speech Synthesis API

Key Takeaways

The Speech Synthesis API allows websites to provide information to users by reading text, which can greatly assist users with visual impairments and those who are multitasking.
The Speech Synthesis API provides several methods and properties to customize the speech output, such as the language, speed rate, and pitch. This API also includes methods to start, pause, resume, and stop the speech synthesis process.
Currently, the Speech Synthesis API is only fully supported by Chrome 33 and partially supported by Safari for iOS 7. More widespread browser support is needed for this API to be realistically considered for broader use on websites.

A few weeks ago, I briefly discussed NLP and its related technologies. When dealing with natural language, there are two different, yet complementary, aspects to consider: Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). In the article Introducing the Web Speech API, I discussed the Web Speech API, an API to provide speech input and text-to-speech output features in a web browser. You may have noticed that I only presented how to implement the speech recognition in a website, not the speech synthesis. In this article, we’ll fill the gap describing the Speech Synthesis API.

Speech recognition gives the users, especially those with disabilities, the chance to provide information to a website. Recalling the use cases I highlighted:

In a website, users could navigate pages or populate form fields using their voice. Users could also interact with a page while driving, without taking their eyes off of the road. These are not trivial use cases.

So, we can see it as the channel from the user to the website. The speech synthesis is the other way around, providing websites the ability to provide information to users by reading text. This is especially useful for blind people and, in general, those with visual impairments.

Speech synthesis has as many use cases as speech recognition. Think of the systems implemented in some new cars that read your texts or emails so that you don’t have to take your eyes off of the road. Visually impaired people who use computers are familiar with softwares like JAWS that read whatever is on the desktop allowing them to perform tasks. These applications are great, but they cost a lot of money. Thanks to the Speech Synthesis API we can help people using our websites regardless of their disabilities.

As an example, imagine you’re writing a post on your blog (as I’m doing right now), and in order to improve its readability you split it in several paragraphs. Isn’t this a good chance to use the Speech Synthesis API? In fact, we could program our website so that, once a user hovers over (or focuses on) text, an icon of a speaker appears on the screen. If the user clicks the icon, we call a function that will synthesize the text of the given paragraph. This is a non-trivial improvement. Even better, it has a very low overhead for us as developers, and no overhead for our users. A basic implementation of this concept is shown in the JS Bin below.

Speech Synthesis API demo
Now that we know more about the use cases of this API, let’s learn about its methods and properties.

Methods and Properties

The Speech Synthesis API defines an interface, called SpeechSynthesis, whose structure is presented here. Like the previous article, this one won’t cover all the properties and methods described in the specification. The reason is that it’s too complex to be covered in one article. However, we’ll explain enough elements to let you easily understand those not covered.

The `SpeechSynthesisUtterance` Object

The first object we need to learn about is the SpeechSynthesisUtterance object. It represents the utterance (i.e. the text) that will be spoken by the synthesizer. This object is pretty flexible, and can be customized in several ways. Apart from the text, we can set the language used to pronounce the text, the speed rate, and even the pitch. The following is a list of its properties:

text – A string that specifies the utterance (text) to be synthesized.
lang – A string representing the language of the speech synthesis for the utterance (for example “en-GB” or “it-IT”).
voiceURI – A string that specifies the speech synthesis voice and the location of the speech synthesis service that the web application wishes to use.
volume – A number representing the volume for the text. It ranges from 0 (minimum) to 1 (maximum) inclusive, and the default value is 1.
rate – A number representing the speaking rate for the utterance. It is relative to the default rate for the voice. The default value is 1. A value of 2 means that the utterance will be spoken at twice the default speed. Values below 0.1 or above 10 are disallowed.
pitch – A number representing the speaking pitch for the utterance. It ranges from 0 (minimum) to 2 (maximum) inclusive. The default value is 1.

To instantiate this object we can either pass the text to synthesize as a constructor argument, or omit the text and set it later. The following code is an example of the first scenario.


// Create the utterance object
var utterance = new SpeechSynthesisUtterance('My name is Aurelio De Rosa');

The second case, which constructs a SpeechSynthesisUtterance and then assigns parameters is shown below.


// Create the utterance object
var utterance = new SpeechSynthesisUtterance();
utterance.text = 'My name is Aurelio De Rosa';
utterance.lang = 'it-IT';
utterance.rate = 1.2;

Some of the methods exposed by this object are:

onstart – Sets a callback that is fired when the synthesis starts.
onpause – Sets a callback that is fired when the speech synthesis is paused.
onresume – Sets a callback that is fired when the synthesis is resumed.
onend – Sets a callback that is fired when the synthesis is concluded.

The SpeechSynthesisUtterance object allows us to set the text to be spoken well as to configure how it will be spoken. At the moment, we’ve only created the object representing the utterance though. We still need to tie it to the synthesizer.

The `SpeechSynthesis` Object

The SpeechSynthesis object doesn’t need to be instantiate. It belongs to the window object, and can be used directly. This object exposes several methods such as:

speak() – Accepts a SpeechSynthesisUtterance object as its only parameter. This method is used to synthesize an utterance.
stop() – Immediately terminates the synthesis process.
pause() – Pauses the synthesis process.
resume() – Resumes the synthesis process.

Another interesting method is getVoices(). It doesn’t accept any arguments, and is used to retrieve the list (an array) of voices available for the specific browser. Each entry in the list provides information such as name, a mnemonic name to give developers a hint of the voice (for example “Google US English”), lang, the language of the voice (for example it-IT), and voiceURI, the location of the speech synthesis service for this voice.

Important note: In Chrome and Safari, the voiceURI property is named voice instead. So, the demo we’ll build in this article uses voice instead of voiceURI.

Browser Compatibility

Unfortunately, at the time of writing the only browsers that support the Speech Synthesis API are Chrome 33 with full support, and Safari for iOS 7 with partial support.

Demo

This section provides a simple demo of the Speech Synthesis API. This page allows you to input some text and have it synthesized. In addition, it’s possible to set the rate, the pitch, and the language you want to use. You can also stop, pause, or resume the synthesis of the text at any time using the respective buttons provided.

Before attaching the listener to the buttons, because the support for this API is very limited, we perform a test for the implementation. As usual the test is very simple and consist of the following code:


if (window.SpeechSynthesisUtterance === undefined) {
  // Not supported
} else {
  // Read my text
}

If the test fails, we show the message “API not supported” to the user. Once support is verified, we dynamically load the voices available in the specific select box put in the markup. Please note that the getVoices() method in Chrome has an issue (#340160). Therefore, I created a workaround for it using setInterval(). Then, we attach a handler for each button, so that they can call their specific action (play, stop, and so on).

A live demo of the code is available here. In addition, this demo together with all the others I’ve built so far, is available in my HTML5 API demos repository.


<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    <title>Speech Synthesis API Demo</title>
    <style>
      *
      {
        -webkit-box-sizing: border-box;
        -moz-box-sizing: border-box;
        box-sizing: border-box;
      }
      body
      {
        max-width: 500px;
        margin: 2em auto;
        padding: 0 0.5em;
        font-size: 20px;
      }
      h1,
      .buttons-wrapper
      {
        text-align: center;
      }
      .hidden
      {
        display: none;
      }
      #text,
      #log
      {
        display: block;
        width: 100%;
        height: 5em;
        overflow-y: scroll;
        border: 1px solid #333333;
        line-height: 1.3em;
      }
      .field-wrapper
      {
        margin-top: 0.2em;
      }
      .button-demo
      {
        padding: 0.5em;
        display: inline-block;
        margin: 1em auto;
      }
    </style>
  </head>
  <body>
    <h1>Speech Synthesis API</h1>
    <h3>Play area</h3>
    <form action="" method="get">
      <label for="text">Text:</label>
      <textarea id="text"></textarea>
      <div class="field-wrapper">
        <label for="voice">Voice:</label>
        <select id="voice"></select>
      </div>
      <div class="field-wrapper">
        <label for="rate">Rate (0.1 - 10):</label>
        <input type="number" id="rate" min="0.1" max="10" value="1" step="any" />
      </div>
      <div class="field-wrapper">
        <label for="pitch">Pitch (0.1 - 2):</label>
        <input type="number" id="pitch" min="0.1" max="2" value="1" step="any" />
      </div>
      <div class="buttons-wrapper">
        <button id="button-speak-ss" class="button-demo">Speak</button>
        <button id="button-stop-ss" class="button-demo">Stop</button>
        <button id="button-pause-ss" class="button-demo">Pause</button>
        <button id="button-resume-ss" class="button-demo">Resume</button>
      </div>
    </form>
    <span id="ss-unsupported" class="hidden">API not supported</span>
    <h3>Log</h3>
    <div id="log"></div>
    <button id="clear-all" class="button-demo">Clear all</button>
    <script>
      // Test browser support
      if (window.SpeechSynthesisUtterance === undefined) {
        document.getElementById('ss-unsupported').classList.remove('hidden');
        ['button-speak-ss', 'button-stop-ss', 'button-pause-ss', 'button-resume-ss'].forEach(function(elementId) {
          document.getElementById(elementId).setAttribute('disabled', 'disabled');
        });
      } else {
        var text = document.getElementById('text');
        var voices = document.getElementById('voice');
        var rate = document.getElementById('rate');
        var pitch = document.getElementById('pitch');
        var log = document.getElementById('log');
        // Workaround for a Chrome issue (#340160 - https://code.google.com/p/chromium/issues/detail?id=340160)
        var watch = setInterval(function() {
          // Load all voices available
          var voicesAvailable = speechSynthesis.getVoices();
          if (voicesAvailable.length !== 0) {
            for(var i = 0; i < voicesAvailable.length; i++) {
              voices.innerHTML += '<option value="' + voicesAvailable[i].lang + '"' +
                                  'data-voice-uri="' + voicesAvailable[i].voiceURI + '">' +
                                  voicesAvailable[i].name +
                                  (voicesAvailable[i].default ? ' (default)' : '') + '</option>';
            }
            clearInterval(watch);
          }
        }, 1);
        document.getElementById('button-speak-ss').addEventListener('click', function(event) {
          event.preventDefault();
          var selectedVoice = voices.options[voices.selectedIndex];
          // Create the utterance object setting the chosen parameters
          var utterance = new SpeechSynthesisUtterance();
          utterance.text = text.value;
          utterance.voice = selectedVoice.getAttribute('data-voice-uri');
          utterance.lang = selectedVoice.value;
          utterance.rate = rate.value;
          utterance.pitch = pitch.value;
          utterance.onstart = function() {
            log.innerHTML = 'Speaker started' + '<br />' + log.innerHTML;
          };
          utterance.onend = function() {
            log.innerHTML = 'Speaker finished' + '<br />' + log.innerHTML;
          };
          window.speechSynthesis.speak(utterance);
        });
        document.getElementById('button-stop-ss').addEventListener('click', function(event) {
          event.preventDefault();
          window.speechSynthesis.cancel();
          log.innerHTML = 'Speaker stopped' + '<br />' + log.innerHTML;
        });
        document.getElementById('button-pause-ss').addEventListener('click', function(event) {
          event.preventDefault();
          window.speechSynthesis.pause();
          log.innerHTML = 'Speaker paused' + '<br />' + log.innerHTML;
        });
        document.getElementById('button-resume-ss').addEventListener('click', function(event) {
          event.preventDefault();
          if (window.speechSynthesis.paused === true) {
            window.speechSynthesis.resume();
            log.innerHTML = 'Speaker resumed' + '<br />' + log.innerHTML;
          } else {
            log.innerHTML = 'Unable to resume. Speaker is not paused.' + '<br />' + log.innerHTML;
          }
        });
        document.getElementById('clear-all').addEventListener('click', function() {
          log.textContent = '';
        });
      }
    </script>
  </body>
</html>

Conclusion

In this article we’ve covered the Speech Synthesis API. It’s an API to synthesize text and improve the overall experience for the users of our websites, especially those with visual impairments. As we’ve seen, this API exposes several objects, methods, and properties, but it isn’t very difficult to use. Unfortunately, at the moment its browser support is very poor, with Chrome and Safari being the only browsers to support it.

Hopefully, more browsers will follow the lead, allowing you to realistically consider using it on your website. I’ve decided to. Don’t forget to play with the demo and to post a comment if you liked the article. I’d really love to hear your opinion.

Frequently Asked Questions (FAQs) about Web Pages and the Speech Synthesis API

What is the Speech Synthesis API and how does it work?

The Speech Synthesis API is a web-based interface that allows developers to incorporate text-to-speech functionality into their applications. It works by converting written text into spoken words using a computer-generated voice. This is achieved by breaking down the text into phonetic components and then synthesizing these components into speech. The API provides a range of voices and languages to choose from, allowing developers to customize the speech output to suit their needs.

How can I implement the Speech Synthesis API in my web application?

Implementing the Speech Synthesis API in your web application involves a few steps. First, you need to create a new SpeechSynthesisUtterance instance and set its text property to the text you want to be spoken. Then, you can set other properties such as the voice, pitch, and rate to customize the speech output. Finally, you call the speak method of the SpeechSynthesis interface to start the speech synthesis.

Can I customize the voice and language of the speech output?

Yes, the Speech Synthesis API provides a range of voices and languages that you can choose from. You can set the voice and language by setting the voice and lang properties of the SpeechSynthesisUtterance instance. The API also allows you to adjust the pitch and rate of the speech to further customize the output.

What are the limitations of the Speech Synthesis API?

While the Speech Synthesis API is a powerful tool, it does have some limitations. For instance, the availability of voices and languages can vary depending on the browser and operating system. Also, the quality of the speech output can vary, and it may not always sound natural. Furthermore, the API does not provide control over the pronunciation of specific words or phrases.

How can I handle errors when using the Speech Synthesis API?

The Speech Synthesis API provides an error event that you can listen for. This event is fired when an error occurs during the speech synthesis process. You can handle this event by adding an event listener to the SpeechSynthesisUtterance instance and providing a callback function that will be called when the event is fired.

Can I pause and resume the speech output?

Yes, the Speech Synthesis API provides pause and resume methods that you can use to control the speech output. You can call these methods on the SpeechSynthesis interface to pause and resume the speech.

Is the Speech Synthesis API supported in all browsers?

The Speech Synthesis API is supported in most modern browsers, including Chrome, Firefox, Safari, and Edge. However, the availability of voices and languages can vary depending on the browser and operating system.

Can I use the Speech Synthesis API in mobile applications?

Yes, the Speech Synthesis API can be used in mobile applications. However, the availability of voices and languages can vary depending on the mobile operating system.

How can I test the Speech Synthesis API?

You can test the Speech Synthesis API by creating a simple web page that uses the API to convert written text into speech. You can then experiment with different voices, languages, pitches, and rates to see how they affect the speech output.

Where can I find more information about the Speech Synthesis API?

You can find more information about the Speech Synthesis API in the official documentation provided by the World Wide Web Consortium (W3C). There are also many online tutorials and articles that provide detailed explanations and examples of how to use the API.