Introducing the Web Speech API

After receiving my bachelor’s degree, I began working in a group called NLP. As the name implies, we focused on Natural Language Processing (NLP) technologies. At the time, two of the most popular technologies to work with were the VoiceXML standard and Java applets. Both of them had issues. The first was only supported by Opera. The second, used to send the data to the server and execute an action based on the command pronounced by the user, required a lot of code and time. Today things are different. Thanks to the introduction of a dedicated JavaScript API, working with speech recognition has never been easier. This article will introduce you to this API, known as the Web Speech API. Speech recognition has several real-world applications. Many more people have become familiar with this concept thanks to softwares like Siri and S-Voice. These applications can drastically improve the way users, especially those with disabilities, perform tasks. In a website, users could navigate pages or populate form fields using their voice. Users could also interact with a page while driving, without taking their eyes off of the road. These are not trivial use cases.

What is the Web Speech API?

The Web Speech API, introduced at the end of 2012, allows web developers to provide speech input and text-to-speech output features in a web browser. Typically, these features aren’t available when using standard speech recognition or screen reader software. This API takes care of the privacy of the users. Before allowing the website to access the voice via microphone, the user must explicitly grant permission. Interestingly, the permission request is the same as the getUserMedia API, although it doesn’t need the webcam. If the page that runs this API uses the HTTPS protocol, the browser asks for the permission only once, otherwise it does every time a new process starts. The Web Speech API defines a complex interface, called SpeechRecognition, whose structure can be seen here. This article won’t cover all the properties and methods described in the specification for two main reasons. The first is that if you’ve seen the interface, it’s too complex to be covered in one article. Secondly, as we’ll see in the next sections, there is only one browser that supports this API, and its implementation is very limited. Therefore, we’ll cover only the implemented methods and properties. The specification asserts that the API itself is agnostic of the underlying speech recognition and synthesis implementation and can support both server-based and client-based/embedded recognition and synthesis. It allows two types of recognition: one-shot and continuous. In the first type, the recognition ends as soon as the user stops talking, while in the second it ends when the stop() method is called. In the second case, we can still allow our users to end the recognition by attaching a handler that calls the stop() method (via a button for example). The results of the recognition are provided to our code as a list of hypotheses, along with other relevant information for each hypothesis. Another interesting feature of the Web Speech API is that it allows you to specify a grammar object. Explaining in detail what a grammar is, is beyond the scope of this article. You can think of it as a set of rules for defining a language. The advantage of using a grammar is that it usually leads to better results due to the restriction of language possibilities. This API may not surprise you because of the the already implemented x-webkit-speech

attribute introduced in Chrome 11. The main differences is that the Web Speech API allows you to see results in real time and utilize a grammar. You can read more about this attribute, by taking a look at How to Use HTML5 Speech Input Fields. Now that you should have a good overview of what this API is and what it can do, let’s examine its properties and methods.

Methods and Properties

To instantiate a speech recognizer, use the function speechRecognition() as shown below:

var recognizer = new speechRecognition();

This object exposes the following methods:

onstart: Sets a callback that is fired when the recognition service has begun to listen to the audio with the intention of recognizing.
onresult: Sets a callback that is fired when the speech recognizer returns a result. The event must use the SpeechRecognitionEvent interface.
onerror: Sets a callback that is fired when a speech recognition error occurs. The event must use the SpeechRecognitionError interface.
onend: Sets a callback that is fired when the service has disconnected. The event must always be generated when the session ends, no matter what the reason.

In addition to these methods, we can configure the speech recognition object using the following properties:

continuous: Sets the type of the recognition (one-shot or continuous). If its value is set to true we have a continuous recognition, otherwise the process ends as soon as the user stops talking. By default it’s set to false.
lang: Specifies the recognition language. By default it corresponds to the browser language.
interimResults: Specifies if we want interim results. If its value is set to true we’ll have access to interim results that we can show to the users to provide feedback. If the value is false, we’ll obtain the results only after the recognition ends. By default it’s set to false.

As the argument to the result event handler, we receive an object of type SpeechRecognitionEvent. The latter contains the following data:

results[i]: An array containing the results of the recognition. Each array element corresponds to a recognized word.
resultIndex: The current recognition result index.
results[i].isFinal: A Boolean that indicates if the the result is final or interim.
results[i][j]: A 2D array containing alternative recognized words. The first element is the most probable recognized word.
results[i][j].transcript: The text representation of the recognized word(s).
results[i][j].confidence: The probability of the result being correct. The value ranges from 0 to 1.

Browser Compatibility

The previous section pointed out that the proposal for the Web Speech API was made in late 2012. So far, the only browser that supports this API is Chrome, starting in version 25 with a very limited subset of the specification. Additionally, Chrome supports this API using the webkit prefix. Therefore, creating a speech recognition object, looks like this in Chrome:

var recognizer = new webkitSpeechRecognition();

Demo

This section provides a demo of the Web Speech API in action. The demo page contains one readonly field and three buttons. The field is needed to show the transcription of the recognized speech. The first two buttons start and stop the recognition process, while the third clears the log of actions and error messages. The demo also allows you to choose between one-shot and continuous mode using two radio buttons. Because only Chrome supports this API, we perform a check, and if it fails we display an error message. Once support is verified, we initialize the speech recognition object so that we don’t have to perform this action every time the user clicks on the “Play demo” button. We also attach a handler to start the recognition process. Note that inside of the handler, we also set the recognition mode. We need to select the mode inside the handler to assure it reflects the choose of the user (it needs to be refreshed every time a new recognition starts). A live demo of the code is available here. Oh, and just for fun, try to say a dirty word.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
    <title>Web Speech API Demo</title>
    <style>
      body
      {
        max-width: 500px;
        margin: 2em auto;
        font-size: 20px;
      }

      h1
      {
        text-align: center;
      }

      .buttons-wrapper
      {
        text-align: center;
      }

      .hidden
      {
        display: none;
      }

      #transcription,
      #log
      {
        display: block;
        width: 100%;
        height: 5em;
        overflow-y: scroll;
        border: 1px solid #333333;
        line-height: 1.3em;
      }

      .button-demo
      {
        padding: 0.5em;
        display: inline-block;
        margin: 1em auto;
      }
    </style>
  </head>
  <body>
    <h1>Web Speech API</h1>
    <h2>Transcription</h2>
    <textarea id="transcription" readonly="readonly"></textarea>

    <span>Results:</span>
    <label><input type="radio" name="recognition-type" value="final" checked="checked" /> Final only</label>
    <label><input type="radio" name="recognition-type" value="interim" /> Interim</label>

    <h3>Log</h3>
    <div id="log"></div>

    <div class="buttons-wrapper">
      <button id="button-play-ws" class="button-demo">Play demo</button>
      <button id="button-stop-ws" class="button-demo">Stop demo</button>
      <button id="clear-all" class="button-demo">Clear all</button>
    </div>
    <span id="ws-unsupported" class="hidden">API not supported</span>

    <script>
      // Test browser support
      window.SpeechRecognition = window.SpeechRecognition       ||
                                 window.webkitSpeechRecognition ||
                                 null;

      if (window.SpeechRecognition === null) {
        document.getElementById('ws-unsupported').classList.remove('hidden');
        document.getElementById('button-play-ws').setAttribute('disabled', 'disabled');
        document.getElementById('button-stop-ws').setAttribute('disabled', 'disabled');
      } else {
        var recognizer = new window.SpeechRecognition();
        var transcription = document.getElementById('transcription');
        var log = document.getElementById('log');

        // Recogniser doesn't stop listening even if the user pauses
        recognizer.continuous = true;

        // Start recognising
        recognizer.onresult = function(event) {
          transcription.textContent = '';

          for (var i = event.resultIndex; i < event.results.length; i++) {
            if (event.results[i].isFinal) {
              transcription.textContent = event.results[i][0].transcript + ' (Confidence: ' + event.results[i][0].confidence + ')';
            } else {
              transcription.textContent += event.results[i][0].transcript;
            }
          }
        };

        // Listen for errors
        recognizer.onerror = function(event) {
          log.innerHTML = 'Recognition error: ' + event.message + '<br />' + log.innerHTML;
        };

        document.getElementById('button-play-ws').addEventListener('click', function() {
          // Set if we need interim results
          recognizer.interimResults = document.querySelector('input[name="recognition-type"][value="interim"]').checked;

          try {
            recognizer.start();
            log.innerHTML = 'Recognition started' + '<br />' + log.innerHTML;
          } catch(ex) {
            log.innerHTML = 'Recognition error: ' + ex.message + '<br />' + log.innerHTML;
          }
        });

        document.getElementById('button-stop-ws').addEventListener('click', function() {
          recognizer.stop();
          log.innerHTML = 'Recognition stopped' + '<br />' + log.innerHTML;
        });

        document.getElementById('clear-all').addEventListener('click', function() {
          transcription.textContent = '';
          log.textContent = '';
        });
      }
    </script>
  </body>
</html>

Conclusion

This article introduced the Web Speech API, and explained how it can help improve user experience, especially for those with disabilities. The implementation of this API is at a very early stage, with only Chrome offering a limited set of features. The potential of this API is incredible, so keep an eye on its evolution. As a final note, don’t forget to play with the demo, it’s really entertaining.

Frequently Asked Questions (FAQs) about Web Speech API

What is the Web Speech API and how does it work?

The Web Speech API is a web-based interface that allows websites and web applications to incorporate speech recognition and speech synthesis into their functionality. It works by converting spoken language into written text (speech recognition) and vice versa (speech synthesis). This API is particularly useful in creating more accessible and interactive web experiences, such as voice-driven web apps, assistive technologies, and other innovative web projects.

How can I implement the Web Speech API in my web application?

Implementing the Web Speech API involves using JavaScript to interact with the API’s SpeechRecognition and SpeechSynthesis interfaces. You can start by creating a new instance of these interfaces, then use their methods and properties to control the speech recognition and synthesis processes. For example, you can use the start() method to begin speech recognition, and the onresult event handler to process the results.

What are the main features of the Web Speech API?

The Web Speech API provides two main features: speech recognition and speech synthesis. Speech recognition allows your web application to convert spoken language into written text, which can be useful for dictation, voice commands, and more. Speech synthesis, on the other hand, enables your application to generate speech from text, which can be used for text-to-speech, voice prompts, and other applications.

Is the Web Speech API supported by all browsers?

The Web Speech API is not universally supported by all browsers. As of now, it is fully supported by Google Chrome and partially supported by other browsers like Firefox and Safari. It’s always a good idea to check the current browser compatibility before implementing the Web Speech API in your web application.

How can I handle errors when using the Web Speech API?

The Web Speech API provides several event handlers that you can use to handle errors. For example, the onerror event handler is triggered when a speech recognition error occurs. You can use this event handler to determine the type of error and take appropriate action.

Can I customize the voice and speech rate in speech synthesis?

Yes, the Web Speech API allows you to customize various aspects of speech synthesis, including the voice, pitch, volume, and rate of speech. You can use the getVoices() method to get a list of available voices, and the SpeechSynthesisUtterance interface to set the voice and other properties.

How can I improve the accuracy of speech recognition?

The accuracy of speech recognition can be influenced by several factors, including the quality of the input audio, the speaker’s accent, and background noise. You can improve accuracy by using a high-quality microphone, minimizing background noise, and providing a context to the SpeechRecognition interface using the grammars property.

Can I use the Web Speech API for language translation?

While the Web Speech API itself does not provide language translation, you can combine it with other APIs or services to create a language translation application. For example, you can use the Web Speech API for speech recognition and synthesis, and a translation API to translate the recognized text into another language.

Is the Web Speech API free to use?

Yes, the Web Speech API is a free web standard provided by the W3C. However, keep in mind that while the API itself is free, you may incur costs if you choose to use third-party services or APIs in conjunction with the Web Speech API.

What are some potential applications of the Web Speech API?

The Web Speech API opens up a wide range of possibilities for web applications. It can be used to create voice-driven web apps, assistive technologies for people with disabilities, interactive games, dictation software, language learning apps, and much more. The only limit is your imagination!