JavaScript
Article

Experimenting with the Web Speech API

By Aurelio De Rosa

A few days ago, I spoke at WebTech Conference 2014 giving a presentation titled Talking and listening to web pages where I discussed the Web Speech API and what a developer can do with it to improve the user experience. This talk was inspired by two articles I wrote for SitePoint titled Introducing the Web Speech API and Talking Web Pages and the Speech Synthesis API.

In this tutorial we’ll build upon the knowledge acquired and we’ll develop a demo that use both the interfaces defined by this API. If you need an introduction of the Web Speech API I recommend to read the two previously mentioned articles because this one will assume you have a good knowledge of it. Have fun!

Developing an Interactive Form

The goal of this article is to build an interactive form that our users can fill with their voice. For the sake of this example we’ll develop a registration form but you can apply the same concepts to any form you want. An important concept to keep in mind is that the use of the voice should never be the only source of input because no matter how accurate a speech recognizer is, it’ll never be perfect. So, the user should always be able to modify any field to fix any error the recognizer has made.

In this demo we’ll provide a button that, once clicked, will start asking a question to the user and then the interaction continues with the user speaking the answer. The recognizer transforms the speech into text that is placed in the text field. Once the interaction is completed, which means all the fields of our form have been filled, our application will be polite and thank the user.

As a final point remember that at the time of this writing the Web Speech API is very experimental and completely supported by Chrome only. Therefore our experiment will work in this browser only. Without further ado, let’s start building the markup of the registration form.

The HTML of the Registration Form

To keep things as easy as possible, our form will contain only three fields, but you can add as many of them as you need. In particular we’ll require our user to fill the name, surname, and nationality. If you have a basic knowledge of HTML, performing this task should be pretty easy. I suggest you to try to implement it before taking a look at the code below (my implementation):

<form>
   <label for="form-demo-name">Name:</label>
   <input id="form-demo-name" />
   <label for="form-demo-surname">Surname:</label>
   <input id="form-demo-surname" />
   <label for="form-demo-nationality">Nationality:</label>
   <input id="form-demo-nationality" />
   <input id="form-demo-voice" type="submit" value="Start" />
</form>

The previous code shows nothing but a classic form that can be only filled with the use of a keyboard or similar input devices. So, we need to find a way to specify the question we want to ask for each of the fields defined in the form. A good and simple solution is to employ the data-* attribute of HTML5. In particular we’ll specify a data-question attribute for every labelinput pair. I’ve decided to set the attribute to the label associated to the input but you can easily change the demo to define the attribute on the input element.

The resulting code is shown below:

<form>
   <label for="form-demo-name" data-question="What's your name?">Name:</label>
   <input id="form-demo-name" />
   <label for="form-demo-surname" data-question="What's your surname?">Surname:</label>
   <input id="form-demo-surname" />
   <label for="form-demo-nationality" data-question="What's your nationality?">Nationality:</label>
   <input id="form-demo-nationality" />
   <input id="form-demo-voice" type="submit" value="Start" />
</form>

Whether you’re surprised or not, this is all the markup we need to create our interactive form. Let’s now delve into the core of our demo by discussing the JavaScript code.

Adding the Business Logic

To develop the business logic of our form we need three ingredients: a speech synthesizer, a speech recognizer, and promises. We need a speech synthesizer to emit the sound that asks the user the question we’ve defined using the data-question attribute. The speech recognizer is used to transform the user’s response into text that will be set as a value of each field. Finally, we need promises to avoid callback hell!.

The WebSpeech API is driven by asynchronous operations, so we need a way to synchronize all the operations. We need to start recognizing the speech of the user after the question has been asked, and we have to ask a new question after the user has spoken its answer and the recognizer has completed its work. Thus, we need to synchronize a variable set of consecutive (serial) asynchronous operations. We can easily solve this issue by adopting promises in our code. If you need a primer on what promises are, SitePoint has you covered with the article An Overview of JavaScript Promises. Another very good article has been written by Jake Archibald and it’s titled JavaScript Promises: There and back again.

Our code will be logically divided in two parts: a support library that deals with the Web Speech API and will act as the producer of the promises, and the code that will consume the promises. In the next two sections of this article we’ll talk about them.

Developing the Support Library

If you have a working knowledge of how the Web Speech API works, understanding the support library won’t be very hard.

We’ll define an object literal that we’ll assign to a variable named Speech. This object has two methods: speak and recognize. The former accepts the text to speak and will be responsible to emit the audio and also create the promise associated with this operation. The promise will be resolved in case no error occurs (error event) or rejected if the error event is triggered. The promise will also be rejected if the browser doesn’t support the API. The recognize method is used to recognize the speech of the user. It doesn’t accept any arguments, and returns the text recognized by passing it to the resolve method of the promise created. As you’ll see recognize is slightly complex compared to speak because it has to deal with more situations. The promise created by recognize will be resolved when the final results are available or rejected in case any error occurs. Please note that the code will also take care of dealing with an issue I discovered few days ago on Windows 8.1 (#428873).

The complete code of our support library is shown below:

var Speech = {
   speak: function(text) {
      return new Promise(function(resolve, reject) {
         if (!SpeechSynthesisUtterance) {
            reject('API not supported');
         }
      
         var utterance = new SpeechSynthesisUtterance(text);

         utterance.addEventListener('end', function() {
            console.log('Synthesizing completed');
            resolve();
         });

         utterance.addEventListener('error', function (event) {
            console.log('Synthesizing error');
            reject('An error has occurred while speaking: ' + event.error);
         });

         console.log('Synthesizing the text: ' + text);
         speechSynthesis.speak(utterance);
      });
   },
   recognize: function() {
      return new Promise(function(resolve, reject) {
         var SpeechRecognition = SpeechRecognition        ||
                                 webkitSpeechRecognition  ||
                                 null;

         if (SpeechRecognition === null) {
            reject('API not supported');
         }

         var recognizer = new SpeechRecognition();

         recognizer.addEventListener('result', function (event) {
            console.log('Recognition completed');
            for (var i = event.resultIndex; i < event.results.length; i++) {
               if (event.results[i].isFinal) {
                  resolve(event.results[i][0].transcript);
               }
            }
         });

         recognizer.addEventListener('error', function (event) {
            console.log('Recognition error');
            reject('An error has occurred while recognizing: ' + event.error);
         });

         recognizer.addEventListener('nomatch', function (event) {
            console.log('Recognition ended because of nomatch');
            reject('Error: sorry but I could not find a match');
         });

         recognizer.addEventListener('end', function (event) {
            console.log('Recognition ended');
            // If the Promise isn't resolved or rejected at this point
            // the demo is running on Chrome and Windows 8.1 (issue #428873).
            reject('Error: sorry but I could not recognize your speech');
         });

         console.log('Recognition started');
         recognizer.start();
      });
   }
};

Putting All the Pieces Together

With our support library in place, we need to write the code that will retrieve the questions we’ve specified and interact with the support library to create the interactive form.

The first thing we need to do is to retrieve all the labels of our form because we’ll use their for attribute to retrieve the inputs and the data-question attribute to ask the questions. This operation is performed by the statement below:

var fieldLabels = [].slice.call(document.querySelectorAll('label'));

Recalling how we wrote the markup, we can shorten the code necessary by keeping the labelinput pairs, which means the question-answer pairs, coupled. We can do that by using a support function that we’ll call formData. Its goal is to return the new promise generated by every labelinput pair. Treating every label and input in our form as a unique component, instead of different entities, allows us to reduce the code needed because we can extract a more abstract code and loop over them.

The code of the formData function and how it’s called is shown below:

function formData(i) {
   return promise.then(function() {
              return Speech.speak(fieldLabels[i].dataset.question);
           })
           .then(function() {
              return Speech.recognize().then(function(text) {
                  document.getElementById(fieldLabels[i].getAttribute('for')).value = text;
              });
           });
}

for(var i = 0; i < fieldLabels.length; i++) {
   promise = formData(i);
}

Because we have coupled the promises as shown in the formData function we need an initial promise that is resolved to allow the others to start. This task is achieved by creating a promise immediately resolved before the loop of the previous snippet:

var promise = new Promise(function(resolve) {
   resolve();
});

As a final touch we want to thank you our users but also catch any possible error generated by our process:

promise.then(function() {
   return Speech.speak('Thank you for filling the form!');
})
.catch(function(error) {
  alert(error);
});

At this point our code is almost complete. The final step is to place all the code of this section inside a function executed when the user clicks the button.

The Result

As you have noted I haven’t discussed the style for this demo because it’s completely irrelevant and you are free to write your own. As an additional note I also want to mention that in the demo you’ll see below I’ve also created a simple spinner to give a visual feedback when the recognizer is ready to do its job.

The result of the code developed is shown below but it’s also available as a JSBin:

Form demo

Conclusion

In this tutorial we’ve developed a simple yet completely functional interactive form that a user can fill using the voice. To do that we’ve used some cutting-edge technologies such the Web Speech API and promises. The demo should have given you an idea of what’s possible to do using the new JavaScript APIs and how they can improve the experience of your users. As a final note remember that you can play with this demo in Chrome only.

I hope you enjoyed this tutorial and have learned something new and interesting.

  • fred

    Impressive. Very nice. I am using Chrome Windows, I am asked to allow use of the microphone for each field. Is there a way to allow it once for the whole form?

    • Aurelio De Rosa

      You should run the demo on a webpage served through HTTPS.

  • Vasi

    Very interesting idea. I have talked about this today with my partner. Unfortunately your JSBin example is not working for me in Chrome 38

    • Aurelio De Rosa

      Hi. I’ve just tried it and it still works. Can you double check?

      • Vasi

        Today it`s working. I think it was Chrome settings. Thanks.

        • yahooakshay

          any specific chrome setting that you did to get it to work?

  • Jason

    Can confirm, still doesn’t work for the Scottish accent. F***ing ELEVEN!

  • Katja Hollaar

    Hello, I’m trying to implement an app that recognizes speech from an audio file. What steps would you recommend to do this? I was trying to call to https://www.google.com/speech-api/v2/recognize?output=json&lang=LANG&key=MY_API_KEY but the key always seems to be invalid… The idea will be to automatically generate captions based on audio containing speech. Thank you in advance for you reply.

    • Gautam Krishnan

      Hey! Did you manage to get it working?

  • yahooakshay

    this is very good, but it stopped working. I tried it on chrome 38 and chrome 39 not working in both of them

  • Sunil Choudhary

    Very nice piece of code. but its not working on the phone….any tweaks so i can using it on mobile also….thank you Auerlio

  • Saras Space

    How can we do this for a checkbox input?Thanks

  • jay m

    @aurelioderosa:disqus , Very interesting post!! Thank you for posting. I was experimenting with speech API by using the method described in your post as the base. I was trying to fill a set of fields with speech. But at some situations ( when exception occurs or speech can not be recognized), we have to re-do the whole filling process. How can I make sure that the filling process continues from where it got stuck? Any suggestions ?

Recommended
Sponsors
Because We Like You
Free Ebooks!

Grab SitePoint's top 10 web dev and design ebooks, completely free!

Get the latest in JavaScript, once a week, for free.