Make a Voice-Controlled Audio Player with the Web Speech API

Ivan Dimov

This article was peer reviewed by Edwin Reynoso and Mark Brown. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!

The Web Speech API is a JavaScript API that enables web developers to incorporate speech recognition and synthesis into their web pages.

There are many reasons to do this. For example, to enhance the experience of people with disabilities (particularly users with sight problems, or users with limited ability to move their hands), or to allows users to interact with a web app while performing a different task (such as driving).

If you have never heard of the Web Speech API, or you would like a quick primer, then it might be a good idea to read Aurelio De Rosa’s articles Introducing the Web Speech API, Speech Sythesis API and the Talking Form .

Browser Support

Browsers vendors have only recently started implementing both the Speech Recognition API and the Speech Synthesis API. As you can see, support for these is still far from perfect, so if you are following along with this tutorial, please use an appropriate browser.

In addition, the speech recognition API currently requires an Internet connection, as the speech gets passed through the wire and the results are returned to the browser. If the connection uses HTTP, the user has to permit a site to use their microphone on every request. If the connection uses HTTPS, then this is only necessary once.

Speech Recognition Libraries

Libraries can help us manage complexity and can ensure we stay forward compatible. For example when another browser starts supporting the Speech Recognition API, we would not have to worry about adding vendor prefixes.

One such library is Annyang, which is incredibly easy to work with. Tell me more.

To initialize Annyang, we add their script to our website:

<script src="//"></script>

We can check if the API is supported like so:

if (annyang) { /*logic */ }

And add commands using an object with the command names as keys and the callbacks as methods. :

var commands = {
  'show divs': function() {
  'show forms': function() {

Finally, we just add them and start the speech recognition using:


Voice-controlled Audio Player

In this article, we will be building a voice-controlled audio player. We will be using both the Speech Synthesis API (to inform users which song is beginning, or that a command was not recognized) and the Speech Recognition API (to convert voice commands to strings which will trigger different app logic).

The great thing about an audio player that uses the Web Speech API is that users will be able to surf to other pages in their browser or minimize the browser and do something else while still being able to switch between songs. If we have a lot of songs in the playlist, we could even request a particular song without searching for it manually (if we know its name or singer, of course).

We will not be relying on a third-party library for the speech recognition as we want to show how to work with the API without adding extra dependencies in our projects. The voice-controlled audio player will only be supporting browsers that support the interimResults attribute. The latest version of Chrome should be a safe bet.

As ever, you can find the complete code on GitHub, and a demo on CodePen.

Getting Started — a Playlist

Let’s start with a static playlist. It consists of an object with different songs in an array. Each song is a new object containing the path to the file, the singer’s name and the name of the song:

var data = {
  "songs": [
      "fileName": "",
      "singer" : "Jason Shaw",
      "songName" : "Running Waters"

We should be able to add a new objects to the songs array and have the new song automatically included into our audio player.

The Audio Player

Now we come to the player itself. This will be an object containing the following things:

  • some setup data
  • methods pertaining to the UI (e.g. populating the list of songs)
  • methods pertaining to the Speech API (e.g. recognizing and processing commands)
  • methods pertaining to the manipulation of audio (e.g. play, pause, stop, prev, next)

Setup Data

This is relatively straight forward.

var audioPlayer = {
  audioData: {
    currentSong: -1,
    songs: []

The currentSong property refers to the index of the song that the user is currently on. This is useful, for example, when we have to play the next/previous song, or stop/pause the song.

The songs array contains all the songs that the user has listened to. This means that the next time the user listens to the same song, we can load it from the array and not have to download it.

You can see the full code here.

UI Methods

The UI will consist of a list of available commands, a list of available tracks and a context box to inform the user of both the current operation and the previous command. I won’t go into the UI methods in detail, rather offer a brief overview. You can find the code for these methods here.


This iterates over our previously declared playlist and appends the name of the song, as well as the name of the artist to a list of available tracks.


This indicates which song is currently playing (by marking it green and adding a pair of headphones next to it) as well as those which have finished playing.


This indicates to the user that a song is playing, or when it has ended. It does this via the changeStatusCode method, which adds this information to the box and to inform the user of this change via the Speech API.


As mentioned above, this updates the status message in the context box (e.g. to indicate that a new song is playing) and utilizes the speak method to announce this change to the user.


A small helper which updates the last command box.


A small helper to hide or show the spinner icon (which indicates to the user that his voice command is currently processing).

Player Methods

The player will be responsible for what you might expect, namely: starting, stopping and pausing playback, as well as moving backwards and forwards through the tracks. Again, I don’t want to go into the methods in detail, but would rather point you towards our GitHub repo.


This checks if the user has listened to a song yet. If not, it starts the song, otherwise it just calls the playSong method we discussed previously on the currently cached song. This is located in audioData.songs and corresponds to the currentSong index.


This pauses or completely stops (returns playback time to the song’s beginning) a song, depending on what is passed as the second parameter. It also updates the status code to notify the user that the song has either been stopped or paused.


This either pauses or stops the song based on its first and only parameter:


This checks whether the previous song is cached and if so, it pauses the current song, decrements currentSong and plays the current song again. If the new song is not in the array, it does the same but it first loads the song from the file name/path corresponding to the decremented currentSong index.


If the user has listened to a song before, this method tries to pause it. If there is a next song in our data object (i.e. our playlist) it loads it and plays it. If there is no next song it just changes the status code and informs the user that they have reached the final song.


This takes a keyword as an argument and performs a linear search across song names and artists, before playing the first match.

Speech API Methods

The Speech API is surprisingly easy to implement. In fact, it only takes two lines of code to get a web app talking to users:

var utterance = new SpeechSynthesisUtterance('Hello');

What we are doing here is creating an utterance object which contains the text we wish to be spoken. The speechSynthesis interface (which is available on the window object) is responsible for processing this utterance object and controlling the playback of the resulting speech.

Go ahead and try it out in your browser. It’s that easy!


We can see this in action in our speak method, which reads aloud the message passed as an argument:

speak: function(text, scope) {
  var message = new SpeechSynthesisUtterance(text.replace("-", " "));
  message.rate = 1;
  if (scope) {
    message.onend = function() {;

If there is a second argument (scope), we call the play method on scope (which would be an Audio object) after the message has finished playing.


This method is not as exciting. It receives a command as a parameter and calls the appropriate method to respond to it. It checks if the user wants to play a specific song with a regular expression, otherwise, it enters a switch statement to test different commands. If none corresponds to the command received, it informs the user that the command was not understood.

You can find the code for it here.

Tying Things Together

By now we have a data object representing our playlist, as well as an audioPlayer object representing the player itself. Now we need to write some code to recognize and deal with user input. Please note that this will only work in webkit browsers.

The code to have users talk to your app is equally as simple as before:

var recognition = new webkitSpeechRecognition();
recognition.onresult = function(event) {

This will invite the user to allow a page access to their microphone. If you allow access you can start talking and when you stop the onresult event will be fired, making the results of the speech capture available as a JavaScript object.

Reference: The HTML5 Speech Recognition API

We can implement this in our app as follows:

if (window['webkitSpeechRecognition']) {
  var speechRecognizer = new webkitSpeechRecognition();

  // Recognition will not end when user stops speaking
  speechRecognizer.continuous = true;

  // Process the request while the user is speaking
  speechRecognizer.interimResults = true;

  // Account for accent
  speechRecognizer.lang = "en-US";

  speechRecognizer.onresult = function (evt) { ... }
  speechRecognizer.onend = function () { ... }
} else {
  alert("Your browser does not support the Web Speech API");

As you can see we test for the presence of webkitSpeechRecognition on the window object. If it is there, then we’re good to go, otherwise we inform the user that the browser doesn’t support it. If all’s good, we then set a couple of options. Of these lang is an interesting one which can improve the results of the recognition, based on where you hail from.

We then declare handlers for the onresult and the onend events, before kicking things off with the start method.

Handling a result

There are a few things we want to do when the speech recognizer gets a result, at least in the context of the current implementation of speech recognition and our needs. Each time there is a result, we want to save it in an array and set a timeout to wait for three seconds, so the browser can collect any further results. After the thee seconds are up, we want to use the gathered results and loop over them in reverse order (newer results have better chance of being accurate) and check whether the recognized transcript contains one of our available commands. If it does, we execute the command and restart the speech recognition. We do this because waiting for a final result can take up to a minute, making our audio player seem quite unresponsive and pointless, as it would be faster to just click on a button.

speechRecognizer.onresult = function (evt) {
  if (!timeoutSet) {
    setTimeout(function() {
      timeoutSet = false;
      try {
        results.forEach(function (val, i) {
          var el = val[0][0].transcript.toLowerCase();
          if (currentCommands.indexOf(el.split(" ")[0]) !== -1) {
            results = [];
            throw new BreakLoopException;
          if (i === 0) {
            results = [];
      catch(e) {return e;}
    }, 3000)
  timeoutSet = true;

As we are not using a library we have to write more code to set up our speech recognizer, looping over each result and checking if its transcript matches a given keyword.

Lastly, we restart speech recognition as soon as it ends:

speechRecognizer.onend = function () {

You can see the full code for this section here.

And that’s it. We now have an audio player which is fully functional and voice-controlled. I urge to to download the code from Github and have a play about with it, or check out the CodePen demo. I have also made available a version which is served via HTTPS.


I hope this practical tutorial has served as a healthy introduction as to what is possible with the Web Speech API. I think that we will see use of this API grow, as implementations stabilize and new features are added. For example, I see a YouTube of the future which is completely voice-controlled, where we can watch the videos of different users, play specific songs and move between songs just with voice commands.

There are also many other areas where the Web Speech API could bring improvements, or open new possibilities. For example browsing email, navigating websites, or searching the web — all with your voice.

Are you using this API in your projects? I’d love to hear from you in the comments below.