Design & UX
By James Edwards

Accessible Audio Descriptions for HTML5 Video

By James Edwards

A client recently asked me to produce an accessible video player, and one of the features she was very keen to have is audio descriptions. Audio descriptions are intended for people who are blind or have impaired vision, providing additional spoken information to describe important visual details.

Traditionally, audio-described videos have to be made specially, with the audio encoded in a separate track of the single video file. It takes pretty specialised video-editing equipment to encode these audio tracks, and that raises the bar for most content producers beyond a practical level.

All the audio-described content I’ve seen on the web is like this. For example, BBC iPlayer has a selection of such content, but the video player doesn’t give you control over the relative volumes, and you can’t turn the audio-descriptions off — you can only watch separate described or non-described versions of the program.


Enter HTML5

The HTML5 video specification does provide an audioTracks object, which would make it possible to implement an on/off button, and to control the audio and video volumes separately. But its browser support is virtually non-existent — at the time of writing, only IE10 supports this feature.

In any case, what my client wanted was audio descriptions in a separate file, which could be added to a video without needing to create a separate version, and which would be easy to make without specialised software. And of course, it had to work in a decent range of browsers.

So my next thought was to use a MediaController, which is a feature of HTML5 audio and video that allows you to synchronise multiple sources. However browser support for this is equally scant — at the time of writing, only Chrome supports this feature.

But you know — even without that support, it’s clearly not a problem to start two media files at the same time, it’s just a case of keeping them in sync. So can we use existing, widely-implemented features to make that work?

Video Events

The video API provides a number of events we can hook into, that should make it possible to synchronise audio playback with events from the video:

  • The "play" event (which fires when the video is played).
  • The "pause" event (which fires when the video is paused).
  • The "ended" event (which fires when the video ends).
  • The "timeupdate" event (which fires continually while the video is playing).

It’s the "timeupdate" event that’s really crucial. The frequency at which it fires is not specified, and practise it varies considerably — but as a rough, overall average, it amounts to 3–5 times per second, which is enough for our purposes.

I’ve seen a similar approach being tried to synchronise two video files, but it isn’t particularly successful, because even tiny discrepancies are very obvious. But audio descriptions generally don’t need to be so precisely in sync — a delay of 100ms either way would be acceptable — and playing audio files is far less work for the browser anyway.

So all we need to do is use the video events we have, to lock the audio and video playback together:

  • When the video is played, play the audio.
  • When the video is paused, pause the audio.
  • When the video ends, pause the video and audio together.
  • When the time updates, set the audio time to match the video time, if they’re different.

After some experimentation, I discovered that the best results are achieved by comparing the time in whole seconds, like this:

if(Math.ceil(audio.currentTime) != Math.ceil(video.currentTime))
  audio.currentTime = video.currentTime;

This seems counter-intuitive, and initially I had assumed we’d need as much precision as the data provides, but that doesn’t seem to be the case. By testing it using a literal audio copy of the video’s soundtrack (i.e. so the audio and video both produce identical sound), it’s easy to hear when the synchronisation is good or bad. Experimenting on that basis, I got much better synchronisation when rounding the figures, than not.

So here’s the final script. If the browser supports MediaController then we just use that, otherwise we implement manual synchronisation, as described:

var video = document.getElementById('video');
var audio = document.getElementById('audio');
if(typeof(window.MediaController) === 'function')
  var controller = new MediaController();
  video.controller = controller;
  audio.controller = controller;
  controller = null;
video.volume = 0.8;
audio.volume = 1;
video.addEventListener('play', function() 
  if(!controller && audio.paused)
}, false);
video.addEventListener('pause', function()
  if(!controller && !audio.paused)
}, false);
video.addEventListener('ended', function()
}, false);
video.addEventListener('timeupdate', function()
  if(!controller && audio.readyState >= 4)
    if(Math.ceil(audio.currentTime) != Math.ceil(video.currentTime))
      audio.currentTime = video.currentTime;
}, false);

Note that the MediaController itself is defined only through scripting, whereas it is possible to define a controller using the static "mediagroup" attribute:

<video mediagroup="foo"> ... </video>
<audio mediagroup="foo"> ... </audio>

If we did that, then it would work without JavaScript in Chrome. It would sync the media sources, but the user would have no control over the audio (including not being able to turn it off), because the browser wouldn’t know what the audio represents. This is the case in which it would be better to have the audio encoded into the video, because then it could appear in the audioTracks object, and the browser could recognise that and be able to provide native controls.

But since we have no audioTracks data, that’s rather a moot point! So if scripting is not available, the audio simply won’t play.

Here’s the final demo, which will work in any recent version of Opera, Firefox, Chrome, Safari, or IE9 or later:

This is just a simple proof-of-concept demo, of course — there’s no initial feature detection, and it only has the basic controls provided by the native "controls" attribute. For a proper implementation it would need custom controls, to provide (among other things) a button to switch the audio on and off, and separate volume sliders. The interface should also be accessible to the keyboard, which is not the case in some browsers’ native controls. And it would need to handle buffering properly — as it is, if you seek past the point where the video has preloaded, the audio will continue to play freely until the video has loaded enough to bring it back into sync.

I might also mention that the descriptions themselves are hardly up to professional standards! That’s my voice you can hear, recorded and converted using Audacity. But such as it is, I think it makes an effective demonstration, of how low the technical barrier-to-entry is with this approach. I didn’t have to edit the video, and I made the audio in an hour with free software.

As a proof of concept, I’d say it was pretty successful — and I’m sure my client will be very pleased!

  • liza

    good information

  • Great post, I always look at these and think they must take so long to write. Appreciate it!

  • Thanks, I’m glad you like it.

    What really takes the time with posts like this is researching, making and testing the demo. I spent a day on that, but only a couple of hours actually writing the blog post.

    The trick is to write about things that you’re working on anyway, so that research and development time is an investment in knowledge and skills.

  • Please don’t forget about captioning as one of the important components of videos – it is not mentioned in your article. Accessibility is not only for blind people. There are 50 millions of deaf and hard of hearing people in USA who need access to audio via captions and transcripts. I have an audio accessibility website – you csn click on my name to go to the website to learn more. Thanks!

    • Nobody’s forgetting about captioning, it’s just not what this particular article is about.

      I was going to add captions to this demo, but it seemed counter-intuitive, because (as you say) captions and descriptions are intended for different audiences.

      Captions are a solved problem in any case, so it doesn’t need mentioning in this particular context (ie. when talking about technical limitations with HTML5 video), because there are no technical limitations.

  • great post !
    This is exactly what I was looking for in HTML5. I got your blog post as a reference from another writer/researcher/expert (

    Do you think the future of HTML5 video with this concept will be safe (safe in the sense, sync between a/v will be proper)?
    Also @mediagroup will be an alternative for this hack, right?

    Consider my questions as newbee. I am only a beginner.

    thank you James.

    • That’s a good question. As long as the media sources don’t have to buffer (i.e. they’ve already preloaded when you play, or your connection is fast enough to never be a problem) then the synchronisation is totally solid.

      It’s not absolute of course — because the speed of timing events is not consistent, so the difference in time between the two sources can be +/- maybe quarter of a second. That’s way to much for something like DJ mixing, but easily acceptable for audio descriptions.

      But the real problem comes when buffering has to happen. If you seek a long way forward so the video has to buffer for a few seconds, then the audio will keep playing until the video catches up. Or if you start playing the video before the audio has loaded enough, then the audio won’t start playing the until it has. The native MediaController doesn’t have this problem, because it locks the sources together, so if one has to buffer then the other one pauses.

      But even without native controllers, those problems are fixable. In the client project I’m working on, I’ve developed this idea much further. What I do is monitor the “progress” events, which give information on which parts of the video are in cache, and use that data to detect when the video has to buffer. When that happens, I pause the video and the audio, and then continue to monitor progress events until a few more seconds has preloaded, then play again. If the users seeks, a similar process occurs (and there are additional seeking events and properties to detect that too).

      But ultimately, this is just a hack to cope with lack of support for MediaController — because only Chrome implements that, and its implementation is quite buggy anyway (in fact, in my client project, Chrome uses my solution and not a native controller, because it works better!)

      However given time, when most browsers have a solid and reliable implementation of native media controllers, then hacks like this won’t be necessary.

Get the latest in Design, once a week, for free.