Adding Subtitles to HTML5 Audio

Posted on:

Video subtitles

For <video>, we have the <track> element, which we can use to attach WebVTT subtitle files to any video. This has been in all modern browsers since 2015, and you should definitely use it before any bloated JS plugins.

Now, it would be nice if the same also applies to <audio>.

It just works?

<audio preload="metadata" controls crossorigin="anonymous">
<track src="https://skyfalls.xyz/sub.vtt" kind="captions" srclang="en" label="English">
<source src="https://skyfalls.xyz/mumble.ogg">
</audio>

With this markup, chrome loads the VTT file successfully, and shows the Captions option as a menu item. But when you click on it, nothing happens.

The default media menu for chrome

I couldn’t make it display the captions like with HTML video. At least nothing I’ve tried has worked.

Firefox didn’t even try to load the VTT file.

Interlude: CORS

Since you can load and read arbitrary files, loading a VTT file is governed by the Same-origin Policy(SOP).

Unsafe attempt to load URL ... from frame with URL ... Domains, protocols, and ports must match.

Similar to loading a page with an <iframe>, if the VTT file is from another domain, then you’ll need to add crossorigin="anonymous" on your <video> tag. Both URLs should also be served under the same protocol, yours are probably https.

Your server also needs to send proper CORS headers.

Video hack

During research, I found these 2 older posts from 2015 and 2017: WebVTT and Audio, Native HTML5 captions and titles for audio content with WebVTT.

They both present the same solution, which is to use the <video> tag for audio instead.

Apparently it works, since browsers doesn’t really care about what file type you’re trying to play. So here is audio played with an <video> element in chrome:

Default chrome audio controls, same as audio elements, different from true video

Notice how the controls differ from when playing an actual video.

Now, we can select a caption track as usual, but still, nothing shows up. Not until we hide the controls.

Captions displayed in small font

Ah. It’s been hiding for the whole time. Can we style the caption to move it up a bit, so it’s not obstructed?

Well, you can select the pseudo-element video::cue, but the list of permitted properties are extremely limited. In short, nothing on that list can move the text around.

Here we reach another dead end with this “hack”.

JS polyfill

Hard mode

When native implementation fails you, it’s time for some JavaScript action. Since we’re here instead of Google, let’s skip off-the-shelf plugins and try to build our own.

First, the obvious solution. We fetch() a subtitle file, parse it, and display it ourselves. You can implement a simpler format that only contains what you need, e.g. text and timestamps. If you haven’t figured out, WebVTT is way more than that, thus making it much complex to parse.

After grabbing a list of subtitles, we need to match them with the audio stream. For that, listen for the timeupdate event on <audio>. This notifies us the current playtime as it plays, and all there’s left is showing the corresponding subtitle line.

But I think we can do better than that.

Easy mode

As we’ve established before, we can just convert all our <audio> elements to <video>, and get additional features for free. This includes loading subtitles from the VTT file, parsing, and time matching.

So it makes sense to do this: First, we convert all <audio> to <video>. At the end of this post, I have another simple hack for doing this.

Then, we can access the loaded <track>s with HTMLMediaElement.textTracks. This returns a TextTrackList instance containing our subtitles.

Next, we can get the individual TextTrack object with HTMLMediaElement.textTracks[index]. The TextTrack object has an cuechange event, which is called when new cues(fancy name for subtitle line) are displayed.

After that, we hide the default subtitles for chrome by setting TextTrack.mode="hidden".

Finally, we display the cue line. TextTrack.activeCues returns a list of TextTrackCues that’s currently shown. TextTrackCue is an interface, and it’s implementation, VTTCue, has a method called getCueAsHTML()(returns DocumentFragment).

And we’re done!

Demo

Here’s a simple demo I made. You can check out the code with right-click -> inspect frame source.

Appendix: Changing element tag

I found this answer by Martin Braun on Stackoverflow when searching for this. Basically, you can change an element’s markup with HTMLElement.outerHTML, which seems to be a new thing for most people.

With this trick, we can write something like this:

document.querySelectorAll("audio")
    .forEach(e => {
    	e.outerHTML = e.outerHTML.replace(/^<audio/, "<video data-audio-polyfill")
    })

Important: The browser creates new elements when we change the tag. So that means e here is still the original <audio> element, but it’s no longer on the DOM tree.

This is why I added the data-audio-polyfill attribute, so we can filter them out later.