Cross browser speech synthesis - the hard way and the easy way

Jan Kรผster - Dec 7 '21 - - Dev Community

When I implemented my first speech-synthesis app using the Web Speech API I was shocked how hard it was to setup and execute it with cross-browser support in mind:

  • Some browsers don't support speech synthesis at all, for instance IE (at least I don't care ๐Ÿคทโ€โ™‚๏ธ) and Opera (I do care ๐Ÿ˜ ) and a few more mobile browsers (I haven't decided yet, whether I care or not ๐Ÿค”).
  • On top of that, each browser implements the API differently or with some specific quirks the other browsers don't have

Just try it yourself - go to and execute the MDN speech synthesis example on different browsers and different platforms:

  • Linux, Windows, MacOS, BSD, Android, iOS
  • Firefox, Chrome, Chromium, Safari, Opera, Edge, IE, Samsung Browser, Android Webview, Safari on iOS, Opera Mini

You will realize that this example will only work on a subset of these platform-browser combinations. Worst: when you start researching you'll get shocked how quirky and underdeveloped this whole API still is in 2021/2022.

To be fair: it is still labeled as experimental technology. However, it's almost 10 years now, since it has been drafted and still is not a living standard.

This makes it much harder to leverage for our applications and I hope this guide I will help you to get the most out of it for as many browsers as possible.


Minimal example

Let's approach this topic step-by-step and start with a minimal example that all browsers (that generally support speech synthesis) should run:

if ('speechSynthesis' in window) {
  window.speechSynthesis.speak(
    new SpeechSynthesisUtterance('Hello, world!')
  )
}
Enter fullscreen mode Exit fullscreen mode

You can simply copy that code and execute it in your browser console.

If you have basic support you will hear some "default" voice speaking the text 'Hello, world!' and it may sound natural or not, depending on the default "voice" that is used.


Loading voices

Browsers may detect your current language and select a default voice, if installed. However, this may not represent the desired language you'd like to hear for the text to be spoken.

In such case you need to load the list of voices, which are instances of SpeechSynthesisVoice. This is the first greater obstacle where browsers behave quite differently:

Load voices sync-style

const voices =  window.speechSynthesis.getVoices()
voices // Array of voices or empty if none are installed
Enter fullscreen mode Exit fullscreen mode

Firefox and Safari Desktop just load the voices immediately in sync-style. This however would return an empty array on Chrome Desktop, Chrome Android and may return an empty Array on Firefox Android (see next section).

Load voices async-style

window.speechSynthesis.onvoiceschanged = function () {
  const voices = window.speechSynthesis.getVoices()
  voices // Array of voices or empty if none are installed
}
Enter fullscreen mode Exit fullscreen mode

This methods loads the voices async, so your overall system needs a callback or wrap it with a Promise. Firefox Desktop does not support this method at all, although it's defined as property of window.speechSynthesis, while Safari does not have it at all.

In contrast: Firefox Android loads the voices the first time using this method and on a refresh has them available via the sync-style method.

Loading using interval

Some users of older Safari have reported that their voices are not available immediately (while onvoiceschanged is not available, too). For this case we need to check in a constant interval for the voices:

let timeout = 0
const maxTimeout = 2000
const interval = 250

const loadVoices = (cb) => {
  const voices = speechSynthesis.getVoices()

  if (voices.length > 0) {
    return cb(undefined, voices)
  }

  if (timeout >= maxTimeout) {
    return cb(new Error('loadVoices max timeout exceeded'))
  }

  timeout += interval
  setTimeout(() => loadVoices(cb), interval)
}

loadVoices((err, voices) => {
  if (err) return console.error(err)

  voices // voices loaded and available
})
Enter fullscreen mode Exit fullscreen mode

Speaking with a certain voice

There are use-cases, where the default selected voice is not the same language as the text to be spoken. We need to change the voice for the "utterance" to speak.

Step 1: get a voice by a given language

// assume voices are loaded, see previous section
const getVoicebyLang = lang => speechSynthesis
  .getVoices()
  .find(voice => voice.startsWith(lang))

const german = getVoicebyLang('de')
Enter fullscreen mode Exit fullscreen mode

Note: Voices have standard language codes, like en-GB or en-US or de-DE. However, on Android's Samsung Browser or Android Chrome voices have underscore-connected codes, like en_GB.

Then on Firefox android voices have three characters before the separator, like deu-DEU-f00 or eng-GBR-f00.

However, they all start with the language code so passing a two-letter short-code should be sufficient.

Step 2: create a new utterance

We can now pass the voice to a new SpeechSynthesisUtterance and as your precognitive abilities correctly manifest - there are again some browser-specific issues to consider:

const text = 'Guten Tag!'
const utterance = new SpeechSynthesisUtterance(text)

if (utterance.text !== text) {
  // I found no browser yet that does not support text
  // as constructor arg but who knows!?
  utterance.text = text
}

utterance.voice = german // ios required
utterance.lang = voice.lang // // Android Chrome required
utterance.voiceURI = voice.voiceURI // Who knows if required?

utterance.pitch = 1
utterance.volume = 1

// API allows up to 10 but values > 2 break on all Chrome
utterance.rate = 1
Enter fullscreen mode Exit fullscreen mode

We can now pass the utterance to the speak function as a preview:

speechSynthesis.speak(utterance) // speaks 'Guten Tag!' in German
Enter fullscreen mode Exit fullscreen mode

Step 3: add events and speak

This is of course just the half of it. We actually want to get deeper insights of what's happening and what's missing by tapping into some of the utterance's events:

const handler = e => console.debug(e.type)

utterance.onstart = handler
utterance.onend = handler
utterance.onerror = e => console.error(e)

// SSML markup is rarely supported
// See: https://www.w3.org/TR/speech-synthesis/
utterance.onmark = handler

// word boundaries are supported by
// Safari MacOS and on windows but
// not on Linux and Android browsers
utterance.onboundary = handler

// not supported / fired
// on many browsers somehow
utterance.onpause = handler
utterance.onresume = handler

// finally speak and log all the events
speechSynthesis.speak(utterance)
Enter fullscreen mode Exit fullscreen mode

Step 4: Chrome-specific fix

Longer texts on Chrome-Desktop will be cancelled automatically after 15 seconds. This can be fixed by either chunking the texts or by using an interval of "zero"-latency pause/resume combination. At the same time this fix breaks on Android, since Android devices don't implement speechSynthesis.pause() as pause but as cancel:

let timer

utterance.onstart = () => {
  // detection is up to you for this article as
  // this is an own huge topic for itself
  if (!isAndroid) {
    resumeInfinity(utterance)
  }
}

const clear = () => {  clearTimeout(timer) }

utterance.onerror = clear
utterance.onend = clear

const resumeInfinity = (target) => {
  // prevent memory-leak in case utterance is deleted, while this is ongoing
  if (!target && timer) { return clear() }

  speechSynthesis.pause()
  speechSynthesis.resume()

  timer = setTimeout(function () {
    resumeInfinity(target)
  }, 5000)
}
Enter fullscreen mode Exit fullscreen mode

Furthermore, some browser don't update the speechSynthesis.paused property when speechSynthesis.pause() is executed (and speech is correctly paused). You need to manage these states yourself then.


Issues that can't be fixed with JavaScript:

All the above fixes rely on JavaScript but some issues are platform-specific. You need to your app in a way to avoid these issues, where possible:

  • All browsers on Android actually do a cancel/stop when calling speechSynthesis.pause; pause is simply not supported on Android ๐Ÿ‘Ž
  • There are no voices on Chromium-Ubuntu and Ubuntu-derivatives unless the browser is started with a flag ๐Ÿ‘Ž
  • If on Chromium-Desktop Ubuntu and the very first page wants to load speech synthesis, then there are no voices ever loaded until the page is refreshed or a new page is entered. This can be fixed with JavaScript but it can lead to very bad UX to auto-refresh the page. ๐Ÿ‘Ž
  • If voices are not installed on the host-OS and there are no voices loaded from remote by the browser, then there are no voices and thus no speech synthesis ๐Ÿ‘Ž
  • There is no chance to just instant-load custom voices from remote and use them as a shim in case there are no voices ๐Ÿ‘Ž
  • If the installed voices are just bad users have to manually install better voices ๐Ÿ‘Ž

Making your life easier with EasySpeech

Now you have seen the worst and believe me, it takes ages to implement all potential fixes.

Fortunately I already did this and published a package to NPM with the intent to provide a common API that handles most issues internally and provide the same experience across browsers (that support speechSynthesis):

GitHub logo leaonline / easy-speech

๐Ÿ”Š Cross browser Speech Synthesis also known as Text to speech or TTS; no dependencies; uses Web Speech API

Easy Speech

Cross browser Speech Synthesis; no dependencies




API docs ยป


JavaScript Style Guide Project Status: Active โ€“ The project has reached a stable, usable state and is being actively developed. Test suite CodeQL Semantic Analysis npm npm bundle size npm bundle size DOI

โญ๏ธ Why EasySpeech?

This project was created, because it's always a struggle to get the synthesis part of Web Speech API running on most major browsers.

โœจ Features

  • ๐Ÿช„ Single API for using speechSynthesis across multiple browsers
  • ๐ŸŒˆ Async API (Promises, async/await)
  • ๐Ÿš€ Hooks for all events; global and/or voice-instance-specific
  • ๐ŸŒฑ Easy to set up and integrate: auto-detects and loads available voices
  • ๐Ÿ”ง Includes fixes or workarounds for many browser-specific quirks
  • ๐Ÿ“ Internal logging via EasySpeech.debug hook
  • ๐Ÿ“ฆ Multiple build targets
  • ๐ŸŽฎ Live demo to test your browser

Note: this is not a polyfill package, if your target browser does not support speech synthesis or the Web Speech API, this package is not usable.

๐Ÿš€ Live Demo

The live demo is available at https://leaonline.github.io/easy-speech/ You can use it to test your browser for speechSynthesis support and functionality.

live demo screenshot

Table of

โ€ฆ

You should give it a try if you want to implement speech synthesis the next time. It also comes with a DEMO page so you can easy test and debug your devices there: https://jankapunkt.github.io/easy-speech/

Let's take a look how it works:

import EasySpeech from 'easy-speech'

// sync, returns Object with detected features
EasySpeech.detect()

EasySpeech.init()
  .catch(e => console.error('no speech synthesis:', error.message)
  .then(() = > {
     EasySpeech.speak({ text: 'Hello, world!' })
   })
Enter fullscreen mode Exit fullscreen mode

It will not only detect, which features are available but also loads an optimal default voice, based on a few heuristics.

Of course there is much more to use and the full API is also documented via JSDoc: https://github.com/jankapunkt/easy-speech/blob/master/API.md

If you like it leave a star and please file an issue if you found (yet another) browser-specific issue.


References

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player