Text-to-Speech in Python: On-Device Solutions

grace - Aug 16 - - Dev Community

August 16th, 2024 · 2 min read

Text-to-Speech (TTS) technology, also known as Speech Synthesis, converts text into human-like speech. The rise of deep learning has led to major advancements in TTS quality and naturalness, but at the cost of increased computational requirements. Most big tech companies offer cloud-based TTS APIs, like Google Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech, and new companies with similar offerings have emerged, such as ElevenLabs, or Coqui Studio. While convenient, these services require an internet connection, raise privacy concerns, and are prone to network outages. On-device solutions allow for more flexibility and privacy by synthesizing speech directly on the user's device. However, few options exist for on-device TTS. This article explores three open-source Python libraries and Picovoice Orca Text-to-Speech.

🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

PyTTSx3

PyTTSx3 is a Python library that utilizes the popular eSpeak speech synthesis engine on Linux (NSSpeechSynthesizer is used on MacOS and SAPI5 on Windows). Getting started is straightforward:

  1. Install pyTTSx3:
pip install pyttsx3
Enter fullscreen mode Exit fullscreen mode
  1. Save synthesized speech to a file in Python:
import pyttsx3

engine = pyttsx3.init()
engine.save_to_file(text='Hello World', filename='PATH/TO/OUTPUT.wav')
engine.runAndWait()
Enter fullscreen mode Exit fullscreen mode

While simple to use, eSpeak's voice quality is robotic compared to more modern TTS systems.

Coqui TTS
Coqui TTS is the open-source repository of Coqui Studio. Developers can leverage Coqui's pretrained models or train custom voices. To synthesize speech, follow the steps:

  1. Install Coqui TTS:
pip install TTS
Enter fullscreen mode Exit fullscreen mode
  1. List available models in Python:
from TTS.api import TTS

TTS().list_models()
Enter fullscreen mode Exit fullscreen mode
  1. Choose a model name and save synthesized speech to a file:
tts = TTS("CHOSEN/MODEL/NAME")
tts.tts_to_file(text="Hello World", output_path="PATH/TO/OUTPUT.wav")
Enter fullscreen mode Exit fullscreen mode

Coqui offers high-quality voices with natural prosody, at the cost of larger model sizes and longer processing times.

Mimic3 from Mycroft

Mycroft is a free and open-source virtual assistant that offers a TTS system called Mimic3. This framework currently lacks a pure Python API, so we will use Python's subprocess:

  1. Install Mycroft:
pip install mycroft-mimic3-tts

Enter fullscreen mode Exit fullscreen mode
  1. Synthesize speech and save file to directory OUTPUT/DIR:
import subprocess 

args = [    
  "mimic3",    
  "\"Hello World\"",    
  "--output-dir", "OUTPUT/DIR"]

try:    
  subprocess.check_call(args)

except subprocess.CalledProcessError as e:  
  # Handle error    
  pass

Enter fullscreen mode Exit fullscreen mode

For prototyping on-device TTS, Mimic3 from Mycroft provides a balance of quality and performance.

Orca Text-to-Speech

Picovoice Orca Text-to-Speech leverages state-of-the-art Text-to-Speech (TTS) models to provide high-quality voices, while still being small and efficient.

  1. Install Orca Text-to-Speech Python SDK
pip install pvorca
Enter fullscreen mode Exit fullscreen mode
  1. Import Orca and create an Orca instance.
import pvorca 
orca = pvorca.create(access_key="${ACCESS_KEY}")
Enter fullscreen mode Exit fullscreen mode

Sign-up or Log in to Picovoice Console to copy your access key and replace ${ACCESS_KEY} with it.

  1. Synthesize your desired text with
orca.synthesize(text="${TEXT}")
Enter fullscreen mode Exit fullscreen mode

For more information refer to the Orca Text-to-Speech Python SDK Documentation.

Conclusion

On-device TTS removes privacy concerns, internet requirements, and minimizes latency. With Python solutions like PyTTSx3, Coqui TTS, and Mimic3, developers have several options for synthesizing speech directly on devices based on their needs. However, each solution comes with drawbacks such as poor voice quality, large resource requirements, or lack of flexible APIs. Another alternative is Orca Text-to-Speech, which combines state-of-the-art neural TTS with efficiency, allowing to synthesize high-quality speech even on a Raspberry Pi.

. . . . . . . . . . .
Terabox Video Player