Creating Dynamic Audio Narratives: A Guide to Combining Text-to-Speech and Music Using Python

Dmitry Romanoff - Sep 7 - - Dev Community

A Cup of Coffee

Introduction

In the digital age, creating engaging multimedia content is more accessible than ever. One interesting application is combining text-to-speech (TTS) technology with background music to produce dynamic audio narratives. In this article, we’ll walk through a Python script that does exactly that, leveraging the pydub and gtts libraries to merge spoken text with music. This method is ideal for producing polished audio files perfect for podcasts, audiobooks, or other multimedia projects.

To see a practical example of this technique in action, check out Storyteller4uuu, a YouTube channel that uses a similar approach to create captivating audio stories and narratives.

Getting Started

Before diving into the code, ensure you have the necessary Python libraries installed. You’ll need pydub for audio processing, gtts for converting text to speech, and ffmpeg for handling various audio formats. Install these with:

pip install pydub gtts
Enter fullscreen mode Exit fullscreen mode

You’ll also need ffmpeg, which you can download from FFmpeg's official site and ensure it's accessible from your system PATH.

Step 1: Converting Text to Speech

The first part of our script involves converting text from a file into an audio format using Google Text-to-Speech (gtts). We also add silence between sentences to create a natural pause.

from gtts import gTTS
from pydub import AudioSegment
import os

def text_to_speech(input_file_path, output_file_path, silence_duration_ms=1000, start_silence_ms=2000, end_silence_ms=2000):
    try:
        # Read the text from the input file
        with open(input_file_path, 'r', encoding='utf-8') as file:
            text = file.read()

        # Convert text to speech
        tts = gTTS(text, lang='ru' if any(c in text for c in 'АБВГДЕЁЖЗИИЙКЛМНОПРСТУФХЦЧШЩЬЫЭЮЯ') else 'en')

        # Save the converted speech to a temporary MP3 file
        temp_file_path = 'temp.mp3'
        tts.save(temp_file_path)

        # Load the audio file
        audio = AudioSegment.from_mp3(temp_file_path)

        # Create silence segments
        silence_segment = AudioSegment.silent(duration=silence_duration_ms)
        start_silence = AudioSegment.silent(duration=start_silence_ms)
        end_silence = AudioSegment.silent(duration=end_silence_ms)

        # Split the text by periods
        segments = text.split('.')

        # Create and combine audio segments
        audio_segments = []
        for i, segment in enumerate(segments):
            if segment.strip() == '':
                continue
            segment_tts = gTTS(segment.strip(), lang='ru' if any(c in segment for c in 'АБВГДЕЁЖЗИИЙКЛМНОПРСТУФХЦЧШЩЬЫЭЮЯ') else 'en')
            temp_segment_file = f'temp_segment_{i}.mp3'
            segment_tts.save(temp_segment_file)
            segment_audio = AudioSegment.from_mp3(temp_segment_file)
            audio_segments.append(segment_audio)
            if i < len(segments) - 1:
                audio_segments.append(silence_segment)
            os.remove(temp_segment_file)

        # Combine all segments
        final_audio = start_silence + sum(audio_segments, AudioSegment.empty()) + end_silence
        final_audio.export(output_file_path, format='mp3')
        os.remove(temp_file_path)
        print(f"Speech successfully saved to {output_file_path}")

    except Exception as e:
        print(f"An error occurred: {e}")

Enter fullscreen mode Exit fullscreen mode

Step 2: Mixing Speech with Background Music

Once the speech audio is prepared, we mix it with background music. The pydub library helps us handle this task effectively, allowing us to overlay audio files and adjust their properties.

from pydub import AudioSegment
import os

def mix_audio_with_music(speech_file_path, music_file_path, output_file_path, volume_reduction_percent=50, fade_duration_ms=3000):
    try:
        # Load the speech and music files
        speech = AudioSegment.from_mp3(speech_file_path)
        music = AudioSegment.from_mp3(music_file_path)

        # Adjust the volume of the music
        volume_reduction = (volume_reduction_percent / 100.0)
        music = music - (10 * volume_reduction)

        # Fade-out the music if it's longer than the speech
        if len(music) > len(speech):
            music = music[:len(speech)]
            music = music.fade_out(fade_duration_ms)

        # Ensure the length of music matches the length of speech
        if len(music) < len(speech):
            repeats = int(len(speech) / len(music)) + 1
            music = music * repeats
            music = music[:len(speech)]

        # Mix the audio files
        mixed_audio = speech.overlay(music)
        mixed_audio.export(output_file_path, format='mp3')

        print(f"Mixed audio saved to {output_file_path}")

    except Exception as e:
        print(f"An error occurred: {e}")
Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Finally, in the main section of your script, define the paths for the input text file, output speech file, and background music. The script then performs the TTS conversion and mixes the resulting speech with the selected music.

if __name__ == "__main__":
    # Define file paths
    input_file = 'stories/story_2.txt'
    output_file = 'output/converted_speech.mp3'

    # Create output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_file), exist_ok=True)

    # Convert text to speech
    text_to_speech(input_file, output_file, silence_duration_ms=2000, start_silence_ms=3000, end_silence_ms=3000)

    speech_file = 'output/converted_speech.mp3'
    music_file = 'music/FIVE_OF_A_KIND_Density_Time.mp3'
    final_output_file = 'output/converted_speech_with_music.mp3'

    # Mix the audio with music
    mix_audio_with_music(speech_file, music_file, final_output_file, volume_reduction_percent=60, fade_duration_ms=3000)
Enter fullscreen mode Exit fullscreen mode

Example in Action

To see a practical example of combining TTS and music, check out Storyteller4uuu on YouTube. This channel effectively uses similar techniques to create engaging and immersive audio stories, demonstrating the potential of this approach.

Conclusion

This Python script demonstrates how to combine text-to-speech and background music to create professional-sounding audio content. By leveraging libraries like pydub and gtts, you can automate the process of generating engaging audio for various applications. Experiment with different parameters and files to tailor the results to your specific needs and enjoy the creative possibilities of multimedia content creation!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player