Video captioning is crucial for making content accessible and understandable across different languages.

By combining transcription, translation, and the creation of subtitle files (SRT), you can offer a smooth experience for users to consume video content.

In this guide, I'll show you how to create a strong video captioning and translating tool using Python and Streamlit.

We'll go through the process step-by-step, write the code, and understand each implementation part.

The result will be a functional web app where users can upload a video and automatically get captions in multiple languages.

Introduction to Video Captioning and Translating

Adding captions to videos helps make them more accessible for people who are deaf or hard of hearing.

It also helps non-native speakers understand the content better.

Translating these captions into multiple languages can make your video content reach a global audience.

We’ll use several powerful libraries:

Streamlit: For creating the web interface.
MoviePy: To handle video and audio extraction.
Faster Whisper: For speech-to-text transcription on the CPU.
Translate: To handle language translations.

You can get the complete source code at: Get Source Code

Setting Up the Environment

Before you start coding, make sure your environment is properly set up.

Here are the steps you need to follow to get everything ready.

Install the Required Libraries

You'll need to install several Python libraries. These include streamlit, moviepy, faster-whisper, and translate. You can install these using pip.

pip install streamlit moviepy faster-whisper translate

With these libraries installed, you can move on to the next steps.

Are you tired of writing the same old Python code? Want to take your programming skills to the next level? Look no further! This book is the ultimate resource for beginners and experienced Python developers alike.

Get "Python's Magic Methods - Beyond init and str"

Magic methods are not just syntactic sugar, they're powerful tools that can significantly improve the functionality and performance of your code. With this book, you'll learn how to use these tools correctly and unlock the full potential of Python.

Building the Video Captioning and Translating Tool

In this section, we'll build the video captioning and translating tool step by step. We'll break down the code into segments for better understanding.

Importing the Libraries

First, import the necessary libraries:

import streamlit as st
import datetime
from faster_whisper import WhisperModel
from moviepy.editor import VideoFileClip
from translate import Translator

These libraries work together to enable the extraction, transcription, and translation of video content, ultimately generating captions in various languages.

Extracting Audio from Video

To process the video for captioning, we first need to extract the audio. MoviePy is an excellent tool for this.

Here's how you can do it:

# Extract audio from video with MoviePy
def extract_audio(video_path, audio_path):
    # Load the video file
    video = VideoFileClip(video_path)
    # Extract the audio from the video
    audio = video.audio
    # Save the audio to the output path
    audio.write_audiofile(audio_path)
    # Close the audio file
    audio.close()

The extract_audio function performs the following steps:

Loads a video file from the specified video_path.
Extracts the audio track from the video.
Saves the extracted audio to the specified audio_path.
Closes the audio file to release resources.

Transcribing Audio to Text

The next step involves transcribing the extracted audio to text. For this, we use the Whisper model:

# Set up the Whisper model
model_size = "medium.en"
model = WhisperModel(model_size, device="cpu", compute_type="int8")


# Transcribe an audio file
def transcribe_from_video(audio_path):
    segments, _ = model.transcribe(audio_path, )
    # Return the segments
    return segments

This code does the following:

Sets up the Whisper model with a medium-sized English model, configured to run on the CPU with int8 computation type.
Defines a function transcribe_from_video that transcribes an audio file specified by audio_path using the initialized Whisper model.
Returns the list of transcription segments from the audio file.

Function to Format Time for SRT

The SubRip Subtitle (SRT) format uses a specific timestamp format. We need a utility function to convert seconds into this format:

# Function to convert seconds to SRT timestamp format
def format_time(seconds):
    timestamp = str(datetime.timedelta(seconds=seconds))
    # Check if there is a fractional part in the seconds
    if '.' in timestamp:
        hours, minutes, seconds = timestamp.split(':')
        seconds, milliseconds = seconds.split('.')
        # Truncate the milliseconds to 3 decimal places
        milliseconds = milliseconds[:3]
    else:
        hours, minutes, seconds = timestamp.split(':')
        milliseconds = "000"
    # Return the formatted timestamp
    return f"{hours.zfill(2)}:{minutes.zfill(2)}:{seconds.zfill(2)},{milliseconds.zfill(3)}"

The format_time function performs the following steps:

Converts a time duration in seconds to a string representation of a timedelta object.
Check if there is a fractional part in the seconds and split the timestamp accordingly.
Formats the timestamp into the SRT format hh:mm:ss,ms.
Returns the formatted timestamp string.

Function to Generate the SRT File

Next, let's create a function to generate the SRT file from the transcription data:

# Function to generate SRT file from transcription data
def generate_srt(transcription_data, output_file, lang):
    # Open the output file
    with open(output_file, 'w', encoding="UTF-8") as srt_file:
        # Iterate over the transcription data
        for i, segment in enumerate(transcription_data):
            # Format the start and end times
            start_time = format_time(segment.start)
            end_time = format_time(segment.end)

            # Translate the text to the target language if it is not English
            if lang != "en":
                translator = Translator(from_lang="en", to_lang=lang)
                text = translator.translate(segment.text)
            else:
                text = segment.text

            # Write the segment data to the SRT file
            srt_file.write(f"{i + 1}\n")
            srt_file.write(f"{start_time} --> {end_time}\n")
            srt_file.write(f"{text}\n\n")

The generate_srt function performs the following steps:

Opens the output file in write mode with UTF-8 encoding.
Iterates over the transcription data, formatting the start and end times for each segment.
Translates the text of each segment if the target language is not English.
Writes the segment data to the SRT file in the required format.

Building the Streamlit Interface

Now, let's build the interface using Streamlit:

# Set the page title in Streamlit
st.title("Video Captioning and Translating")

# Set the page description in Streamlit
st.write("Upload a video and get the captions and translations.")

# Select the language for translation
language = st.selectbox("Select the target language", ["en", "es", "fr", "de", "it", "pt"])

# Upload a video file
video_file = st.file_uploader("Upload a video file", type=["mp4", "mov", "avi"])

This code sets up a Streamlit interface with the following components:

A page title is set to "Video Captioning and Translating".
A page description that informs users about uploading a video to get captions and translations.
A dropdown menu for selecting the target language for translation, with options including English, Spanish, French, German, Italian, and Portuguese.
A file uploader widget for uploading video files, supporting mp4, mov, and avi formats.

Processing the Uploaded Video

Once the user uploads a video and selects a language, we process the video:

# Check if the video file is uploaded and the language is selected
if video_file and language:
    # Display progress message
    with st.status("Processing the video...", expanded=True):
        # Save the video file temporarily
        st.write("Uploading the video...")
        with open(f"output.mp4", "wb") as f:
            f.write(video_file.read())

        # Extract the audio from the video
        st.write("Extracting audio from the video...")
        extract_audio("output.mp4", "output.wav")

        # Transcribe the audio file
        st.write("Transcribing the audio...")
        text_segments = transcribe_from_video("output.wav")

        # Generate the SRT file
        st.write("Generating the SRT file...")
        generate_srt(text_segments, "output.srt", language)

    # Display the video file
    st.video(video_file, subtitles="output.srt")

This code performs the following steps:

Check if both the video file and the target language are provided.
Displays a progress message indicating that the video is being processed.
Saves the uploaded video file temporarily.
Extracts the audio from the video and saves it as a separate file.
Transcribes the extracted audio file and returns the transcription segments.
Generates the SRT file from the transcription data, translating the text to the selected language if necessary.
Displays the video file with the generated subtitles in the Streamlit interface.

Running the Application

You can run the application locally with the following command:

streamlit run app.py

Streamlit will serve the application on a local web server and open it in a web browser, allowing users to interact with the application as defined in the script.

You can access it by navigating on the browser to http://localhost:8501 (if the web page doesn't open automatically).

Testing the Application

Let's an example of the application in action:

The video used in the example is: