Video captioning is crucial for making content accessible and understandable across different languages.
By combining transcription, translation, and the creation of subtitle files (SRT), you can offer a smooth experience for users to consume video content.
In this guide, I'll show you how to create a strong video captioning and translating tool using Python and Streamlit.
We'll go through the process step-by-step, write the code, and understand each implementation part.
The result will be a functional web app where users can upload a video and automatically get captions in multiple languages.
Introduction to Video Captioning and Translating
Adding captions to videos helps make them more accessible for people who are deaf or hard of hearing.
It also helps non-native speakers understand the content better.
Translating these captions into multiple languages can make your video content reach a global audience.
We’ll use several powerful libraries:
- Streamlit: For creating the web interface.
- MoviePy: To handle video and audio extraction.
- Faster Whisper: For speech-to-text transcription on the CPU.
- Translate: To handle language translations.
You can get the complete source code at: Get Source Code
Setting Up the Environment
Before you start coding, make sure your environment is properly set up.
Here are the steps you need to follow to get everything ready.
Install the Required Libraries
You'll need to install several Python libraries. These include streamlit, moviepy, faster-whisper, and translate. You can install these using pip.
pip install streamlit moviepy faster-whisper translate
With these libraries installed, you can move on to the next steps.
Are you tired of writing the same old Python code? Want to take your programming skills to the next level? Look no further! This book is the ultimate resource for beginners and experienced Python developers alike.
Get "Python's Magic Methods - Beyond init and str"
Magic methods are not just syntactic sugar, they're powerful tools that can significantly improve the functionality and performance of your code. With this book, you'll learn how to use these tools correctly and unlock the full potential of Python.
Building the Video Captioning and Translating Tool
In this section, we'll build the video captioning and translating tool step by step. We'll break down the code into segments for better understanding.
Importing the Libraries
First, import the necessary libraries:
import streamlit as st
import datetime
from faster_whisper import WhisperModel
from moviepy.editor import VideoFileClip
from translate import Translator
These libraries work together to enable the extraction, transcription, and translation of video content, ultimately generating captions in various languages.
Extracting Audio from Video
To process the video for captioning, we first need to extract the audio. MoviePy is an excellent tool for this.
Here's how you can do it:
# Extract audio from video with MoviePy
def extract_audio(video_path, audio_path):
# Load the video file
video = VideoFileClip(video_path)
# Extract the audio from the video
audio = video.audio
# Save the audio to the output path
audio.write_audiofile(audio_path)
# Close the audio file
audio.close()
The extract_audio function performs the following steps:
- Loads a video file from the specified video_path.
- Extracts the audio track from the video.
- Saves the extracted audio to the specified audio_path.
- Closes the audio file to release resources.
Transcribing Audio to Text
The next step involves transcribing the extracted audio to text. For this, we use the Whisper model:
# Set up the Whisper model
model_size = "medium.en"
model = WhisperModel(model_size, device="cpu", compute_type="int8")
# Transcribe an audio file
def transcribe_from_video(audio_path):
segments, _ = model.transcribe(audio_path, )
# Return the segments
return segments
This code does the following:
- Sets up the Whisper model with a medium-sized English model, configured to run on the CPU with int8 computation type.
- Defines a function transcribe_from_video that transcribes an audio file specified by audio_path using the initialized Whisper model.
- Returns the list of transcription segments from the audio file.
Function to Format Time for SRT
The SubRip Subtitle (SRT) format uses a specific timestamp format. We need a utility function to convert seconds into this format:
# Function to convert seconds to SRT timestamp format
def format_time(seconds):
timestamp = str(datetime.timedelta(seconds=seconds))
# Check if there is a fractional part in the seconds
if '.' in timestamp:
hours, minutes, seconds = timestamp.split(':')
seconds, milliseconds = seconds.split('.')
# Truncate the milliseconds to 3 decimal places
milliseconds = milliseconds[:3]
else:
hours, minutes, seconds = timestamp.split(':')
milliseconds = "000"
# Return the formatted timestamp
return f"{hours.zfill(2)}:{minutes.zfill(2)}:{seconds.zfill(2)},{milliseconds.zfill(3)}"
The format_time function performs the following steps:
- Converts a time duration in seconds to a string representation of a timedelta object.
- Check if there is a fractional part in the seconds and split the timestamp accordingly.
- Formats the timestamp into the SRT format hh:mm:ss,ms.
- Returns the formatted timestamp string.
Function to Generate the SRT File
Next, let's create a function to generate the SRT file from the transcription data:
# Function to generate SRT file from transcription data
def generate_srt(transcription_data, output_file, lang):
# Open the output file
with open(output_file, 'w', encoding="UTF-8") as srt_file:
# Iterate over the transcription data
for i, segment in enumerate(transcription_data):
# Format the start and end times
start_time = format_time(segment.start)
end_time = format_time(segment.end)
# Translate the text to the target language if it is not English
if lang != "en":
translator = Translator(from_lang="en", to_lang=lang)
text = translator.translate(segment.text)
else:
text = segment.text
# Write the segment data to the SRT file
srt_file.write(f"{i + 1}\n")
srt_file.write(f"{start_time} --> {end_time}\n")
srt_file.write(f"{text}\n\n")
The generate_srt function performs the following steps:
- Opens the output file in write mode with UTF-8 encoding.
- Iterates over the transcription data, formatting the start and end times for each segment.
- Translates the text of each segment if the target language is not English.
- Writes the segment data to the SRT file in the required format.
Building the Streamlit Interface
Now, let's build the interface using Streamlit:
# Set the page title in Streamlit
st.title("Video Captioning and Translating")
# Set the page description in Streamlit
st.write("Upload a video and get the captions and translations.")
# Select the language for translation
language = st.selectbox("Select the target language", ["en", "es", "fr", "de", "it", "pt"])
# Upload a video file
video_file = st.file_uploader("Upload a video file", type=["mp4", "mov", "avi"])
This code sets up a Streamlit interface with the following components:
- A page title is set to "Video Captioning and Translating".
- A page description that informs users about uploading a video to get captions and translations.
- A dropdown menu for selecting the target language for translation, with options including English, Spanish, French, German, Italian, and Portuguese.
- A file uploader widget for uploading video files, supporting mp4, mov, and avi formats.
Processing the Uploaded Video
Once the user uploads a video and selects a language, we process the video:
# Check if the video file is uploaded and the language is selected
if video_file and language:
# Display progress message
with st.status("Processing the video...", expanded=True):
# Save the video file temporarily
st.write("Uploading the video...")
with open(f"output.mp4", "wb") as f:
f.write(video_file.read())
# Extract the audio from the video
st.write("Extracting audio from the video...")
extract_audio("output.mp4", "output.wav")
# Transcribe the audio file
st.write("Transcribing the audio...")
text_segments = transcribe_from_video("output.wav")
# Generate the SRT file
st.write("Generating the SRT file...")
generate_srt(text_segments, "output.srt", language)
# Display the video file
st.video(video_file, subtitles="output.srt")
This code performs the following steps:
- Check if both the video file and the target language are provided.
- Displays a progress message indicating that the video is being processed.
- Saves the uploaded video file temporarily.
- Extracts the audio from the video and saves it as a separate file.
- Transcribes the extracted audio file and returns the transcription segments.
- Generates the SRT file from the transcription data, translating the text to the selected language if necessary.
- Displays the video file with the generated subtitles in the Streamlit interface.
Running the Application
You can run the application locally with the following command:
streamlit run app.py
Streamlit will serve the application on a local web server and open it in a web browser, allowing users to interact with the application as defined in the script.
You can access it by navigating on the browser to http://localhost:8501 (if the web page doesn't open automatically).
Testing the Application
Let's an example of the application in action:
The video used in the example is:
You can get the complete source code at: Get Source Code
Conclusion
In this article, we explored how to build a video captioning and translating tool using Python and Streamlit.
This tool can be incredibly useful for content creators, educators, and anyone looking to make their video content more accessible and widely understood.
It can be expanded and adapted for even more sophisticated applications, such as live captioning, more advanced translation options, or integrations with other video processing libraries.