August 22, 2024 · 1 min read

OpenAI Whisper Speech-to-Text is a locally executable speech recognition model that comes in various sizes, allowing users to choose a model that suits their device's specifications. Unfortunately, Whisper lacks speaker diarization, a crucial feature for applications that require speaker identification (e.g. discerning speakers in a meeting scenario).

This article guides you through the process of integrating Picovoice Falcon Speaker Diarization with OpenAI Whisper in Python. Adding speaker diarization will result in a more user-friendly, dialogue-style transcription.

Setup
Start by installing the necessary packages:

pip3 install -U openai-whisper
pip3 install -U pvfalcon

Both Falcon Speaker Diarization and Whisper Speech-to-Text run on CPU and do not require a GPU. While Whisper may be slow on CPU, utilizing a GPU can improve its runtime.

Speech Recognition with Whisper
Let's begin by utilizing Whisper for speech recognition. The code snippet below demonstrates how to transcribe speech using Whisper:

import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]

Here, ${WHISPER_MODEL} refers to one of the available Whisper models , and ${AUDIO_FILE_PATH} is the path to the audio file. Since our goal is a dialogue-style transcription, we'll focus on extracting segments from the result, each representing a part of the transcript with its corresponding timestamp.

Speaker Diarization with Falcon
Next, let's perform speaker diarization using Falcon. The following code snippet illustrates how to apply Falcon for this purpose:

import pvfalcon

falcon = pvfalcon.create(${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})

Here, ${ACCESS_KEY} is your access key obtained from the Picovoice Console. The process method result is a list of speaker segments, similar to Whisper's segments but with speaker_tag fields indicating the speaker.

Integrating Whisper and Falcon Speaker Diarization
By combining OpenAI Whisper for speech recognition and Picovoice Falcon Speaker Diarization for speaker diarization, we aim to create a dialogue-style transcription. To achieve this, we'll define a simple score to measure the overlap between Whisper and Falcon Speaker Diarization segments. The following code snippet demonstrates how to calculate this score:

def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio

Utilizing this score, we can find the best-matching Falcon Speaker Diarization segment for each Whisper segment. The code snippet below demonstrates this process:

for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score > max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")

This is a basic approach for merging the two segment lists, intended for demonstration purposes. Results can be further enhanced with a more sophisticated matching algorithm.
Putting everything together would result in the script below:

import pvfalcon
import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]

falcon = pvfalcon.create(access_key=${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})


def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio


for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score > max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")

And the expected result follows a format similar to the below output:

Speaker 1:  Hey, has the task been completed?
Speaker 2:  I don't know anything about it.
Speaker 3:  Well, we're in the process of working on it. 
Speaker 3:  There's a bit of a delay because we're waiting on someone else to complete their part.
Speaker 1:  Waiting again? This is taking longer than expected. 
Speaker 1:  Can we get an update on the timeline?
Speaker 3:  I understand the urgency. 
Speaker 3:  I've followed up with the person responsible, and they've assured me they're working on it. 
Speaker 3:  We should have a clearer timeline by the end of the day.

It only takes a minute to add speaker diarization to Whisper using Falcon:

For more in-depth information on the Falcon Speaker Diarization Python SDK, delve into the documentation. For those seeking a seamless solution that effortlessly combines speech recognition and speaker diarization, consider exploring Picovoice Leopard Speech-to-Text. Leopard Speech-to-Text, recognized for its lightweight and fast performance, internally incorporates Falcon Speaker Diarization, resulting in optimized outcomes. It streamlines the transcription process, enabling you to effortlessly obtain speaker information through a single function call.

Adding Speaker Diarization to OpenAI Whisper using Picovoice Falcon