Build a Transcription App with Strapi, ChatGPT, & Whisper: Part 2 - Bringing it All Together

Introduction

In Part 1 of this series, we set the stage for our transcription app by building the backend infrastructure using Strapi and configuring a simple frontend with React. This part dives deep into the exciting process of integrating powerful AI capabilities to bring your app to life. We'll utilize the Whisper speech-to-text model from OpenAI and harness the prowess of ChatGPT for transcription enhancement and contextual understanding.

Key Concepts

Whisper: A powerful speech-to-text model from OpenAI, Whisper excels at transcribing audio in various languages and even handles background noise effectively. We'll utilize its API to convert audio files into text.
ChatGPT: A large language model (LLM) also from OpenAI, ChatGPT is adept at generating human-like text, answering questions, and providing summaries. We'll leverage ChatGPT to refine and enhance the initial transcription from Whisper.
Strapi API: Our backend, built with Strapi, provides the foundation for managing data and seamlessly integrating with the AI models.
React Frontend: The user interface we built in Part 1 will serve as the platform to interact with the AI models and present the final transcription.

Step-by-Step Implementation

1. Setting Up the OpenAI API Integration

Obtain API Keys: Sign up for a free OpenAI account and obtain your API keys. These keys are crucial for accessing and utilizing both Whisper and ChatGPT.
Install OpenAI Library: Add the OpenAI library to your project:

npm install openai

Configure the OpenAI Library: In your backend code (Strapi), set up the OpenAI library with your API key:

const OpenAI = require('openai');

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

// Use the openai instance to interact with Whisper and ChatGPT APIs

2. Integrating Whisper for Speech-to-Text

Create an Endpoint: In your Strapi backend, create a new API endpoint that handles audio file uploads and Whisper transcription:

const { create } = require('strapi-utils').contentAPI;
const openai = require('openai');

module.exports = async (ctx) =&gt; {
  try {
    const { file } = ctx.request.files;

    // Convert uploaded audio to a base64 string for Whisper API
    const audioContent = file.data.toString('base64');

    // Call the Whisper API
    const response = await openai.audio.transcribe(
      'whisper-1',
      audioContent,
      { language: 'en' } // Specify the language if needed
    );

    // Save the transcription to the database
    await create('transcription', {
      audio: file.name,
      text: response.text,
    });

    ctx.send({ message: 'Transcription complete!' });
  } catch (error) {
    ctx.send({ error: error.message });
  }
};

Handle File Uploads: Ensure your React frontend allows users to upload audio files. This can be done using HTML input elements.

3. Leveraging ChatGPT for Transcription Enhancement

Create an Endpoint: Add another endpoint to your Strapi backend to process the initial transcription from Whisper and enhance it with ChatGPT:

const { create } = require('strapi-utils').contentAPI;
const openai = require('openai');

module.exports = async (ctx) =&gt; {
  try {
    const { id } = ctx.params; // Get the transcription ID

    // Fetch the initial transcription from the database
    const transcription = await strapi.db.query('api::transcription.transcription').findOne({
      where: { id: id },
    });

    // Use ChatGPT to refine and enhance the transcription
    const response = await openai.chat.completions.create({
      model: 'gpt-3.5-turbo',
      messages: [
        { role: 'user', content: `Please review and refine this transcription: ${transcription.text}` }
      ],
    });

    // Update the transcription with the enhanced text
    await strapi.db.query('api::transcription.transcription').update({
      where: { id: id },
      data: {
        text: response.choices[0].message.content,
      }
    });

    ctx.send({ message: 'Transcription enhanced!' });
  } catch (error) {
    ctx.send({ error: error.message });
  }
};

Call the Endpoint: In your React frontend, after receiving the initial transcription from Whisper, call the ChatGPT endpoint to refine the text.

4. Presenting the Final Transcription to the User

Retrieve Data from Strapi: In your React frontend, fetch the final, enhanced transcription from the Strapi API.
Display the Transcription: Display the transcribed text to the user in a readable format, perhaps with formatting options like paragraphs and line breaks.

Example Frontend Code (React)

import React, { useState } from 'react';

function TranscriptionApp() {
  const [audioFile, setAudioFile] = useState(null);
  const [transcription, setTranscription] = useState('');

  const handleFileChange = (event) =&gt; {
    setAudioFile(event.target.files[0]);
  };

  const handleSubmit = async (event) =&gt; {
    event.preventDefault();
    const formData = new FormData();
    formData.append('file', audioFile);

    // Send the audio to the Whisper endpoint
    try {
      const response = await fetch('/api/upload-audio', {
        method: 'POST',
        body: formData,
      });

      const { id } = await response.json(); // Get the transcription ID

      // Send the transcription ID to the ChatGPT endpoint
      const enhanceResponse = await fetch(`/api/enhance-transcription/${id}`, {
        method: 'GET',
      });

      const enhancedData = await enhanceResponse.json(); // Get the enhanced transcription
      setTranscription(enhancedData.text);
    } catch (error) {
      console.error(error);
    }
  };

  return (
<div>
 <h1>
  Transcription App
 </h1>
 <form onsubmit="{handleSubmit}">
  <input accept="audio/*" onchange="{handleFileChange}" type="file"/>
  <button type="submit">
   Transcribe
  </button>
 </form>
 <div>
  <h2>
   Transcription:
  </h2>
  <p>
   {transcription}
  </p>
 </div>
</div>
);
}

export default TranscriptionApp;

Conclusion

This detailed guide walks you through the process of integrating Whisper and ChatGPT to build a robust and intelligent transcription app. By combining the power of Strapi for backend management, React for user interface development, and OpenAI for state-of-the-art AI capabilities, you can create a powerful and user-friendly solution for audio transcription.

Best Practices

API Rate Limits: Be mindful of API rate limits for both Whisper and ChatGPT. Implementing proper error handling and throttling mechanisms is essential for a seamless user experience.
Data Security: Securely store API keys and user data. Utilize appropriate encryption techniques and follow best practices for data protection.
User Feedback: Gather user feedback to continuously improve your transcription accuracy and the overall user experience.
Experiment with Model Parameters: Explore various Whisper and ChatGPT model parameters to fine-tune the performance of your app.

Further Enhancements

Speaker Identification: Integrate speaker diarization models to distinguish between multiple speakers in an audio recording.
Punctuation and Formatting: Further refine the transcribed text with advanced formatting and punctuation.
Real-time Transcription: Implement real-time transcription using WebSockets for a more interactive user experience.

By implementing these best practices and exploring further enhancements, you can create a truly impressive transcription app that utilizes the latest advances in AI technology.

Build A Transcription App with Strapi, ChatGPT, & Whisper: Part 2

Build a Transcription App with Strapi, ChatGPT, & Whisper: Part 2 - Bringing it All Together