Hey Friends! This week's adventure sort of doubles as a Product Launch... It's something I've dreamed about and hinted at for a while now, and previous adventures have even been building up to it... but it's becoming a real thing now and I couldn't be happier to share it with you!
I'd rather watch a youtube video
Finding a Problem to solve
I attend a church that records a video of the pastor's message each week and posts it to their Youtube channel. As I was browsing their channel, I noticed that it takes a while for a new video to get to 100 views, and I started digging into why that might be.
What I found was that it came down to time. The staff have a lot going on, both professionally and personally, and as a result they don't have the time required to market the youtube channel effectively. Some key points that I noted:
- The titles don't attract attention very well.
- Most of the videos don't have anything in the description field.
- There are no hashtags in use.
- There's no additional marketing of the video posting - we don't make use of shorts, or other channels like Reels or TikTok, to acquire new watchers and direct them to our content.
Anybody can complain; Devs roll up their sleeves and do work!
Given my recent Adventures exploring AI capabilities, my brain was primed to solve this problem. I could build a Social Media Assistant to help my pals using Generative AI tools!
So here's the plan:
- Start with the "finished" video file that they're going to upload
- Transcribe the audio
- Create vector embeddings out of the transcript
- Push those into a RAG application pattern using a free LLM
- Have the app do the work listed in the bullets above:
- Write a clickbait-y title
- Give me a description summary
- Find appropriate hashtags
- Write a list of discussion/reflection questions
- Write the social media email invitation to next week's services
- Find quotable moments in the message and clip them for use on short-form platforms
- Drop all the outputs into a folder where the staff can review / tweak before they post it
Uh, Blink, why not just automate the whole upload?
True, I could probably make it a completely automatic pipeline. Why stop there?
First of all, there's a trust issue to overcome. How do we know the model isn't going to hallucinate and leave our church with an embarrassing faux pas?
Second, full automation without human oversight is unhealthy. I'm fully on board with making all the of the work automatic - as long as a person is still governing its use. It's a safety thing, yeah, but it's also a philosophical stance. I don't like the idea of replacing people with AI. I like the idea of augmenting people's efforts with AI.
Introducing... the PhilBott!
Our pastor's surname is "Philpott". In his honor, I named my robot buddy "the PhilBott"! 😏 You can find the code for the PhilBott on my GitHub. I've made it open-source and I have no intention of profiting from it - I'm a bit of a 'digital hippie' and I just want the information to be free, man...
A Brief Tour for the Devs in the Room
Let's take a look at the parts of the PhilBott! There are several components, and I've tried hard to keep them coupled very loosely. First, here's the flow:
Now that we have a big picture in mind, let's take a look at the first step: creating a transcript.
Turning Sounds into Words
I elected to wrap this in a Python class so that it could be easily called from my main program.
A Transcripter takes a single input - the file name of an input video - and then it's ready to use. By "use", I mean to say that you call the transcribe()
method and it will spit out a transcript.
There's one extra requirement for Transcripter, however: we need to have ffmpeg
installed in order for it to work. We're using this to extract the audio from the mp4 - it creates a wav file and then strips it down to a mono channel. Once this is done, we can feed it to vosk, a python library & associated machine learning model that transcribes audio and creates a text file from it. One of the features of Vosk that we're taking advantage of is that it will mark the beginning & ending timestamp of every word that it transcribes. This is particularly important for our quotable short-form tool - we need to know where those quotes are in the video in order to clip them out!
Transcripts become Vector Embeddings
I won't get too deep into these weeds here because we've covered it in a previous Adventure. I used the same pattern and flow to make this work: We chunk the transcript, create vector embeddings, and then push them into a Chroma DB.
As with the transcripter, I made the RAG application into a Python class so it would be easier to implement in my main program.
Stitching it all together
The main python program accepts 3 command-line parameters:
- The video file we're going to work on
- The config yaml file that we want to load up to tell the LLM what our outputs need to be
- The location of the output folder where our responses will live
It creates a Transcripter, transcribes the video, and feeds that data to a RAG App configuration. The RAG app loads all the data into Chroma and then joins it to the language model and invokes it to ask all the questions in our config.
Wrapping up
To me, the PhilBott represents all of the AI learning I've been doing lately. It's sort of the place where I've joined all of my thoughts together and made something practical and useful out of them.
I think projects like this are important - because they force us, the technologists, the practitioners - to stop and think about the people around us and how our work can help them achieve more.