I’m still riding the GenAI train, testing and tweaking new apps using LLMs and VLMs like a mad scientist in a digital lab. My latest experiment? Resurrecting my old projects with some modern, no-nonsense solutions. Enter Moondream 2, the open-source VLM that plays nicely with Streamlit. I put it to work creating an intelligent video tagging system for surveillance because who doesn’t love a bit of AI snooping?

In my latest tutorial, I’ll walk you through deploying a VLM locally. No cloud is needed; it's just a good old-fashioned DIY. You’ll also get the lowdown on tackling tokenization and the other infernal tasks in passing an image through a VLM. Trust me, it’s more fun than it sounds!

The Model: Moondream

Moondream is a highly versatile and modular Vision Language Model (VLM) capable of performing various vision-related tasks. From answering questions based on images and detecting objects with bounding boxes to generating accurate image captions, Moondream is designed to deliver reliable results across various applications. It's an advanced tool for developers looking to integrate powerful Vision AI capabilities into their projects.

Built to run efficiently across multiple platforms, Moondream stands out as a compact, open-source VLM that combines performance with accessibility. It’s the perfect choice for developing next-level AI Vision applications without the burden of heavy or complex models such as GPT4o and Gemma. The Apache License also lets us use the model for our use cases.

Implementation

Moving to the implementation, we are looking at 3-4 major functionalities, which can be blocked by loading the VLM, setting up a tokenizer for the logging, extracting frames from an uploaded image, and finally inferring to store the logs in a CSV.

We are using a Streamlit workflow to set up the application's input and output streams. To see how the actual implementation code goes, check out the Github Repository Here or the YouTube tutorial Here.

Loading Model and Tokenizer

We are going to use Moondream VLM sourced from a function using AutoModelForCasualLM. This statement will let us download all the weights for the model and cache the download into our web application instance, avoiding repeated downloads.

Warning: The Model is Over 2.5 GB, So Mind Your Internet Connection

# Cache the model and tokenizer to avoid downloading them repeatedly
@st.cache_resource
def load_model_and_tokenizer():
    model_id = "vikhyatk/moondream2"
    revision = "2024-07-23"

    model = AutoModelForCausalLM.from_pretrained(
        model_id, trust_remote_code=True, revision=revision,
        torch_dtype=torch.float16).to("cuda")

    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    return model, tokenizer

Extracting Frames with Timestamp

The next function we write handles the uploaded CCTV Surveillance Footage, letting us capture frames according to time intervals. This will also help us identify Key Frames later.

# Function to extract frames from video and their timestamps
def extract_frames_with_timestamps(video_path, interval=0.2):
    cap = cv2.VideoCapture(video_path)
    frames = []
    timestamps = []
    frame_rate = cap.get(cv2.CAP_PROP_FPS)
    success, image = cap.read()
    count = 0

    while success:
        timestamp_ms = cap.get(cv2.CAP_PROP_POS_MSEC)
        timestamp_sec = timestamp_ms / 1000.0

        if count % (interval * frame_rate) == 0:
            img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
            frames.append(img)
            timestamps.append(timestamp_sec)

        success, image = cap.read()
        count += 1

    cap.release()

    print(f"Total frames captured: {len(frames)}")
    return frames, timestamps

Frame Inference

Now, passing the system prompt "Describe this image." After uploading the frames one by one, we shall get the descriptions for the video logs. We will still, however, pass one logic to get some estimated key frames from the video, asking for the code to flag frames generating more than 5 different words from their predecessor.

# Extract frames and timestamps from the video
    frames, timestamps = extract_frames_with_timestamps(video_path, interval=1)  # Extract 1 frame per second

    # Process each frame using the model
    descriptions = []
    prev_description_words = set()
    key_frames = []

    with st.spinner("Processing..."):
        for i, frame in enumerate(frames):
            enc_image = model.encode_image(frame)
            description = model.answer_question(enc_image, "Describe this image.", tokenizer)
            filtered_words = list(filter_description(description))  # Convert to list
# Logic for Key Frames
            new_words = set(filtered_words) - prev_description_words
            if len(new_words) > 5:
                key_frames.append((timestamps[i], frame))

            descriptions.append((timestamps[i], filtered_words))
            prev_description_words = set(filtered_words)  # Ensure it remains a set

Streamlit Formatting for Images

Finally, for the more keen, here is the code for formatting the displayed frames and keyframes using Streamlit Commands.

 # Display the frames in a grid layout
    num_columns = 3  # Number of columns in the grid
    num_rows = (len(frames) + num_columns - 1) // num_columns  # Calculate number of rows needed

    for row in range(num_rows):
        cols = st.columns(num_columns)
        for col in range(num_columns):
            index = row * num_columns + col
            if index < len(frames):
                frame = frames[index]
                cols[col].image(frame, caption=f"Frame {index + 1} at {timestamps[index]:.2f}s")

     # Display key frames in a grid layout
    if key_frames:
        st.write("Key Frames:")
        num_columns_key_frames = 3  # Number of columns for key frames grid
        num_rows_key_frames = (len(key_frames) + num_columns_key_frames - 1) // num_columns_key_frames  # Calculate number of rows needed

        for row in range(num_rows_key_frames):
            cols = st.columns(num_columns_key_frames)
            for col in range(num_columns_key_frames):
                index = row * num_columns_key_frames + col
                if index < len(key_frames):
                    timestamp, frame = key_frames[index]
                    cols[col].image(frame, caption=f"Key Frame {index + 1} at {timestamp:.2f}s")

Conclusion

So there you have it, a crash course in making your old projects feel new again with a bit of VLM magic. Whether you want to impress your boss or geek out over some next-level AI, Moondream 2 has your back. If you’re anything like me, you’ll probably wonder why you didn’t do this sooner. Go forth and tag those videos like a pro!

SurvBot🎥: Automatic Surveillance Tagging using Moondream and Streamlit