Quick tip: Using OpenAI with SingleStoreDB to store and query vectors of Fine Food Reviews

Akmal Chaudhri - May 5 '23 - - Dev Community

Abstract

This short article will demonstrate another example of working with OpenAI and storing and querying vectors in SingleStoreDB. In a previous article, we saw that SingleStoreDB was adept at working with vectors through its vector functions, using a local installation. This time, however, we'll use SingleStoreDB in the cloud. In a future article, we'll use Docker.

Introduction

In a great article, the author shows how to create vectors for a clustering task. The vectors are eventually stored and queried using a vector database. However, we could store and query the vectors using SingleStoreDB instead. Since the original article code is copyrighted, we'll just provide the additional code required to use SingleStoreDB, instead of a vector database. There is, however, one code correction required in the original article. The following line:

matrix = np.array(df.ada_similarity.apply(eval).to_list())
Enter fullscreen mode Exit fullscreen mode

needs to be changed to:

matrix = np.array(df.ada_similarity.to_list())
Enter fullscreen mode Exit fullscreen mode

Create a SingleStoreDB Cloud account

A previous article showed the steps required to create a free SingleStoreDB Cloud account. We'll use OpenAI Demo Group as our Workspace Group Name and openai-demo as our Workspace Name.

Fill out the Notebook

In a SingleStore Jupyter notebook, we'll need to add the following libraries:

!pip install matplotlib --quiet
!pip install openai --quiet
!pip install scikit-learn --quiet
!pip install tabulate --quiet
!pip install transformers --quiet
Enter fullscreen mode Exit fullscreen mode

and imports:

import getpass
import matplotlib.pyplot as plt
import numpy as np
import openai
import os
import pandas as pd
import warnings

from openai import OpenAI
from tabulate import tabulate
from transformers import GPT2TokenizerFast
from singlestoredb import notebook as nb
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
Enter fullscreen mode Exit fullscreen mode

We'll create a database and table:

CREATE DATABASE IF NOT EXISTS openai_demo;

USE openai_demo;

DROP TABLE IF EXISTS fine_food_reviews;
CREATE TABLE IF NOT EXISTS fine_food_reviews (
    combined TEXT,
    ada_similarity VECTOR(1536)
)
Enter fullscreen mode Exit fullscreen mode

and a connection to SingleStoreDB, as follows:

from sqlalchemy import *

db_connection = create_engine(connection_url)
Enter fullscreen mode Exit fullscreen mode

We could store all the Dataframe data in SingleStoreDB, as shown in the previous article, but we'll limit it this time to several columns. Our code below will be similar to the previous article.

Next, we'll insert the data into the table, as follows:

df_copy = df[["combined", "ada_similarity"]]

df_copy.to_sql(
    "fine_food_reviews",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)
Enter fullscreen mode Exit fullscreen mode

We'll now create a Python function that will call OpenAI and query SingleStoreDB, returning the values and scores, as follows:

def search_reviews(
    query: str,
    num_rows: int = 10
) -> list:
    """Searches Fine Foods Reviews for the given query and returns the top `num_rows` results.

    Args:
        query: The query to search for.
        num_rows: The number of results to return.

    Returns:
        A list of the top `num_rows` results.
    """

    # Get the embedding of the query
    query_embedding_response = client.embeddings.create(
        input = query,
        model = EMBEDDING_MODEL
    )
    query_embedding = query_embedding_response.data[0].embedding

    results = pd.read_sql("""
        SELECT combined, (%s <*> ada_similarity) AS score
        FROM fine_food_reviews
        ORDER BY score DESC
        LIMIT %s
    """, db_connection, params = (str(query_embedding), num_rows))

    # Separate the results into two lists
    values = results["combined"].tolist()
    scores = results["score"].tolist()

    # Return the results
    return values, scores
Enter fullscreen mode Exit fullscreen mode

Now we are ready to query the data:

values, scores = search_reviews(
    query = "I have ordered these raisins",
    num_rows = 5
)
Enter fullscreen mode Exit fullscreen mode

We can tabulate the results as follows:

# Combine the values and scores lists into a list of tuples.
# Each tuple contains a value and its corresponding score.
table_data = list(zip([value[:50] for value in values], scores))

# Add a rank column to the table data.
table_data = [(i + 1,) + data for i, data in enumerate(table_data)]

# Create the table.
table = tabulate(table_data, headers = ["Rank", "Combined", "Score"])

# Print the table.
print(table)
Enter fullscreen mode Exit fullscreen mode

The output should be similar to the following:

  Rank  Combined                                               Score
------  --------------------------------------------------  --------
     1  Title: Delicious!; Content: I have ordered these r  0.879987
     2  Title: Just what I expected!; Content: I bought th  0.838344
     3  Title: its delicious...; Content: I was surprised   0.819695
     4  Title: What a treat!; Content: Ordered these as pa  0.813487
     5  Title: Perfect Gift; Content: I got these to give   0.813398
Enter fullscreen mode Exit fullscreen mode

A quick visual inspection shows that the results are similar to those in the original article.

Summary

In this short article, we have seen another example where SingleStoreDB can store and query vectors. In future articles, we'll look at further examples and scenarios where we might wish to perform additional queries on the database beyond just querying the vector data.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player