This tutorial will cover how to use embeddings and vectors to perform semantic search using ChromaDB in JavaScript.
What are Embeddings
Have you ever wondered how recommendation systems like Netflix almost always know what movies you like? When you log in to Netflix, the app presents recommendations that will likely fit your tastes and preferences;Embeddings power the mechanism behind this.
Embeddings refer to the transformation of words, text, or audio into numerical vectors. A numerical vector is essentially an array of numbers. This transformation preserves the meaning of the words and also captures their relationship to to other words in the vector space.
What is A vector space
A vector space is a mathematical space where vectors represent data. For example, consider the words 'cat' and 'kitten.' When these words are represented as vectors in a vector space, the vectors capture their semantic relationship, thus facilitating their mapping within the space.
The distance between the 'cat' and 'kitten' vectors measures their relatedness. Since 'cat' and 'kitten' are close to one another, the distance between them is small. Larger distances between vectors indicate that the words or texts are not closely related.
This means that when you search for "cat," the system can recognize the similarity and suggest content related to cats and kittens.
This powerful technology is what allows platforms like Netflix and Spotify to provide you with personalized and accurate recommendations, enhancing your viewing and listening experience.
How to create Embeddings with OpenAI
OpenAI provides an embedding model that measures the relatedness of text. To get an embedding of our 'cat' and 'kitten' words, we need to send each string to the OpenAI embeddings API endpoint along with the model name
First, define your OpenAI API_KEY
const OPENAI_API_KEY ="your_openai_api_key";
Create a function that takes a phrase or word as an argument, sends it to the OpenAI embeddings API, and gives back the embedding.
async function createEmbeddings(word) {
const url = " https://api.openai.com/v1/embeddings";
const headers = {
"Content-Type": "application/json",
Authorization: `Bearer ${OPENAI_API_KEY}`,
};
const data = {
input: word,
model: "text-embedding-3-small",
};
const response = await fetch(url, {
method:'POST',
headers: headers,
body: JSON.stringify(data),
});
const embedding = await response.json();
console.log(embedding.data)
}
Now let's invoke the function with the words cat and kitten
createEmbeddings("cat");
createEmbeddings("kitten");
The output will look like this:
[
{
object: 'embedding',
index: 0,
embedding: [
0.02552942, -0.023411665, -0.016092611, 0.03937628, 0.02094483,
-0.02632067, 0.0018908527, 0.030602723, -0.015929706, 0.0053118416,
0.02214334, -0.0002121755, 0.010460779, 0.0031213614, 0.02985802,
0.006265995, -0.021363726, -0.010716772, -0.030532908, 0.057528466,
0.03409353, 0.04589245, 0.020502662, -0.046637155, -0.006871068,
0.03800323, -0.009268087, 0.04405396, 0.051803548, -0.013497779,
0.0033686268, -0.043123078, -0.0112753, -0.029090041, -0.022946225,
0.017768197, 0.017570386, -0.028019529, -0.015743531, 0.01378868,
-0.037281796, -0.008773557, 0.045799363, 0.011473113, 0.009460081,
-0.0533395, -0.022597145, -0.019606689, 0.019362332, 0.037142165,
0.023388393, -0.014870829, 0.01746566, 0.04998833, -0.004168603,
-0.0011636016, -0.019292515, 0.04659061, -0.0029279126, 0.009279723,
-0.024970891, 0.0059925485, 0.02518034, -0.002679193, 0.019420512,
0.038282495, 0.01837327, 0.017232941, -0.05962295, -0.018210366,
-0.0058034635, 0.028415153, -0.062089786, 0.011286936, 0.047218956,
0.009401902, -0.029974379, -0.000250538, 0.062974125, 0.043425616,
0.0011352389, 0.058552437, 0.016243879, -0.025226884, 0.01259017,
-0.023202218, -0.034512427, 0.02850824, 0.011054216, -0.026041405,
-0.0038457036, 0.015487539, -0.044798665, -0.038980655, -0.010332783,
0.043774694, -0.008517564, -0.048219655, -0.001969396, 0.014149397,
... 1436 more items
]
}
]
What is a Vector Database
As the name suggests, a vector database is a database that can store vectors. Unlike traditional databases that use primary keys and foreign keys when querying data, data in vector databases is in the form of highly dimensional vectors. When querying, vector databases use mathematical proximity to find similar items.
How to Set up A vector database with ChromaDB and Docker
Vector databases are ideal for building complex AI applications. ChromadB is an open-source vector database that requires minimal configuration to get started.
To get started, you should have Docker Installed. Follow the steps below to get it running on your machine:
Pull the ChromaDB docker image from the Docker hub repository.
docker pull chromadb/chromadb
Run the chromaDB container and specify the ports
docker run -d -p 8080:8080 --name chromadb chromadb/chromadb
To verify that the container is running, issue this command
docker ps
You should see the ChromaDB container from your list of running containers.
Adding Data to the VectorStore
To ensure the semantic meaning of data is accurate, the data needs to be in small chunks, we will start by adding items in an array describing some movies that look like this:
const movies = [
'"Title":"Due Date","Year":"2010","Rated":"R","Released":"05 Nov 2010","Runtime":"95 min","Genre":"Comedy, Drama","Actors":"Robert Downey Jr., Zach Galifianakis, Michelle Monaghan","Plot":"High-strung father-to-be Peter Highman is forced to hitch a ride with aspiring actor Ethan Tremblay on a road trip in order to make it to his child\'s birth on time."',
'"Title":"Easy A","Year":"2010","Rated":"PG-13","Released":"17 Sep 2010","Runtime":"92 min","Genre":"Comedy, Drama, Romance","Actors":"Emma Stone, Amanda Bynes, Penn Badgley","Plot":"When Olive lies to her best friend about losing her virginity to one of the college boys, a girl overhears their conversation. Soon, her story spreads across the entire school like wildfire."',
'"Title":"Unstoppable","Year":"2010","Rated":"PG-13","Released":"12 Nov 2010","Runtime":"98 min","Genre":"Action, Thriller","Actors":"Denzel Washington, Chris Pine, Rosario Dawson","Plot":"With an unmanned, half-mile-long freight train barreling toward a city, a veteran engineer and a young conductor race against the clock to prevent a catastrophe."',
'"Title":"Despicable Me","Year":"2010","Rated":"PG","Runtime":"95 min","Genre":"Animation, Adventure, Comedy","Actors":"Steve Carell, Jason Segel, Russell Brand","Plot":"Gru, a criminal mastermind, adopts three orphans as pawns to carry out the biggest heist in history. His life takes an unexpected turn when the little girls see the evildoer as their potential father."',
'"Title":"Don Henley: Live Inside Job","Year":"2000","Rated":"N/A","Runtime":"105 min","Genre":"Documentary, Music","Actors":"Don Henley, Jonathan K. Bendis, Will Hollis","Plot":"Don Henley performs his greatest hits live in Dallas."',
'"Title":"Harry Potter and the Deathly Hallows: Part 1","Year":"2010","Rated":"PG-13","Runtime":"146 min","Genre":"Adventure, Family, Fantasy","Actors":"Daniel Radcliffe, Emma Watson, Rupert Grint","Plot":"As Harry, Ron and Hermione race against time and evil to destroy the Horcruxes, they uncover the existence of the three most powerful objects in the wizarding world: the Deathly Hallows."',
'"Title":"Tangled","Year":"2010","Rated":"PG",,"Runtime":"100 min","Genre":"Animation, Adventure, Comedy","Actors":"Mandy Moore, Zachary Levi, Donna Murphy","Plot":"The magically long-haired Rapunzel has spent her entire life in a tower, but now that a runaway thief has stumbled upon her, she is about to discover the world for the first time, and who she really is."',
'"Title":"Black Swan","Year":"2010","Rated":"R",,"Runtime":"108 min","Genre":"Drama, Thriller","Actors":"Natalie Portman, Mila Kunis, Vincent Cassel","Plot":"Nina is a talented but unstable ballerina on the verge of stardom. Pushed to the breaking point by her artistic director and a seductive rival, Nina\'s grip on reality slips, plunging her into a waking nightmare."',
'"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea and by the co-founder who was later squeezed out of the business."',
'"Title":"Toy Story 3","Year":"2010","Rated":"G","Runtime":"103 min","Genre":"Animation, Adventure, Comedy","Actors":"Tom Hanks, Tim Allen, Joan Cusack","Plot":"The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it\'s up to Woody to convince the other toys that they weren\'t abandoned and to return home."',
'"Title":"A Clockwork Orange","Year":"1971","Rated":"R","Runtime":"136 min","Genre":"Crime, Sci-Fi","Actors":"Malcolm McDowell, Patrick Magee, Michael Bates","Plot":"In the future, a sadistic gang leader is imprisoned and volunteers for a conduct-aversion experiment, but it doesn\'t go as planned."',
'"Title":"Inception","Year":"2010","Rated":"PG-13",,"Runtime":"148 min","Genre":"Action, Adventure, Sci-Fi","Actors":"Leonardo DiCaprio, Joseph Gordon-Levitt, Elliot Page","Plot":"A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project."'
];
Import ChromaClient.
import { ChromaClient } from "chromadb";
Instantiate a chromaDB client that will connect to the ChromaDBb server.
const client = ChromaClient();
Create a collection.
A collection is a way to organize vectors. Our collection will store all the details and features about the movies in the movies array. Each vector will have the following features:
- ID,
- metadata,
- movie details,
- and embeddings.
Chroma is integrated with OpenAI's Embeddings, which allows it to leverage OpenAI's Embedding capabilities.
Import OpenAIEmbeddingFunction class from chromadb and instantiate an OpenAIEmbeddingFunction class , authenticate with OpenAI and supply your embedding function in creating a collection.
import { ChromaClient,OpenAIEmbeddingFunction } from "chromadb";
const embeddingFunction = new OpenAIEmbeddingFunction({
openai_api_key: OPENAI_API_KEY,
});
Create a collection called movies and specify the embedding function.
const collection = await client.createCollection({
name: "movies",
embeddingFunction:embeddingFunction
});
The embedding function ensures that Chroma transforms each individual movie into a multi-dimensional array (embeddings). This will ensure the semantic meaning is maintained, which will be useful when performing queries.
Add data to the Collection
Each movie should have a unique ID, so we will loop over the movie's array, create a unique ID for each movie, and insert it into the database.
for (const movie of movies) {
const uniqueId = `${Date.now()}-${Math.floor(Math.random() * 10000)}`;
collection.add({
documents: [movie],
ids: [uniqueId],
metadatas: [{ name: movie }],
});
To view the collection, navigate to http://localhost:8000/api/v1/collections , and you should see all your collections.
Perform Similarity Search
Let's first get the collection. Use the .getCollection() method and specify the name of your collection and the embeddingFunction.
const mycollection = await client.getCollection({
name:"movies",
embeddingFunction:embeddingFunction
})
Search Collection
Let's do a query with the phrase “ recommend for me a movie suitable for kids”,
const results = await mycollection.query({
queryTexts: ["recommend for me a movie suitable for kids"],
nResults: 2,
});
console.log(results.documents);
Here is the response .
[
[
'"Title":"Despicable Me","Year":"2010","Rated":"PG","Runtime":"95 min","Genre":"Animation, Adventure, Comedy","Actors":"Steve Carell, Jason Segel, Russell Brand","Plot":"Gru, a criminal mastermind, adopts three orphans as pawns to carry out the biggest heist in history. His life takes an unexpected turn when the little girls see the evildoer as their potential father."',
`"Title":"Toy Story 3","Year":"2010","Rated":"G","Runtime":"103 min","Genre":"Animation, Adventure, Comedy","Actors":"Tom Hanks, Tim Allen, Joan Cusack","Plot":"The toys are mistakenly delivered to a day-care center instead of the attic right before Andy leaves for college, and it's up to Woody to convince the other toys that they weren't abandoned and to return home."`
]
]
We expected our query to return results that are semantically similar to the query, and as you can see, the response is accurate. Despicable Me and Toy Story 3 are all movies suitable for kids. How awesome is this?
Conclusion
In conclusion, this tutorial has shown you how to leverage the power of embeddings and ChromaDB to perform semantic searches in JavaScript.
Stay tuned for part 2, where we will cover how to add a retriever.