If you're new to RAG, vector search, and related concepts, this article will guide you through the key terms and principles used in modern LLM-based applications.
This article attempts to provide a very high-level overview of the key concepts and terms used in the LLM ecosystem with an easy to relate explanation. For a more in-depth understanding, I recommend reading other dedicated resources.
With that said, let's get started!
Embedding
Embedding is a way to represent unstructured data as numbers to capture the semantic meaning of the data. In the context of LLMs, embeddings are used to represent words, sentences, or documents.
Let's say we have a couple of words that we want to represent as numbers. For simplicity, we will only consider 2 aspects of the words: edibility and affordability.
Word
Edibility
Affordability
Label
Apple
0.9
0.8
Fruit
Apple
0.0
0.0
Tech Company
Banana
0.8
0.8
?
In the table above, we can roughly deduce that the first apple is a fruit, while the second apple refers to a tech company. If we were to deduce if the banana here is a fruit or a tech company we never heard about, we could roughly say that it's a fruit since it has similar edibility and affordability values as the first apple.
In practice, embeddings are much more complex and have many more dimensions, often capturing various semantic properties beyond simple attributes like edibility and affordability. For instance, embeddings in models like Word2Vec, GloVe, BERT, or GPT-3 can have hundreds or thousands of dimensions. These embeddings are learned by neural networks and are used in numerous applications, such as search engines, recommendation systems, sentiment analysis, and machine
translation.
Moreover, modern LLMs use contextual embeddings, meaning the representation of a word depends on the context in which it appears. This allows the model to distinguish between different meanings of the same word based on its usage in a sentence.
Note that embedding and vector are often used interchangeably in the context of LLMs.
Indexing
Indexing is the process of organizing and storing data to optimize search and retrieval efficiency. In the context of RAG and vector search, indexing organizes data based on their embeddings.
Let's consider 4 data points below with their respective embeddings representing features: alive and edible.
ID
Embedding
Data
1
[0.0, 0.8]
Apple
2
[0.0, 0.7]
Banana
3
[1.0, 0.4]
Dog
4
[0.0, 0.0]
BMW
To illustrate simple indexing, let's use a simplified version of the NSW (Navigable Small World) algorithm. This algorithm establishes links between data points based on the distances between their embeddings:
# ID -> Closest IDs
1->2,32->1,33->2,44->3,2
ANNS
ANNS is a technique for efficiently finding the nearest data points to a given query, albeit approximately. While it may not always return the exact nearest data points, ANNS provides results that are close enough. This probabilistic approach balances accuracy with efficiency.
Imagine we have a query with specific constraints:
Find the closest data to [0.0, 0.9].
Calculate a maximum of 2 distances using the Euclidean distance formula.
Here's how we utilize the index created above to find the closest data point:
We start at a random data point, say 4, which is linked to 3 and 2.
We calculate the distances and find that 2 is closer to [0.0, 0.9] than 3.
We determine that the closest data to [0.0, 0.9] is Banana.
This method isn't perfect; in this case, the actual closest data point to [0.0, 0.9] is Apple. But, under these constraints, linear search would rely heavily on chance to find the nearest data point. Indexing mitigates this issue by efficiently narrowing down the search based on data embeddings.
In real-world applications with millions of data points, linear search becomes impractical. Indexing, however, enables swift retrieval by structuring data intelligently according to their embeddings.
Note that for managing billions of data points, sophisticated disk-based indexing algorithms may be necessary to ensure efficient data handling.
RAG
RAG (Retrieval-Augmented Generation) is a framework that combines information retrieval and large language models (LLMs) to generate high-quality, contextually relevant responses to user queries. This approach enhances the capabilities of LLMs by incorporating relevant information retrieved from external sources into the model's input.
In practice, RAG works by retrieving relevant information from a vector database, which allows efficient searching for the most relevant data based on the user query. This retrieved information is then inserted into the input context of the language model, providing it with additional knowledge to generate more accurate and informative responses.
Below is an example of a prompt with and without RAG in a simple Q&A scenario:
Without RAG
What is the name of my dog?
LLM: I don't know.
With RAG
Based on the context below:
I have a dog named Pluto.
Answer the following question: What is the name of my dog?
LLM: The name of your dog is Pluto.
By integrating retrieval with generation, RAG significantly improves the performance of LLMs in tasks that require specific, up-to-date, or external information, making it a powerful tool for various applications such as customer support, knowledge management, and content generation.
Token
A token is a unit of text that AI models use to process and understand natural language. Tokens can be words, subwords, or characters, depending on the model's architecture. Tokenization is a crucial preprocessing step in natural language processing (NLP) and is essential for breaking down text into manageable pieces that the model can process.
In this example, we'll use WordPunctTokenizer from the NLTK library to tokenize the sentence: "OasysDB is awesome."
fromnltk.tokenizeimportWordPunctTokenizertokenizer=WordPunctTokenizer()tokens=tokenizer.tokenize("OasysDB is awesome.")print(tokens)
["OasysDB","is","awesome","."]
Tokenization plays a big role in LLMs and embedding models. Understanding tokenization can help in various aspects, such as optimizing model performance and managing costs.
Since many AI service providers charge based on the number of tokens processed. So, you'll often encounter this term when working with LLMs and embedding models, especially when determining the pricing of using a specific model.
Conclusion
These five concepts are crucial in understanding and implementing RAG effectively.
Thank you for reading! If you have any questions or if there's anything I missed, please let me know in the comments section.
If you found this article helpful, consider supporting OasysDB. We are developing a production-ready vector database that supports hybrid ANN searches from the ground up.
OasysDB is a hybrid vector database that allows you to utilize relational
databases like SQLite and Postgres as a storage engine for your vector data
without using them to compute expensive vector operations.
This allows you to consolidate your data into a single database and ensure high
data integrity with the ACID properties of traditional databases while also
having a fast and isolated vector indexing layer.
For more details about OasysDB, please visit the
Documentation.
Quickstart 🚀
Currently, OasysDB is only available for Rust projects as an embedded database
We are still working on implementing RPC APIs to allow you to use OasysDB in any
language as a standalone service.
OasysDB has 2 primary components: Database and Index.
The Database is responsible for managing the vector indices and connecting the
storage engine, the SQL database, to the indices as the data source. OasysDB
uses SQLx…