Previously, I introduced a generic RAG tamplate, in which I mentioned that there are three cores needed to make a high-quality RAG.
- embedding with semantic understanding
- LLM with contextualized knowledge.
- compression result by rerank.
When all of these are in place, a high quality RAG will be created, regardless of whether there is fine-tuning or not.
Add high quality sources and accurate prompts, and you've got a complete RAG.
Simple, right?
Is it possible to containerize such a simple yet useful implementation and run it completely locally? Yes, of course.
Let's take the three models mentioned in the previous template as an example.
-
Ollama
plusTAIDE
. -
BGE-M3
for embedding. -
ms-marco-MultiBERT-L-12
as reranker
Ollama with Models
Ollama is a completely local LLM framework, you can pull down the LLM model you want to use by ollama pull
.
Ollama itself provides a basic container.
docker pull ollama/ollama
.
Nevertheless, there is no simple way to get this container to mount the model. So here's a little bit hack, let me demonstrate with a Dockerfile
.
FROM ollama/ollama as taide_base
RUN nohup bash -c "ollama serve &" && sleep 5 && ollama pull cwchang/llama3-taide-lx-8b-chat-alpha1
We use Ollama's containers directly and wake up the ollama service during the docker build
process and download the model directly.
This way we have an LLM framework with models.
Packaging BGE-M3
The BGE-M3
here is a HuggingFace supplied model, so all we need to do is find the HuggingFace model catalog and copy it into the container.
In my environment (without modifying any settings), the model directory is at
~/.cache/huggingface/hub/models--BAAI-bge-m3
Therefore, we only need to COPY
the contents of this directory into the container.
However, it is important to note that HuggingFace requires config.json
when loading models, and this file is very deep.
def init_embeddings():
from langchain_huggingface import HuggingFaceEmbeddings
HF_EMBEDDING_MODEL = './models--BAAI--bge-m3/snapshots/5617a9f61b028005a4858fdac845db406aefb181'
return HuggingFaceEmbeddings(
model_name=HF_EMBEDDING_MODEL,
model_kwargs={'device': 'cpu'},
encode_kwargs={'normalize_embeddings': False}
)
As we can see from this code, we actually need to specify the snapshot that is used at the moment when using the model.
Well, we are left with the last one, reranker.
Packaging ms-marco-MultiBERT-L-12
The ms-marco-MultiBERT-L-12
used here is integrated by langchain. With the default behavior, langchain's document_compressors
will place the model in /tmp
.
In other words, when we run the following code, it downloads the model into /tmp
.
from langchain.retrievers.document_compressors import FlashrankRerank
compressor = FlashrankRerank(model_name='ms-marco-MultiBERT-L-12', top_n=5)
So what we need to do is copy /tmp/ms-marco-MultiBERT-L-12
into the container.
But that's not enough, we need to explicitly specify on the client side that the model's directory has been changed to the container's current directory. This is a bit complicated to explain, so let's just look at an example.
from flashrank import Ranker
from langchain.retrievers.document_compressors import FlashrankRerank
ranker = Ranker(model_name='ms-marco-MultiBERT-L-12', cache_dir='.')
compressor = FlashrankRerank(client=ranker, top_n=5)
All right, we've got the three models we need in the container.
Conclusion
Although this article provides a containerized RAG solution, I have to say that the container image is 18 GB.
If we were to package it with the embedded vectors from the source, it would easily exceed 20 GB.
Therefore, this container can only be used for simple testing, and is not really capable of scaling, so you need to be more careful when using it.