Abstract
This is the first part of a series of blog posts on using a RAG (Retrieval Augmented Generation) information system for coding. Find out how this can empower your development team.
In this episode, we will discuss at a very simple RAG system for Ruby made with LangChain, JinaAI embeddings and a very light local LLM served via Ollama.
If you donʼt know what any of this means, this blog post is for you. You will see how to benefit from the technology without uploading/donating your codebase to external providers like OpenAI.
It is an introduction, the coding follows in the next episodes.
Naming is hard, and I wanted something that goes against the flashy-ai-y. So letʼs call this a Granny RAG.
The next parts will then move on to the following topics:
- examine different embedding models
- splitting and chunking strategies
- RAG for and with Elixir
Letʼs start.
A Use Case
Imagine you are developer and you just started a new job. The code base looks well maintained, the team is nice. Could be bitcrowd. You grab your first ticket, and it looks like this:
Uff. No description. Normally, you would reach out to your onboarding-buddy and ask for advise. This ticket is clearly written with a developers who knows the application in mind.
But ... you could also quickly investigate. Your team does have a Granny after all!
You go to your local granny. Granny RAG is in fact just some light off-the-shelf application wrapper around a simple RAG system, but you are used to call her Granny RAG, because she is patient, kind and knows all the stories.
Letʼs ask her:
Hy Granny, I need to show the end date of the subscriptions in the “subscription list”. What is a subscription in the application?
Granny: To display the end dates of subscriptions in the list of subscriptions, you can modify the COLLECTION_ATTRIBUTES constant in your SubscriptionDashboard class to include the active_until field. Hereʼs an updated version of the SubscriptionDashboard class with the modified COLLECTION_ATTRIBUTES:
class SubscriptionDashboard < ApplicationDashboard ATTRIBUTE_TYPES = { course: Field::BelongsTo.with_options(order: :name), user: Field::BelongsTo.with_options(order: :name), user_id: Field::Number, id: Field::String, active_until: Field::DateTime, deactivated_at: Field::DateTime, active?: Field::Boolean, product_type: EnumField, created_at: Field::DateTime, updated_at: Field::DateTime }.freeze COLLECTION_ATTRIBUTES = %i[ course user id active? product_type active_until ].freeze
Wow, this is magic!
Without knowing the application too well, you came from a non-workable ticket to a plan of action. You can now chose to update the ticket description, or create a PR and use that for the discussion.
ℹ️ Note
This is a real life ticket from a client.
This ticket came from the work at our client MedGurus. When I tried out my Granny RAG, I browsed through the tickets in progress and found this one. I compared the solution Granny RAG proposed with the PR that came out of this ticket. I was actually thrilled how well this works.
My conversation with Granny RAG was actually a a bit longer. Here is the full conversation with Granny RAG.
How does it all work?
Granny RAG is a RAG system. That is short for Retrieval Augmented Generation. If you are looking for a quick intro, here is the nice video by Marina Danilevsky.
In essence, RAG improves the quality of LLM responses by enriching user prompts with relevant contextual information. It retrieves this information from an efficiently searchable index of of your entire project, generated with the help of an embedding model.
Embedding models
Itʼs not easy to say something simple about the embedding process without being incorrect. Embedding models are models that generate a representation of the “meaning” sequence of text. This “meaning” is represented as a vector called “embedding”. It is a long array of numbers that represent semantic meaning within the given context.
Tokens with a similar meaning in the source document get embedding vectors “close to each other” by some distance measurement.
A suitable model will place expressions with similar meaning in similar spaces of its vector space. So subscription
will be next to activation
and active_until
.
You can think of the process as hashing with hashing function that understands the input.
Retrieval
Instead, when the user asks a question, we throw it into the same embedding function to get an index for it. With that, we do a lookup what sequences of text occupy a similar space in the memory.
There are multiple strategies for this similarity criteria. We will explore similarity in more depth in the second post of this series. For now, letʼs assume we found entries “close” to the index we got for the search term.
Each of those entries carries a piece of text and some metadata. The metadata tells us more about the source, e.g. which file it came from. Until now, we have build a more intelligent search function. It finds active_until
even if you searched for end date
. Something, a classic fulltext index would not find.
In an “old fashioned” information system, we would output those magical pieces of text and leave it to the reader to go through them, understand their meaning and evaluate their relevance.
“But wait”, you say, “are there not these new cool kids on the block, The LLMʼs™, that are brilliant at exactly that?”. You are right, this is exactly what RAG systems do.
Context
Attention: We will be simplifying heavily. If you would like to get a l ightweight intro head over to this huggingface course, or this series of videos from three blue one brown.
It boils down to this: When LLMs generate, they find the next word, or gaps in a text. They take this a step at a time, a bit like friends finishing each otherʼs sentences.
Then, they look at the text created, including the new word, and compile the next word, and the next. Put differently, they try to find the piece of text or the character that is most likely to make sense in the previously generated context.
Here is an example for a prompt that uses RAG:
You are an assistant for question-answering tasks. Use the following pieces of
retrieved context to answer the question. If you donʼt know the answer, just
say that you donʼt know.
Use three sentences maximum and keep the answer concise. # (1)
--
Question: “What would I need to change to show the active_until date in the list
of subscriptions?” # (2)
Context: {context} # <- The RAG magic happens here
Answer: # (3)
ℹ️ Info
A system prompt tells the LLM what is expected from it (1), then a question is specifying the task (2) and the “please fill in your answer here”-part (3) is what LLMs are used to work with.
LLMs do so, again, based on vector representations. Starting from a seed, often the system prompt, and the userʼs instructions.
The idea of RAG is that if you include facts that you researched into your prompt, the context for the generation is narrowed down significantly compared to a prompt that does not include those facts. Retrieval Augmented Generation is an effective countermeasure against hallucinations. It does not stop them, but makes them less likely.
💡 Tip
Outline the Important Findings in your follow up
All LLM based systems hallucinate at some point. RAG helps to avoid that, but as you can see in 5. Follow Up, even Retrieval Based Systems stray from the truth at times. You can detect that because the information in 5. Follow Up does not align with the previous answers.
If this happens, it helps to outline the previous facts in the next prompt, as I did in 6. Follow Up:
...
“You said before that the
COLLECTION_ATTRIBUTES
are responsible for the list ofSubscriptions
(...)”...
Mentioning previous findings in the new prompt amplifies them in the context. This steers the conversation in the direction you like and helps helps the LLM to let go of the hallucinations.
The Code
You can find the code on bitcrowdʼs GitHub, which is probably easier to read than going through the inline version here.
We will use LangChain, which is a Python library. There are also implementations of it in JavaScript and Elixir. We use Python to start with, because it is the common language for machine learning, so prototyping is easy. Once we know what we want, we will explore the Elixir way. So please make sure you have Python installed.
We use JinaAI embeddings, which is a very capable model that can be used locally under the Apache 2 licence.
Finally, you need Ollama, or any other tool that letʼs you run a model and expose it to a web endpoint. In the example, we use Meta's Llama3 model. Models like CodeLlama:7b-instruct also work. Feel free to change the .env
file and experiment with different models.
Test Drive
You need some Ruby library that we can try out here. Letʼs go for Discourse.
Weʼre interested in the logic Discourse uses prevent users from re-using an old, expired password and which files in the code are relevant for this. So we ask it:
How does Discourse keep track of which passwords a user already used? So that when changing their password, a user does not use the same, expired password again.
Granny RAG answers:
According to the context you provided, Discourse keeps track of which passwords a user already used by storing the old password hash and algorithm in the
user_passwords
table. This is done through the following mechanisms:
- The
user_passwords
table stores the userʼs previous passwords along with their corresponding salts and algorithms.- When a user changes their password, Discourse generates a new password hash using the new password, salt, and algorithm.
- Discourse then checks if the new password hash is already present in the
user_passwords
table for that user. If it is, it means the user has used this password before, and therefore, itʼs considered expired.By storing the old password hashes and algorithms in the database, Discourse can easily verify whether a user has used a particular password before and prevent them from using an expired password again.
This approach ensures that users cannot reuse the same password twice, even if they change their password multiple times.
…
Without knowing all the details of Discourseʼ codebase, this seems a perfectly valid answer. It gives us a nice summary of the feature an idea on where to go looking for the implementation in Discourse otherwise overwhelmingly big codebase.
Wrap up
For our InfoServer Granny RAG we created a sophisticated search index for our codebase. This index allows us to infer meaning. We can now search for an answer by infering the meaning of the question and retrieve relavant information by looking what is close in the vector space.
We fed this information to an LLM to understand and prioritize the sources. We made the LLM create answers to our questions, based on its training data and the information retrieved.
So with a few, off-the-shelf, open source components, we manage to get expert knowledge about our code base, and get it delivered to us by our friendly LLM companion, hosted on our own premises.
How is this different from Copilot, Duo and friends?
If you ask Granny RAG a question, it can draw information from the whole codebase. It is able to incorporate all kinds of documents, and can be extended to incorporate additional data-sources. Granny RAG operates on resource effective, local LLMs.
No data needs to leaves your control.
The scripts that ingest and embed your data and code can be specific to your needs - as is your codebase. That way, you can even specify what should, and what should not, find its way into your RAG knowledge base.
Copilot and GitLab Duo have a much narrower angle of vision. Their context is primarily the opened files of the editor, or the PR. That means, once you know where to look, they can be helpful. Both to you and their creators, which can (and probably will) use some data to improve their models. Even if, per contract, your data and code should not be shared with GitLab or Microsoft, you lost all control once your data leaves the premises.
If you set theses concerns aside, you still have little control about what makes its way into the LLMs that are hosted on remote servers.
Here again, Granny RAG is different. You can collect data from usage and reactions, and you can use that data to train both, LLM and embedding model, on your data and needs.
That way, new arrivals in your dev team get an assistant that is steadily improving. Granny RAG can integrate into a Slack channel to provide a first opinion, and take feedback from the more seasoned developers to improve.
All in all, Granny RAG is a concept that can (and should) be adopted to your use-case and needs. Itʼs not a subscription you buy, but a technique your team learns to master. You invest in consulting or learning time, and you get control and excellent knowledge about the core or your business logic.
Try it yourself!
It is really easy! Just clone our repo, follow the README and tell the script where to find your codebase:
CODEBASE_PATH="./path-to-my-codebase"
CODEBASE_LANGUAGE="ruby"
We kept the scripts basic, so that they are easy to understand and extend. Depending on your codebase, the results might not always be perfect, but often surprisingly good.
Outlook
In this introductory post, we saw what a little off-the-shelf system can achieve. Itʼs already impressive, and it only uses local models, namely Llama3 and JinaAI Code.
You will find that this off-the-shelf solution is lacking precision in some use cases. To improve this, we will explore how changes in the parsing, chunking and embedding strategies will change performance in the next episodes of this blog post series.
Or, if you canʼt wait, give the team at bitcrowd a shout via granny-rag@bitcrowd.net or book a consulting call here.