The Adventures of Blink #27: LLMs in Code

Ben Link - Jun 13 - - Dev Community

So last week was pretty wild, huh? Our previous adventure saw us download GPT4all and add in some local data, which gave me the ability to have a conversation with my memoirs!

The UI has limited usefulness

GPT4All's application is a fantastic sandbox for ideas. We can start to load data and models and experiment to see how they interact. While being able to do that in the GPT4All interface was wicked cool, let's be real: this is a developer blog. Code needs to be written here! Besides... the real compelling thing about this AI revolution is what happens when models and data are able to interact "in the wild"... when we can start to use their unique capabilities by embedding them in other software.

Note: This is why Gemini and ChatGPT sandboxes are free-to-play. You might be wondering how they can afford to serve up so many requests... it's because the API calls are where the real money is!

Uh... so, where do we start?

This question confounded me for quite a while. I don't know if I was just obtuse, or if the documentation needs to be improved, or what... but I simply couldn't figure out where to get started with coding things that used LLMs! And that, my friends, is what I'm going to do today - let's learn how to connect a python program to a LLM!

Finding the model

Fortunately, GPT4All makes it pretty easy to do this! We can start in the official docs. Here on the front page you can see the simplest python app you can write to connect to a language model, some 4 lines:

from gpt4all import GPT4All
model = GPT4All("orca-mini-3b-gguf2-q4_0.gguf")
output = model.generate("The capital of France is ", max_tokens=3)
print(output)
Enter fullscreen mode Exit fullscreen mode

It's admittedly very crude - it's sort of the "hello world" of LLMs - but technically, that's enough to get started! We'll of course need to pip install gpt4all before we run our program... but then the model will respond with its guess as to the identity of France's capital.

Let's move on to the next step and have a conversation

Having a hard-coded question that generates a short response is kinda boring. It could be interesting if you were still trying to learn Python, perhaps, but wouldn't it be better if we could have a loop of some kind that gave us a conversation?

You can see details on building this in the quickstart. We just need to instantiate a model.chat_session()! This provides a looping kind of structure where we're now able to interact with the model repeatedly.

TL/DR: YouTube

Not a reader? Just click here and you can watch me build the same thing!

Note: I don't cover the vocabulary check below in the video, just the build...

A Quick Vocabulary Check

If this is your first time in LLM documentation, you've probably run across some words that didn't make sense. You might be able to infer their meaning from context, but instead of assuming you figured it all out, here's a handy list:

Embedding

An Embedding is a sort of "knowledge nugget" for a model. Embeddings are when you take some input text and run it through the model to create a Vector. LLMs work by being able to find Embeddings that are "near" each other in space - that distance between two Vector Embeddings is a measure of how closely related the two tokens are. Using probabilities and measurements of these vector relationships is how an LLM can generate sentences that sound "normal" to humans... by predicting what words are likely to come next in the flow of thought.

Generation

This is the ability of the model to create an appropriate response to an input. This requires you to have a LLM and an input, nothing else. You're at the mercy of whatever the model was trained on - so if they built it with the works of William Shakespeare, you're not going to find any Emily Bronté in there!

Model

A Model is a machine-learning algorithm that's been trained on a large data set, to establish a base of data to evaluate your tokens against. These vary in size and complexity; and the data science folks are regularly training new models with increasing efficiency and skill to create smarter and smarter AI tools! A model is limited in knowledge to the data set that it was trained with - for example, you couldn't train a model with wikipedia articles and expect it to be a good surgeon's assistant!

RAG

RAG (Retrieval-Augmented Generation) is an architecture that allows you to "enhance" a model with embeddings of a specific dataset. For instance, if you wanted to download all of your company's policies and procedure manuals and documentation, you could enhance a generic LLM like Llama3 with this dataset to create a "smart assistant" who would know about your business practices and be able to help with something like customer support or HR questions. In RAG architecture you have a base model, a set of embeddings of your dataset, and some code that ties them together and uses the model's Generation capabilities to interact with the data set's embeddings.

Token

A Token is the smallest chunk of data that an LLM can work with. It's kinda analogous to a word (but not exactly, sometimes it could be a couple words, or maybe just a few characters). Tokens are used to measure the complexity of a query (and therefore used to measure how much it costs you as a consumer to make a given API call).

Vector

Just like in math & physics classes, this is a multi-dimensional array of values. Every nugget of input that a model is trained on becomes a special Vector called an Embedding.

Back to building

So... a simple python program that creates a conversation loop would look like the following:

# Importing GPT4All and picking the orca model.
from gpt4all import GPT4All
model = GPT4All('orca-mini-3b-gguf2-q4_0.gguf', allow_download=False)

# The system prompt provides instructions to the model about how to 
# respond.  You can change this to your preferences.
system_prompt = '### System:\nYou are my personal AI assistant.  You follow my instructions carefully and try to provide me with the most accurate information.\n\n'

# The prompt template helps the model understand the format of the 
# data it's going to parse.  This helps the model understand the flow
# of the conversation - you could theoretically set a delimiter here
# and it would keep processing until it found it.
prompt_template = '### User:\n{0}\n\n### Response:\n'

# Now we're ready to actually do something.  We create a chat_session 
# with the model, passing it the system_prompt and the 
# prompt_template, and everything in this block will be kept 
# contiguously as a "session".  The model will be able to use 
# all of the text in the conversation... but its "memory" will end
# when we exit the `with` block.
with model.chat_session(system_prompt=system_prompt, prompt_template=prompt_template):
    # infinite loop that's cleared when the user types 'quit'
    while True:
        user_input = input("User:")
        if 'quit' in user_input.lower():
            break
        else:
            # if the user didn't quit, we pass whatever the input was
            # to the model and get its response.
            print(f"\n\n{model.generate(user_input)}\n\n")
Enter fullscreen mode Exit fullscreen mode

Impressively short code, isn't it? If I remove the comments, it's like 11 lines... to create an ongoing chat with a Large Language Model!

That's a wrap... for today 😉

While that's a super cool project to get started with, next week we're going to see if we can kick things up a notch. See, this LLM conversation is limited to the model's data. Just like we talked about in last week's Adventure, we'd like to use our own data set alongside the LLM as a Retrieval Augmented Generation (RAG) application. This allows us to "train" our bot to handle specific information expertly... data that wasn't part of the model's original training. So tune in next week as we expand on this concept to create our very own Retrieval Augmented Generation (RAG) app!

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player