Introduction

Last time, we set up our project with OpenAI SDK and made our first API call. Now, we're going deeper into the SDK's features and exploring key concepts in large language models (LLMs). In this post, we'll cover:

Memory in LLMs: How to keep context in chatbots.
Temperature: What it is and how it affects LLM responses.
Top-P and Top-K: Tweaking these to improve LLM responses.

While we're using OpenAI's SDK, these concepts apply to most LLMs and their tools.

Memory in LLMs: Understanding Context and Pricing

LLMs Don't Have Built-in Memory. LLMs don't remember previous conversations on their own. They're designed to generate text based on the input they receive at that moment. This means each time you ask a question, it's like starting a new conversation. Let's look at an example:

async function main() {
  const responseOne = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: "You are a friendly teacher, teaching 4 year olds.",
      },
      {
        role: "user",
        content: "What is the capital of India?",
      },
    ],
  });

  console.log("First Response", responseOne.choices[0].message);

  const responseTwo = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "user",
        content: "How large is it?",
      },
    ],
  });

  console.log("Second Response", responseTwo.choices[0].message);
}

The responses:

First Response {
  role: 'assistant',
  content: "The capital of India is New Delhi! It's a big city where the government works and many important decisions are made. Do you want to know more about New Delhi?",
  refusal: null
}

Second Response {
  role: 'assistant',
  content: 'Could you please provide more context or specify what "it" refers to? That way, I can give you a more accurate answer.',
  refusal: null
}

As you can see, the second response doesn't know we were talking about New Delhi. The LLM treats each request as brand new.

Managing Memory: The Developer's Job

async function main() {
  const messages = [
    {
      role: "system",
      content: "You are a friendly teacher, teaching 4 year olds.",
    },
  ];

  messages.push({
    role: "user",
    content: "What is the capital of India?",
  });

  const responseOne = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages,
  });

  console.log("First Response", responseOne.choices[0].message);

  messages.push({
    role: responseOne.choices[0].message.role,
    content: responseOne.choices[0].message.content,
  });

  messages.push({
    role: "user",
    content: "How large is it ?",
  });

  const responseTwo = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages,
  });

  console.log("Second Response", responseTwo.choices[0].message);
}

Above is a simple way to maintain context in our conversations with the LLM. We can keep track of all messages in a messages array, which includes both user inputs and AI responses. Whenever we make a new request to the LLM, we send the entire messages array along with it.

This method is quite straightforward. It’s easy to implement, and it ensures that the LLM has access to the full conversation history. But there is an issue. As conversations get longer, passing the entire history to the LLM for each new message increases the token count. This directly impacts your costs. For example, after 20 messages, you'd be sending all 20 previous messages plus the new one each time.

There are more sophisticated ways to handle memory, like using libraries such as LangChain's Memory feature. These can help manage long conversations more efficiently, but they're complex enough to warrant their own tutorial in the future.

Other LLM providers, like Cohere, make it even easier. Their SDK allows you to simply add a conversation_id, and it will automatically manage the conversation history and context for you.

Few approaches to LLM memory management as highlighted below -

Approach	Pro	Con
Pass the entire conversation	Maintains full context.	Can quickly hit token limits and increase costs.
Keep only recent messages	Balances context and token count.	May lose some important earlier context.
Summarize previous messages	Maintains context while reducing token count.	Requires additional LLM processing and may lose some details.

Pricing Control: Limiting Tokens

async function main() {
  const response = await openai.chat.completions.create({
    model: "gpt-4o-mini",
    messages: [
      {
        role: "system",
        content: "You are a friendly teacher, teaching 4 year olds.",
      },
      {
        role: "user",
        content: "What is the capital of India?",
      },
    ],
  });

  console.dir(response, { depth: null, colors: true });
}

I have discussed in depth about tokens, in my previous post. Tokens are the basic units that LLMs process. They can be parts of words, whole words, or even punctuation marks. LLMs charge based on the number of tokens used, both in your input (prompt) and in the generated output. In the example above we are printing the whole response object -

{
  id: 'chatcmpl-9wUBsQCvYQGWOvZif7UWrPl0vPefs',
  object: 'chat.completion',
  created: 1723726104,
  model: 'gpt-4o-mini-2024-07-18',
  choices: [
    {
      index: 0,
      message: {
        role: 'assistant',
        content: "The capital of India is New Delhi! It's a big city with lots of important buildings and places. Would you like to know more about India or its culture?",
        refusal: null
      },
      logprobs: null,
      finish_reason: 'stop'
    }
  ],
  usage: { prompt_tokens: 30, completion_tokens: 32, total_tokens: 62 },
  system_fingerprint: 'fp_507c9469a1'
}

In the LLM's response, you can see the usage. For example, if it shows 62 tokens were used, that's what you'll be charged for. You can check the pricing for Open AI models here.

To check token counts before sending requests:

Use OpenAI's 'tiktoken' package.
Or use this handy tool: OpenAI Platform Token Counter.

When building apps with LLMs, it's crucial to manage costs. Here are two ways to limit token usage:

Limiting Input Tokens: Use the 'tiktoken' package to count tokens in your prompt. Adjust your prompt if it's too long.
Limiting Output Tokens: Use the 'max_tokens' parameter in your API call. This sets a cap on how many tokens the LLM can generate in its response.

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [/* your messages here */],
  max_tokens: 100  // This limits the response to 100 tokens
});

Why Limit Tokens?

Cost Control: Prevent unexpected high bills.
Performance: Shorter responses can be generated faster.
Focused Answers: Encourages more concise, targeted responses.

Remember, finding the right balance is key. Too few tokens might cut off important information, while too many could lead to unnecessary costs.

Fine-tuning LLM Responses: What is LLM Temperature?

As discussed in the previous post LLMs are text generation models. It starts by reading the context you've provided (prompt), much like reading a book and trying to guess the next word.

Example: If the LLM is completing the sentence "The sky is ______," it might assign probabilities like this:

"blue" (40% chance)
"cloudy" (30% chance)
"clear" (20% chance)
"beautiful" (10% chance)

The LLM is likely to choose "blue," but sometimes it might surprise you. This variety makes LLM responses feel natural and diverse.

Temperature is a setting that controls how creative or predictable an AI's responses will be. Think of it like a creativity knob you can adjust. How Temperature Works:

Low Temperature (0 to 0.5): The AI plays it safe. It gives predictable, common answers. Good for tasks needing accurate, consistent responses. Example: Answering factual questions, a quiz chatbot, etc.
High Temperature (0.5 to 1): The AI gets more creative. It might give unusual or unexpected answers. Good for tasks needing originality. Example: Brainstorming ideas, writing stories, a sales pitch chatbot, etc.

Example in code:

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [/* your messages here */],
  temperature: 0.7  // Adjust this value between 0 and 1
});

Low temperature doesn't mean "better" - it depends on what you need.
Experiment with different settings to find what works best for your task.

By adjusting temperature, you can fine-tune the AI's responses to better suit your needs, whether you're looking for creativity or consistency. You can try playing around with different temperature values for your prompts at the OpenAI Playground.

Fine-tuning LLM Responses: Top-P & Top-K parameters

Now, let’s talk about top-p and top-k, which influence how the LLM chooses words.

Top-P (Nucleus Sampling): Imagine you're at an ice cream shop, and the flavors represent words. Top-P is like saying, "I'll only look at flavors that make up 75% of all sales." This approach is flexible—you might consider just a few super popular flavors or a larger mix.

Example: If top-p is 0.75 and you have:

Vanilla (40% popularity)
Chocolate (35% popularity)
Strawberry (15% popularity)
Mint (10% popularity)

You'd only consider Vanilla, Chocolate, and Strawberry because they add up to over 75%. This is how the LLM will generate text, if you set the top_p parameter.

const response = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [/* your messages here */],
  top_p: 0.75
});

Top-K: On the other hand, top-k is like saying, "I'll only look at the top 3 most popular flavors," regardless of their exact popularity. Example - If top-k is 3, you always consider Vanilla, Chocolate, and Strawberry. This is how the LLM will generate text, if you set the top_K parameter. You can play with all these parameters in the Open AI Playground. The OpenAI SDK doesn't offer a top-k setting, but other LLM providers do, like Anthropic's Claude.

const response = await anthropic.messages.create({
  model: "claude-3-5-sonnet-20240620",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello, Claude" }],
  top_K: 50
});

How They Differ from Temperature: While top-p and top-k decide which words the LLM can choose from, the temperature setting adjusts how "daring" the AI is in making its choices.

Conclusion

In this tutorial, we explored OpenAI SDK, focusing on memory, conversational context, and tokens. We discussed how tokens impact pricing and how to fine-tune LLM responses using parameters like temperature, Top-P, and Top-K, all with the popular OpenAI SDK. In the next tutorial, we'll dive into some prompting techniques with OpenAI SDK. Until then, keep coding!

GEN AI for JavaScript Devs: Working with the OpenAI SDK