You want to learn how OpenAI’s newest model o1 works and why it is a revolution in a simple way ? You also want to know why it matters for RAG and Agentic ? Say no more, this is exactly what we are going to see!

How a traditional LLM works

Before discussing the o1 model, we need to talk about the traditional LLM works:

LLM are gigantic neural networks (with billions of layers) that are trained on an equally gigantic corpus of data (basically with every data that can be found on the web “ethically”).
Many techniques (like self-attention) are used so that the model can get the nuance and pattern of words and sentences.
They are trained to predict the next words in a sentence. That is why, if you give it a question, it will try to predict the next words which is the answer.
The training process consists of two phases: the pre-training, where the model learns general language understanding from vast datasets, and the post-training, where it is fine-tuned for specific tasks and aligned with human preferences.
Some of the largest models are suspected to use an architecture called Mixture of Experts (MoE). In this architecture, the model is composed of multiple sub-networks, called ‘experts,’ that specialize in different aspects of the data. These experts are not separate models but rather specific parts of a larger network. A routing mechanism dynamically selects which experts to use for a given input, allowing the model to use computational resources only on relevant experts. This selective activation reduces the overall cost and improves efficiency.
Inference, depending of the given compute (the size of the gpu cluster), is typically faster (tens of tokens per seconds) and needs fewer resources than training.
Training, depending of the size of the model and computing available, can take a significant amount of time (hours to weeks) and needs hundred times more resources than for inference.
Now let’s see how the o1 models works.

The model o1 “thinks”

The o1 model works the same as the traditional LLM, but for one major feature:

We trained these models to spend more time thinking through problems before they respond, much like a person would. Through training, they learn to refine their thinking process, try different strategies, and recognize their mistakes.

Here’s what you can see on ChatGPT for example:

There is a new section called “Thought” that appears for each call, showing how the model “think” before answering. This should allow the model to be at PhD level for many scientific field and better at coding by testing different strategies and recognising its mistakes.
So how does this works ? To explain this, we first need to introduce a major concept: Chain of thought (CoT)!

Chain of thought (CoT)

Chain of thought (CoT) refers to a prompt engineering technique to make an LLM generate intermediate reasoning steps when solving problems or answering questions

This technique allows:

Step-by-Step Reasoning: CoT allows models to break down complex tasks into simpler, sequential steps, leading to more accurate and coherent answers.
Enhanced Problem-Solving: By simulating a reasoning process, models can handle arithmetic problems, logical puzzles, and questions that require understanding context or making inferences.
Transparency: The intermediate steps provide insights into how the model arrived at an answer, which can be useful for debugging and improving model performance.
Scalability: The larger the model , the more it tends to produce more coherent and accurate chain of thought.
This is a widely used prompting engineering to force a model to think step by step and give better answer. It was notably used for mathematical or complex task so that the model does not forget a step to complete a task.
Here’s an example:

Question: If a train travels at 60 miles per hour for 2 hours, how far does it travel? Answer step by step.
Chain of Thought:
    1.  Identify the given values:
       •  Speed of the train: 60 miles per hour (mph)
       •  Time traveled: 2 hours
    2.  Apply the formula: Distance = 60 mph x 2 hours
    3.  Calculate the distance: Distance = 120 miles

Answer: The train travels 120 miles.
So, how can this technique be integrated into a model ? A novel paper explains it very well, the Quiet-STaR approach!

The Quiet-STaR Approach

The Quiet-STaR (Sequential Thought and Rationale) approach is a method to enhance the model by generating intermediate steps (“thoughts”) for each input (tokens).

The Quiet-STaR method involves three steps: “Think”, “Talk” and “Learn”:

Think: Generate multiple “thought” or CoT sequences for each input token in parallel, creating multiple reasoning paths.
Talk: Mix predictions by combining the original input and the generated thoughts determining how much influence the generated thoughts have on the next prediction.
Learn: Optimise the model’s parameters by favorising thought sequences that lead to more accurate predictions and penalizing those that do not.
Simply put, for each input, the model generates multiple CoTs, refines the reasoning to generate prediction using those COTs and then produce an output. The contribution of each Cot to the prediction is recorded and used for further training of the model , allowing the model to improve in the next inferences.

So with all these, we have now a better idea on how the model o1 might work. I say “might” because OpenAI rarely talks about their internal architecture, even more on something state of the art like this.
But this hypothesis can be corroborated by the fact that the community could mostly reproduce the o1 model output using the aforementioned methods (with prompt engineering using self-reflection and CoT ) with classic LLMs (see this link).

If you want more details, you can check the official paper here.

The model o1 paradigm change

Now that we saw how model o1 might work, we can talk about this paradigm change.
Traditional LLMs used most of the time in training and the inference was just using the model to generate the prediction.
With this new model, the LLM spends far more time “thinking” during the inference phase .

And this is the big paradigm change: the scaling of the inference.
Because we are attaining the ceiling of scaling of training models, OpenAI just opened the door of the scaling of the inference, meaning the scaling of the search of the best reasoning.

This paves the way to completely new kind of model, the reasoning cores. These are completely different kind of models, not focusing on memorizing vast amounts of knowledge but dynamic reasoning and search strategies, far more capable at using different tools for each tasks.

This paradigm shift will also literally create a Data Flywheel Effect for Continuous Learning for the model. Each “thoughts” the model generated becomes a dataset that can be used further used to make the mode reason better which will attracts more users.

Not let’s see why this is not only a paradigm change for LLMs.

Innovation for Agentic and RAG

The advent of the “Reasoning cores” is also an innovation for RAG and Agentic.

An agents is an entity that should autonomously execute a task (take action, answer a question, …). One of the most important capacities of an agent is reason about the current state and define a plan to achieve its end goal.
By achieving reasoning cores, that concentrate on dynamic reasoning and search strategies and removing the excess knowledge, we can have incredibly lighter but more performant LLMs that will responds faster and better for planning.
Even more, by better integrating tools, these reasoning cores will be able use them in their thoughts and create far better strategies to achieve their task.

With these tools augmented thoughts, we could achieve far better performance in RAG because the model will by itself test multiple strategy which means creating a parallel Agentic graph using a vector store without doing more and get the best value.
Beside, RAG integrate more and more agents so any advance to Agentic will make more performant RAG system.
Finally, by continuously fine-tuning a reasoning cores on the specific thoughts that gave the best results, notably for RAG where we can have more feedbacks, we could have a truly specialized model, tailored to the data of the RAG system and the usage.

Conclusion

OpenAI newest model, o1, is a model that opens the way to scale the inference part of an LLM and train its reasoning and search strategies. It is opening the door for a new kind of models called reasoning cores that focus on lighter model with dynamic reasoning and search strategies. This will be big innovation for Agentic and RAG where these kind of models will make them even more autonomous and performant.

Afterwards

I really hope you loved this post. Don’t forget to check my other posts here and in my own blog and please leave a comment if you liked it (even if you didn't like it :D). See you!

How OpenAI o1 works in a simple way and why it matters for RAG and Agentic 🤯