This article focuses on the potential of Computing in memory(CIM) technology in accelerating the inference of large language models (LLMs). Starting with the background knowledge of LLMs, it explores the current challenges they face and then analyzes two classic papers to highlight how CIM could address existing issues in the inference acceleration of LLMs. Finally, it discusses the integration of LLMs with CIM and its future prospects.

Background and Challenges of Large Language Models
(I) Basic Concepts of Large Language Models

In layman's terms, Large Language Models (LLMs) are deep learning models trained on massive text data, with parameters in the billions (or more). They can not only generate natural language text but also understand its meaning, handling various natural language tasks such as text summarization, question answering, and translation.

The performance of LLMs often follows the scaling law, but certain abilities only emerge when the language model reaches a certain scale. These abilities are called "emergent abilities," which include three representative aspects: First, the ability to learn contextually by generating the expected output of test examples through the completion of input text word sequences without additional training or gradient updates; Second, the behavior of following instructions, where LLMs perform well on small tasks through fine-tuning with mixed multi-task datasets formatted as instructions; Third, the characteristic of gradual reasoning, where LLMs can solve such tasks by using a prompt mechanism involving intermediate reasoning steps to derive the final answer.

Figure 1: The development timeline of LLMs

Currently, representative LLMs include models such as GPT-4, LLaMA, and PaLM (as shown in Figure 1). They have demonstrated strong capabilities in various natural language processing tasks, such as machine translation, text summarization, dialog systems, and question-answering systems. Moreover, LLMs have played a significant role in driving innovation in various fields, including healthcare, finance, and education, by providing enhanced data analysis, pattern recognition, and predictive modeling capabilities. This transformative impact underscores the importance of exploring and understanding the foundations of these models and their widespread application in different domains.

(II) Challenges Faced by Large Language Models

Executing LLM inference in traditional computing architectures mainly faces challenges such as computational latency, energy consumption, and data transfer bottlenecks. As models like GPT and BERT reach billions or more parameters, they require a massive amount of computation during inference, especially matrix operations and activation function processing. This high-density computational demand leads to significant latency, particularly in applications that require rapid response, such as real-time language translation or interactive dialog systems.

Furthermore, as model sizes increase, the required computational resources also grow, leading to a substantial increase in energy consumption and higher operating costs and environmental impact. Running these large models in data centers or cloud environments consumes a significant amount of electricity, and high energy consumption limits their application on edge devices. Data transfer bottlenecks are another critical issue. Due to the enormous parameter volume of LLMs, they cannot all fit into the processor's high-speed cache, necessitating frequent data retrieval from slower main memory, increasing inference latency and further increasing energy consumption.

Computing In-Memory Technology (Compute-In-Memory, CIM) effectively reduces the need for data transfer between memory and processors in traditional computing architectures by performing data processing directly within memory chips. This reduction in data movement significantly lowers energy consumption and reduces latency for inference tasks, making model responses faster and more efficient. Considering the current challenges faced by LLMs, CIM may be an effective solution to these problems.

Case Study I - X-Former: In-Memory Acceleration of Transformers
The X-Former architecture is a CIM hardware platform designed to accelerate LLMs such as Transformers. This architecture effectively overcomes the performance bottlenecks of traditional computing hardware in processing natural language tasks, such as high energy consumption, high latency, difficulty in parameter management, and scalability limitations, by computing directly within storage units, optimizing parameter management, improving computational efficiency, and hardware utilization.

The uniqueness of X-Former lies in its integration of specific hardware units, such as the projection engine and attention engine, which are optimized for different computational needs of the model (Figure 2).
The projection engine uses NVM to handle static, compute-intensive matrix operations, while the attention engine utilizes CMOS technology to optimize frequent and dynamically changing computational tasks. This design allows X-Former to almost eliminate the dependency on external memory during execution since most data processing is completed within the memory array.

Additionally, by adopting an intra-layer sequential blocking data flow method, X-Former further optimizes the data processing procedure, reducing memory occupancy and improving overall computational speed through parallel processing. This approach is particularly suitable for handling self-attention operations as it allows simultaneous processing of multiple data blocks instead of processing the entire dataset sequentially. Such design makes X-Former perform excellently in terms of hardware utilization and memory requirements, especially when dealing with models with a large number of parameters and complex computational demands.

Through actual performance evaluation, X-Former has shown significant performance advantages when processing Transformer models. Compared to traditional GPUs and other NVM-based accelerators, X-Former has made substantial improvements in latency and energy consumption. For example, compared to the NVIDIA GeForce GTX 1060 GPU, it has improved latency and energy consumption by 69.8 times and 13 times, respectively; compared to the state-of-the-art NVM accelerators, it has improved by 24.1 times and 7.95 times, respectively. This significant performance enhancement proves the great potential of X-Former in the field of natural language processing, especially when dealing with complex and parameter-rich models such as GPT-3 and BERT.

Case Study II - iMCAT: In-Memory Computing based Accelerator for Transformer Networks for Long Sequences
The iMCAT Computing In-Memory architecture proposed by Laguna

In conclusion,Computing in memory architecture is the potential important and next generation support for LLM/MLLM,This organization in Github is the world first open source in CIM array, it contains the newest CIM motivation, tools used frequently and paper collection for study.
.

CIM accelate LLM,Top Github resposity for learning CIM