How to deploy Llama3.1 8B with Novita AI

Novita AI - Jul 26 - - Dev Community

Foreword

The release of Llama3.1 has immediately garnered global attention, marking the first time an open-source model has approached and even surpassed proprietary models in certain metrics.

This article introduces the method and steps for deploying the Llama3.1-405B model from scratch using a Novita AI GPU Instance. It is particularly noted that deploying a large model from scratch is a time-consuming and labor-intensive task. If you wish to avoid such tedious work and directly use the Llama3.1-405B model, you can look for an excellent model hosting platform in the industry to call the inference service in the form of OpenAI API. It is not only easy to use but also very friendly in terms of cost, with a very low and controllable total cost. At present, Novita AI has launched an API service for Llama3.1, and you can experience it directly in the Playground: https://novita.ai/model-api/llm-api/playground

Basic Requirements for Llama3.1-405B Deployment

The Llama3.1 series of models open-sourced by Meta includes 8B, 70B, and 405B in three sizes, among which the 405B model is the largest open-source large language model to date, with a parameter scale of 40.5 billion. The model's performance in multiple evaluation results has exceeded the GPT-4 and GPT-4o models, and is comparable to Claude3.5-Sonnet.

Image description

It is challenging to load such a large model into the GPU. The original FP16 version of the 405B model requires 810GB of GPU memory, as shown in the figure below. However, even with the highest specification GPU H100 available on the market, a server with the highest 8-card configuration cannot directly load this version of the model. It is necessary to quantize the FP16 version of the model into a lower precision representation, thereby reducing the requirements for memory and successfully loading it into the GPU.

Image description

The model architecture of the Llama3.1 series inherits from Llama3, and the structure is very similar, which is very friendly to the inference framework. Open-source inference solutions like vLLM can quickly complete adaptation and support the inference of large models such as Llama3.1-405B.

In summary, to complete the deployment of the Llama3.1-405B inference service, we need to prepare from three aspects:

  • Hardware: It is recommended to choose a GPU server instance configured with 8 H00 GPUs and reserve about 1.5TB of storage space. You can select the Instance through the Novita AI console and provide storage capacity by mounting a network volume.

Image description

  • Model: Prepare a Huggingface account, download the original Llama3.1-405B model, or you can also download the FP8 or INT4 version of the model.
  • Inference Framework: Download the latest version of vLLM v0.5.3.post1.

Model Preparation

After preparing the GPU server with 8 H100 cards, log in to the server and start downloading the model. Here, the method of downloading the FP16 version of the model and manually converting it into FP8 and INT4 quantized versions is introduced (of course, you can also directly download the quantized model on the huggingface platform).
It is recommended to download the Instruct version from the Huggingface platform. First, register and log in to the Huggingface platform, and create and save the current user's Access Token on the settings page, which will be used when downloading the model.
Open the Llama3.1-405B model homepage: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct, submit the model application to Meta, and wait for about half an hour for authorization.
Back on the GPU server, install the huggingface client program and start downloading the model with the following commands:

pip install huggingface-hub
huggingface-cli login  ## Enter Access Token
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct  ## Start downloading the 405B large model
Enter fullscreen mode Exit fullscreen mode

After a long wait (depending on the current network speed), the 800GB 405B large model is successfully downloaded to the local area, and the specific model information can be viewed by calling "huggingface-cli scan-cache".
Next, perform FP8 and INT4 quantization on the original model.
First, download the FP8 quantization tool, and you can directly use the open-source AutoFP8: https://github.com/neuralmagic/AutoFP8, and use the script in it for manual quantization, with the quantization mode being dynamic, as follows:

git clone https://github.com/neural
magic/AutoFP8.git
cd AutoFP8
pip install -e .  ## Compile and install the AutoFP8 tool locally
python3 examples/quantize.py --model-id meta-llama/Meta-Llama-3.1-405B-Instruct --save-dir Meta-Llama-3-8B-Instruct-fp8 --activation-scheme dynamic --max-seq-len 2048 --num-samples 2048
Enter fullscreen mode Exit fullscreen mode

The quantization script specifies the output path for the quantized model using "--save-dir". Since it is weight-only quantization, the overall speed is very fast.

You can also directly download the FP8 version quantized by Meta: https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct-FP8. The download command is as follows:
huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct-FP8

Similarly to FP8 quantization, you can use the open-source AutoAWQ for INT4 version quantization. The address for this quantization solution is: https://github.com/casper-hansen/AutoAWQ. The quantization method is as follows:

git clone https://github.com/casper-hansen/AutoAWQ.git
cd AutoAWQ
pip install -e .   ## Compile and install the AutoAWQ tool locally
## Manually modify examples/quantize.py
    model_path = 'meta-llama/Meta-Llama-3.1-405B-Instruct'
    quant_path = 'meta-llama/Meta-Llama-3.1-405B-Instruct-awq'

# Start quantization
python examples/quantize.py
Enter fullscreen mode Exit fullscreen mode

Building the Inference Framework

Thanks to the strong support of the open-source community, vLLM is recognized as an industrial-grade high-efficiency inference framework, and it is very rich in support for large models, and the updates are very timely. Starting from v0.5.3.post1, vLLM supports the inference service of the Llama3.1 series models.

The method to download the vLLM source code and compile is as follows:

git clone https://github.com/vllm-project/vllm
cd vllm
git checkout -b v0.5.3.post1 v0.5.3.post1
pip install -e
Enter fullscreen mode Exit fullscreen mode

.

After successful compilation, you can start the inference service.

In addition to compiling vLLM locally, you can also compile vLLM into a docker image, as follows:

git clone https://github.com/vllm-project/vllm
cd vllm
git checkout -b v0.5.3.post1 v0.5.3.post1
docker build -t vllm_0.5.3.p1 .
Enter fullscreen mode Exit fullscreen mode

You can also directly download the official vLLM image:

docker pull vllm/vllm-openai:v0.5.3.post1
Enter fullscreen mode Exit fullscreen mode

Running the Inference Service

After completing the construction of the vLLM inference framework, you can start vLLM and load the Llama3.1-405B model to start the inference service. For the locally compiled vLLM inference framework, you need to switch to the root directory of the vLLM source code and then run the following command to start the inference service:

cd vllm
python3 -m vllm.entrypoints.openai.api_server --port 18001 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --pipeline-parallel-size 1 --swap-space 16 --gpu-memory-utilization 0.99 --dtype auto --served-model-name llama31-405b-fp8 --max-num-seqs 32 --max-model-len 32768 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768
Enter fullscreen mode Exit fullscreen mode

If running the inference service in a docker container, you can refer to the following command to run it:

docker run -d --gpus all --privileged --ipc=host --net=host vllm/vllm-openai:v0.5.3.post1 --port 18001 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --pipeline-parallel-size 1 --swap-space 16 --gpu-memory-utilization 0.99 --dtype auto --served-model-name llama31-405b-fp8 --max-num-seqs 32 --max-model-len 32768 --max-num-batched-tokens 32768 --max-seq-len-to-capture 32768
Enter fullscreen mode Exit fullscreen mode

Once the vLLM inference framework is successfully running, it will listen on

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Terabox Video Player