The Fastest Llama: Uncovering the Speed of LLMs

Francesco Mattia - Sep 1 - - Dev Community

Curious about LLM Speed? I Tested Local vs Cloud GPUs (and CPUs too!)

I've been itching to compare the speed of locally-run LLMs against the big players like OpenAI and Anthropic. So, I decided to put my curiosity to the test with a series of experiments across different hardware setups.

I started with LM Studio and Ollama on my trusty laptop, but then I thought, "Why not push it further?" So, I fired up my PC with an RTX 3070 GPU and dove into some cloud options like RunPod, AWS, and vast.ai. I wanted to see not just the speed differences but also get a handle on the costs involved.

Now, I'll be the first to admit my test wasn't exactly very scientific. I used just two prompts for inference, which some might argue is a bit basic. But hey, it gives us a solid starting point to compare speeds across different GPUs and understand the nuances between prompt evaluation (input) and response (output) speeds.

Check out this table of results.

Device Cost/hr Phi-3 input (t/s) Phi-3 output (t/s) Phi-3 IO ratio Llama3 input (t/s) Llama3 output (t/s) Llama3 IO ratio
M1 Pro - 96.73 30.63 3.158 59.12 25.44 2.324
RTX 3070 - 318.68 103.12 3.090 167.48 64.15 2.611
g5g.xlarge (T4G) $0.42 185.55 60.85 3.049 88.61 42.33 2.093
g5.12xlarge (4x A10G) $5.672 266.46 105.97 2.514 131.36 68.07 1.930
A40 (runpod) $0.49 (spot) 307.51 123.73 2.485 153.41 79.33 1.934
L40 (runpod) $0.69 (spot) 444.29 154.22 2.881 212.25 97.51 2.177
RTX 4090 (runpod) $0.49 (spot) 470.42 168.08 2.799 222.27 101.43 2.191
2x RTX 4090 (runpod) $0.99 (spot) 426.73 40.95 10.4 168.60 111.34 1.51
RTX 3090 (vast.ai) $0.24 335.49 142.02 2.36 145.47 88.99 1.63

Setup and Specs: For the Tech-Curious

I ran tests on a variety of setups, from cloud services to my local machines. Below is a quick rundown of the hardware. I wrote about running LLMs in the cloud more in detail here.

The benchmarks are run using these python scripts https://github.com/MinhNgyuen/llm-benchmark.git which lean on ollama for the inference. Hence on any environment we need to set up ollama and python, pull the models we want to test and prepare to run the tests.

On runpod (starting from ollama/ollama Docker template):

# basic setup (on ubuntu)
apt-get update
apt install pip python3 git python3.10-venv -y

# pull models we want to test
ollama pull phi3; ollama pull llama3

python3 -m venv venv
source venv/bin/activate

# download benchmarking script and install dependencies 
git clone https://github.com/MinhNgyuen/llm-benchmark.git
cd llm-benchmark
pip install -r requirements.txt

# run benchmarking script with installed models and these prompts
python benchmark.py --verbose --skip-models nomic-embed-text:latest --prompts "Why is the sky blue?" "Write a report on the financials of Nvidia"
Enter fullscreen mode Exit fullscreen mode

Systems specs

Environment Hardware Specification VRAM Software
AWS EC2 - g5g.xlarge, Nvidia T4G 16GB VRAM ollama
AWS EC2 - g5.12xlarge, 4x Nvidia A10G 96GB VRAM ollama
runpod Nvidia A40 48GB VRAM ollama
runpod Nvidia L40 48GB VRAM ollama
runpod Nvidia RTX 4090 24GB VRAM ollama
runpod 2x Nvidia RTX 4090 48GB VRAM ollama
vast.ai Nvidia RTX 3090 24GB VRAM ollama
Local Mac M1 Pro 8 CPU Cores (6p + 2e) + 14 GPU cores 16GB (V)RAM LM Studio
Local PC Nvidia RTX 3070, LLM on GPU 8GB VRAM LM Studio
Local PC Ryzen 5500 6 CPU Cores, LLM on CPU 64GB RAM LM Studio

But Wait, What About CPUs?

Curious about CPU performance compared to GPUs? I ran a quick test to give you an idea. I used a single prompt across three different setups:

  1. A Mac, which uses its integrated GPUs
  2. A PC with an Nvidia GPU, which expectedly gave the best speed results
  3. A PC running solely on its CPU

For this test, I used LM Studio, that gives you flexibility on where to load the LLM layers, conveniently letting you choose whether to use your system's GPU or not. I ran the tests with temperature set to 0, using the prompt Who is the president of the US?

Here are the results:

Model Device TTFT Speed
Phi3 mini 4k instruct q4 M1 Pro 0.04s ~35 tok/s
RTX 3070 0.01s ~97 tok/s
Ryzen 5 0.07s ~13 tok/s
Meta Llama 3 Instruct 7B M1 Pro 0.17s ~23 tok/s
RTX 3070 0.02s ~64 tok/s
Ryzen 5 0.13s ~7 tok/s
Gemma It 2B Q4_K_M M1 Pro 0.02s ~63 tok/s
RTX 3070 0.01s ~170 tok/s
Ryzen 5 0.05s ~23 tok/s

My takeaways

  1. Dedicated GPUs are speed demons: They outperform Macs when it comes to inference speed, especially considering the costs.

  2. Size matters (for models): Smaller models can provide a viable experience even on lower-end hardware, as long as you've got the RAM or VRAM to back it up.

  3. CPUs? Not so hot for inference: Your average desktop CPU is still vastly slow compared to dedicated GPUs.

  4. Gaming GPUs for the win: a beastly gaming GPU like the 4090 is quite cost-effective and can deliver top-notch results, comparable to an H100. Multiple GPUs didn't necessarily make things faster in this scenario.

This little experiment has been a real eye-opener for me, and I'm eager to dive deeper. I'd love to hear your thoughts! What other tests would you like to see? Any specific hardware or models you're curious about?

. . .
Terabox Video Player