How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)

Name: How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)
Rating: 1.4 (1109 reviews)
Author: airabbit

In this tutorial, we'll walk you through the process of setting up and using Ollama for private model inference on a VM with GPU, either on your local machine or a rented VM from Vast.ai or Runpod.io. Ollama allows you to run models privately, ensuring data security and faster inference times thanks to the power of GPUs. By leveraging a GPU-powered VM, you can significantly improve the performance and efficiency of your model inference tasks.

Outline

Set up a VM with GPU on Vast.ai
Start Jupyter Terminal
Install Ollama
Run Ollama Serve
Test Ollama with a model
(Optional) using your own model

AI Rabbit News & Tutorials

Solving Real-World Problems with AI: Harnessing ChatGPT, Claude & More. Explore Innovations, Productivity Hacks, Trends, Tools, and Tutorials.

airabbit.blog

Setting Up a VM with GPU on Vast.ai

1. Create a VM with GPU: - Visit Vast.ai to create your VM. - Choose a VM with at least 30 GB of storage to accommodate the models. This ensures you have enough space for installation and model storage. - Select a VM that costs less than $0.30 per hour to keep the setup cost-effective.

2. Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it.

Downloading and Running Ollama

Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it. This is the easiest method to get started. - Alternatively, you can use SSH on your local VM, for example with VSCode, but you will need to create an SSH key to use it.

Install Ollama: - Open the terminal in Jupyter and run the following command to install Ollama:

bash curl -fsSL https://ollama.com/install.sh | sh

2. Run Ollama Serve: - After installation, start the Ollama service by running:

bash ollama serve &

Ensure there are no GPU errors. If there are issues, the response will be slow when interacting with the model.

3. Test Ollama with a Model: - Test the setup by running a sample model like Mistral:

bash ollama run mistral

You can now start chatting with the model to ensure everything is working correctly.

Optional (Check GPU usage)

Check GPU Utilization: - During the inference (last step), check if the GPU is being utilized by running the following command:bash nvidia-smi - Ensure that the memory utilization is greater than 0%. This indicates that the GPU is being used for the inference process.

Using Your Own Hugging Face Model with Ollama

1. Install Hugging Face CLI: - If you want to use your own model from Hugging Face, first install the Hugging Face CLI. Here we will use an example of a fine tuned Mistral model TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf

2. Download Your Model: - Download your desired model from Hugging Face. For example, to download a fine-tuned Mistral model:

pip3 install huggingface-hub

# Try with my custom model for fine tuned Mistral
huggingface-cli download TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

3. Create a Model File: - Create a model config file Modelfile with the following content:

FROM em_german_mistral_v01.Q4_K_M.gguf


# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0

# # set the system message
# SYSTEM """
# You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
# """

4. Instruct Ollama to Create the Model: - Create the custom model using Ollama with the command:

ollama create -f mymodel Modelfile

5. Run Your Custom Model: - Run your custom model using:

ollama run mymodel

By following these steps, you can effectively utilize Ollama for private model inference on a VM with GPU, ensuring secure and efficient operations for your machine learning projects.

Happy prompting!