How to Set Up and Run Ollama on a GPU-Powered VM (vast.ai)
In this tutorial, we'll walk you through the process of setting up and using Ollama for private model inference on a VM with GPU, either on your local machine or a rented VM from Vast.ai or Runpod.io. Ollama allows you to run models privately, ensuring data security and faster inference times thanks to the power of GPUs. By leveraging a GPU-powered VM, you can significantly improve the performance and efficiency of your model inference tasks.
Outline
Set up a VM with GPU on Vast.ai
Start Jupyter Terminal
Install Ollama
Run Ollama Serve
Test Ollama with a model
(Optional) using your own model
Setting Up a VM with GPU on Vast.ai
1. Create a VM with GPU: - Visit Vast.ai to create your VM. - Choose a VM with at least 30 GB of storage to accommodate the models. This ensures you have enough space for installation and model storage. - Select a VM that costs less than $0.30 per hour to keep the setup cost-effective.
2. Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it.
Downloading and Running Ollama
- Start Jupyter Terminal: - Once your VM is up and running, start Jupyter and open a terminal within it. This is the easiest method to get started. - Alternatively, you can use SSH on your local VM, for example with VSCode, but you will need to create an SSH key to use it.
- Install Ollama: - Open the terminal in Jupyter and run the following command to install Ollama:
bash curl -fsSL https://ollama.com/install.sh | sh
2. Run Ollama Serve: - After installation, start the Ollama service by running:
bash ollama serve &
Ensure there are no GPU errors. If there are issues, the response will be slow when interacting with the model.
3. Test Ollama with a Model: - Test the setup by running a sample model like Mistral:
bash ollama run mistral
You can now start chatting with the model to ensure everything is working correctly.
Optional (Check GPU usage)
Check GPU Utilization: - During the inference (last step), check if the GPU is being utilized by running the following command:bash nvidia-smi
- Ensure that the memory utilization is greater than 0%. This indicates that the GPU is being used for the inference process.
Using Your Own Hugging Face Model with Ollama
1. Install Hugging Face CLI: - If you want to use your own model from Hugging Face, first install the Hugging Face CLI. Here we will use an example of a fine tuned Mistral model TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf
2. Download Your Model: - Download your desired model from Hugging Face. For example, to download a fine-tuned Mistral model:
pip3 install huggingface-hub
# Try with my custom model for fine tuned Mistral
huggingface-cli download TheBloke/em_german_mistral_v01-GGUF em_german_mistral_v01.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
3. Create a Model File: - Create a model config file Modelfile with the following content:
FROM em_german_mistral_v01.Q4_K_M.gguf
# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0
# # set the system message
# SYSTEM """
# You are Mario from Super Mario Bros. Answer as Mario, the assistant, only.
# """
4. Instruct Ollama to Create the Model: - Create the custom model using Ollama with the command:
ollama create -f mymodel Modelfile
5. Run Your Custom Model: - Run your custom model using:
ollama run mymodel
By following these steps, you can effectively utilize Ollama for private model inference on a VM with GPU, ensuring secure and efficient operations for your machine learning projects.
Happy prompting!