How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU
Step-by-step guide to deploying Ollama on Runpod with persistent storage, API access, and cost optimization. Run models up to 70B parameters on cloud GPU.
Running Ollama on Runpod gives you access to powerful GPUs without buying expensive hardware. This guide walks you through deploying Ollama with persistent storage so your models and data survive restarts.
Prerequisites
- A Runpod account
- Basic terminal familiarity
- A credit card for billing (you only pay while the instance is running)
Step 1: Choose Your GPU
Select a GPU based on the models you want to run:
| Model Size | Min VRAM | Recommended GPU | Est. Cost/hr |
|---|---|---|---|
| 7-8B params | 8 GB | RTX 4090 | ~$0.44 |
| 14B params | 16 GB | RTX 4090 | ~$0.44 |
| 32B params | 32 GB | A100 40GB | ~$0.80 |
| 70B params | 64 GB | A100 80GB | ~$1.50 |
For most users, an RTX 4090 offers the best value. It can handle all models up to 14B parameters comfortably.
Step 2: Deploy with the Ollama Template
- Go to GPU Cloud in your Runpod dashboard
- Click Deploy
- Select your GPU (e.g., RTX 4090)
- In Customize Deployment, search for "Ollama" in the template list
- Select the official Ollama template
- Click Deploy
Wait 1-2 minutes for the instance to start. You'll see the status change to "Running."
Step 3: Connect and Verify
Click Connect on your instance, then choose Connect to HTTP Proxy to open a web terminal.
Verify Ollama is running:
ollama --version
ollama listStep 4: Pull and Run Models
# Pull a model
ollama pull llama3.1:8b
# Run it
ollama run llama3.1:8b
# Try a larger model (if your GPU has enough VRAM)
ollama pull qwen2.5:14b
ollama run qwen2.5:14bStep 5: Set Up Persistent Storage
By default, everything is lost when you stop the instance. To keep your models:
- Go to Storage in your Runpod dashboard
- Click Add Network Volume
- Choose a size (50 GB is enough for several models)
- Select a data center (pick the same region as your GPU)
- Attach the volume to your instance at
/workspace
Then configure Ollama to use the volume:
# Stop Ollama
sudo systemctl stop ollama
# Set the model storage path
export OLLAMA_MODELS=/workspace/ollama/models
# Restart Ollama
ollama serve &Now your downloaded models persist across restarts.
Step 6: Expose the API
To connect external tools (like Open WebUI or your own apps) to your cloud Ollama:
- In your instance settings, expose port 11434
- Use the Runpod proxy URL as your API endpoint
Your API URL will look like:
https://your-pod-id.proxy.runpod.netTest it:
curl https://your-pod-id.proxy.runpod.net/api/tagsYou can now use this URL as an OpenAI-compatible API endpoint in any application.
Step 7: Connect from Open WebUI
If you have Open WebUI running locally or on another instance:
- Go to Open WebUI Settings
- Set the Ollama API URL to your Runpod proxy URL
- Your cloud models will appear in the model selector
Cost Optimization
Enable Auto-Stop:
- Go to your instance settings
- Set Auto-Stop to 1 hour of inactivity
- This prevents accidental overcharges
Use Spot Instances:
- Spot instances are up to 70% cheaper
- They can be interrupted when demand is high
- Fine for experimentation, not for production use
Estimated monthly costs (10 hours/week usage):
| GPU | Monthly Cost |
|---|---|
| RTX 4090 | ~$17 |
| A100 40GB | ~$32 |
| A100 80GB | ~$60 |
Troubleshooting
Ollama not responding: Check if the service is running with ollama list. Restart with ollama serve.
Out of VRAM: Use a smaller quantization or a model with fewer parameters. Try ollama run llama3.1:8b-q3 for lower VRAM usage.
Slow first response: The first run after pulling a model loads it into VRAM. Subsequent responses will be fast.
Port not accessible: Make sure port 11434 is exposed in your instance settings.
Summary
Deploying Ollama on Runpod takes about 5 minutes. The Ollama template handles the setup. Add persistent storage to keep your models, expose the API port for external access, and set auto-stop to control costs.
Next Steps
- Runpod Beginner Guide — if you're completely new to Runpod
- Run Open WebUI on Runpod — add a browser interface
- Best GPU Cloud for LLM — compare Runpod with alternatives
Author

Categories
More Posts
How to Install LM Studio — The Easiest Way to Run Local AI
TutorialDownload, install, and start chatting with AI models in under 5 minutes using LM Studio. No terminal needed — everything runs through a beautiful desktop app.

Best AI Models for 32GB RAM — Run Professional-Grade LLMs Locally
Guide32GB RAM unlocks professional-grade models like Qwen 2.5 32B and Mixtral 8x7B. Here is exactly what to run and how to get the best performance from each.

Private AI Setup Guide — Run AI Completely Offline in 2026
TutorialA step-by-step guide to setting up a fully private, offline AI system. No data leaves your machine — covers model selection, tools, and privacy best practices.
