Run Ollama on Runpod — Persistent Cloud GPU Setup Guide
Set up Ollama as a persistent cloud AI service on Runpod. Keep your models between sessions, expose the API endpoint, and connect from any device you own.
If you want to run Ollama on the cloud but keep your models and settings between sessions, you need a persistent setup. This guide shows you how to deploy Ollama on Runpod with persistent storage, API access, and auto-recovery.
What You'll Get
- Ollama running on a cloud GPU (available 24/7 or on-demand)
- Models stored persistently — no re-downloading after restarts
- OpenAI-compatible API accessible from anywhere
- Automatic startup when the instance boots
Prerequisites
- A Runpod account
- Basic Docker and terminal knowledge
- A credit card for billing
Step 1: Create a Network Volume
Persistent storage ensures your models survive instance restarts:
- Go to Storage → Network Volumes
- Click Add Network Volume
- Size: 50 GB (enough for several large models)
- Data Center: Pick one close to you (remember this for Step 2)
- Click Create
Step 2: Deploy a GPU Instance
- Go to GPU Cloud → Deploy
- Select a GPU:
- RTX 4090 ($0.44/hr) — best for models up to 14B
- A100 40GB ($0.80/hr) — best for models up to 30B
- A100 80GB ($1.50/hr) — best for 70B models
- Important: Select the same data center as your network volume
- Under Customize Deployment, select the Ollama template
- Attach your network volume at mount path
/workspace - Click Deploy
Step 3: Configure Persistent Storage
Connect to your instance via HTTP Proxy terminal, then configure Ollama to store models on the persistent volume:
# Create model directory on persistent storage
mkdir -p /workspace/ollama/models
# Set Ollama to use persistent storage
export OLLAMA_MODELS=/workspace/ollama/models
# Stop the default Ollama service
sudo systemctl stop ollama 2>/dev/null || true
# Start Ollama with persistent storage
OLLAMA_MODELS=/workspace/ollama/models ollama serve > /workspace/ollama.log 2>&1 &Step 4: Download Your Models
# Set the model path
export OLLAMA_MODELS=/workspace/ollama/models
# Download your preferred models
ollama pull llama3.1:8b
ollama pull qwen2.5:14b
ollama pull deepseek-r1:8b
# Verify downloads
ollama listModels are now stored on your persistent volume and will survive restarts.
Step 5: Set Up Auto-Start
Create a startup script so Ollama launches automatically:
cat > /workspace/start-ollama.sh << 'EOF'
#!/bin/bash
export OLLAMA_MODELS=/workspace/ollama/models
export OLLAMA_HOST=0.0.0.0:11434
# Kill any existing Ollama process
pkill ollama 2>/dev/null || true
sleep 2
# Start Ollama
ollama serve > /workspace/ollama.log 2>&1 &
echo "Ollama started. Waiting for it to be ready..."
sleep 5
ollama list
EOF
chmod +x /workspace/start-ollama.shAdd it to your instance's start command in Runpod settings, or run it manually after each restart.
Step 6: Expose the API
To access Ollama from external applications:
- Go to your instance settings in Runpod
- Under Ports, expose port 11434
- Use the proxy URL:
https://your-pod-id.proxy.runpod.net
Test it:
curl https://your-pod-id.proxy.runpod.net/api/tagsUse as OpenAI-Compatible API
Your Runpod Ollama instance works as a drop-in replacement for the OpenAI API:
import openai
client = openai.OpenAI(
base_url="https://your-pod-id.proxy.runpod.net/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[
{"role": "user", "content": "Hello from the cloud!"}
]
)
print(response.choices[0].message.content)Step 7: Connect from Your Local Tools
Open WebUI
- Deploy Open WebUI (or run it locally)
- Set the Ollama URL to your Runpod proxy URL
- Your cloud models appear in the model selector
Custom Applications
Point any OpenAI-compatible client to:
https://your-pod-id.proxy.runpod.net/v1Cost Management
Auto-Stop Configuration
Save money by auto-stopping idle instances:
- Go to instance settings
- Set Auto-Stop to 1 hour of inactivity
- Your instance stops automatically when not in use
- Restart it from the dashboard when needed (takes ~2 minutes)
Estimated Costs
| Usage Pattern | GPU | Monthly Cost |
|---|---|---|
| 2 hrs/day, weekdays | RTX 4090 | ~$18 |
| 4 hrs/day, weekdays | RTX 4090 | ~$35 |
| Always on (24/7) | RTX 4090 | ~$320 |
| 2 hrs/day, weekdays | A100 80GB | ~$60 |
For most users, 2-4 hours per day on an RTX 4090 is sufficient and affordable.
Spot Instances for Development
Use spot instances (up to 70% cheaper) when:
- You're testing and don't mind interruptions
- You can save your work frequently
- You're doing batch processing that can resume
Troubleshooting
Models missing after restart: Make sure OLLAMA_MODELS=/workspace/ollama/models is set in your startup script.
API not accessible: Verify port 11434 is exposed in instance settings and Ollama is running (ollama list).
Slow first response: The model needs to load into VRAM after Ollama starts. Subsequent responses are fast.
Out of VRAM: Switch to a smaller model or a GPU with more VRAM. Use ollama rm model-name to free space.
Summary
With persistent storage and auto-start, your Runpod Ollama instance behaves like a personal AI server. Models stay between sessions, the API is accessible from anywhere, and you only pay for what you use.
Next Steps
- Runpod Beginner Guide — basics if you're new
- Run Open WebUI on Runpod — add a browser interface
- Best GPU Cloud for LLM — compare cloud providers
Author

Categories
More Posts
Local AI in VS Code — Continue.dev, Cline, and Twinny Setup Guide
TutorialSet up AI-powered coding in VS Code with local models. Complete guide to Continue.dev, Cline, and Twinny extensions running on Ollama — no API keys needed.

How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU
TutorialStep-by-step guide to deploying Ollama on Runpod with persistent storage, API access, and cost optimization. Run models up to 70B parameters on cloud GPU.

Best AI Models for 8GB RAM — What Can You Run Locally?
GuideA complete guide to the best LLMs you can run on a computer with 8GB of RAM. Includes benchmarks, practical recommendations, and setup commands for each model.
