Run Ollama on Runpod — Persistent Cloud GPU Setup Guide

Set up AI-powered coding in VS Code with local models. Complete guide to Continue.dev, Cline, and Twinny extensions running on Ollama — no API keys needed.

2026/04/22

Cloud DeployTutorials

How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU

Step-by-step guide to deploying Ollama on Runpod with persistent storage, API access, and cost optimization. Run models up to 70B parameters on cloud GPU.

Lists & GuidesModels & Hardware

Best AI Models for 8GB RAM — What Can You Run Locally?

Guide

A complete guide to the best LLMs you can run on a computer with 8GB of RAM. Includes benchmarks, practical recommendations, and setup commands for each model.

2026/04/16

Intermediate25 min

Run Ollama on Runpod — Persistent Cloud GPU Setup Guide

Set up Ollama as a persistent cloud AI service on Runpod. Keep your models between sessions, expose the API endpoint, and connect from any device you own.

What You'll Get

Ollama running on a cloud GPU (available 24/7 or on-demand)
Models stored persistently — no re-downloading after restarts
OpenAI-compatible API accessible from anywhere
Automatic startup when the instance boots

Prerequisites

A Runpod account
Basic Docker and terminal knowledge
A credit card for billing

Step 1: Create a Network Volume

Persistent storage ensures your models survive instance restarts:

Go to Storage → Network Volumes
Click Add Network Volume
Size: 50 GB (enough for several large models)
Data Center: Pick one close to you (remember this for Step 2)
Click Create

Step 2: Deploy a GPU Instance

Go to GPU Cloud → Deploy
Select a GPU:
- RTX 4090 ($0.44/hr) — best for models up to 14B
- A100 40GB ($0.80/hr) — best for models up to 30B
- A100 80GB ($1.50/hr) — best for 70B models
Important: Select the same data center as your network volume
Under Customize Deployment, select the Ollama template
Attach your network volume at mount path /workspace
Click Deploy

Step 3: Configure Persistent Storage

Connect to your instance via HTTP Proxy terminal, then configure Ollama to store models on the persistent volume:

# Create model directory on persistent storage
mkdir -p /workspace/ollama/models

# Set Ollama to use persistent storage
export OLLAMA_MODELS=/workspace/ollama/models

# Stop the default Ollama service
sudo systemctl stop ollama 2>/dev/null || true

# Start Ollama with persistent storage
OLLAMA_MODELS=/workspace/ollama/models ollama serve > /workspace/ollama.log 2>&1 &

Step 4: Download Your Models

# Set the model path
export OLLAMA_MODELS=/workspace/ollama/models

# Download your preferred models
ollama pull llama3.1:8b
ollama pull qwen2.5:14b
ollama pull deepseek-r1:8b

# Verify downloads
ollama list

Models are now stored on your persistent volume and will survive restarts.

Step 5: Set Up Auto-Start

Create a startup script so Ollama launches automatically:

cat > /workspace/start-ollama.sh << 'EOF'
#!/bin/bash
export OLLAMA_MODELS=/workspace/ollama/models
export OLLAMA_HOST=0.0.0.0:11434

# Kill any existing Ollama process
pkill ollama 2>/dev/null || true
sleep 2

# Start Ollama
ollama serve > /workspace/ollama.log 2>&1 &

echo "Ollama started. Waiting for it to be ready..."
sleep 5
ollama list
EOF

chmod +x /workspace/start-ollama.sh

Add it to your instance's start command in Runpod settings, or run it manually after each restart.

Step 6: Expose the API

To access Ollama from external applications:

Go to your instance settings in Runpod
Under Ports, expose port 11434
Use the proxy URL: https://your-pod-id.proxy.runpod.net

Test it:

curl https://your-pod-id.proxy.runpod.net/api/tags

Use as OpenAI-Compatible API

Your Runpod Ollama instance works as a drop-in replacement for the OpenAI API:

import openai

client = openai.OpenAI(
    base_url="https://your-pod-id.proxy.runpod.net/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Hello from the cloud!"}
    ]
)
print(response.choices[0].message.content)

Step 7: Connect from Your Local Tools

Open WebUI

Deploy Open WebUI (or run it locally)
Set the Ollama URL to your Runpod proxy URL
Your cloud models appear in the model selector

Custom Applications

Point any OpenAI-compatible client to:

https://your-pod-id.proxy.runpod.net/v1

Cost Management

Auto-Stop Configuration

Save money by auto-stopping idle instances:

Go to instance settings
Set Auto-Stop to 1 hour of inactivity
Your instance stops automatically when not in use
Restart it from the dashboard when needed (takes ~2 minutes)

Estimated Costs

Usage Pattern	GPU	Monthly Cost
2 hrs/day, weekdays	RTX 4090	~$18
4 hrs/day, weekdays	RTX 4090	~$35
Always on (24/7)	RTX 4090	~$320
2 hrs/day, weekdays	A100 80GB	~$60

For most users, 2-4 hours per day on an RTX 4090 is sufficient and affordable.

Spot Instances for Development

Use spot instances (up to 70% cheaper) when:

You're testing and don't mind interruptions
You can save your work frequently
You're doing batch processing that can resume

Troubleshooting

Models missing after restart: Make sure OLLAMA_MODELS=/workspace/ollama/models is set in your startup script.

API not accessible: Verify port 11434 is exposed in instance settings and Ollama is running (ollama list).

Slow first response: The model needs to load into VRAM after Ollama starts. Subsequent responses are fast.

Out of VRAM: Switch to a smaller model or a GPU with more VRAM. Use ollama rm model-name to free space.

Summary

Next Steps

Runpod Beginner Guide — basics if you're new
Run Open WebUI on Runpod — add a browser interface
Best GPU Cloud for LLM — compare cloud providers

Get started with Runpod cloud GPU — deploy Ollama in minutes.

Get started with Runpod for cloud GPU computing. No hardware upgrades needed — run any AI model on powerful remote GPUs.

Get Started with Runpod

Partner link. We may earn a commission at no extra cost to you.

All Posts

Tutorials

Local AI in VS Code — Continue.dev, Cline, and Twinny Setup Guide

Set up AI-powered coding in VS Code with local models. Complete guide to Continue.dev, Cline, and Twinny extensions running on Ollama — no API keys needed.

2026/04/22

Cloud DeployTutorials

How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU

Step-by-step guide to deploying Ollama on Runpod with persistent storage, API access, and cost optimization. Run models up to 70B parameters on cloud GPU.

Lists & GuidesModels & Hardware

Best AI Models for 8GB RAM — What Can You Run Locally?

Guide

A complete guide to the best LLMs you can run on a computer with 8GB of RAM. Includes benchmarks, practical recommendations, and setup commands for each model.