Local AI Hub
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Blog
Run Ollama on Runpod — Persistent Cloud GPU Setup Guide
2026/04/16
Intermediate25 min

Run Ollama on Runpod — Persistent Cloud GPU Setup Guide

Set up Ollama as a persistent cloud AI service on Runpod. Keep your models between sessions, expose the API endpoint, and connect from any device you own.

If you want to run Ollama on the cloud but keep your models and settings between sessions, you need a persistent setup. This guide shows you how to deploy Ollama on Runpod with persistent storage, API access, and auto-recovery.

What You'll Get

  • Ollama running on a cloud GPU (available 24/7 or on-demand)
  • Models stored persistently — no re-downloading after restarts
  • OpenAI-compatible API accessible from anywhere
  • Automatic startup when the instance boots

Prerequisites

  • A Runpod account
  • Basic Docker and terminal knowledge
  • A credit card for billing

Step 1: Create a Network Volume

Persistent storage ensures your models survive instance restarts:

  1. Go to Storage → Network Volumes
  2. Click Add Network Volume
  3. Size: 50 GB (enough for several large models)
  4. Data Center: Pick one close to you (remember this for Step 2)
  5. Click Create

Step 2: Deploy a GPU Instance

  1. Go to GPU Cloud → Deploy
  2. Select a GPU:
    • RTX 4090 ($0.44/hr) — best for models up to 14B
    • A100 40GB ($0.80/hr) — best for models up to 30B
    • A100 80GB ($1.50/hr) — best for 70B models
  3. Important: Select the same data center as your network volume
  4. Under Customize Deployment, select the Ollama template
  5. Attach your network volume at mount path /workspace
  6. Click Deploy

Step 3: Configure Persistent Storage

Connect to your instance via HTTP Proxy terminal, then configure Ollama to store models on the persistent volume:

# Create model directory on persistent storage
mkdir -p /workspace/ollama/models

# Set Ollama to use persistent storage
export OLLAMA_MODELS=/workspace/ollama/models

# Stop the default Ollama service
sudo systemctl stop ollama 2>/dev/null || true

# Start Ollama with persistent storage
OLLAMA_MODELS=/workspace/ollama/models ollama serve > /workspace/ollama.log 2>&1 &

Step 4: Download Your Models

# Set the model path
export OLLAMA_MODELS=/workspace/ollama/models

# Download your preferred models
ollama pull llama3.1:8b
ollama pull qwen2.5:14b
ollama pull deepseek-r1:8b

# Verify downloads
ollama list

Models are now stored on your persistent volume and will survive restarts.

Step 5: Set Up Auto-Start

Create a startup script so Ollama launches automatically:

cat > /workspace/start-ollama.sh << 'EOF'
#!/bin/bash
export OLLAMA_MODELS=/workspace/ollama/models
export OLLAMA_HOST=0.0.0.0:11434

# Kill any existing Ollama process
pkill ollama 2>/dev/null || true
sleep 2

# Start Ollama
ollama serve > /workspace/ollama.log 2>&1 &

echo "Ollama started. Waiting for it to be ready..."
sleep 5
ollama list
EOF

chmod +x /workspace/start-ollama.sh

Add it to your instance's start command in Runpod settings, or run it manually after each restart.

Step 6: Expose the API

To access Ollama from external applications:

  1. Go to your instance settings in Runpod
  2. Under Ports, expose port 11434
  3. Use the proxy URL: https://your-pod-id.proxy.runpod.net

Test it:

curl https://your-pod-id.proxy.runpod.net/api/tags

Use as OpenAI-Compatible API

Your Runpod Ollama instance works as a drop-in replacement for the OpenAI API:

import openai

client = openai.OpenAI(
    base_url="https://your-pod-id.proxy.runpod.net/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[
        {"role": "user", "content": "Hello from the cloud!"}
    ]
)
print(response.choices[0].message.content)

Step 7: Connect from Your Local Tools

Open WebUI

  1. Deploy Open WebUI (or run it locally)
  2. Set the Ollama URL to your Runpod proxy URL
  3. Your cloud models appear in the model selector

Custom Applications

Point any OpenAI-compatible client to:

https://your-pod-id.proxy.runpod.net/v1

Cost Management

Auto-Stop Configuration

Save money by auto-stopping idle instances:

  1. Go to instance settings
  2. Set Auto-Stop to 1 hour of inactivity
  3. Your instance stops automatically when not in use
  4. Restart it from the dashboard when needed (takes ~2 minutes)

Estimated Costs

Usage PatternGPUMonthly Cost
2 hrs/day, weekdaysRTX 4090~$18
4 hrs/day, weekdaysRTX 4090~$35
Always on (24/7)RTX 4090~$320
2 hrs/day, weekdaysA100 80GB~$60

For most users, 2-4 hours per day on an RTX 4090 is sufficient and affordable.

Spot Instances for Development

Use spot instances (up to 70% cheaper) when:

  • You're testing and don't mind interruptions
  • You can save your work frequently
  • You're doing batch processing that can resume

Troubleshooting

Models missing after restart: Make sure OLLAMA_MODELS=/workspace/ollama/models is set in your startup script.

API not accessible: Verify port 11434 is exposed in instance settings and Ollama is running (ollama list).

Slow first response: The model needs to load into VRAM after Ollama starts. Subsequent responses are fast.

Out of VRAM: Switch to a smaller model or a GPU with more VRAM. Use ollama rm model-name to free space.

Summary

With persistent storage and auto-start, your Runpod Ollama instance behaves like a personal AI server. Models stay between sessions, the API is accessible from anywhere, and you only pay for what you use.

Next Steps

  • Runpod Beginner Guide — basics if you're new
  • Run Open WebUI on Runpod — add a browser interface
  • Best GPU Cloud for LLM — compare cloud providers
Get started with Runpod cloud GPU — deploy Ollama in minutes.
Get started with Runpod for cloud GPU computing. No hardware upgrades needed — run any AI model on powerful remote GPUs.
Get Started with Runpod

Partner link. We may earn a commission at no extra cost to you.

All Posts

Author

avatar for Local AI Hub
Local AI Hub

Categories

  • Cloud Deploy
  • Tutorials
What You'll GetPrerequisitesStep 1: Create a Network VolumeStep 2: Deploy a GPU InstanceStep 3: Configure Persistent StorageStep 4: Download Your ModelsStep 5: Set Up Auto-StartStep 6: Expose the APIUse as OpenAI-Compatible APIStep 7: Connect from Your Local ToolsOpen WebUICustom ApplicationsCost ManagementAuto-Stop ConfigurationEstimated CostsSpot Instances for DevelopmentTroubleshootingSummaryNext Steps

More Posts

Local AI in VS Code — Continue.dev, Cline, and Twinny Setup Guide
Tutorials

Local AI in VS Code — Continue.dev, Cline, and Twinny Setup Guide

Tutorial

Set up AI-powered coding in VS Code with local models. Complete guide to Continue.dev, Cline, and Twinny extensions running on Ollama — no API keys needed.

avatar for Local AI Hub
Local AI Hub
2026/04/22
How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU
Cloud DeployTutorials

How to Deploy Ollama on Runpod — Run Any Model on Cloud GPU

Tutorial

Step-by-step guide to deploying Ollama on Runpod with persistent storage, API access, and cost optimization. Run models up to 70B parameters on cloud GPU.

avatar for Local AI Hub
Local AI Hub
2026/04/10
Best AI Models for 8GB RAM — What Can You Run Locally?
Lists & GuidesModels & Hardware

Best AI Models for 8GB RAM — What Can You Run Locally?

Guide

A complete guide to the best LLMs you can run on a computer with 8GB of RAM. Includes benchmarks, practical recommendations, and setup commands for each model.

avatar for Local AI Hub
Local AI Hub
2026/04/10
Local AI Hub

Run AI locally — fast, cheap, and private

Resources
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Device Check
  • Blog
Company
  • About
  • Contact
Legal
  • Cookie Policy
  • Privacy Policy
  • Terms of Service
© 2026 Local AI Hub. All Rights Reserved.