Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant

2026/04/22

Advanced18 min read

Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant

Deploy local AI for enterprise use. Covers air-gapped setups, on-premise GPU servers, compliance, and multi-user configurations powered by Open WebUI.

Enterprises are adopting local AI for three reasons: data sovereignty, regulatory compliance, and cost control. When you process customer data, proprietary code, or regulated content through cloud APIs, you introduce risk. Local AI eliminates that risk by keeping everything on infrastructure you control.

This guide covers enterprise-grade local AI deployment — from air-gapped environments to multi-user Open WebUI setups with role-based access control.

Enterprise Use Cases for Local AI

Use Case	Why Local	Compliance Driver
Legal document analysis	Attorney-client privilege	Attorney-client privilege rules
Healthcare record processing	Protected health information	HIPAA
Financial data analysis	Sensitive financial records	SOX, GDPR
Code assistant for proprietary code	Trade secrets, source code	Trade secret law, IP protection
Internal knowledge base	Confidential business data	NDA, internal policies
Customer support automation	Customer PII in queries	GDPR, CCPA

Architecture Overview

An enterprise local AI deployment has four layers:

[Users] → [Reverse Proxy / Auth] → [Open WebUI] → [Ollama] → [GPU Server]

GPU Server — runs the inference workload (Ollama + models)
Open WebUI — provides the web interface and user management
Reverse Proxy — handles TLS, authentication, and access logging
Users — access the system through a browser

Air-Gapped Deployment

Air-gapped deployments have zero internet connectivity. This is required for classified environments, certain government workloads, and organizations with strict data isolation policies.

Step 1: Prepare on a Connected Machine

Download everything you need on a machine with internet access:

# Download Ollama
curl -fsSL https://ollama.com/install.sh -o ollama-install.sh

# Download Docker images
docker pull ghcr.io/open-webui/open-webui:main
docker save ghcr.io/open-webui/open-webui:main -o open-webui.tar

# Download Ollama images (requires a Docker image of Ollama)
docker pull ollama/ollama:latest
docker save ollama/ollama:latest -o ollama.tar

# Pull models
ollama pull llama3.1:70b
ollama pull qwen2.5-coder:32b
ollama pull nomic-embed-text

# Export models for transfer
# Models are stored in ~/.ollama by default
tar -czf ollama-models.tar.gz ~/.ollama/models

Step 2: Transfer to Air-Gapped Network

Use approved transfer media (USB, encrypted drive, secure file transfer):

# Copy these files to transfer media:
# - ollama-install.sh
# - open-webui.tar
# - ollama.tar
# - ollama-models.tar.gz

Step 3: Install on Air-Gapped Machine

# Install Ollama
chmod +x ollama-install.sh
./ollama-install.sh

# Load Docker images
docker load -i ollama.tar
docker load -i open-webui.tar

# Restore models
tar -xzf ollama-models.tar.gz -C ~/

# Verify
ollama list

Step 4: Deploy the Stack

# Create docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "443:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=false
      - WEBUI_SECRET_KEY=<generate-a-strong-secret>
    depends_on:
      - ollama

volumes:
  ollama:
  open-webui:
EOF

docker compose up -d

The system is now running with no internet dependency. Users access https://localhost:443 on the local network.

On-Premise GPU Server Hardware

Choosing the right hardware determines what models you can run and how many users you can support.

GPU Recommendations

Configuration	GPU	VRAM	Max Model Size	Concurrent Users	Est. Cost
Entry	1x RTX 4090	24 GB	32B Q4	5-10	$2,000-3,000
Mid-range	2x RTX 4090	48 GB	70B Q4	10-20	$5,000-7,000
Enterprise	2x A100 80GB	160 GB	70B+ Q8 / multiple models	20-50	$25,000-35,000
High-end	4x A100 80GB	320 GB	Multiple large models	50-100	$60,000-80,000

System Requirements

Component	Minimum	Recommended
CPU	16-core modern x86	32+ core (EPYC or Xeon)
RAM	64 GB DDR5	128-256 GB DDR5 ECC
Storage	1 TB NVMe SSD	2-4 TB NVMe SSD (models are large)
Network	1 GbE	10 GbE (for multi-server setups)
Power	850W	1600W+ (per GPU server)

Cooling and Environment

Dedicated server room with temperature control (GPU servers generate significant heat)
UPS battery backup to prevent data corruption during power events
Cable management for maintenance access
Noise isolation if the server is near workspaces (GPU fans are loud under load)

Open WebUI Multi-User Configuration

Open WebUI supports multi-user setups with authentication and basic role-based access control.

Enable Authentication

# In docker-compose.yml environment section:
environment:
  - WEBUI_AUTH=true
  - ENABLE_SIGNUP=false           # Disable public signup
  - WEBUI_SECRET_KEY=<your-secret> # Strong random secret
  - DATA_EXPORT_ENABLED=true       # Allow data export for compliance

User Roles

Open WebUI provides three roles:

Role	Capabilities
Admin	Full control: manage users, configure models, set system settings, view all chats
User	Use models, create chats, upload documents, manage own data
Pending	Registered but awaiting admin approval

Admin Setup Workflow

First user is automatically admin — this is your IT administrator account
Disable public signup after creating the admin account (ENABLE_SIGNUP=false)
Create user accounts manually through the admin panel
Or enable LDAP/SSO for enterprise environments (see below)

LDAP Integration

For Active Directory or LDAP environments, configure Open WebUI to authenticate against your existing directory:

environment:
  - ENABLE_LDAP=true
  - LDAP_SERVER_URL=ldap://your-ad-server:389
  - LDAP_BIND_DN=CN=service-account,OU=ServiceAccounts,DC=company,DC=com
  - LDAP_BIND_PASSWORD=<bind-password>
  - LDAP_SEARCH_BASE=OU=Employees,DC=company,DC=com
  - LDAP_SEARCH_FILTER=(sAMAccountName={username})
  - LDAP_USE_SSL=true

Users authenticate with their existing corporate credentials. No separate passwords to manage.

RBAC and Access Control

Model-Level Access

Restrict which models different user groups can access:

Go to Admin Settings → Models
For each model, set the Access Control list
Assign models to user groups (e.g., "Legal team gets Llama 3.1 70B; Engineering gets Qwen Coder 32B")

This prevents unauthorized users from accessing expensive models and controls GPU costs.

Document Workspace Isolation

Open WebUI stores documents per user by default. For team workspaces:

Create shared workspaces through the admin panel
Assign users to workspaces based on their department
Each workspace maintains its own vector database and document index

This ensures the legal team's documents are separate from engineering's knowledge base.

Compliance Considerations

Requirement	Implementation
Data minimization	Only upload documents needed for the task
Right to erasure	Open WebUI allows admin to delete user data and chat history
Data portability	Enable `DATA_EXPORT_ENABLED=true` for user data exports
Processing records	Enable access logging (see Monitoring section)
Lawful basis	Internal business operations typically fall under legitimate interest

HIPAA Considerations

For organizations handling protected health information:

Encryption at rest — store Open WebUI data volumes on encrypted storage
Encryption in transit — use TLS (configure in your reverse proxy)
Access logging — log all queries and responses for audit trails
BAA (Business Associate Agreement) — since you're running the software yourself, you are the processor. No BAA is needed with external AI providers because there are none.
Access controls — enforce role-based access so only authorized personnel query health data
Audit trails — retain logs for the required period (typically 6 years)

This is one of the strongest arguments for local AI in healthcare: no data flows to third-party AI providers, eliminating an entire category of HIPAA risk.

SOC 2 Alignment

Control	How Local AI Helps
CC6.1 (Logical access)	All access through authenticated Open WebUI
CC6.2 (Access removal)	Admin can deactivate users immediately
CC6.3 (Encryption)	TLS in transit, encrypted volumes at rest
CC7.1 (Detection)	Access logs capture all queries
CC7.2 (Monitoring)	Prometheus + Grafana dashboards (see below)

Monitoring and Logging

Access Logging

Configure Open WebUI to log all user interactions:

environment:
  - ENABLE_AUDIT_LOGGING=true
  - LOG_LEVEL=INFO

Logs include:

User authentication events
Model access per user
Query timestamps
Document uploads and deletions

Forward logs to your SIEM (Splunk, Elastic, or similar) for centralized monitoring.

Performance Monitoring with Prometheus

# Add to docker-compose.yml
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    volumes:
      - grafana:/var/lib/grafana

Track these metrics:

Metric	What It Tells You	Alert Threshold
GPU utilization	Are GPUs being used efficiently?	< 20% (over-provisioned) or > 95% (bottleneck)
GPU memory usage	Are models fitting in VRAM?	> 90% (risk of OOM)
Request latency	How fast are responses?	> 30s for chat, > 5s for autocomplete
Request queue depth	Are users waiting for GPU time?	> 10 queued requests
Error rate	Are requests failing?	> 1% error rate

Ollama Health Check

# Check Ollama status
curl http://localhost:11434/api/tags

# Monitor running models
curl http://localhost:11434/api/ps

# Simple health check script for cron
#!/bin/bash
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
  echo "Ollama is down!" | mail -s "Local AI Alert" admin@company.com
fi

Scaling Strategies

Vertical Scaling (More GPU)

The simplest approach: upgrade your GPU server.

# With 4x A100 80GB, you can run:
ollama run llama3.1:70b        # Uses ~40GB VRAM
# Plus simultaneously:
ollama run qwen2.5-coder:32b   # Uses ~20GB VRAM
# Leaving ~260GB for additional models or batch processing

Horizontal Scaling (Multiple Servers)

For 50+ concurrent users, distribute the load:

                    [Nginx / HAProxy]
                   /        |        \
          [Ollama-1]   [Ollama-2]   [Ollama-3]
          (70B model)  (Coder 32B)  (General 8B)

Configure Open WebUI to point to multiple Ollama instances:

Deploy 2-3 Ollama servers, each running different models
Configure Open WebUI with multiple Ollama endpoints
Users select the appropriate model for their task
Load balancing happens at the model-selection level

Cloud Burst for Peak Loads

For organizations that want on-premise baseline with cloud burst capability:

Run your primary infrastructure on-premise
Configure a secondary Ollama instance on Runpod
During peak usage, route overflow requests to the cloud instance
Cloud instances are destroyed after use — no persistent data in the cloud

This hybrid approach keeps sensitive data on-premise by default while providing elasticity.

Security Checklist

Before going live, verify:

Backup and Recovery

# Backup Open WebUI data
docker cp open-webui:/app/backend/data ./backup-$(date +%Y%m%d)

# Backup Ollama models (if you want to avoid re-downloading)
tar -czf ollama-backup-$(date +%Y%m%d).tar.gz ~/.ollama/models

# Restore from backup
docker cp ./backup-20260422 open-webui:/app/backend/data
docker restart open-webui

Schedule daily backups with cron. Store backups on encrypted, off-site storage for disaster recovery.

Need enterprise-grade GPU infrastructure? Try Runpod's bare metal servers.

Get started with Runpod for cloud GPU computing. No hardware upgrades needed — run any AI model on powerful remote GPUs.

Get Started with Runpod

Partner link. We may earn a commission at no extra cost to you.

All Posts

Author

Local AI Hub

Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant

Deploy local AI for enterprise use. Covers air-gapped setups, on-premise GPU servers, compliance, and multi-user configurations powered by Open WebUI.

This guide covers enterprise-grade local AI deployment — from air-gapped environments to multi-user Open WebUI setups with role-based access control.

Enterprise Use Cases for Local AI

Use Case	Why Local	Compliance Driver
Legal document analysis	Attorney-client privilege	Attorney-client privilege rules
Healthcare record processing	Protected health information	HIPAA
Financial data analysis	Sensitive financial records	SOX, GDPR
Code assistant for proprietary code	Trade secrets, source code	Trade secret law, IP protection
Internal knowledge base	Confidential business data	NDA, internal policies
Customer support automation	Customer PII in queries	GDPR, CCPA

Architecture Overview

An enterprise local AI deployment has four layers:

[Users] → [Reverse Proxy / Auth] → [Open WebUI] → [Ollama] → [GPU Server]

GPU Server — runs the inference workload (Ollama + models)
Open WebUI — provides the web interface and user management
Reverse Proxy — handles TLS, authentication, and access logging
Users — access the system through a browser

Air-Gapped Deployment

Air-gapped deployments have zero internet connectivity. This is required for classified environments, certain government workloads, and organizations with strict data isolation policies.

Step 1: Prepare on a Connected Machine

Download everything you need on a machine with internet access:

# Download Ollama
curl -fsSL https://ollama.com/install.sh -o ollama-install.sh

# Download Docker images
docker pull ghcr.io/open-webui/open-webui:main
docker save ghcr.io/open-webui/open-webui:main -o open-webui.tar

# Download Ollama images (requires a Docker image of Ollama)
docker pull ollama/ollama:latest
docker save ollama/ollama:latest -o ollama.tar

# Pull models
ollama pull llama3.1:70b
ollama pull qwen2.5-coder:32b
ollama pull nomic-embed-text

# Export models for transfer
# Models are stored in ~/.ollama by default
tar -czf ollama-models.tar.gz ~/.ollama/models

Step 2: Transfer to Air-Gapped Network

Use approved transfer media (USB, encrypted drive, secure file transfer):

# Copy these files to transfer media:
# - ollama-install.sh
# - open-webui.tar
# - ollama.tar
# - ollama-models.tar.gz

Step 3: Install on Air-Gapped Machine

# Install Ollama
chmod +x ollama-install.sh
./ollama-install.sh

# Load Docker images
docker load -i ollama.tar
docker load -i open-webui.tar

# Restore models
tar -xzf ollama-models.tar.gz -C ~/

# Verify
ollama list

Step 4: Deploy the Stack

# Create docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
  ollama:
    image: ollama/ollama:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - ollama:/root/.ollama
    ports:
      - "11434:11434"

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    ports:
      - "443:8080"
    volumes:
      - open-webui:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=true
      - ENABLE_SIGNUP=false
      - WEBUI_SECRET_KEY=<generate-a-strong-secret>
    depends_on:
      - ollama

volumes:
  ollama:
  open-webui:
EOF

docker compose up -d

The system is now running with no internet dependency. Users access https://localhost:443 on the local network.

On-Premise GPU Server Hardware

Choosing the right hardware determines what models you can run and how many users you can support.

GPU Recommendations

Configuration	GPU	VRAM	Max Model Size	Concurrent Users	Est. Cost
Entry	1x RTX 4090	24 GB	32B Q4	5-10	$2,000-3,000
Mid-range	2x RTX 4090	48 GB	70B Q4	10-20	$5,000-7,000
Enterprise	2x A100 80GB	160 GB	70B+ Q8 / multiple models	20-50	$25,000-35,000
High-end	4x A100 80GB	320 GB	Multiple large models	50-100	$60,000-80,000

System Requirements

Component	Minimum	Recommended
CPU	16-core modern x86	32+ core (EPYC or Xeon)
RAM	64 GB DDR5	128-256 GB DDR5 ECC
Storage	1 TB NVMe SSD	2-4 TB NVMe SSD (models are large)
Network	1 GbE	10 GbE (for multi-server setups)
Power	850W	1600W+ (per GPU server)

Cooling and Environment

Dedicated server room with temperature control (GPU servers generate significant heat)
UPS battery backup to prevent data corruption during power events
Cable management for maintenance access
Noise isolation if the server is near workspaces (GPU fans are loud under load)

Open WebUI Multi-User Configuration

Open WebUI supports multi-user setups with authentication and basic role-based access control.

Enable Authentication

# In docker-compose.yml environment section:
environment:
  - WEBUI_AUTH=true
  - ENABLE_SIGNUP=false           # Disable public signup
  - WEBUI_SECRET_KEY=<your-secret> # Strong random secret
  - DATA_EXPORT_ENABLED=true       # Allow data export for compliance

User Roles

Open WebUI provides three roles:

Role	Capabilities
Admin	Full control: manage users, configure models, set system settings, view all chats
User	Use models, create chats, upload documents, manage own data
Pending	Registered but awaiting admin approval

Admin Setup Workflow

First user is automatically admin — this is your IT administrator account
Disable public signup after creating the admin account (ENABLE_SIGNUP=false)
Create user accounts manually through the admin panel
Or enable LDAP/SSO for enterprise environments (see below)

LDAP Integration

For Active Directory or LDAP environments, configure Open WebUI to authenticate against your existing directory:

environment:
  - ENABLE_LDAP=true
  - LDAP_SERVER_URL=ldap://your-ad-server:389
  - LDAP_BIND_DN=CN=service-account,OU=ServiceAccounts,DC=company,DC=com
  - LDAP_BIND_PASSWORD=<bind-password>
  - LDAP_SEARCH_BASE=OU=Employees,DC=company,DC=com
  - LDAP_SEARCH_FILTER=(sAMAccountName={username})
  - LDAP_USE_SSL=true

Users authenticate with their existing corporate credentials. No separate passwords to manage.

RBAC and Access Control

Model-Level Access

Restrict which models different user groups can access:

Go to Admin Settings → Models
For each model, set the Access Control list
Assign models to user groups (e.g., "Legal team gets Llama 3.1 70B; Engineering gets Qwen Coder 32B")

This prevents unauthorized users from accessing expensive models and controls GPU costs.

Document Workspace Isolation

Open WebUI stores documents per user by default. For team workspaces:

Create shared workspaces through the admin panel
Assign users to workspaces based on their department
Each workspace maintains its own vector database and document index

This ensures the legal team's documents are separate from engineering's knowledge base.

Compliance Considerations

Requirement	Implementation
Data minimization	Only upload documents needed for the task
Right to erasure	Open WebUI allows admin to delete user data and chat history
Data portability	Enable `DATA_EXPORT_ENABLED=true` for user data exports
Processing records	Enable access logging (see Monitoring section)
Lawful basis	Internal business operations typically fall under legitimate interest

HIPAA Considerations

For organizations handling protected health information:

Encryption at rest — store Open WebUI data volumes on encrypted storage
Encryption in transit — use TLS (configure in your reverse proxy)
Access logging — log all queries and responses for audit trails
BAA (Business Associate Agreement) — since you're running the software yourself, you are the processor. No BAA is needed with external AI providers because there are none.
Access controls — enforce role-based access so only authorized personnel query health data
Audit trails — retain logs for the required period (typically 6 years)

This is one of the strongest arguments for local AI in healthcare: no data flows to third-party AI providers, eliminating an entire category of HIPAA risk.

SOC 2 Alignment

Control	How Local AI Helps
CC6.1 (Logical access)	All access through authenticated Open WebUI
CC6.2 (Access removal)	Admin can deactivate users immediately
CC6.3 (Encryption)	TLS in transit, encrypted volumes at rest
CC7.1 (Detection)	Access logs capture all queries
CC7.2 (Monitoring)	Prometheus + Grafana dashboards (see below)

Monitoring and Logging

Access Logging

Configure Open WebUI to log all user interactions:

environment:
  - ENABLE_AUDIT_LOGGING=true
  - LOG_LEVEL=INFO

Logs include:

User authentication events
Model access per user
Query timestamps
Document uploads and deletions

Forward logs to your SIEM (Splunk, Elastic, or similar) for centralized monitoring.

Performance Monitoring with Prometheus

# Add to docker-compose.yml
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    volumes:
      - grafana:/var/lib/grafana

Track these metrics:

Metric	What It Tells You	Alert Threshold
GPU utilization	Are GPUs being used efficiently?	< 20% (over-provisioned) or > 95% (bottleneck)
GPU memory usage	Are models fitting in VRAM?	> 90% (risk of OOM)
Request latency	How fast are responses?	> 30s for chat, > 5s for autocomplete
Request queue depth	Are users waiting for GPU time?	> 10 queued requests
Error rate	Are requests failing?	> 1% error rate

Ollama Health Check

# Check Ollama status
curl http://localhost:11434/api/tags

# Monitor running models
curl http://localhost:11434/api/ps

# Simple health check script for cron
#!/bin/bash
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
  echo "Ollama is down!" | mail -s "Local AI Alert" admin@company.com
fi

Scaling Strategies

Vertical Scaling (More GPU)

The simplest approach: upgrade your GPU server.

# With 4x A100 80GB, you can run:
ollama run llama3.1:70b        # Uses ~40GB VRAM
# Plus simultaneously:
ollama run qwen2.5-coder:32b   # Uses ~20GB VRAM
# Leaving ~260GB for additional models or batch processing

Horizontal Scaling (Multiple Servers)

For 50+ concurrent users, distribute the load:

                    [Nginx / HAProxy]
                   /        |        \
          [Ollama-1]   [Ollama-2]   [Ollama-3]
          (70B model)  (Coder 32B)  (General 8B)

Configure Open WebUI to point to multiple Ollama instances:

Deploy 2-3 Ollama servers, each running different models
Configure Open WebUI with multiple Ollama endpoints
Users select the appropriate model for their task
Load balancing happens at the model-selection level

Cloud Burst for Peak Loads

For organizations that want on-premise baseline with cloud burst capability:

Run your primary infrastructure on-premise
Configure a secondary Ollama instance on Runpod
During peak usage, route overflow requests to the cloud instance
Cloud instances are destroyed after use — no persistent data in the cloud

This hybrid approach keeps sensitive data on-premise by default while providing elasticity.

Security Checklist

Before going live, verify:

Backup and Recovery

# Backup Open WebUI data
docker cp open-webui:/app/backend/data ./backup-$(date +%Y%m%d)

# Backup Ollama models (if you want to avoid re-downloading)
tar -czf ollama-backup-$(date +%Y%m%d).tar.gz ~/.ollama/models

# Restore from backup
docker cp ./backup-20260422 open-webui:/app/backend/data
docker restart open-webui

Schedule daily backups with cron. Store backups on encrypted, off-site storage for disaster recovery.

Need enterprise-grade GPU infrastructure? Try Runpod's bare metal servers.

Get started with Runpod for cloud GPU computing. No hardware upgrades needed — run any AI model on powerful remote GPUs.

Get Started with Runpod

Partner link. We may earn a commission at no extra cost to you.

All Posts

Author

Local AI Hub

Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant

Author

Categories

More Posts

Best AI Models for Coding, Chat, and RAG — Task-Specific Guide

Best AI Models for 8GB RAM — What Can You Run Locally?

Can 16GB RAM Run LLMs? (And Can Your Mac Run Them?)

Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant

Author

Categories

More Posts

Best AI Models for Coding, Chat, and RAG — Task-Specific Guide

Best AI Models for 8GB RAM — What Can You Run Locally?

Can 16GB RAM Run LLMs? (And Can Your Mac Run Them?)