Enterprise Local AI Deployment — Air-Gapped, On-Premise, and Compliant
Deploy local AI for enterprise use. Covers air-gapped setups, on-premise GPU servers, compliance, and multi-user configurations powered by Open WebUI.
Enterprises are adopting local AI for three reasons: data sovereignty, regulatory compliance, and cost control. When you process customer data, proprietary code, or regulated content through cloud APIs, you introduce risk. Local AI eliminates that risk by keeping everything on infrastructure you control.
This guide covers enterprise-grade local AI deployment — from air-gapped environments to multi-user Open WebUI setups with role-based access control.
Enterprise Use Cases for Local AI
| Use Case | Why Local | Compliance Driver |
|---|---|---|
| Legal document analysis | Attorney-client privilege | Attorney-client privilege rules |
| Healthcare record processing | Protected health information | HIPAA |
| Financial data analysis | Sensitive financial records | SOX, GDPR |
| Code assistant for proprietary code | Trade secrets, source code | Trade secret law, IP protection |
| Internal knowledge base | Confidential business data | NDA, internal policies |
| Customer support automation | Customer PII in queries | GDPR, CCPA |
Architecture Overview
An enterprise local AI deployment has four layers:
[Users] → [Reverse Proxy / Auth] → [Open WebUI] → [Ollama] → [GPU Server]- GPU Server — runs the inference workload (Ollama + models)
- Open WebUI — provides the web interface and user management
- Reverse Proxy — handles TLS, authentication, and access logging
- Users — access the system through a browser
Air-Gapped Deployment
Air-gapped deployments have zero internet connectivity. This is required for classified environments, certain government workloads, and organizations with strict data isolation policies.
Step 1: Prepare on a Connected Machine
Download everything you need on a machine with internet access:
# Download Ollama
curl -fsSL https://ollama.com/install.sh -o ollama-install.sh
# Download Docker images
docker pull ghcr.io/open-webui/open-webui:main
docker save ghcr.io/open-webui/open-webui:main -o open-webui.tar
# Download Ollama images (requires a Docker image of Ollama)
docker pull ollama/ollama:latest
docker save ollama/ollama:latest -o ollama.tar
# Pull models
ollama pull llama3.1:70b
ollama pull qwen2.5-coder:32b
ollama pull nomic-embed-text
# Export models for transfer
# Models are stored in ~/.ollama by default
tar -czf ollama-models.tar.gz ~/.ollama/modelsStep 2: Transfer to Air-Gapped Network
Use approved transfer media (USB, encrypted drive, secure file transfer):
# Copy these files to transfer media:
# - ollama-install.sh
# - open-webui.tar
# - ollama.tar
# - ollama-models.tar.gzStep 3: Install on Air-Gapped Machine
# Install Ollama
chmod +x ollama-install.sh
./ollama-install.sh
# Load Docker images
docker load -i ollama.tar
docker load -i open-webui.tar
# Restore models
tar -xzf ollama-models.tar.gz -C ~/
# Verify
ollama listStep 4: Deploy the Stack
# Create docker-compose.yml
cat > docker-compose.yml << 'EOF'
version: "3.8"
services:
ollama:
image: ollama/ollama:latest
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- ollama:/root/.ollama
ports:
- "11434:11434"
open-webui:
image: ghcr.io/open-webui/open-webui:main
ports:
- "443:8080"
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false
- WEBUI_SECRET_KEY=<generate-a-strong-secret>
depends_on:
- ollama
volumes:
ollama:
open-webui:
EOF
docker compose up -dThe system is now running with no internet dependency. Users access https://localhost:443 on the local network.
On-Premise GPU Server Hardware
Choosing the right hardware determines what models you can run and how many users you can support.
GPU Recommendations
| Configuration | GPU | VRAM | Max Model Size | Concurrent Users | Est. Cost |
|---|---|---|---|---|---|
| Entry | 1x RTX 4090 | 24 GB | 32B Q4 | 5-10 | $2,000-3,000 |
| Mid-range | 2x RTX 4090 | 48 GB | 70B Q4 | 10-20 | $5,000-7,000 |
| Enterprise | 2x A100 80GB | 160 GB | 70B+ Q8 / multiple models | 20-50 | $25,000-35,000 |
| High-end | 4x A100 80GB | 320 GB | Multiple large models | 50-100 | $60,000-80,000 |
System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| CPU | 16-core modern x86 | 32+ core (EPYC or Xeon) |
| RAM | 64 GB DDR5 | 128-256 GB DDR5 ECC |
| Storage | 1 TB NVMe SSD | 2-4 TB NVMe SSD (models are large) |
| Network | 1 GbE | 10 GbE (for multi-server setups) |
| Power | 850W | 1600W+ (per GPU server) |
Cooling and Environment
- Dedicated server room with temperature control (GPU servers generate significant heat)
- UPS battery backup to prevent data corruption during power events
- Cable management for maintenance access
- Noise isolation if the server is near workspaces (GPU fans are loud under load)
Open WebUI Multi-User Configuration
Open WebUI supports multi-user setups with authentication and basic role-based access control.
Enable Authentication
# In docker-compose.yml environment section:
environment:
- WEBUI_AUTH=true
- ENABLE_SIGNUP=false # Disable public signup
- WEBUI_SECRET_KEY=<your-secret> # Strong random secret
- DATA_EXPORT_ENABLED=true # Allow data export for complianceUser Roles
Open WebUI provides three roles:
| Role | Capabilities |
|---|---|
| Admin | Full control: manage users, configure models, set system settings, view all chats |
| User | Use models, create chats, upload documents, manage own data |
| Pending | Registered but awaiting admin approval |
Admin Setup Workflow
- First user is automatically admin — this is your IT administrator account
- Disable public signup after creating the admin account (
ENABLE_SIGNUP=false) - Create user accounts manually through the admin panel
- Or enable LDAP/SSO for enterprise environments (see below)
LDAP Integration
For Active Directory or LDAP environments, configure Open WebUI to authenticate against your existing directory:
environment:
- ENABLE_LDAP=true
- LDAP_SERVER_URL=ldap://your-ad-server:389
- LDAP_BIND_DN=CN=service-account,OU=ServiceAccounts,DC=company,DC=com
- LDAP_BIND_PASSWORD=<bind-password>
- LDAP_SEARCH_BASE=OU=Employees,DC=company,DC=com
- LDAP_SEARCH_FILTER=(sAMAccountName={username})
- LDAP_USE_SSL=trueUsers authenticate with their existing corporate credentials. No separate passwords to manage.
RBAC and Access Control
Model-Level Access
Restrict which models different user groups can access:
- Go to Admin Settings → Models
- For each model, set the Access Control list
- Assign models to user groups (e.g., "Legal team gets Llama 3.1 70B; Engineering gets Qwen Coder 32B")
This prevents unauthorized users from accessing expensive models and controls GPU costs.
Document Workspace Isolation
Open WebUI stores documents per user by default. For team workspaces:
- Create shared workspaces through the admin panel
- Assign users to workspaces based on their department
- Each workspace maintains its own vector database and document index
This ensures the legal team's documents are separate from engineering's knowledge base.
Compliance Considerations
GDPR Compliance
| Requirement | Implementation |
|---|---|
| Data minimization | Only upload documents needed for the task |
| Right to erasure | Open WebUI allows admin to delete user data and chat history |
| Data portability | Enable DATA_EXPORT_ENABLED=true for user data exports |
| Processing records | Enable access logging (see Monitoring section) |
| Lawful basis | Internal business operations typically fall under legitimate interest |
HIPAA Considerations
For organizations handling protected health information:
- Encryption at rest — store Open WebUI data volumes on encrypted storage
- Encryption in transit — use TLS (configure in your reverse proxy)
- Access logging — log all queries and responses for audit trails
- BAA (Business Associate Agreement) — since you're running the software yourself, you are the processor. No BAA is needed with external AI providers because there are none.
- Access controls — enforce role-based access so only authorized personnel query health data
- Audit trails — retain logs for the required period (typically 6 years)
This is one of the strongest arguments for local AI in healthcare: no data flows to third-party AI providers, eliminating an entire category of HIPAA risk.
SOC 2 Alignment
| Control | How Local AI Helps |
|---|---|
| CC6.1 (Logical access) | All access through authenticated Open WebUI |
| CC6.2 (Access removal) | Admin can deactivate users immediately |
| CC6.3 (Encryption) | TLS in transit, encrypted volumes at rest |
| CC7.1 (Detection) | Access logs capture all queries |
| CC7.2 (Monitoring) | Prometheus + Grafana dashboards (see below) |
Monitoring and Logging
Access Logging
Configure Open WebUI to log all user interactions:
environment:
- ENABLE_AUDIT_LOGGING=true
- LOG_LEVEL=INFOLogs include:
- User authentication events
- Model access per user
- Query timestamps
- Document uploads and deletions
Forward logs to your SIEM (Splunk, Elastic, or similar) for centralized monitoring.
Performance Monitoring with Prometheus
# Add to docker-compose.yml
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3001:3000"
volumes:
- grafana:/var/lib/grafanaTrack these metrics:
| Metric | What It Tells You | Alert Threshold |
|---|---|---|
| GPU utilization | Are GPUs being used efficiently? | < 20% (over-provisioned) or > 95% (bottleneck) |
| GPU memory usage | Are models fitting in VRAM? | > 90% (risk of OOM) |
| Request latency | How fast are responses? | > 30s for chat, > 5s for autocomplete |
| Request queue depth | Are users waiting for GPU time? | > 10 queued requests |
| Error rate | Are requests failing? | > 1% error rate |
Ollama Health Check
# Check Ollama status
curl http://localhost:11434/api/tags
# Monitor running models
curl http://localhost:11434/api/ps
# Simple health check script for cron
#!/bin/bash
if ! curl -sf http://localhost:11434/api/tags > /dev/null; then
echo "Ollama is down!" | mail -s "Local AI Alert" admin@company.com
fiScaling Strategies
Vertical Scaling (More GPU)
The simplest approach: upgrade your GPU server.
# With 4x A100 80GB, you can run:
ollama run llama3.1:70b # Uses ~40GB VRAM
# Plus simultaneously:
ollama run qwen2.5-coder:32b # Uses ~20GB VRAM
# Leaving ~260GB for additional models or batch processingHorizontal Scaling (Multiple Servers)
For 50+ concurrent users, distribute the load:
[Nginx / HAProxy]
/ | \
[Ollama-1] [Ollama-2] [Ollama-3]
(70B model) (Coder 32B) (General 8B)Configure Open WebUI to point to multiple Ollama instances:
- Deploy 2-3 Ollama servers, each running different models
- Configure Open WebUI with multiple Ollama endpoints
- Users select the appropriate model for their task
- Load balancing happens at the model-selection level
Cloud Burst for Peak Loads
For organizations that want on-premise baseline with cloud burst capability:
- Run your primary infrastructure on-premise
- Configure a secondary Ollama instance on Runpod
- During peak usage, route overflow requests to the cloud instance
- Cloud instances are destroyed after use — no persistent data in the cloud
This hybrid approach keeps sensitive data on-premise by default while providing elasticity.
Security Checklist
Before going live, verify:
- TLS configured on the reverse proxy (Nginx/Caddy)
- Public signup disabled in Open WebUI
- Strong admin password set
- LDAP/SSO integrated if available
- Data volumes encrypted at rest
- Access logging enabled
- Firewall rules restrict access to authorized IP ranges
- Ollama API not exposed to the network (only accessible to Open WebUI)
- Regular backup schedule for Open WebUI data and configurations
- Incident response plan for AI-generated content issues
Backup and Recovery
# Backup Open WebUI data
docker cp open-webui:/app/backend/data ./backup-$(date +%Y%m%d)
# Backup Ollama models (if you want to avoid re-downloading)
tar -czf ollama-backup-$(date +%Y%m%d).tar.gz ~/.ollama/models
# Restore from backup
docker cp ./backup-20260422 open-webui:/app/backend/data
docker restart open-webuiSchedule daily backups with cron. Store backups on encrypted, off-site storage for disaster recovery.
Related Guides
Author

Categories
More Posts
Best AI Models for Coding, Chat, and RAG — Task-Specific Guide
GuideDifferent AI tasks need different models. Find the best model for coding, conversational chat, and document-based RAG based on your hardware and needs.

Best AI Models for 8GB RAM — What Can You Run Locally?
GuideA complete guide to the best LLMs you can run on a computer with 8GB of RAM. Includes benchmarks, practical recommendations, and setup commands for each model.

Can 16GB RAM Run LLMs? (And Can Your Mac Run Them?)
GuideYes, 16GB RAM is excellent for local AI. This guide covers what models run on 16GB, why Apple Silicon Macs are ideal, and how to get the best performance.
