Running Multimodal AI Models Locally — Image and Vision with LLaVA

2026/04/22

Intermediate9 min read

Running Multimodal AI Models Locally — Image and Vision with LLaVA

Run vision-capable AI models like LLaVA on your hardware. Analyze images, describe photos, and extract text — all locally, without sending data to the cloud.

Text-only AI models are useful, but the real world is visual. Photos, screenshots, diagrams, charts, and documents all contain information that text models simply cannot access. Multimodal models solve this by combining language understanding with visual perception — and you can run them entirely on your own hardware.

What Are Multimodal Models?

Multimodal models process both text and images as input. You can show them a photo and ask questions about it, hand them a screenshot and ask for an explanation, or feed them a chart and request a summary.

These models work by connecting a vision encoder (which understands images) with a language model (which generates text). The vision encoder converts an image into a sequence of numerical representations (embeddings), and the language model treats those embeddings as additional context alongside your text prompt.

What You Can Do with Local Vision Models

Describe images — "What is shown in this photo?"
Extract text (OCR) — "Read the text on this sign"
Analyze charts and graphs — "What are the key trends in this chart?"
Understand screenshots — "Explain what this error message means"
Answer visual questions — "How many red cars are in this image?"
Identify objects — "What species of bird is this?"

Available Multimodal Models

Several open-source multimodal models work well locally. Here are the best options:

Model	Base LLM	Size	Best For
LLaVA 1.6	Mistral 7B / Vicuna 13B	7B, 13B	General vision tasks
llava-llama3	Llama 3 8B	8B	Better language reasoning
LLaVA 1.5	Vicuna 7B/13B	7B, 13B	Lightweight, fast
MiniCPM-V 2.6	MiniCPM	8B	High efficiency, multilingual
Qwen2-VL	Qwen 2	7B	Strong overall performance

Setting Up with Ollama

Ollama makes running multimodal models straightforward. The setup is the same as text-only models — you pull the model and start chatting.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download the macOS app from ollama.com

Pull a Vision Model

# LLaVA 1.6 (7B) — best general-purpose vision model
ollama pull llava:7b

# LLaVA 1.6 (13B) — higher quality, needs more RAM
ollama pull llava:13b

# llava-llama3 — Llama 3 backbone, stronger reasoning
ollama pull llava-llama3

# MiniCPM-V — compact and efficient
ollama pull minicpm-v

Chat with Images in the Terminal

Ollama supports image inputs directly from the command line:

# Describe an image
ollama run llava:7b "Describe what you see in this image" /path/to/photo.jpg

# Extract text from a screenshot
ollama run llava-llama3 "Extract all the text from this screenshot" /path/to/screenshot.png

# Analyze a chart
ollama run llava:7b "What are the key trends in this chart? Provide specific numbers." /path/to/chart.png

You can also enter an interactive session and pass images with /path/to/image.jpg:

ollama run llava:7b

>>> /path/to/photo.jpg Describe what's in this image

Setting Up with Open WebUI

For a better experience with image uploads, use Open WebUI:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Once Open WebUI is running:

Open http://localhost:3000 in your browser
Select a vision model (e.g., llava:7b) from the model dropdown
Click the + button or drag and drop an image into the chat
Type your question about the image
The model will analyze the image and respond

Open WebUI stores conversations locally, so you can review past image analyses anytime.

Practical Examples

Example 1: Describe a Photo

ollama run llava:7b "Describe this photo in detail. What objects, people, and setting do you see?" ~/Photos/vacation.jpg

Sample output:

The image shows a coastal landscape at sunset. In the foreground, there is a rocky shoreline with waves gently washing over the rocks. The sky displays a gradient of orange, pink, and purple hues. A wooden pier extends into the water on the right side. No people are visible in the frame.

Example 2: Extract Text from Screenshots

ollama run llava-llama3 "Extract all text from this image. Format it exactly as shown." ~/Desktop/error-message.png

This works well for error messages, receipts, signs, and documents. Accuracy is highest with clear, well-lit images containing standard fonts.

Example 3: Analyze Charts and Data

ollama run llava:7b "Analyze this sales chart. What are the trends? Which quarter performed best? Give specific numbers where possible." ~/Documents/q3-sales.png

Sample output:

The chart shows quarterly sales from Q1 to Q4. Q1 shows approximately $2.1M, Q2 rises to $2.8M, Q3 peaks at $3.4M, and Q4 drops to $2.9M. The overall trend is upward with Q3 being the strongest quarter, likely driven by seasonal demand.

Example 4: Code Understanding from Screenshots

ollama run llava-llama3 "What does this code do? Identify any bugs or issues." ~/Desktop/code-snippet.png

This is useful for reviewing code from tutorials, presentations, or screenshots shared in chat.

Model Comparison: LLaVA vs llava-llama3 vs MiniCPM-V

Task	LLaVA 7B	llava-llama3	MiniCPM-V
General image description	Good	Good	Good
Text extraction (OCR)	Good	Very good	Good
Chart analysis	Good	Very good	Good
Code from screenshots	Moderate	Very good	Moderate
Detailed reasoning	Moderate	Good	Moderate
Speed (tokens/sec)	~20 tok/s	~18 tok/s	~22 tok/s
RAM required	8 GB	8 GB	8 GB

Recommendation: Use llava-llama3 for tasks requiring text extraction or reasoning, and LLaVA 13B for the highest quality image understanding if you have 16GB RAM.

Hardware Requirements

Multimodal models are slightly more demanding than text-only models because they need to process image data alongside text.

Model	Minimum RAM	Recommended RAM	VRAM (GPU)
LLaVA 7B	8 GB	12 GB	6 GB
LLaVA 13B	16 GB	24 GB	12 GB
llava-llama3	8 GB	12 GB	6 GB
MiniCPM-V	8 GB	12 GB	6 GB

Tips for Better Performance

Close other applications — free up RAM before loading vision models
Resize large images — models work best with images under 1024x1024 pixels
Use PNG for text extraction — sharper edges help OCR accuracy
Crop to relevant areas — focus the model on what matters
Use Q4_K_M quantization — the default in Ollama, best balance of speed and quality

Using the Ollama API for Vision Tasks

You can integrate vision models into your own applications using the Ollama API:

import ollama

# Analyze an image
response = ollama.chat(
    model='llava:7b',
    messages=[
        {
            'role': 'user',
            'content': 'What is shown in this image? List all objects.',
            'images': ['/path/to/image.jpg']
        }
    ]
)

print(response['message']['content'])

# Using curl
curl http://localhost:11434/api/chat -d '{
  "model": "llava:7b",
  "messages": [
    {
      "role": "user",
      "content": "Describe this image",
      "images": ["'"$(base64 -i photo.jpg)"'"]
    }
  ]
}'

This makes it easy to build applications that process images — document scanners, photo organizers, accessibility tools, and more.

Limitations to Keep in Mind

Local multimodal models are impressive but have real limitations:

No real-time video — they process static images only, not video streams
OCR accuracy — good but not at the level of dedicated OCR tools like Tesseract for complex layouts
Fine details — small text, far-away objects, or subtle details may be missed
Hallucination — models can confidently describe things that are not in the image; verify critical results
Image size limits — very high-resolution images are automatically downscaled, which can lose detail

For production OCR tasks, consider pairing vision models with a dedicated OCR library. For visual Q&A and general image understanding, these models perform very well.

Going Further

If you need more capable vision models and have the hardware, consider:

Qwen2-VL 7B — strong performance on document understanding and multilingual OCR
LLaVA-NeXT (LLaVA 1.6) 34B — near-GPT-4V quality on image understanding, requires 32GB+ RAM
Cloud GPU — deploy on Runpod with an A100 for the largest vision models

All Posts

Lists & GuidesModels & Hardware

Best AI Models for 32GB RAM — Run Professional-Grade LLMs Locally

Guide

32GB RAM unlocks professional-grade models like Qwen 2.5 32B and Mixtral 8x7B. Here is exactly what to run and how to get the best performance from each.

Local AI Hub

2026/04/18

Lists & GuidesTutorials

Apple Silicon LLM Optimization — Get the Most from M1, M2, M3, and M4

Tutorial

Optimize local AI performance on Apple Silicon. Covers Metal GPU acceleration, unified memory advantages, and the best models for each Mac chip generation.

Local AI Hub

2026/04/22

Comparisons

Open WebUI vs AnythingLLM — Which Local AI Interface Is Right for You?

Comparison

Open WebUI and AnythingLLM both add chat interfaces to local AI, but serve very different needs. Compare features, RAG capabilities, and ease of use.

Local AI Hub

2026/04/12

2026/04/22

Intermediate9 min read

Running Multimodal AI Models Locally — Image and Vision with LLaVA

Run vision-capable AI models like LLaVA on your hardware. Analyze images, describe photos, and extract text — all locally, without sending data to the cloud.

What Are Multimodal Models?

What You Can Do with Local Vision Models

Describe images — "What is shown in this photo?"
Extract text (OCR) — "Read the text on this sign"
Analyze charts and graphs — "What are the key trends in this chart?"
Understand screenshots — "Explain what this error message means"
Answer visual questions — "How many red cars are in this image?"
Identify objects — "What species of bird is this?"

Available Multimodal Models

Several open-source multimodal models work well locally. Here are the best options:

Model	Base LLM	Size	Best For
LLaVA 1.6	Mistral 7B / Vicuna 13B	7B, 13B	General vision tasks
llava-llama3	Llama 3 8B	8B	Better language reasoning
LLaVA 1.5	Vicuna 7B/13B	7B, 13B	Lightweight, fast
MiniCPM-V 2.6	MiniCPM	8B	High efficiency, multilingual
Qwen2-VL	Qwen 2	7B	Strong overall performance

Setting Up with Ollama

Ollama makes running multimodal models straightforward. The setup is the same as text-only models — you pull the model and start chatting.

Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download the macOS app from ollama.com

Pull a Vision Model

# LLaVA 1.6 (7B) — best general-purpose vision model
ollama pull llava:7b

# LLaVA 1.6 (13B) — higher quality, needs more RAM
ollama pull llava:13b

# llava-llama3 — Llama 3 backbone, stronger reasoning
ollama pull llava-llama3

# MiniCPM-V — compact and efficient
ollama pull minicpm-v

Chat with Images in the Terminal

Ollama supports image inputs directly from the command line:

# Describe an image
ollama run llava:7b "Describe what you see in this image" /path/to/photo.jpg

# Extract text from a screenshot
ollama run llava-llama3 "Extract all the text from this screenshot" /path/to/screenshot.png

# Analyze a chart
ollama run llava:7b "What are the key trends in this chart? Provide specific numbers." /path/to/chart.png

You can also enter an interactive session and pass images with /path/to/image.jpg:

ollama run llava:7b

>>> /path/to/photo.jpg Describe what's in this image

Setting Up with Open WebUI

For a better experience with image uploads, use Open WebUI:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main

Once Open WebUI is running:

Open http://localhost:3000 in your browser
Select a vision model (e.g., llava:7b) from the model dropdown
Click the + button or drag and drop an image into the chat
Type your question about the image
The model will analyze the image and respond

Open WebUI stores conversations locally, so you can review past image analyses anytime.

Practical Examples

Example 1: Describe a Photo

ollama run llava:7b "Describe this photo in detail. What objects, people, and setting do you see?" ~/Photos/vacation.jpg

Sample output:

Example 2: Extract Text from Screenshots

ollama run llava-llama3 "Extract all text from this image. Format it exactly as shown." ~/Desktop/error-message.png

This works well for error messages, receipts, signs, and documents. Accuracy is highest with clear, well-lit images containing standard fonts.

Example 3: Analyze Charts and Data

ollama run llava:7b "Analyze this sales chart. What are the trends? Which quarter performed best? Give specific numbers where possible." ~/Documents/q3-sales.png

Sample output:

Example 4: Code Understanding from Screenshots

ollama run llava-llama3 "What does this code do? Identify any bugs or issues." ~/Desktop/code-snippet.png

This is useful for reviewing code from tutorials, presentations, or screenshots shared in chat.

Model Comparison: LLaVA vs llava-llama3 vs MiniCPM-V

Task	LLaVA 7B	llava-llama3	MiniCPM-V
General image description	Good	Good	Good
Text extraction (OCR)	Good	Very good	Good
Chart analysis	Good	Very good	Good
Code from screenshots	Moderate	Very good	Moderate
Detailed reasoning	Moderate	Good	Moderate
Speed (tokens/sec)	~20 tok/s	~18 tok/s	~22 tok/s
RAM required	8 GB	8 GB	8 GB

Recommendation: Use llava-llama3 for tasks requiring text extraction or reasoning, and LLaVA 13B for the highest quality image understanding if you have 16GB RAM.

Hardware Requirements

Multimodal models are slightly more demanding than text-only models because they need to process image data alongside text.

Model	Minimum RAM	Recommended RAM	VRAM (GPU)
LLaVA 7B	8 GB	12 GB	6 GB
LLaVA 13B	16 GB	24 GB	12 GB
llava-llama3	8 GB	12 GB	6 GB
MiniCPM-V	8 GB	12 GB	6 GB

Tips for Better Performance

Close other applications — free up RAM before loading vision models
Resize large images — models work best with images under 1024x1024 pixels
Use PNG for text extraction — sharper edges help OCR accuracy
Crop to relevant areas — focus the model on what matters
Use Q4_K_M quantization — the default in Ollama, best balance of speed and quality

Using the Ollama API for Vision Tasks

You can integrate vision models into your own applications using the Ollama API:

import ollama

# Analyze an image
response = ollama.chat(
    model='llava:7b',
    messages=[
        {
            'role': 'user',
            'content': 'What is shown in this image? List all objects.',
            'images': ['/path/to/image.jpg']
        }
    ]
)

print(response['message']['content'])

# Using curl
curl http://localhost:11434/api/chat -d '{
  "model": "llava:7b",
  "messages": [
    {
      "role": "user",
      "content": "Describe this image",
      "images": ["'"$(base64 -i photo.jpg)"'"]
    }
  ]
}'

This makes it easy to build applications that process images — document scanners, photo organizers, accessibility tools, and more.

Limitations to Keep in Mind

Local multimodal models are impressive but have real limitations:

No real-time video — they process static images only, not video streams
OCR accuracy — good but not at the level of dedicated OCR tools like Tesseract for complex layouts
Fine details — small text, far-away objects, or subtle details may be missed
Hallucination — models can confidently describe things that are not in the image; verify critical results
Image size limits — very high-resolution images are automatically downscaled, which can lose detail

For production OCR tasks, consider pairing vision models with a dedicated OCR library. For visual Q&A and general image understanding, these models perform very well.

Going Further

If you need more capable vision models and have the hardware, consider:

Qwen2-VL 7B — strong performance on document understanding and multilingual OCR
LLaVA-NeXT (LLaVA 1.6) 34B — near-GPT-4V quality on image understanding, requires 32GB+ RAM
Cloud GPU — deploy on Runpod with an A100 for the largest vision models

All Posts

Lists & GuidesModels & Hardware

Best AI Models for 32GB RAM — Run Professional-Grade LLMs Locally

Guide

32GB RAM unlocks professional-grade models like Qwen 2.5 32B and Mixtral 8x7B. Here is exactly what to run and how to get the best performance from each.

Local AI Hub

2026/04/18

Lists & GuidesTutorials

Apple Silicon LLM Optimization — Get the Most from M1, M2, M3, and M4

Tutorial

Optimize local AI performance on Apple Silicon. Covers Metal GPU acceleration, unified memory advantages, and the best models for each Mac chip generation.

Local AI Hub

2026/04/22

Comparisons