Local RAG Tutorial — Chat with Your Documents Using Free AI Tools
A step-by-step guide to setting up Retrieval-Augmented Generation (RAG) locally. Chat with your PDFs, documents, and knowledge base — fully offline and private.
RAG (Retrieval-Augmented Generation) lets you chat with your own documents using AI. Instead of relying on a model's general knowledge, you feed it your specific files — PDFs, docs, text files — and ask questions about them. And you can do this entirely locally, for free.
What You'll Need
- A computer with 8GB+ RAM (16GB recommended for larger document sets)
- Ollama installed and running
- One of: AnythingLLM or Open WebUI
- Your documents (PDF, DOCX, TXT, MD)
How RAG Works (Simplified)
- Upload documents — your files are processed and stored locally
- Ask a question — the system finds relevant sections from your documents
- Generate answer — the local AI model reads those sections and answers your question
- All local — no data leaves your machine at any point
Method 1: AnythingLLM (Easiest)
AnythingLLM is purpose-built for document chat. Best for users who want a simple setup.
Step 1: Install Ollama
# Download from ollama.com, then pull a model
ollama pull llama3.1
ollama pull qwen2.5:14b # Better for RAG if you have 16GB RAMVerify Ollama is running:
ollama listStep 2: Install AnythingLLM
- Download from anythingllm.com
- Available for macOS, Windows, and Linux
- No account needed — runs entirely locally
Step 3: Connect to Ollama
- Open AnythingLLM
- Go to Settings → LLM Provider
- Select Ollama
- It should auto-detect your running Ollama instance
- Select your preferred model (e.g., Llama 3.1 8B)
Step 4: Create a Workspace and Upload Documents
- Click New Workspace
- Give it a name (e.g., "Research Papers" or "Project Docs")
- Drag and drop your documents into the workspace
- Supported formats: PDF, DOCX, TXT, MD, CSV, and more
- Wait for processing to complete (usually seconds per document)
Step 5: Start Chatting
Ask questions about your documents:
- "Summarize the key findings in the Q3 report"
- "What are the main arguments in this paper?"
- "Extract all action items from these meeting notes"
AnythingLLM shows you which document sections it used to answer each question, so you can verify accuracy.
Method 2: Open WebUI (Most Flexible)
Open WebUI gives you more control and a ChatGPT-like interface. Best for advanced users and teams.
Step 1: Install Ollama
Same as above — install Ollama and pull your preferred model.
Step 2: Deploy Open WebUI
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
ghcr.io/open-webui/open-webui:mainOpen http://localhost:3000 in your browser.
Step 3: Enable Document Upload
- Open Open WebUI in your browser
- Go to Settings → Documents
- Set the embedding model (use the default — it downloads automatically)
- Configure the document store path if needed
Step 4: Upload and Chat
- Start a new chat
- Click the + button or paperclip icon to attach a document
- Upload your files (PDF, DOCX, TXT)
- Ask questions about the uploaded content
Open WebUI will:
- Process the document into searchable chunks
- Find relevant sections when you ask a question
- Feed those sections to your local model for accurate answers
Choosing the Right Model for RAG
RAG quality depends heavily on the model. Here are recommendations:
| Model | RAM | RAG Quality | Best For |
|---|---|---|---|
| Llama 3.1 8B | 8 GB | Good | General document Q&A |
| Qwen 2.5 7B | 8 GB | Good | Multilingual documents |
| Qwen 2.5 14B | 16 GB | Very good | Complex documents, better accuracy |
| DeepSeek R1 8B | 8 GB | Good | Analytical / reasoning tasks |
Recommendation: Use Qwen 2.5 14B if you have 16GB RAM — it handles document comprehension noticeably better than 7-8B models.
Tips for Better RAG Results
Document Preparation
- Clean your documents — remove headers, footers, and navigation text from PDFs
- Use text-based PDFs — scanned PDFs need OCR first (AnythingLLM handles some OCR automatically)
- Break large documents into sections — smaller chunks improve retrieval accuracy
- Use descriptive filenames — helps you organize and find documents later
Asking Better Questions
- Be specific — "What is the revenue for Q3 2025?" beats "Tell me about revenue"
- Reference document types — "According to the meeting notes, what was decided?"
- Ask for sources — "What evidence supports this answer?"
- Iterate — if the first answer isn't great, rephrase and ask again
Performance Optimization
- Close other apps — free up RAM for the model
- Use Q4_K_M quantization — best balance of speed and quality for RAG
- Limit workspace size — 50-100 documents per workspace works best
- Rebuild the index if you add many documents at once
Common Issues and Fixes
"No relevant context found"
The model can't find matching content in your documents:
- Check that documents were processed successfully
- Try rephrasing your question
- Make sure the document actually contains the information you're asking about
Slow responses
- Try a smaller model (Qwen 2.5 7B instead of 14B)
- Reduce the number of documents in the workspace
- Check RAM usage — close other apps if needed
Incorrect answers
- LLMs can hallucinate — always verify important answers against the source document
- Use a larger model for better accuracy
- Ask the model to cite the specific section it used
Advanced: RAG with Cloud GPU
If you want to use a powerful model like Llama 3.1 70B for RAG but don't have the hardware:
- Deploy Ollama on Runpod with an A100 GPU
- Run Open WebUI on Runpod alongside Ollama
- Upload your documents to the cloud instance
- Access from your browser
This gives you enterprise-grade RAG quality at pay-per-hour pricing. Data is deleted when you stop the instance.
Related Guides
Author

Categories
More Posts
How to Run DeepSeek Locally — The Best Open Reasoning Model
TutorialRun DeepSeek R1 on your own computer. Known for chain-of-thought reasoning, math, and coding — it is one of the most capable open-source models available today.

Ollama Tutorial for Beginners — From Zero to Chatting with AI
TutorialA hands-on beginner tutorial for Ollama. Learn to install, run models, use system prompts, switch between models, and tap into the API for your own projects.

Windows GPU LLM Guide — Best Models for NVIDIA & AMD GPUs in 2026
GuideA complete guide to running LLMs on Windows with NVIDIA and AMD GPUs. Covers VRAM requirements, setup tools, and model recommendations organized by GPU tier.
