Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search
Go beyond basic RAG. Learn chunking strategies, embedding model selection, reranking, and hybrid search to get more accurate answers from your local documents.
Basic RAG works — upload a PDF, ask a question, get an answer. But the quality of those answers depends heavily on how you chunk your documents, which embedding model you use, and how you retrieve relevant text. If your RAG setup returns irrelevant results or misses key information, the problem isn't the language model. It's the pipeline feeding it.
This guide covers the techniques that separate a working RAG system from a great one: smart chunking, embedding selection, reranking, and hybrid search. All running locally with Ollama.
The RAG Pipeline Refresher
Before diving into optimizations, here's what happens when you ask a question:
- Chunking — your documents are split into smaller pieces
- Embedding — each chunk is converted to a vector (a list of numbers)
- Storage — vectors are stored in a vector database
- Retrieval — your question is embedded, and similar chunks are found
- Generation — the LLM reads those chunks and generates an answer
Steps 1-4 determine what information the LLM sees. Garbage in, garbage out. Let's optimize each step.
Chunking Strategies
Chunking is the single most impactful RAG parameter. Bad chunks mean bad retrieval, regardless of how good your model is.
Fixed-Size Chunking
The simplest approach — split text every N characters or tokens, with optional overlap.
def fixed_chunk(text, chunk_size=500, overlap=50):
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += chunk_size - overlap
return chunksPros: Simple, predictable chunk sizes, easy to implement.
Cons: Splits mid-sentence, breaks semantic meaning, can separate related information.
When to use: As a baseline, or for homogeneous documents (logs, structured records).
Recursive Chunking
Split text hierarchically using multiple separators. Try paragraph breaks first, then sentences, then characters.
def recursive_chunk(text, max_size=500, separators=["\n\n", "\n", ". ", " "]):
if len(text) <= max_size:
return [text]
# Try each separator in order
for sep in separators:
parts = text.split(sep)
if len(parts) > 1:
chunks = []
current = ""
for part in parts:
candidate = current + sep + part if current else part
if len(candidate) <= max_size:
current = candidate
else:
if current:
chunks.append(current)
# Recurse on oversized parts
chunks.extend(recursive_chunk(part, max_size, separators))
current = ""
if current:
chunks.append(current)
return chunks
# Fallback to fixed-size
return [text[i:i+max_size] for i in range(0, len(text), max_size)]Pros: Respects document structure, keeps paragraphs and sentences together.
Cons: Still purely text-based, doesn't understand meaning.
When to use: Most general-purpose RAG setups. This is the default in frameworks like LangChain for good reason.
Semantic Chunking
Group text by meaning, not size. Split when the semantic similarity between consecutive sentences drops below a threshold.
import numpy as np
def semantic_chunk(sentences, embeddings, threshold=0.7):
chunks = []
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Cosine similarity between adjacent sentence embeddings
sim = cosine_similarity(embeddings[i-1], embeddings[i])
if sim < threshold:
chunks.append(" ".join(current_chunk))
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunksPros: Chunks respect topic boundaries, each chunk is semantically coherent.
Cons: Requires embedding every sentence first (slower processing), variable chunk sizes.
When to use: High-stakes RAG where precision matters — legal documents, medical records, technical documentation.
Chunking Recommendations
| Document Type | Best Strategy | Chunk Size | Overlap |
|---|---|---|---|
| Technical docs | Recursive | 400-600 tokens | 50-100 |
| Legal contracts | Semantic | Variable | N/A |
| Research papers | Recursive | 500-800 tokens | 100 |
| Meeting transcripts | Semantic | Variable | N/A |
| General PDFs | Recursive | 400-500 tokens | 50 |
| Code documentation | Recursive | 300-400 tokens | 50 |
Overlapping by 10-20% prevents information from being split across chunk boundaries. Always use overlap with fixed-size and recursive chunking.
Embedding Model Selection
Your embedding model determines how well "what the user asked" matches "what the document says." A bad embedding model makes similar concepts look unrelated.
Local Embedding Models for Ollama
| Model | Dimensions | Speed | Quality | Best For |
|---|---|---|---|---|
| nomic-embed-text | 768 | Fast | Good | General purpose, most RAG setups |
| mxbai-embed-large | 1024 | Medium | Very good | Complex documents, nuanced queries |
| all-minilm | 384 | Very fast | Decent | Large document sets, speed-critical |
Pulling and Testing Embedding Models
# Pull recommended models
ollama pull nomic-embed-text
ollama pull mxbai-embed-large
# Test embedding generation
ollama run nomic-embed-text "What is machine learning?"When to Use Which
- nomic-embed-text — default choice. Good quality, fast, 768 dimensions. Works for 90% of RAG use cases.
- mxbai-embed-large — when you need higher retrieval precision. Better at matching paraphrased queries to documents. Slower but worth it for complex knowledge bases.
- all-minilm — when you're embedding millions of chunks and speed matters more than perfect precision.
Rule of thumb: Start with nomic-embed-text. Switch to mxbai-embed-large if you notice missed retrievals on semantically similar but lexically different queries.
Reranking
Vector search finds similar chunks, but similarity doesn't always mean relevance. Reranking takes the top-K results from vector search and scores them for actual relevance to the query.
How Reranking Works
- Vector search retrieves top 20-50 chunks
- A cross-encoder model scores each (query, chunk) pair for relevance
- Results are re-sorted by relevance score
- Only the top N (usually 5-10) are sent to the LLM
Implementing Reranking Locally
from sentence_transformers import CrossEncoder
# Load a local cross-encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank(query, chunks, top_n=5):
# Score each (query, chunk) pair
pairs = [(query, chunk) for chunk in chunks]
scores = reranker.predict(pairs)
# Sort by score and return top N
ranked = sorted(zip(scores, chunks), reverse=True)
return [chunk for score, chunk in ranked[:top_n]]
# Usage
initial_results = vector_search(query, top_k=20)
reranked_results = rerank(query, initial_results, top_n=5)When Reranking Helps
- Query is ambiguous — reranking disambiguates by considering full query-chunk context
- Document has similar-but-different sections — reranking separates truly relevant from merely similar
- You're retrieving many chunks — the more chunks you fetch, the more noise you need to filter
Performance tip: Reranking is slower than vector search (it processes each pair individually). Use it when quality matters more than speed. For real-time chat, skip it. For document analysis, always use it.
Hybrid Search
Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. Hybrid search combines both for the best of both worlds.
Why Hybrid Search Matters
Vector search alone fails when:
- The user searches for a specific product code, error message, or name
- The answer depends on exact terminology
- The query contains rare words not well-represented in the embedding space
BM25 alone fails when:
- The user paraphrases differently from the document
- The query uses synonyms the document doesn't
- The meaning matters more than the words
Implementing Hybrid Search
import math
from collections import Counter
def bm25_score(query_tokens, doc_tokens, doc_lengths, avg_dl, k1=1.5, b=0.75):
"""Simple BM25 scoring."""
score = 0
doc_freq = Counter(doc_tokens)
N = len(doc_lengths)
for term in query_tokens:
if term in doc_freq:
tf = doc_freq[term]
# IDF component
df = sum(1 for dl in doc_lengths if term in dl)
idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
# TF component with length normalization
tf_score = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * len(doc_tokens) / avg_dl))
score += idf * tf_score
return score
def hybrid_search(query, chunks, vector_scores, alpha=0.7):
"""
Combine vector and BM25 scores.
alpha controls the blend: 1.0 = pure vector, 0.0 = pure BM25
"""
# Normalize scores to [0, 1] range
vector_normalized = normalize(vector_scores)
bm25_scores = compute_bm25(query, chunks)
bm25_normalized = normalize(bm25_scores)
# Weighted combination
combined = []
for i in range(len(chunks)):
score = alpha * vector_normalized[i] + (1 - alpha) * bm25_normalized[i]
combined.append((score, chunks[i]))
combined.sort(reverse=True)
return [chunk for score, chunk in combined]The Alpha Parameter
The alpha parameter controls how much weight goes to vector vs. keyword search:
| Alpha | Behavior | Best For |
|---|---|---|
| 1.0 | Pure vector search | Conceptual queries, paraphrased questions |
| 0.7 | Mostly vector, some keyword | General purpose — good starting point |
| 0.5 | Equal weight | Balanced queries |
| 0.3 | Mostly keyword, some vector | Technical docs with specific terms |
| 0.0 | Pure BM25 | Exact match queries, code search |
Start with alpha=0.7 and adjust based on your retrieval quality.
Putting It All Together
Here's an optimized RAG pipeline combining all techniques:
- Recursive chunking with 500-token chunks and 100-token overlap
- nomic-embed-text for embeddings
- Hybrid search with alpha=0.7
- Cross-encoder reranking on top 20 results
- Feed top 5 chunks to your LLM
With Open WebUI
Open WebUI handles chunking and embedding automatically. To optimize:
- Go to Settings → Documents
- Select nomic-embed-text as the embedding model
- Adjust the chunk size (try 400-600 with 100 overlap)
- Increase top-K to 10-20 for better recall
With AnythingLLM
AnythingLLM gives you workspace-level control:
- Create separate workspaces for different document types
- Use the built-in embedding (based on Ollama)
- Adjust chunk settings in workspace configuration
- Use the citation feature to verify which chunks were used
Measuring RAG Quality
How do you know if your improvements actually helped?
Manual Evaluation
Create a test set of 20-30 questions with known correct answers from your documents. Run each question through your RAG pipeline and check:
- Recall — did the retrieval find the right chunk?
- Faithfulness — does the answer stick to the retrieved content?
- Relevance — does the answer actually address the question?
Key Metrics to Track
| Metric | What It Measures | Good Target |
|---|---|---|
| Retrieval recall @ 5 | Is the right chunk in top 5? | > 85% |
| Answer faithfulness | Answer matches source content | > 90% |
| Answer relevance | Answer addresses the question | > 85% |
If recall is below 85%, improve chunking or try hybrid search. If faithfulness is low, your LLM is hallucinating — try a larger model or more focused prompts. If relevance is low, improve your query processing or reranking.
Common Pitfalls
Chunks too small (under 200 tokens): You lose context. The model sees fragments without the surrounding explanation.
Chunks too large (over 1000 tokens): You dilute relevance. The retrieval matches broadly but misses specific answers buried in large chunks.
No overlap: Information at chunk boundaries gets lost. Always use 10-20% overlap.
Ignoring document structure: Headers, lists, and code blocks should inform your chunking strategy, not be treated as plain text.
One embedding model for everything: Different domains benefit from different embedding models. If you're working with technical or specialized content, test alternatives.
Related Guides
Author

Categories
More Posts
Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared
ComparisonCompare the best cloud GPU platforms for running large language models. Pricing, GPU options, ease of use, and recommendations for different use cases.

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model
TutorialRun Qwen 2.5 models on your own computer — one of the best open models for coding, multilingual tasks, and general use. Works on devices with 8GB RAM or more.

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally
GuideWith 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.
