Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

2026/04/22

Advanced15 min read

Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

Go beyond basic RAG. Learn chunking strategies, embedding model selection, reranking, and hybrid search to get more accurate answers from your local documents.

Basic RAG works — upload a PDF, ask a question, get an answer. But the quality of those answers depends heavily on how you chunk your documents, which embedding model you use, and how you retrieve relevant text. If your RAG setup returns irrelevant results or misses key information, the problem isn't the language model. It's the pipeline feeding it.

This guide covers the techniques that separate a working RAG system from a great one: smart chunking, embedding selection, reranking, and hybrid search. All running locally with Ollama.

The RAG Pipeline Refresher

Before diving into optimizations, here's what happens when you ask a question:

Chunking — your documents are split into smaller pieces
Embedding — each chunk is converted to a vector (a list of numbers)
Storage — vectors are stored in a vector database
Retrieval — your question is embedded, and similar chunks are found
Generation — the LLM reads those chunks and generates an answer

Steps 1-4 determine what information the LLM sees. Garbage in, garbage out. Let's optimize each step.

Chunking Strategies

Chunking is the single most impactful RAG parameter. Bad chunks mean bad retrieval, regardless of how good your model is.

Fixed-Size Chunking

The simplest approach — split text every N characters or tokens, with optional overlap.

def fixed_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Pros: Simple, predictable chunk sizes, easy to implement.

Cons: Splits mid-sentence, breaks semantic meaning, can separate related information.

When to use: As a baseline, or for homogeneous documents (logs, structured records).

Recursive Chunking

Split text hierarchically using multiple separators. Try paragraph breaks first, then sentences, then characters.

def recursive_chunk(text, max_size=500, separators=["\n\n", "\n", ". ", " "]):
    if len(text) <= max_size:
        return [text]

    # Try each separator in order
    for sep in separators:
        parts = text.split(sep)
        if len(parts) > 1:
            chunks = []
            current = ""
            for part in parts:
                candidate = current + sep + part if current else part
                if len(candidate) <= max_size:
                    current = candidate
                else:
                    if current:
                        chunks.append(current)
                    # Recurse on oversized parts
                    chunks.extend(recursive_chunk(part, max_size, separators))
                    current = ""
            if current:
                chunks.append(current)
            return chunks
    # Fallback to fixed-size
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]

Pros: Respects document structure, keeps paragraphs and sentences together.

Cons: Still purely text-based, doesn't understand meaning.

When to use: Most general-purpose RAG setups. This is the default in frameworks like LangChain for good reason.

Semantic Chunking

Group text by meaning, not size. Split when the semantic similarity between consecutive sentences drops below a threshold.

import numpy as np

def semantic_chunk(sentences, embeddings, threshold=0.7):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Cosine similarity between adjacent sentence embeddings
        sim = cosine_similarity(embeddings[i-1], embeddings[i])

        if sim < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Pros: Chunks respect topic boundaries, each chunk is semantically coherent.

Cons: Requires embedding every sentence first (slower processing), variable chunk sizes.

When to use: High-stakes RAG where precision matters — legal documents, medical records, technical documentation.

Chunking Recommendations

Document Type	Best Strategy	Chunk Size	Overlap
Technical docs	Recursive	400-600 tokens	50-100
Legal contracts	Semantic	Variable	N/A
Research papers	Recursive	500-800 tokens	100
Meeting transcripts	Semantic	Variable	N/A
General PDFs	Recursive	400-500 tokens	50
Code documentation	Recursive	300-400 tokens	50

Overlapping by 10-20% prevents information from being split across chunk boundaries. Always use overlap with fixed-size and recursive chunking.

Embedding Model Selection

Your embedding model determines how well "what the user asked" matches "what the document says." A bad embedding model makes similar concepts look unrelated.

Local Embedding Models for Ollama

Model	Dimensions	Speed	Quality	Best For
nomic-embed-text	768	Fast	Good	General purpose, most RAG setups
mxbai-embed-large	1024	Medium	Very good	Complex documents, nuanced queries
all-minilm	384	Very fast	Decent	Large document sets, speed-critical

Pulling and Testing Embedding Models

# Pull recommended models
ollama pull nomic-embed-text
ollama pull mxbai-embed-large

# Test embedding generation
ollama run nomic-embed-text "What is machine learning?"

When to Use Which

nomic-embed-text — default choice. Good quality, fast, 768 dimensions. Works for 90% of RAG use cases.
mxbai-embed-large — when you need higher retrieval precision. Better at matching paraphrased queries to documents. Slower but worth it for complex knowledge bases.
all-minilm — when you're embedding millions of chunks and speed matters more than perfect precision.

Rule of thumb: Start with nomic-embed-text. Switch to mxbai-embed-large if you notice missed retrievals on semantically similar but lexically different queries.

Reranking

Vector search finds similar chunks, but similarity doesn't always mean relevance. Reranking takes the top-K results from vector search and scores them for actual relevance to the query.

How Reranking Works

Vector search retrieves top 20-50 chunks
A cross-encoder model scores each (query, chunk) pair for relevance
Results are re-sorted by relevance score
Only the top N (usually 5-10) are sent to the LLM

Implementing Reranking Locally

from sentence_transformers import CrossEncoder

# Load a local cross-encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, chunks, top_n=5):
    # Score each (query, chunk) pair
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)

    # Sort by score and return top N
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for score, chunk in ranked[:top_n]]

# Usage
initial_results = vector_search(query, top_k=20)
reranked_results = rerank(query, initial_results, top_n=5)

When Reranking Helps

Query is ambiguous — reranking disambiguates by considering full query-chunk context
Document has similar-but-different sections — reranking separates truly relevant from merely similar
You're retrieving many chunks — the more chunks you fetch, the more noise you need to filter

Performance tip: Reranking is slower than vector search (it processes each pair individually). Use it when quality matters more than speed. For real-time chat, skip it. For document analysis, always use it.

Hybrid Search

Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. Hybrid search combines both for the best of both worlds.

Why Hybrid Search Matters

Vector search alone fails when:

The user searches for a specific product code, error message, or name
The answer depends on exact terminology
The query contains rare words not well-represented in the embedding space

BM25 alone fails when:

The user paraphrases differently from the document
The query uses synonyms the document doesn't
The meaning matters more than the words

Implementing Hybrid Search

import math
from collections import Counter

def bm25_score(query_tokens, doc_tokens, doc_lengths, avg_dl, k1=1.5, b=0.75):
    """Simple BM25 scoring."""
    score = 0
    doc_freq = Counter(doc_tokens)
    N = len(doc_lengths)
    for term in query_tokens:
        if term in doc_freq:
            tf = doc_freq[term]
            # IDF component
            df = sum(1 for dl in doc_lengths if term in dl)
            idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
            # TF component with length normalization
            tf_score = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * len(doc_tokens) / avg_dl))
            score += idf * tf_score
    return score

def hybrid_search(query, chunks, vector_scores, alpha=0.7):
    """
    Combine vector and BM25 scores.
    alpha controls the blend: 1.0 = pure vector, 0.0 = pure BM25
    """
    # Normalize scores to [0, 1] range
    vector_normalized = normalize(vector_scores)
    bm25_scores = compute_bm25(query, chunks)
    bm25_normalized = normalize(bm25_scores)

    # Weighted combination
    combined = []
    for i in range(len(chunks)):
        score = alpha * vector_normalized[i] + (1 - alpha) * bm25_normalized[i]
        combined.append((score, chunks[i]))

    combined.sort(reverse=True)
    return [chunk for score, chunk in combined]

The Alpha Parameter

The alpha parameter controls how much weight goes to vector vs. keyword search:

Alpha	Behavior	Best For
1.0	Pure vector search	Conceptual queries, paraphrased questions
0.7	Mostly vector, some keyword	General purpose — good starting point
0.5	Equal weight	Balanced queries
0.3	Mostly keyword, some vector	Technical docs with specific terms
0.0	Pure BM25	Exact match queries, code search

Start with alpha=0.7 and adjust based on your retrieval quality.

Putting It All Together

Here's an optimized RAG pipeline combining all techniques:

Recursive chunking with 500-token chunks and 100-token overlap
nomic-embed-text for embeddings
Hybrid search with alpha=0.7
Cross-encoder reranking on top 20 results
Feed top 5 chunks to your LLM

With Open WebUI

Open WebUI handles chunking and embedding automatically. To optimize:

Go to Settings → Documents
Select nomic-embed-text as the embedding model
Adjust the chunk size (try 400-600 with 100 overlap)
Increase top-K to 10-20 for better recall

With AnythingLLM

AnythingLLM gives you workspace-level control:

Create separate workspaces for different document types
Use the built-in embedding (based on Ollama)
Adjust chunk settings in workspace configuration
Use the citation feature to verify which chunks were used

Measuring RAG Quality

How do you know if your improvements actually helped?

Manual Evaluation

Create a test set of 20-30 questions with known correct answers from your documents. Run each question through your RAG pipeline and check:

Recall — did the retrieval find the right chunk?
Faithfulness — does the answer stick to the retrieved content?
Relevance — does the answer actually address the question?

Key Metrics to Track

Metric	What It Measures	Good Target
Retrieval recall @ 5	Is the right chunk in top 5?	> 85%
Answer faithfulness	Answer matches source content	> 90%
Answer relevance	Answer addresses the question	> 85%

If recall is below 85%, improve chunking or try hybrid search. If faithfulness is low, your LLM is hallucinating — try a larger model or more focused prompts. If relevance is low, improve your query processing or reranking.

Common Pitfalls

Chunks too small (under 200 tokens): You lose context. The model sees fragments without the surrounding explanation.

Chunks too large (over 1000 tokens): You dilute relevance. The retrieval matches broadly but misses specific answers buried in large chunks.

No overlap: Information at chunk boundaries gets lost. Always use 10-20% overlap.

Ignoring document structure: Headers, lists, and code blocks should inform your chunking strategy, not be treated as plain text.

One embedding model for everything: Different domains benefit from different embedding models. If you're working with technical or specialized content, test alternatives.

All Posts

Cloud DeployComparisons

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared

Comparison

Compare the best cloud GPU platforms for running large language models. Pricing, GPU options, ease of use, and recommendations for different use cases.

Local AI Hub

2026/04/17

Models & HardwareTutorials

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model

Tutorial

Run Qwen 2.5 models on your own computer — one of the best open models for coding, multilingual tasks, and general use. Works on devices with 8GB RAM or more.

Local AI Hub

2026/04/13

Lists & GuidesModels & Hardware

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally

Guide

With 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.

Local AI Hub

2026/04/18

2026/04/22

Advanced15 min read

Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

Go beyond basic RAG. Learn chunking strategies, embedding model selection, reranking, and hybrid search to get more accurate answers from your local documents.

This guide covers the techniques that separate a working RAG system from a great one: smart chunking, embedding selection, reranking, and hybrid search. All running locally with Ollama.

The RAG Pipeline Refresher

Before diving into optimizations, here's what happens when you ask a question:

Chunking — your documents are split into smaller pieces
Embedding — each chunk is converted to a vector (a list of numbers)
Storage — vectors are stored in a vector database
Retrieval — your question is embedded, and similar chunks are found
Generation — the LLM reads those chunks and generates an answer

Steps 1-4 determine what information the LLM sees. Garbage in, garbage out. Let's optimize each step.

Chunking Strategies

Chunking is the single most impactful RAG parameter. Bad chunks mean bad retrieval, regardless of how good your model is.

Fixed-Size Chunking

The simplest approach — split text every N characters or tokens, with optional overlap.

def fixed_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Pros: Simple, predictable chunk sizes, easy to implement.

Cons: Splits mid-sentence, breaks semantic meaning, can separate related information.

When to use: As a baseline, or for homogeneous documents (logs, structured records).

Recursive Chunking

Split text hierarchically using multiple separators. Try paragraph breaks first, then sentences, then characters.

def recursive_chunk(text, max_size=500, separators=["\n\n", "\n", ". ", " "]):
    if len(text) <= max_size:
        return [text]

    # Try each separator in order
    for sep in separators:
        parts = text.split(sep)
        if len(parts) > 1:
            chunks = []
            current = ""
            for part in parts:
                candidate = current + sep + part if current else part
                if len(candidate) <= max_size:
                    current = candidate
                else:
                    if current:
                        chunks.append(current)
                    # Recurse on oversized parts
                    chunks.extend(recursive_chunk(part, max_size, separators))
                    current = ""
            if current:
                chunks.append(current)
            return chunks
    # Fallback to fixed-size
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]

Pros: Respects document structure, keeps paragraphs and sentences together.

Cons: Still purely text-based, doesn't understand meaning.

When to use: Most general-purpose RAG setups. This is the default in frameworks like LangChain for good reason.

Semantic Chunking

Group text by meaning, not size. Split when the semantic similarity between consecutive sentences drops below a threshold.

import numpy as np

def semantic_chunk(sentences, embeddings, threshold=0.7):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Cosine similarity between adjacent sentence embeddings
        sim = cosine_similarity(embeddings[i-1], embeddings[i])

        if sim < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Pros: Chunks respect topic boundaries, each chunk is semantically coherent.

Cons: Requires embedding every sentence first (slower processing), variable chunk sizes.

When to use: High-stakes RAG where precision matters — legal documents, medical records, technical documentation.

Chunking Recommendations

Document Type	Best Strategy	Chunk Size	Overlap
Technical docs	Recursive	400-600 tokens	50-100
Legal contracts	Semantic	Variable	N/A
Research papers	Recursive	500-800 tokens	100
Meeting transcripts	Semantic	Variable	N/A
General PDFs	Recursive	400-500 tokens	50
Code documentation	Recursive	300-400 tokens	50

Overlapping by 10-20% prevents information from being split across chunk boundaries. Always use overlap with fixed-size and recursive chunking.

Embedding Model Selection

Your embedding model determines how well "what the user asked" matches "what the document says." A bad embedding model makes similar concepts look unrelated.

Local Embedding Models for Ollama

Model	Dimensions	Speed	Quality	Best For
nomic-embed-text	768	Fast	Good	General purpose, most RAG setups
mxbai-embed-large	1024	Medium	Very good	Complex documents, nuanced queries
all-minilm	384	Very fast	Decent	Large document sets, speed-critical

Pulling and Testing Embedding Models

# Pull recommended models
ollama pull nomic-embed-text
ollama pull mxbai-embed-large

# Test embedding generation
ollama run nomic-embed-text "What is machine learning?"

When to Use Which

nomic-embed-text — default choice. Good quality, fast, 768 dimensions. Works for 90% of RAG use cases.
mxbai-embed-large — when you need higher retrieval precision. Better at matching paraphrased queries to documents. Slower but worth it for complex knowledge bases.
all-minilm — when you're embedding millions of chunks and speed matters more than perfect precision.

Rule of thumb: Start with nomic-embed-text. Switch to mxbai-embed-large if you notice missed retrievals on semantically similar but lexically different queries.

Reranking

Vector search finds similar chunks, but similarity doesn't always mean relevance. Reranking takes the top-K results from vector search and scores them for actual relevance to the query.

How Reranking Works

Vector search retrieves top 20-50 chunks
A cross-encoder model scores each (query, chunk) pair for relevance
Results are re-sorted by relevance score
Only the top N (usually 5-10) are sent to the LLM

Implementing Reranking Locally

from sentence_transformers import CrossEncoder

# Load a local cross-encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, chunks, top_n=5):
    # Score each (query, chunk) pair
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)

    # Sort by score and return top N
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for score, chunk in ranked[:top_n]]

# Usage
initial_results = vector_search(query, top_k=20)
reranked_results = rerank(query, initial_results, top_n=5)

When Reranking Helps

Query is ambiguous — reranking disambiguates by considering full query-chunk context
Document has similar-but-different sections — reranking separates truly relevant from merely similar
You're retrieving many chunks — the more chunks you fetch, the more noise you need to filter

Hybrid Search

Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. Hybrid search combines both for the best of both worlds.

Why Hybrid Search Matters

Vector search alone fails when:

The user searches for a specific product code, error message, or name
The answer depends on exact terminology
The query contains rare words not well-represented in the embedding space

BM25 alone fails when:

The user paraphrases differently from the document
The query uses synonyms the document doesn't
The meaning matters more than the words

Implementing Hybrid Search

import math
from collections import Counter

def bm25_score(query_tokens, doc_tokens, doc_lengths, avg_dl, k1=1.5, b=0.75):
    """Simple BM25 scoring."""
    score = 0
    doc_freq = Counter(doc_tokens)
    N = len(doc_lengths)
    for term in query_tokens:
        if term in doc_freq:
            tf = doc_freq[term]
            # IDF component
            df = sum(1 for dl in doc_lengths if term in dl)
            idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
            # TF component with length normalization
            tf_score = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * len(doc_tokens) / avg_dl))
            score += idf * tf_score
    return score

def hybrid_search(query, chunks, vector_scores, alpha=0.7):
    """
    Combine vector and BM25 scores.
    alpha controls the blend: 1.0 = pure vector, 0.0 = pure BM25
    """
    # Normalize scores to [0, 1] range
    vector_normalized = normalize(vector_scores)
    bm25_scores = compute_bm25(query, chunks)
    bm25_normalized = normalize(bm25_scores)

    # Weighted combination
    combined = []
    for i in range(len(chunks)):
        score = alpha * vector_normalized[i] + (1 - alpha) * bm25_normalized[i]
        combined.append((score, chunks[i]))

    combined.sort(reverse=True)
    return [chunk for score, chunk in combined]

The Alpha Parameter

The alpha parameter controls how much weight goes to vector vs. keyword search:

Alpha	Behavior	Best For
1.0	Pure vector search	Conceptual queries, paraphrased questions
0.7	Mostly vector, some keyword	General purpose — good starting point
0.5	Equal weight	Balanced queries
0.3	Mostly keyword, some vector	Technical docs with specific terms
0.0	Pure BM25	Exact match queries, code search

Start with alpha=0.7 and adjust based on your retrieval quality.

Putting It All Together

Here's an optimized RAG pipeline combining all techniques:

Recursive chunking with 500-token chunks and 100-token overlap
nomic-embed-text for embeddings
Hybrid search with alpha=0.7
Cross-encoder reranking on top 20 results
Feed top 5 chunks to your LLM

With Open WebUI

Open WebUI handles chunking and embedding automatically. To optimize:

Go to Settings → Documents
Select nomic-embed-text as the embedding model
Adjust the chunk size (try 400-600 with 100 overlap)
Increase top-K to 10-20 for better recall

With AnythingLLM

AnythingLLM gives you workspace-level control:

Create separate workspaces for different document types
Use the built-in embedding (based on Ollama)
Adjust chunk settings in workspace configuration
Use the citation feature to verify which chunks were used

Measuring RAG Quality

How do you know if your improvements actually helped?

Manual Evaluation

Create a test set of 20-30 questions with known correct answers from your documents. Run each question through your RAG pipeline and check:

Recall — did the retrieval find the right chunk?
Faithfulness — does the answer stick to the retrieved content?
Relevance — does the answer actually address the question?

Key Metrics to Track

Metric	What It Measures	Good Target
Retrieval recall @ 5	Is the right chunk in top 5?	> 85%
Answer faithfulness	Answer matches source content	> 90%
Answer relevance	Answer addresses the question	> 85%

Common Pitfalls

Chunks too small (under 200 tokens): You lose context. The model sees fragments without the surrounding explanation.

Chunks too large (over 1000 tokens): You dilute relevance. The retrieval matches broadly but misses specific answers buried in large chunks.

No overlap: Information at chunk boundaries gets lost. Always use 10-20% overlap.

Ignoring document structure: Headers, lists, and code blocks should inform your chunking strategy, not be treated as plain text.

One embedding model for everything: Different domains benefit from different embedding models. If you're working with technical or specialized content, test alternatives.

All Posts

Cloud DeployComparisons

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared

Comparison

Compare the best cloud GPU platforms for running large language models. Pricing, GPU options, ease of use, and recommendations for different use cases.

Local AI Hub

2026/04/17

Models & HardwareTutorials

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model

Tutorial

Run Qwen 2.5 models on your own computer — one of the best open models for coding, multilingual tasks, and general use. Works on devices with 8GB RAM or more.

Local AI Hub

2026/04/13

Lists & GuidesModels & Hardware

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally

Guide

With 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.

Local AI Hub

2026/04/18

Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

Author

Categories

More Posts

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally

Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

Author

Categories

More Posts

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally