Local AI Hub
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Blog
Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search
2026/04/22
Advanced15 min read

Advanced RAG Techniques — Chunking, Reranking, and Hybrid Search

Go beyond basic RAG. Learn chunking strategies, embedding model selection, reranking, and hybrid search to get more accurate answers from your local documents.

Basic RAG works — upload a PDF, ask a question, get an answer. But the quality of those answers depends heavily on how you chunk your documents, which embedding model you use, and how you retrieve relevant text. If your RAG setup returns irrelevant results or misses key information, the problem isn't the language model. It's the pipeline feeding it.

This guide covers the techniques that separate a working RAG system from a great one: smart chunking, embedding selection, reranking, and hybrid search. All running locally with Ollama.

The RAG Pipeline Refresher

Before diving into optimizations, here's what happens when you ask a question:

  1. Chunking — your documents are split into smaller pieces
  2. Embedding — each chunk is converted to a vector (a list of numbers)
  3. Storage — vectors are stored in a vector database
  4. Retrieval — your question is embedded, and similar chunks are found
  5. Generation — the LLM reads those chunks and generates an answer

Steps 1-4 determine what information the LLM sees. Garbage in, garbage out. Let's optimize each step.

Chunking Strategies

Chunking is the single most impactful RAG parameter. Bad chunks mean bad retrieval, regardless of how good your model is.

Fixed-Size Chunking

The simplest approach — split text every N characters or tokens, with optional overlap.

def fixed_chunk(text, chunk_size=500, overlap=50):
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

Pros: Simple, predictable chunk sizes, easy to implement.

Cons: Splits mid-sentence, breaks semantic meaning, can separate related information.

When to use: As a baseline, or for homogeneous documents (logs, structured records).

Recursive Chunking

Split text hierarchically using multiple separators. Try paragraph breaks first, then sentences, then characters.

def recursive_chunk(text, max_size=500, separators=["\n\n", "\n", ". ", " "]):
    if len(text) <= max_size:
        return [text]

    # Try each separator in order
    for sep in separators:
        parts = text.split(sep)
        if len(parts) > 1:
            chunks = []
            current = ""
            for part in parts:
                candidate = current + sep + part if current else part
                if len(candidate) <= max_size:
                    current = candidate
                else:
                    if current:
                        chunks.append(current)
                    # Recurse on oversized parts
                    chunks.extend(recursive_chunk(part, max_size, separators))
                    current = ""
            if current:
                chunks.append(current)
            return chunks
    # Fallback to fixed-size
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]

Pros: Respects document structure, keeps paragraphs and sentences together.

Cons: Still purely text-based, doesn't understand meaning.

When to use: Most general-purpose RAG setups. This is the default in frameworks like LangChain for good reason.

Semantic Chunking

Group text by meaning, not size. Split when the semantic similarity between consecutive sentences drops below a threshold.

import numpy as np

def semantic_chunk(sentences, embeddings, threshold=0.7):
    chunks = []
    current_chunk = [sentences[0]]

    for i in range(1, len(sentences)):
        # Cosine similarity between adjacent sentence embeddings
        sim = cosine_similarity(embeddings[i-1], embeddings[i])

        if sim < threshold:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

Pros: Chunks respect topic boundaries, each chunk is semantically coherent.

Cons: Requires embedding every sentence first (slower processing), variable chunk sizes.

When to use: High-stakes RAG where precision matters — legal documents, medical records, technical documentation.

Chunking Recommendations

Document TypeBest StrategyChunk SizeOverlap
Technical docsRecursive400-600 tokens50-100
Legal contractsSemanticVariableN/A
Research papersRecursive500-800 tokens100
Meeting transcriptsSemanticVariableN/A
General PDFsRecursive400-500 tokens50
Code documentationRecursive300-400 tokens50

Overlapping by 10-20% prevents information from being split across chunk boundaries. Always use overlap with fixed-size and recursive chunking.

Embedding Model Selection

Your embedding model determines how well "what the user asked" matches "what the document says." A bad embedding model makes similar concepts look unrelated.

Local Embedding Models for Ollama

ModelDimensionsSpeedQualityBest For
nomic-embed-text768FastGoodGeneral purpose, most RAG setups
mxbai-embed-large1024MediumVery goodComplex documents, nuanced queries
all-minilm384Very fastDecentLarge document sets, speed-critical

Pulling and Testing Embedding Models

# Pull recommended models
ollama pull nomic-embed-text
ollama pull mxbai-embed-large

# Test embedding generation
ollama run nomic-embed-text "What is machine learning?"

When to Use Which

  • nomic-embed-text — default choice. Good quality, fast, 768 dimensions. Works for 90% of RAG use cases.
  • mxbai-embed-large — when you need higher retrieval precision. Better at matching paraphrased queries to documents. Slower but worth it for complex knowledge bases.
  • all-minilm — when you're embedding millions of chunks and speed matters more than perfect precision.

Rule of thumb: Start with nomic-embed-text. Switch to mxbai-embed-large if you notice missed retrievals on semantically similar but lexically different queries.

Reranking

Vector search finds similar chunks, but similarity doesn't always mean relevance. Reranking takes the top-K results from vector search and scores them for actual relevance to the query.

How Reranking Works

  1. Vector search retrieves top 20-50 chunks
  2. A cross-encoder model scores each (query, chunk) pair for relevance
  3. Results are re-sorted by relevance score
  4. Only the top N (usually 5-10) are sent to the LLM

Implementing Reranking Locally

from sentence_transformers import CrossEncoder

# Load a local cross-encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, chunks, top_n=5):
    # Score each (query, chunk) pair
    pairs = [(query, chunk) for chunk in chunks]
    scores = reranker.predict(pairs)

    # Sort by score and return top N
    ranked = sorted(zip(scores, chunks), reverse=True)
    return [chunk for score, chunk in ranked[:top_n]]

# Usage
initial_results = vector_search(query, top_k=20)
reranked_results = rerank(query, initial_results, top_n=5)

When Reranking Helps

  • Query is ambiguous — reranking disambiguates by considering full query-chunk context
  • Document has similar-but-different sections — reranking separates truly relevant from merely similar
  • You're retrieving many chunks — the more chunks you fetch, the more noise you need to filter

Performance tip: Reranking is slower than vector search (it processes each pair individually). Use it when quality matters more than speed. For real-time chat, skip it. For document analysis, always use it.

Hybrid Search

Vector search captures semantic similarity. Keyword search (BM25) captures exact term matches. Hybrid search combines both for the best of both worlds.

Why Hybrid Search Matters

Vector search alone fails when:

  • The user searches for a specific product code, error message, or name
  • The answer depends on exact terminology
  • The query contains rare words not well-represented in the embedding space

BM25 alone fails when:

  • The user paraphrases differently from the document
  • The query uses synonyms the document doesn't
  • The meaning matters more than the words

Implementing Hybrid Search

import math
from collections import Counter

def bm25_score(query_tokens, doc_tokens, doc_lengths, avg_dl, k1=1.5, b=0.75):
    """Simple BM25 scoring."""
    score = 0
    doc_freq = Counter(doc_tokens)
    N = len(doc_lengths)
    for term in query_tokens:
        if term in doc_freq:
            tf = doc_freq[term]
            # IDF component
            df = sum(1 for dl in doc_lengths if term in dl)
            idf = math.log((N - df + 0.5) / (df + 0.5) + 1)
            # TF component with length normalization
            tf_score = (tf * (k1 + 1)) / (tf + k1 * (1 - b + b * len(doc_tokens) / avg_dl))
            score += idf * tf_score
    return score

def hybrid_search(query, chunks, vector_scores, alpha=0.7):
    """
    Combine vector and BM25 scores.
    alpha controls the blend: 1.0 = pure vector, 0.0 = pure BM25
    """
    # Normalize scores to [0, 1] range
    vector_normalized = normalize(vector_scores)
    bm25_scores = compute_bm25(query, chunks)
    bm25_normalized = normalize(bm25_scores)

    # Weighted combination
    combined = []
    for i in range(len(chunks)):
        score = alpha * vector_normalized[i] + (1 - alpha) * bm25_normalized[i]
        combined.append((score, chunks[i]))

    combined.sort(reverse=True)
    return [chunk for score, chunk in combined]

The Alpha Parameter

The alpha parameter controls how much weight goes to vector vs. keyword search:

AlphaBehaviorBest For
1.0Pure vector searchConceptual queries, paraphrased questions
0.7Mostly vector, some keywordGeneral purpose — good starting point
0.5Equal weightBalanced queries
0.3Mostly keyword, some vectorTechnical docs with specific terms
0.0Pure BM25Exact match queries, code search

Start with alpha=0.7 and adjust based on your retrieval quality.

Putting It All Together

Here's an optimized RAG pipeline combining all techniques:

  1. Recursive chunking with 500-token chunks and 100-token overlap
  2. nomic-embed-text for embeddings
  3. Hybrid search with alpha=0.7
  4. Cross-encoder reranking on top 20 results
  5. Feed top 5 chunks to your LLM

With Open WebUI

Open WebUI handles chunking and embedding automatically. To optimize:

  1. Go to Settings → Documents
  2. Select nomic-embed-text as the embedding model
  3. Adjust the chunk size (try 400-600 with 100 overlap)
  4. Increase top-K to 10-20 for better recall

With AnythingLLM

AnythingLLM gives you workspace-level control:

  1. Create separate workspaces for different document types
  2. Use the built-in embedding (based on Ollama)
  3. Adjust chunk settings in workspace configuration
  4. Use the citation feature to verify which chunks were used

Measuring RAG Quality

How do you know if your improvements actually helped?

Manual Evaluation

Create a test set of 20-30 questions with known correct answers from your documents. Run each question through your RAG pipeline and check:

  • Recall — did the retrieval find the right chunk?
  • Faithfulness — does the answer stick to the retrieved content?
  • Relevance — does the answer actually address the question?

Key Metrics to Track

MetricWhat It MeasuresGood Target
Retrieval recall @ 5Is the right chunk in top 5?> 85%
Answer faithfulnessAnswer matches source content> 90%
Answer relevanceAnswer addresses the question> 85%

If recall is below 85%, improve chunking or try hybrid search. If faithfulness is low, your LLM is hallucinating — try a larger model or more focused prompts. If relevance is low, improve your query processing or reranking.

Common Pitfalls

Chunks too small (under 200 tokens): You lose context. The model sees fragments without the surrounding explanation.

Chunks too large (over 1000 tokens): You dilute relevance. The retrieval matches broadly but misses specific answers buried in large chunks.

No overlap: Information at chunk boundaries gets lost. Always use 10-20% overlap.

Ignoring document structure: Headers, lists, and code blocks should inform your chunking strategy, not be treated as plain text.

One embedding model for everything: Different domains benefit from different embedding models. If you're working with technical or specialized content, test alternatives.

Related Guides

  • Local RAG Tutorial — Chat with Your Documents
  • Best Models for Coding, Chat, and RAG
  • Open WebUI vs AnythingLLM
  • Getting Started with Local AI
  • How to Install Ollama
All Posts

Author

avatar for Local AI Hub
Local AI Hub

Categories

  • Tutorials
The RAG Pipeline RefresherChunking StrategiesFixed-Size ChunkingRecursive ChunkingSemantic ChunkingChunking RecommendationsEmbedding Model SelectionLocal Embedding Models for OllamaPulling and Testing Embedding ModelsWhen to Use WhichRerankingHow Reranking WorksImplementing Reranking LocallyWhen Reranking HelpsHybrid SearchWhy Hybrid Search MattersImplementing Hybrid SearchThe Alpha ParameterPutting It All TogetherWith Open WebUIWith AnythingLLMMeasuring RAG QualityManual EvaluationKey Metrics to TrackCommon PitfallsRelated Guides

More Posts

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared
Cloud DeployComparisons

Best GPU Cloud for LLM — Runpod, DigitalOcean, and Alternatives Compared

Comparison

Compare the best cloud GPU platforms for running large language models. Pricing, GPU options, ease of use, and recommendations for different use cases.

avatar for Local AI Hub
Local AI Hub
2026/04/17
How to Run Qwen Locally — Alibaba's Powerful Multilingual Model
Models & HardwareTutorials

How to Run Qwen Locally — Alibaba's Powerful Multilingual Model

Tutorial

Run Qwen 2.5 models on your own computer — one of the best open models for coding, multilingual tasks, and general use. Works on devices with 8GB RAM or more.

avatar for Local AI Hub
Local AI Hub
2026/04/13
Best AI Models for 16GB RAM — Run High-Quality LLMs Locally
Lists & GuidesModels & Hardware

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally

Guide

With 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.

avatar for Local AI Hub
Local AI Hub
2026/04/18
Local AI Hub

Run AI locally — fast, cheap, and private

Resources
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Device Check
  • Blog
Company
  • About
  • Contact
Legal
  • Cookie Policy
  • Privacy Policy
  • Terms of Service
© 2026 Local AI Hub. All Rights Reserved.