Local AI Hub
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Blog
Local AI Fine-Tuning Guide — Customize Models with LoRA and Quantization
2026/04/22
Advanced12 min read

Local AI Fine-Tuning Guide — Customize Models with LoRA and Quantization

Learn how to fine-tune open-source LLMs on your own hardware using LoRA, and understand quantization formats like GGUF, AWQ, and GPTQ to optimize performance.

Pre-trained language models are powerful out of the box, but they are general-purpose by design. If you need a model that speaks your company's tone, understands your domain's jargon, or follows a specific output format, fine-tuning is the answer. This guide walks you through fine-tuning open-source LLMs locally using LoRA, and explains the quantization formats that make running these models practical on consumer hardware.

What Is Fine-Tuning?

Fine-tuning means taking a pre-trained model and training it further on a smaller, task-specific dataset. Instead of learning everything from scratch, the model adjusts its existing knowledge to specialize.

There are two main approaches:

ApproachVRAM NeededTraining TimeQuality
Full fine-tuning40GB+Hours to daysBest
LoRA (Low-Rank Adaptation)8-24GBMinutes to hoursVery good

For local, consumer-hardware setups, LoRA is the practical choice. It achieves 90-95% of full fine-tuning quality at a fraction of the cost.

Understanding LoRA

LoRA (Low-Rank Adaptation) works by freezing the original model weights and training a small set of new, low-rank matrices alongside them. These "adapters" are tiny — typically 1-5% of the original model size — but they capture the adjustments needed for your specific task.

Why LoRA matters for local AI:

  • Low VRAM requirements — fine-tune a 7B model on a single consumer GPU
  • Fast training — most tasks finish in under an hour
  • Portable adapters — share your LoRA weights as a small file (50-500MB)
  • Composable — merge multiple LoRA adapters or switch between them at runtime

Key LoRA Parameters

ParameterWhat It DoesTypical Value
r (rank)Controls adapter capacity8-64
lora_alphaScaling factor16-32
lora_dropoutRegularization0.05-0.1
target_modulesWhich layers to adaptq_proj, v_proj, k_proj, o_proj

Higher rank (r) means more capacity but larger adapters and more VRAM. Start with r=16 and increase if the model underfits.

Preparing Your Training Data

LoRA fine-tuning works best with well-formatted examples. The most common format is JSONL with instruction-response pairs.

Example: JSONL Format

{"instruction": "Classify this customer feedback", "input": "The delivery was fast but the packaging was damaged", "output": "Mixed | Delivery: Positive | Packaging: Negative"}
{"instruction": "Classify this customer feedback", "input": "Product works exactly as described, very happy", "output": "Positive | Product Quality: Positive | Overall: Positive"}
{"instruction": "Classify this customer feedback", "input": "Still waiting for my order after 2 weeks", "output": "Negative | Delivery: Negative | Timeliness: Negative"}

Data Preparation Tips

  1. 50-500 examples is enough for most LoRA tasks
  2. Be consistent — use the same format, tone, and structure across all examples
  3. Cover edge cases — include unusual inputs the model might encounter
  4. Quality over quantity — 100 well-written examples beat 1,000 sloppy ones
  5. Shuffle your data — randomize the order to prevent the model from learning sequence patterns

Fine-Tuning with Unsloth

Unsloth is the fastest way to fine-tune LLMs locally. It supports Llama, Mistral, Qwen, and other popular architectures with 2x faster training and 70% less memory usage compared to standard HuggingFace trainers.

Installation

# Create a virtual environment
python -m venv finetune-env
source finetune-env/bin/activate

# Install Unsloth
pip install unsloth
pip install --upgrade torch transformers datasets trl

Fine-Tuning Script

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load the base model (Llama 3.1 8B)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.1-8b",
    max_seq_length=2048,
    load_in_4bit=True,  # Use 4-bit quantization to save VRAM
)

# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                     "gate_proj", "up_proj", "down_proj"],
)

# Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

# Format the dataset
def format_example(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n"
                f"### Input:\n{example['input']}\n\n"
                f"### Response:\n{example['output']}"
    }

dataset = dataset.map(format_example)

# Set up training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        warmup_steps=50,
        max_steps=500,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="./output",
        optim="adamw_8bit",
    ),
    dataset_text_field="text",
    max_seq_length=2048,
)

# Start training
trainer.train()

# Save the LoRA adapter
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")

Hardware Requirements for Fine-Tuning

Model SizeMinimum VRAMRecommended VRAMTraining Time (500 examples)
3B6 GB8 GB~10 minutes
7-8B8 GB16 GB~30 minutes
14B16 GB24 GB~1 hour
32B32 GB48 GB~3 hours

Tip: If you don't have enough VRAM, use cloud GPU instances on Runpod for the training step, then run the fine-tuned model locally.

Quantization: Running Models Efficiently

Quantization reduces the precision of model weights — from 16-bit floats to 4-bit or even 2-bit integers. This shrinks the model dramatically with minimal quality loss, making it possible to run large models on consumer hardware.

Quantization Formats Compared

FormatCreated ByKey StrengthBest ForGPU Required
GGUFGeorgi Gerganov (llama.cpp)CPU + GPU flexibilityGeneral local use, OllamaOptional
AWQMITGPU-optimized inferenceNVIDIA/AMD GPU serversRecommended
GPTQAutoGPTQ teamMature GPU quantizationNVIDIA GPU inferenceRequired (NVIDIA)
EXL2ExLlamaV2Fastest GPU inferenceHigh-throughput servingRequired

GGUF — The Universal Format

GGUF is the most versatile quantization format. It works on CPU, GPU, or a mix of both. This is what Ollama uses under the hood.

Naming convention: GGUF files include the quantization level in the name:

QuantizationBits/WeightModel Size (7B)Quality Loss
Q2_K~2.7 bit~3 GBNoticeable
Q3_K_M~3.4 bit~3.5 GBModerate
Q4_K_M~4.8 bit~4.5 GBMinimal
Q5_K_M~5.7 bit~5 GBVery small
Q6_K~6.6 bit~5.5 GBNegligible
Q8_08 bit~7 GBAlmost none

Recommendation: Q4_K_M is the best balance for most users. It preserves 98%+ of the model's quality while cutting size roughly in half.

Using Quantized Models with Ollama

Ollama automatically uses quantized GGUF models. You don't need to configure anything:

# These all use Q4_K_M quantization by default
ollama pull llama3.1          # 8B model, ~4.7 GB
ollama pull qwen2.5:14b       # 14B model, ~9 GB
ollama pull deepseek-r1:8b    # 8B model, ~5 GB

# Verify downloaded models
ollama list

Creating Custom GGUF Models

If you fine-tuned a model and want to convert it to GGUF:

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert HuggingFace model to GGUF (FP16 first)
python convert_hf_to_gguf.py ./my-finetuned-model --outfile model-f16.gguf

# Quantize to Q4_K_M
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_M

Importing Custom Models into Ollama

Create a Modelfile to bring your quantized model into Ollama:

# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4km.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9

SYSTEM You are a helpful assistant specialized in customer feedback classification.
EOF

# Build the model in Ollama
ollama create my-custom-model -f Modelfile

# Run it
ollama run my-custom-model

AWQ — GPU-Optimized Quantization

AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns, resulting in better quality at the same compression level.

Best for: Running on servers with NVIDIA or AMD GPUs where you want maximum inference speed.

# Install AutoAWQ
pip install autoawq

# Quantize a model with AWQ
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")

quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
}

model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama-3.1-8b-awq")

AWQ models work with vLLM and other serving frameworks:

# Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model ./llama-3.1-8b-awq \
  --quantization awq

GPTQ — Mature GPU Quantization

GPTQ uses layer-wise calibration to compress models. It has been around longer than AWQ and has wide framework support.

# Install AutoGPTQ
pip install auto-gptq

# Quantize (Python)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=True)
model.quantize(calibration_data, config)
model.save_quantized("./llama-3.1-8b-gptq")

EXL2 — Maximum GPU Throughput

EXL2 is the quantization format for ExLlamaV2, designed for the fastest possible inference on NVIDIA GPUs. It supports mixed quantization (different bits per layer) for optimal quality-per-bit.

Best for: High-throughput serving where latency matters most.

# Clone and build ExLlamaV2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .

# Convert to EXL2
python convert.py \
  -i meta-llama/Llama-3.1-8B \
  -o ./llama-3.1-8b-exl2 \
  -b 4.0

Which Quantization Should You Use?

ScenarioFormatWhy
Running on Mac or laptopGGUFCPU+GPU flexibility, Ollama support
Running on desktop with NVIDIA GPUGGUF or AWQGGUF for simplicity, AWQ for speed
Serving many usersEXL2 or AWQMaximum throughput
Fine-tuned model for personal useGGUF (Q4_K_M)Easy import into Ollama
Cloud GPU deploymentAWQ or EXL2Best GPU utilization

Putting It All Together: End-to-End Workflow

Here is the complete workflow from raw data to a running fine-tuned model:

  1. Prepare your dataset in JSONL format with 50-500 examples
  2. Fine-tune with Unsloth using LoRA on your GPU (or a cloud GPU)
  3. Merge LoRA adapters into the base model
  4. Convert to GGUF and quantize to Q4_K_M
  5. Import into Ollama with a Modelfile
  6. Run locally on any machine with enough RAM
# Step 3: Merge LoRA into base model
python merge_lora.py --base-model unsloth/llama-3.1-8b \
  --lora-path ./my-lora-adapter \
  --output-dir ./merged-model

# Step 4: Convert and quantize (shown above)

# Step 5-6: Import and run (shown above)
ollama run my-custom-model

Common Pitfalls

Overfitting: If your model memorizes training examples but fails on new inputs, reduce epochs, increase dropout, or add more diverse training data.

Catastrophic forgetting: The model loses general knowledge after fine-tuning. Use a lower learning rate (1e-5 to 2e-4) and fewer steps. LoRA inherently mitigates this since base weights are frozen.

Poor quantization quality: If Q4_K_M degrades quality too much, try Q5_K_M or Q6_K. The extra GB is worth it for critical tasks.

Out of memory during training: Reduce per_device_train_batch_size to 1-2, increase gradient_accumulation_steps to 8, and ensure load_in_4bit=True.

Related Guides

  • Getting Started with Local AI
  • How to Install Ollama
  • Best Local AI Tools in 2026
  • Private AI Setup Guide
  • Models for 16GB RAM
All Posts

Author

avatar for Local AI Hub
Local AI Hub

Categories

  • Tutorials
What Is Fine-Tuning?Understanding LoRAKey LoRA ParametersPreparing Your Training DataExample: JSONL FormatData Preparation TipsFine-Tuning with UnslothInstallationFine-Tuning ScriptHardware Requirements for Fine-TuningQuantization: Running Models EfficientlyQuantization Formats ComparedGGUF — The Universal FormatUsing Quantized Models with OllamaCreating Custom GGUF ModelsImporting Custom Models into OllamaAWQ — GPU-Optimized QuantizationGPTQ — Mature GPU QuantizationEXL2 — Maximum GPU ThroughputWhich Quantization Should You Use?Putting It All Together: End-to-End WorkflowCommon PitfallsRelated Guides

More Posts

Best Local AI Stack in 2026 — Complete Setup Guide
Getting StartedTutorials

Best Local AI Stack in 2026 — Complete Setup Guide

Tutorial

Build the optimal local AI stack for your needs. Covers model runtimes, user interfaces, document chat, and cloud GPU options with step-by-step setup guides.

avatar for Local AI Hub
Local AI Hub
2026/04/19
How to Install Ollama on Mac, Windows, and Linux
Tutorials

How to Install Ollama on Mac, Windows, and Linux

Tutorial

Step-by-step guide to installing Ollama on macOS, Windows, or Linux and running your first AI model locally in under five minutes — no GPU required.

avatar for Local AI Hub
Local AI Hub
2026/04/01
Best AI Models for 16GB RAM — Run High-Quality LLMs Locally
Lists & GuidesModels & Hardware

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally

Guide

With 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.

avatar for Local AI Hub
Local AI Hub
2026/04/18
Local AI Hub

Run AI locally — fast, cheap, and private

Resources
  • Compare Tools
  • Tutorials
  • Cloud Deploy
  • Device Check
  • Blog
Company
  • About
  • Contact
Legal
  • Cookie Policy
  • Privacy Policy
  • Terms of Service
© 2026 Local AI Hub. All Rights Reserved.