Local AI Fine-Tuning Guide — Customize Models with LoRA and Quantization
Learn how to fine-tune open-source LLMs on your own hardware using LoRA, and understand quantization formats like GGUF, AWQ, and GPTQ to optimize performance.
Pre-trained language models are powerful out of the box, but they are general-purpose by design. If you need a model that speaks your company's tone, understands your domain's jargon, or follows a specific output format, fine-tuning is the answer. This guide walks you through fine-tuning open-source LLMs locally using LoRA, and explains the quantization formats that make running these models practical on consumer hardware.
What Is Fine-Tuning?
Fine-tuning means taking a pre-trained model and training it further on a smaller, task-specific dataset. Instead of learning everything from scratch, the model adjusts its existing knowledge to specialize.
There are two main approaches:
| Approach | VRAM Needed | Training Time | Quality |
|---|---|---|---|
| Full fine-tuning | 40GB+ | Hours to days | Best |
| LoRA (Low-Rank Adaptation) | 8-24GB | Minutes to hours | Very good |
For local, consumer-hardware setups, LoRA is the practical choice. It achieves 90-95% of full fine-tuning quality at a fraction of the cost.
Understanding LoRA
LoRA (Low-Rank Adaptation) works by freezing the original model weights and training a small set of new, low-rank matrices alongside them. These "adapters" are tiny — typically 1-5% of the original model size — but they capture the adjustments needed for your specific task.
Why LoRA matters for local AI:
- Low VRAM requirements — fine-tune a 7B model on a single consumer GPU
- Fast training — most tasks finish in under an hour
- Portable adapters — share your LoRA weights as a small file (50-500MB)
- Composable — merge multiple LoRA adapters or switch between them at runtime
Key LoRA Parameters
| Parameter | What It Does | Typical Value |
|---|---|---|
r (rank) | Controls adapter capacity | 8-64 |
lora_alpha | Scaling factor | 16-32 |
lora_dropout | Regularization | 0.05-0.1 |
target_modules | Which layers to adapt | q_proj, v_proj, k_proj, o_proj |
Higher rank (r) means more capacity but larger adapters and more VRAM. Start with r=16 and increase if the model underfits.
Preparing Your Training Data
LoRA fine-tuning works best with well-formatted examples. The most common format is JSONL with instruction-response pairs.
Example: JSONL Format
{"instruction": "Classify this customer feedback", "input": "The delivery was fast but the packaging was damaged", "output": "Mixed | Delivery: Positive | Packaging: Negative"}
{"instruction": "Classify this customer feedback", "input": "Product works exactly as described, very happy", "output": "Positive | Product Quality: Positive | Overall: Positive"}
{"instruction": "Classify this customer feedback", "input": "Still waiting for my order after 2 weeks", "output": "Negative | Delivery: Negative | Timeliness: Negative"}Data Preparation Tips
- 50-500 examples is enough for most LoRA tasks
- Be consistent — use the same format, tone, and structure across all examples
- Cover edge cases — include unusual inputs the model might encounter
- Quality over quantity — 100 well-written examples beat 1,000 sloppy ones
- Shuffle your data — randomize the order to prevent the model from learning sequence patterns
Fine-Tuning with Unsloth
Unsloth is the fastest way to fine-tune LLMs locally. It supports Llama, Mistral, Qwen, and other popular architectures with 2x faster training and 70% less memory usage compared to standard HuggingFace trainers.
Installation
# Create a virtual environment
python -m venv finetune-env
source finetune-env/bin/activate
# Install Unsloth
pip install unsloth
pip install --upgrade torch transformers datasets trlFine-Tuning Script
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# Load the base model (Llama 3.1 8B)
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.1-8b",
max_seq_length=2048,
load_in_4bit=True, # Use 4-bit quantization to save VRAM
)
# Add LoRA adapters
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
)
# Load your dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
# Format the dataset
def format_example(example):
return {
"text": f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
}
dataset = dataset.map(format_example)
# Set up training
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=50,
max_steps=500,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="./output",
optim="adamw_8bit",
),
dataset_text_field="text",
max_seq_length=2048,
)
# Start training
trainer.train()
# Save the LoRA adapter
model.save_pretrained("./my-lora-adapter")
tokenizer.save_pretrained("./my-lora-adapter")Hardware Requirements for Fine-Tuning
| Model Size | Minimum VRAM | Recommended VRAM | Training Time (500 examples) |
|---|---|---|---|
| 3B | 6 GB | 8 GB | ~10 minutes |
| 7-8B | 8 GB | 16 GB | ~30 minutes |
| 14B | 16 GB | 24 GB | ~1 hour |
| 32B | 32 GB | 48 GB | ~3 hours |
Tip: If you don't have enough VRAM, use cloud GPU instances on Runpod for the training step, then run the fine-tuned model locally.
Quantization: Running Models Efficiently
Quantization reduces the precision of model weights — from 16-bit floats to 4-bit or even 2-bit integers. This shrinks the model dramatically with minimal quality loss, making it possible to run large models on consumer hardware.
Quantization Formats Compared
| Format | Created By | Key Strength | Best For | GPU Required |
|---|---|---|---|---|
| GGUF | Georgi Gerganov (llama.cpp) | CPU + GPU flexibility | General local use, Ollama | Optional |
| AWQ | MIT | GPU-optimized inference | NVIDIA/AMD GPU servers | Recommended |
| GPTQ | AutoGPTQ team | Mature GPU quantization | NVIDIA GPU inference | Required (NVIDIA) |
| EXL2 | ExLlamaV2 | Fastest GPU inference | High-throughput serving | Required |
GGUF — The Universal Format
GGUF is the most versatile quantization format. It works on CPU, GPU, or a mix of both. This is what Ollama uses under the hood.
Naming convention: GGUF files include the quantization level in the name:
| Quantization | Bits/Weight | Model Size (7B) | Quality Loss |
|---|---|---|---|
| Q2_K | ~2.7 bit | ~3 GB | Noticeable |
| Q3_K_M | ~3.4 bit | ~3.5 GB | Moderate |
| Q4_K_M | ~4.8 bit | ~4.5 GB | Minimal |
| Q5_K_M | ~5.7 bit | ~5 GB | Very small |
| Q6_K | ~6.6 bit | ~5.5 GB | Negligible |
| Q8_0 | 8 bit | ~7 GB | Almost none |
Recommendation: Q4_K_M is the best balance for most users. It preserves 98%+ of the model's quality while cutting size roughly in half.
Using Quantized Models with Ollama
Ollama automatically uses quantized GGUF models. You don't need to configure anything:
# These all use Q4_K_M quantization by default
ollama pull llama3.1 # 8B model, ~4.7 GB
ollama pull qwen2.5:14b # 14B model, ~9 GB
ollama pull deepseek-r1:8b # 8B model, ~5 GB
# Verify downloaded models
ollama listCreating Custom GGUF Models
If you fine-tuned a model and want to convert it to GGUF:
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert HuggingFace model to GGUF (FP16 first)
python convert_hf_to_gguf.py ./my-finetuned-model --outfile model-f16.gguf
# Quantize to Q4_K_M
./llama-quantize model-f16.gguf model-q4km.gguf Q4_K_MImporting Custom Models into Ollama
Create a Modelfile to bring your quantized model into Ollama:
# Create a Modelfile
cat > Modelfile << 'EOF'
FROM ./model-q4km.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
SYSTEM You are a helpful assistant specialized in customer feedback classification.
EOF
# Build the model in Ollama
ollama create my-custom-model -f Modelfile
# Run it
ollama run my-custom-modelAWQ — GPU-Optimized Quantization
AWQ (Activation-aware Weight Quantization) preserves important weights based on activation patterns, resulting in better quality at the same compression level.
Best for: Running on servers with NVIDIA or AMD GPUs where you want maximum inference speed.
# Install AutoAWQ
pip install autoawq
# Quantize a model with AWQ
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
}
model.quantize(tokenizer, quant_config=quant_config)
model.save_quantized("./llama-3.1-8b-awq")AWQ models work with vLLM and other serving frameworks:
# Serve with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./llama-3.1-8b-awq \
--quantization awqGPTQ — Mature GPU Quantization
GPTQ uses layer-wise calibration to compress models. It has been around longer than AWQ and has wide framework support.
# Install AutoGPTQ
pip install auto-gptq
# Quantize (Python)
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B")
config = BaseQuantizeConfig(bits=4, group_size=128, desc_act=True)
model.quantize(calibration_data, config)
model.save_quantized("./llama-3.1-8b-gptq")EXL2 — Maximum GPU Throughput
EXL2 is the quantization format for ExLlamaV2, designed for the fastest possible inference on NVIDIA GPUs. It supports mixed quantization (different bits per layer) for optimal quality-per-bit.
Best for: High-throughput serving where latency matters most.
# Clone and build ExLlamaV2
git clone https://github.com/turboderp/exllamav2
cd exllamav2
pip install .
# Convert to EXL2
python convert.py \
-i meta-llama/Llama-3.1-8B \
-o ./llama-3.1-8b-exl2 \
-b 4.0Which Quantization Should You Use?
| Scenario | Format | Why |
|---|---|---|
| Running on Mac or laptop | GGUF | CPU+GPU flexibility, Ollama support |
| Running on desktop with NVIDIA GPU | GGUF or AWQ | GGUF for simplicity, AWQ for speed |
| Serving many users | EXL2 or AWQ | Maximum throughput |
| Fine-tuned model for personal use | GGUF (Q4_K_M) | Easy import into Ollama |
| Cloud GPU deployment | AWQ or EXL2 | Best GPU utilization |
Putting It All Together: End-to-End Workflow
Here is the complete workflow from raw data to a running fine-tuned model:
- Prepare your dataset in JSONL format with 50-500 examples
- Fine-tune with Unsloth using LoRA on your GPU (or a cloud GPU)
- Merge LoRA adapters into the base model
- Convert to GGUF and quantize to Q4_K_M
- Import into Ollama with a Modelfile
- Run locally on any machine with enough RAM
# Step 3: Merge LoRA into base model
python merge_lora.py --base-model unsloth/llama-3.1-8b \
--lora-path ./my-lora-adapter \
--output-dir ./merged-model
# Step 4: Convert and quantize (shown above)
# Step 5-6: Import and run (shown above)
ollama run my-custom-modelCommon Pitfalls
Overfitting: If your model memorizes training examples but fails on new inputs, reduce epochs, increase dropout, or add more diverse training data.
Catastrophic forgetting: The model loses general knowledge after fine-tuning. Use a lower learning rate (1e-5 to 2e-4) and fewer steps. LoRA inherently mitigates this since base weights are frozen.
Poor quantization quality: If Q4_K_M degrades quality too much, try Q5_K_M or Q6_K. The extra GB is worth it for critical tasks.
Out of memory during training: Reduce per_device_train_batch_size to 1-2, increase gradient_accumulation_steps to 8, and ensure load_in_4bit=True.
Related Guides
Author

Categories
More Posts
Best Local AI Stack in 2026 — Complete Setup Guide
TutorialBuild the optimal local AI stack for your needs. Covers model runtimes, user interfaces, document chat, and cloud GPU options with step-by-step setup guides.

How to Install Ollama on Mac, Windows, and Linux
TutorialStep-by-step guide to installing Ollama on macOS, Windows, or Linux and running your first AI model locally in under five minutes — no GPU required.

Best AI Models for 16GB RAM — Run High-Quality LLMs Locally
GuideWith 16GB RAM you can run powerful models like Qwen 2.5 14B and Mistral Small. The complete list of models, performance expectations, and setup commands.
