Fine-Tune Local LLMs 2026 | Practical Guide

The ability to fine-tune local LLMs has become a realistic option for individual developers and small teams in 2026, driven by reduced VRAM requirements, maturing toolchains, and a growing catalog of permissively licensed base models. Where once only well-funded labs could adapt large language models to domain-specific tasks, QLoRA improvements and unified frameworks like Unsloth now enable fine-tuning of 8B-parameter models on a single 12 GB consumer GPU. This guide walks through the entire workflow: deciding whether fine-tuning is the right approach, preparing datasets, configuring and running training, evaluating results, and exporting models for local inference.

How to Fine-Tune Local LLMs

Evaluate whether fine-tuning is needed over prompt engineering or RAG using a decision framework based on cost, latency, and data privacy.
Prepare your hardware and software environment with Python, PyTorch, Unsloth, and the Hugging Face ecosystem.
Curate a high-quality dataset of 500–10,000 examples in ChatML, ShareGPT, or Alpaca format with deduplication and length filtering.
Select a permissively licensed base model in the 7B–8B parameter range (Llama 3.1 8B, Mistral 7B, or Qwen 2.5 7B).
Configure QLoRA training with appropriate rank, learning rate, and target modules using Unsloth and SFTTrainer.
Monitor training and validation loss to detect overfitting or divergence, stopping early if validation loss climbs.
Merge LoRA adapters into the base model and export to GGUF format for local inference with llama.cpp or Ollama.
Evaluate the fine-tuned model with quantitative metrics and side-by-side qualitative comparison against the base model.

When to Fine-Tune vs Prompt Engineering vs RAG
Prerequisites and Hardware Requirements
Preparing Your Fine-Tuning Dataset
Understanding LoRA, QLoRA, and Full Fine-Tuning
Fine-Tuning with Unsloth and Hugging Face (Step-by-Step)
Evaluating Your Fine-Tuned Model
Troubleshooting and Best Practices
What Comes Next

When to Fine-Tune vs Prompt Engineering vs RAG

Decision Framework: Choosing the Right Approach

Before committing to a fine-tuning workflow, it is worth systematically comparing the three dominant strategies for customizing LLM behavior: prompt engineering, retrieval-augmented generation (RAG), and fine-tuning. Each carries distinct trade-offs across cost, latency, privacy, accuracy, and maintenance overhead.

Dimension	Prompt Engineering	RAG	Fine-Tuning
Upfront Cost	Near zero	Moderate (embedding pipeline, vector store)	High (compute, dataset curation)
Inference Latency	Low	Higher (retrieval + generation)	Low (no retrieval step)
Data Privacy	Depends on API provider	Data stays local if self-hosted	Data stays fully local
Accuracy Ceiling	Limited by context window and base model knowledge	High for factual recall; limited by retrieval quality	Highest for behavioral and stylistic adaptation
Maintenance Burden	Low (update prompts)	Moderate (keep index current)	Higher (retrain on new data)
Best Use Cases	Prototyping, general tasks, few-shot scenarios	Knowledge-intensive Q&A, document search	Domain jargon, strict output formatting, offline deployment

Use this table as a quick reference when choosing between approaches. The key insight is that these approaches are not mutually exclusive. RAG and fine-tuning combine well: a fine-tuned model that also retrieves from a knowledge base often outperforms either technique in isolation.

Signs You Actually Need Fine-Tuning

Fine-tuning is the right tool when:

The problem demands consistent structural output formatting that prompt engineering cannot reliably enforce.
The model must internalize domain-specific terminology (legal, medical, proprietary codebases).
Behavioral alignment requires the model to adopt a specific persona or tone across all interactions.
Latency-sensitive inference rules out the retrieval step that RAG introduces.
Air-gapped and offline deployments leave no external API or vector store available.

Fine-tuning is overkill, however, when a well-crafted system prompt achieves the desired behavior, when the knowledge gap can be filled by injecting context at inference time, or when the dataset contains fewer than 200-300 high-quality examples and cannot meaningfully shift model behavior. In those cases, the compute cost and iteration cycle of fine-tuning yield diminishing returns.

Prerequisites and Hardware Requirements

Minimum Hardware for Each Fine-Tuning Method

Hardware requirements vary dramatically depending on the fine-tuning method and model size:

Full fine-tuning of a 7B-parameter model requires 48 GB or more of VRAM. This means an NVIDIA A6000 or a multi-GPU setup. For models above 13B parameters, multi-node training or 80 GB A100/H100 cards become necessary.

LoRA (Low-Rank Adaptation) reduces trainable parameters substantially, bringing VRAM requirements down to 16 to 24 GB. An RTX 4090 (24 GB) or RTX 5090 handles 7B models comfortably with LoRA.

QLoRA pushes requirements further down to 8 to 12 GB by quantizing the base model to 4-bit precision and training only the low-rank adapters in higher precision.

An RTX 4070 Ti (12 GB) or equivalent consumer card is viable for 7B-8B models.

For those without local GPU access, cloud instances on RunPod, Lambda, or Vast.ai with A100 or H100 GPUs eliminate the need to purchase dedicated hardware. Pricing varies by provider and GPU type; check current rates before provisioning.

Software Stack Setup

The 2026 fine-tuning stack centers on Python 3.11+, PyTorch 2.5+, CUDA 12.x, and the Hugging Face ecosystem (transformers, datasets, peft, trl). Unsloth provides optimized training kernels that reduce memory usage and increase throughput. Pinning dependency versions is critical for reproducibility.

# Code Example 1: Environment setup with pinned versions
conda create -n finetune python=3.11 -y
conda activate finetune

# Verify your CUDA driver version with nvidia-smi and select the matching
# index URL from https://pytorch.org/get-started/locally
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install "unsloth>=2025.3,<2026.0"  # Check https://pypi.org/project/unsloth for latest stable version
pip install transformers==4.48.0 datasets==3.2.0 peft==0.14.0 trl==0.14.0
pip install bitsandbytes==0.45.0
pip install wandb tensorboard
pip install sentencepiece protobuf

# Verify CUDA availability
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}, Device: {torch.cuda.get_device_name(0)}')"

This environment provides a stable foundation. The unsloth package patches Hugging Face internals for faster training with lower memory consumption, while bitsandbytes enables 4-bit quantization for QLoRA workflows and is required by the paged_adamw_8bit optimizer used during training.

Preparing Your Fine-Tuning Dataset

Dataset Formats and Standards

Three dataset formats dominate the 2026 fine-tuning ecosystem:

ChatML uses structured <|im_start|> and <|im_end|> tokens with explicit role labels (system, user, assistant). This is the native format for most chat-oriented models and the preferred choice for instruction-following and conversational fine-tuning.
ShareGPT stores conversations as a list of turns with from and value fields. It maps cleanly to multi-turn dialogue and is widely used for community datasets on Hugging Face.
For single-turn instruction-response tasks and classification workflows, Alpaca format is the simplest option: just instruction, input, and output fields.

For dataset size, the minimum viable dataset depends heavily on the task. Simple classification or formatting tasks can show measurable improvement with 500 to 1,000 examples. Complex instruction-following or adapting a model to a new domain requires 3,000 to 10,000 high-quality examples. Beyond 10,000 examples, returns diminish unless the domain is exceptionally broad.

Data Cleaning and Validation

You rarely receive raw data in the right format. The following script converts CSV or JSON records into ChatML format, removes duplicates, filters by token length, and creates a train/validation split:

# Code Example 2: Dataset preparation — CSV/JSON to ChatML conversion
import json
import hashlib
import random
import logging
from pathlib import Path

logger = logging.getLogger(__name__)


def load_raw_data(filepath):
    """Load from CSV or JSON."""
    path = Path(filepath)
    if path.suffix == ".json":
        with open(path, encoding="utf-8") as f:
            return json.load(f)
    elif path.suffix == ".csv":
        import csv
        with open(path, encoding="utf-8-sig") as f:  # utf-8-sig handles BOM
            reader = csv.DictReader(f)
            return list(reader)
    raise ValueError(f"Unsupported format: {path.suffix}")


def to_chatml(record, system_prompt="You are a helpful domain expert."):
    """Convert a record with 'instruction' and 'response' fields to ChatML."""
    return {
        "messages": [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": record["instruction"].strip()},
            {"role": "assistant", "content": record["response"].strip()},
        ]
    }


def deduplicate(records):
    """Remove exact duplicates based on instruction + response SHA-256 hash."""
    seen = set()
    unique = []
    for r in records:
        if "instruction" not in r or "response" not in r:
            logger.warning("Record missing required fields; skipping: %s", r)
            continue
        key = hashlib.sha256(
            (r["instruction"] + r["response"]).encode()
        ).hexdigest()
        if key not in seen:
            seen.add(key)
            unique.append(r)
    return unique


def filter_by_length(records, min_chars=50, max_chars=4000):
    """Remove records that are too short or too long."""
    valid = []
    for r in records:
        resp = r.get("response", "")
        if not isinstance(resp, str):
            logger.warning("Non-string response field; skipping: %s", r)
            continue
        if min_chars <= len(resp) <= max_chars:
            valid.append(r)
    return valid


def prepare_dataset(input_path, output_dir, val_ratio=0.1, system_prompt="You are a helpful domain expert."):
    random.seed(42)  # Set seed first for full reproducibility
    raw = load_raw_data(input_path)
    logger.info("Loaded %d raw records", len(raw))

    raw = deduplicate(raw)
    logger.info("After dedup: %d", len(raw))

    raw = filter_by_length(raw)
    logger.info("After length filter: %d", len(raw))

    random.shuffle(raw)
    split_idx = int(len(raw) * (1 - val_ratio))
    train_data = [to_chatml(r, system_prompt) for r in raw[:split_idx]]
    val_data = [to_chatml(r, system_prompt) for r in raw[split_idx:]]

    out = Path(output_dir)
    out.mkdir(parents=True, exist_ok=True)
    with open(out / "train.json", "w", encoding="utf-8") as f:
        json.dump(train_data, f)
    with open(out / "val.json", "w", encoding="utf-8") as f:
        json.dump(val_data, f)
    logger.info("Train: %d, Val: %d", len(train_data), len(val_data))


# Usage
prepare_dataset("raw_data.json", "./dataset", val_ratio=0.1)

Common Dataset Pitfalls

Overfitting on small datasets is the most frequent failure mode. With fewer than 500 examples, the model memorizes training samples rather than generalizing. Label leakage, where information from the expected output bleeds into the input field, inflates evaluation metrics while producing useless models. Inconsistent formatting across examples (mixing Markdown and plain text, varying response structures) forces the model to learn formatting noise rather than the target behavior.

Understanding LoRA, QLoRA, and Full Fine-Tuning

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. It adapts the model more thoroughly than any other method but demands proportionally large VRAM. Weight storage alone costs 4 bytes per parameter for fp32, or 2 bytes per parameter for bf16; add optimizer states (8 bytes per parameter for AdamW) and gradient buffers for total VRAM estimation. For a 7B model in bf16, this means 14 GB just for model weights, with optimizer states pushing total requirements to 48 GB or beyond. Full fine-tuning also carries the highest risk of catastrophic forgetting, where the model loses general capabilities while specializing. Full fine-tuning justifies its cost only when the domain shift is extreme and the dataset is large enough (tens of thousands of examples) to warrant full parameter updates.

LoRA (Low-Rank Adaptation)

LoRA freezes the base model weights and injects small trainable matrices into specific layers, typically the attention projection matrices (q_proj, k_proj, v_proj, o_proj). These matrices use rank decomposition: we approximate the weight update matrix W as the product B·A, where A has shape (r × d_in) and B has shape (d_out × r), with rank r ≪ min(d_in, d_out). This reduces trainable parameters by orders of magnitude. Key hyperparameters include rank (commonly 8 to 64; higher rank captures more complex shifts but increases memory), alpha (a scaling factor commonly set equal to rank or 2× rank; the ratio alpha/rank acts as an effective learning rate scalar), and the choice of target modules.

QLoRA and 2026 Improvements

QLoRA combines LoRA with 4-bit NormalFloat (NF4) quantization of the base model, paged optimizers that pre-allocate CPU RAM pages so the system swaps optimizer states to CPU when GPU memory fills up, and double quantization (quantizing the quantization constants themselves). The base model loads in 4-bit precision, while LoRA adapters and computations remain in bf16 or fp16. This architecture makes it possible to fine-tune an 8B model in under 10 GB of VRAM with sequence length ≤512, batch size 1, and gradient checkpointing enabled. At longer sequence lengths or larger batch sizes, VRAM usage increases proportionally.

Since 2024, the QLoRA ecosystem has matured considerably. Broader model architecture support means QLoRA works out of the box with Llama 3, Mistral, Qwen 2.5, Phi-3, and Gemma 2 families. Unsloth's kernel optimizations further reduce memory overhead by fusing operations and cutting intermediate activation storage.

Fine-Tuning with Unsloth and Hugging Face (Step-by-Step)

Choosing a Base Model

Model selection in 2026 should balance capability, VRAM fit, and licensing:

The 7B-8B tier is the sweet spot for consumer hardware: Llama 3.1 8B, Mistral 7B v0.3, and Qwen 2.5 7B all fit comfortably with QLoRA on 12 GB GPUs. Llama 3.1 8B uses the Llama community license (permissive for most commercial use; verify current terms at ai.meta.com/llama/license before deployment). Mistral 7B and Qwen 2.5 both use Apache 2.0.

Larger models (13B variants at 16 GB+ with QLoRA, 70B models requiring multi-GPU or 80 GB cards even with QLoRA) are options if you have the hardware, but for most local fine-tuning scenarios the 7B-8B tier with QLoRA provides the best balance of capability and accessibility.

Note: Llama 3.1 8B is a gated model on Hugging Face. You must accept Meta's license at huggingface.co/meta-llama/Meta-Llama-3.1-8B and authenticate with huggingface-cli login before downloading.

Configuring the Training Run

The following script demonstrates a complete QLoRA training run using Unsloth with an SFTTrainer from the trl library:

# Code Example 3: Complete QLoRA training script with Unsloth
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
import torch

# Model configuration
max_seq_length = 2048
model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit"

# Load model with 4-bit quantization via Unsloth
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=None,  # Auto-detect: bf16 on Ampere+, fp16 otherwise
    load_in_4bit=True,
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # Rank: 8-64, higher = more capacity
    lora_alpha=32,                 # Scaling factor, commonly 2x rank
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    use_gradient_checkpointing="unsloth",  # Unsloth optimized checkpointing
)

# Load dataset (ChatML format)
dataset = load_dataset("json", data_files={
    "train": "./dataset/train.json",
    "validation": "./dataset/val.json",
})

# Format conversations for the tokenizer and remove source column
def format_chat(example):
    return {"text": tokenizer.apply_chat_template(
        example["messages"], tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_chat, remove_columns=["messages"])

# Training arguments
training_args = TrainingArguments(
    output_dir="./output",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,       # Effective batch size = 16
    warmup_steps=50,
    num_train_epochs=3,
    learning_rate=2e-4,
    bf16=True,
    logging_steps=10,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,                  # Prevents disk exhaustion; keeps 3 most recent checkpoints
    eval_strategy="steps",
    eval_steps=100,
    load_best_model_at_end=True,         # Save best checkpoint, not last
    metric_for_best_model="eval_loss",
    optim="paged_adamw_8bit",
    lr_scheduler_type="cosine",
    seed=42,
    report_to=[],                        # Explicit empty list; no external reporting. Change to ["wandb"] only for non-sensitive data with WANDB_API_KEY set.
)

# Initialize trainer
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["validation"],
    args=training_args,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    packing=True,                        # Pack short examples for efficiency
)

# Execute training; must follow all configuration above
trainer.train()

Privacy note: If you switch report_to to ["wandb"], training metadata and loss curves will be transmitted to Weights & Biases external servers. Use ["tensorboard"] for air-gapped or proprietary workflows.

Annotated configuration templates for common scenarios (instruction tuning, multi-turn chat, text classification) are provided as companion config files. Key parameters to adjust per use case: rank (8 for simple formatting tasks, 32 to 64 for complex domain shifts), learning rate (1e-4 to 3e-4 for QLoRA), and epochs (1 to 3 for large datasets, 3 to 5 for smaller ones).

Launching Training and Monitoring

# Code Example 4: Optional WandB logging setup and checkpoint saving
# Requires WANDB_API_KEY env var. For offline/private use: wandb.init(..., mode="offline")
import os
import wandb


# Only initialize WandB if explicitly requested and key is available
def init_wandb_if_configured(project: str, run_name: str) -> bool:
    """Returns True if WandB was initialized, False otherwise."""
    api_key = os.environ.get("WANDB_API_KEY")
    if not api_key:
        print("WANDB_API_KEY not set; skipping WandB logging.")
        return False
    wandb.init(project=project, name=run_name)
    return True


# Call before trainer.train() in Example 3's session, not standalone
init_wandb_if_configured("llm-finetune", "llama3.1-8b-qlora-domain")

# trainer must be defined in the same session (from Example 3)
# Save final checkpoint
trainer.save_model("./output/final_checkpoint")
tokenizer.save_pretrained("./output/final_checkpoint")

print("Training complete. Final checkpoint saved.")

Expected training times vary by hardware. On an RTX 4090 (24 GB) with a 5,000-example dataset and the configuration above, expect 1 to 2 hours for 3 epochs at sequence length 2048. An RTX 4070 Ti (12 GB) will take longer due to smaller batch sizes and more gradient accumulation steps, typically 2 to 4 hours for the same dataset. Monitoring training loss and validation loss via WandB or TensorBoard is essential for detecting overfitting early. A divergence where training loss continues to drop while validation loss increases signals the need to stop training or reduce epochs.

A divergence where training loss continues to drop while validation loss increases signals the need to stop training or reduce epochs.

Merging Adapters and Exporting the Model

After training, the LoRA adapters exist as separate weight files. For deployment, they need to be merged back into the base model and optionally converted to GGUF format for inference engines like llama.cpp or Ollama. GGUF export requires either llama.cpp build tools installed locally or the appropriate Unsloth extras; consult Unsloth documentation for your installed version.

# Code Example 5: Merging LoRA weights and exporting to GGUF
import torch
from pathlib import Path
from unsloth import FastLanguageModel

checkpoint_path = Path("./output/final_checkpoint")
assert checkpoint_path.exists(), f"Checkpoint not found: {checkpoint_path}"

# Validate that this is a merged checkpoint, not adapter-only
adapter_config = checkpoint_path / "adapter_config.json"
assert not adapter_config.exists(), (
    "final_checkpoint appears to contain LoRA adapters only. "
    "Run save_pretrained_merged first, then reload from merged_model path."
)

# Reload model — must load in 16-bit (not 4-bit) for lossless merge
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=str(checkpoint_path),
    max_seq_length=2048,
    dtype=torch.bfloat16,
    load_in_4bit=False,  # Must be False for lossless 16-bit merge
)

# Merge LoRA weights into base model
model.save_pretrained_merged(
    "./output/merged_model",
    tokenizer,
    save_method="merged_16bit",  # Full precision merged weights
)

# Export to GGUF for llama.cpp / Ollama
model.save_pretrained_gguf(
    "./output/gguf_model",
    tokenizer,
    quantization_method="q4_k_m",  # Options: q4_k_m, q5_k_m, q8_0
)

print("Model merged and exported to GGUF format.")

Quantization method selection affects both model size and quality. Q4_K_M provides the best size-to-quality ratio for most deployments (about 4.5 bits per weight). Q5_K_M offers slightly higher fidelity at ~5.5 bits per weight. Q8_0 preserves nearly full quality at 8 bits per weight but doubles the file size compared to Q4_K_M. For an 8B model, expect GGUF file sizes of approximately 4.9 GB (Q4_K_M), 5.7 GB (Q5_K_M), and 8.5 GB (Q8_0). Sizes scale proportionally with parameter count.

Evaluating Your Fine-Tuned Model

Quantitative Evaluation

Evaluation should combine general language modeling metrics with task-specific measures. Perplexity on the held-out validation set provides a baseline signal: lower perplexity indicates better prediction of the validation data, though it does not directly measure task performance. Task-specific metrics matter more. For classification tasks, measure F1 score and exact match accuracy. For generation tasks, BLEU and ROUGE measure n-gram overlap, which correlates weakly with human preference for open-ended text; prefer task-specific automated metrics (e.g., LLM-as-judge) or human evaluation when possible.

Check validation set performance against the base model's scores on the same set. If the fine-tuned model shows only marginal improvement over the base model with well-crafted prompts, the fine-tuning may not be justified.

Qualitative Testing and Red-Teaming

Manual testing against the base model on identical prompts reveals behavioral changes that metrics miss. The following script runs the same prompts through both models for side-by-side comparison. It loads and tests models sequentially to avoid OOM on consumer GPUs.

# Code Example 6: Side-by-side inference comparison
import torch
from unsloth import FastLanguageModel


def run_inference(model, tokenizer, messages, device="cuda"):
    """Tokenize, generate, and decode only new tokens."""
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        return_tensors="pt",
        add_generation_prompt=True,
    ).to(device)
    attention_mask = (inputs != tokenizer.pad_token_id).long().to(device)
    with torch.no_grad():
        output = model.generate(
            input_ids=inputs,
            attention_mask=attention_mask,
            max_new_tokens=256,
            pad_token_id=tokenizer.eos_token_id,
        )
    # Decode only newly generated tokens, not the prompt
    new_tokens = output[0][inputs.shape[-1]:]
    return tokenizer.decode(new_tokens, skip_special_tokens=True)


# Test prompts
test_prompts = [
    "Explain the key differences between LoRA and full fine-tuning.",
    "Generate a compliance report summary for Q3 2025.",
    # Medical prompts are included only to test domain regression;
    # do not deploy fine-tuned models for medical advice without clinical review.
    "What are the side effects of metformin?",
]

# Load and test models sequentially to avoid OOM on consumer GPUs
for model_name, label in [
    ("unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", "BASE MODEL"),
    ("./output/final_checkpoint", "FINE-TUNED"),
]:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_name,
        max_seq_length=2048,
        load_in_4bit=True,
    )
    FastLanguageModel.for_inference(model)

    for prompt in test_prompts:
        messages = [{"role": "user", "content": prompt}]
        result = run_inference(model, tokenizer, messages)
        print(f"
{'='*60}")
        print(f"PROMPT: {prompt}")
        print(f"
{label}:
{result}")

    # Free VRAM before loading next model
    del model
    torch.cuda.empty_cache()

Beyond positive test cases, red-teaming is critical. Test for regression on general knowledge questions the base model handles correctly. Check for hallucinations introduced by overfitting. Probe edge cases in the domain to verify the model has genuinely learned the target behavior rather than surface-level pattern matching.

Troubleshooting and Best Practices

Common Failure Modes

When loss refuses to converge, the most common cause is a learning rate that is too high. For QLoRA, start at 2e-4 and reduce to 1e-4 if loss is unstable. Also check dataset formatting: malformed ChatML templates cause the model to train on garbage tokens.

The model shows strong domain performance but degrades on general tasks? That is catastrophic forgetting. Reduce epochs, lower the learning rate, or reduce LoRA rank. Mixing a small percentage (5 to 10%) of general instruction-following data into the training set helps preserve broad capabilities.

Training loss approaching zero while validation loss climbs is the classic overfitting signature. Reduce epochs, increase dropout, or expand the dataset. With fewer than 1,000 examples, keeping epochs low (1 to 3) is usually enough.

For CUDA OOM errors, reduce batch size first, then increase gradient accumulation steps to maintain effective batch size. Enable gradient checkpointing (already set in the Unsloth config above). If still out of memory, reduce sequence length or LoRA rank.

Hyperparameter Tuning Tips

For learning rate, a brief sweep across 1e-4, 2e-4, and 3e-4 on a small data subset (10% of training data, 1 epoch) quickly identifies the right order of magnitude. Rank selection follows the complexity of the task: rank 8 for formatting and style changes, rank 16 to 32 for moderate domain shifts, rank 64 for substantial knowledge injection. Epochs should be kept low (1 to 3 for datasets above 5,000 examples, 3 to 5 for smaller datasets). When results are poor, across published ablations, increasing dataset size and quality outperforms hyperparameter tweaking.

When results are poor, across published ablations, increasing dataset size and quality outperforms hyperparameter tweaking.

What Comes Next

The workflow above covers the full cycle from decision-making through dataset preparation, QLoRA training, evaluation, and export. The decision framework from section one helps determine when fine-tuning is the right investment.

The most immediate next step: deploy the exported GGUF model with Ollama for local inference. From there, you can wrap it in a REST API for application integration (using FastAPI or llama-cpp-python's built-in OpenAI-compatible server) or set up a retraining pipeline triggered when new labeled data exceeds a threshold or evaluation scores drift below an acceptable baseline.

How to Fine-Tune Local LLMs in 2026: A Practical Guide