Short answer

In 2026, the default fine-tuning stack for most production teams is QLoRA + SFT for behavior, followed by DPO for alignment. QLoRA lets you fit large models on a single GPU. DPO has largely replaced PPO-based RLHF for alignment because it's simpler, more stable, and roughly half the GPU memory. Use RFT (Reinforcement Fine-Tuning) only when correctness is automatically verifiable. Skip fine-tuning entirely if prompting or RAG solves the problem.

Three years ago, fine-tuning an LLM meant standing up a training cluster, hiring an ML platform engineer, and committing six weeks of work. Today an applied engineer can fine-tune a 70B-class open model on a single GPU in a few hours using QLoRA, ship it as a vLLM endpoint by the end of the week, and run alignment on top of it with DPO the week after. The methods got better, the libraries got easier, and the wall between research and production got thinner.

But the menu also got longer. LoRA, QLoRA, full fine-tuning, supervised fine-tuning, PPO-based RLHF, DPO, IPO, KTO, RFT, GRPO — the acronyms keep coming. Most teams don't need all of them. Most teams need a clean opinion about which two or three to combine, what order to apply them in, and when to stop.

This guide is the opinion. It's the framework we walk through with AI engineers researching companies on JobsByCulture and trying to figure out where their fine-tuning skills will be valued in 2026 hiring.

The Decision Tree (Read This First)

Before picking a method, decide if you should be fine-tuning at all. The order of cheapest-to-most-expensive interventions:

  1. Prompting — free, instant, no training. Solves ~80% of business cases.
  2. RAG — cheap, fast to iterate. Solves cases needing fresh, proprietary, or large context.
  3. SFT (supervised fine-tuning) — teaches style, format, behavior. Use when prompts can't reliably elicit it.
  4. DPO or RLHF — aligns behavior to preferences. Use after SFT when you have preference data.
  5. RFT — rewards verifiable correctness. Use when ground truth is automatable.
  6. Continued pre-training — rare. Use only for genuine domain shift (specialized legal, medical, code).

Skip ahead in the tree only when the cheaper option provably doesn't work. The classic failure pattern in 2024–2025 was teams jumping to fine-tuning because it sounded sophisticated, only to find a better prompt would have worked. See our fine-tuning vs RAG vs prompt engineering guide for the full decision tree.

"Try prompting first, RAG second, fine-tuning third. The fine-tuning decision should usually come after you've validated demand with the cheaper approaches."

The Four Method Families You Need to Understand

Every fine-tuning method falls into one of four categories. Understanding the family is more useful than memorizing acronyms.

Family 1

Parameter-Efficient Methods (LoRA, QLoRA)

Instead of updating all the model's weights, you freeze the base and train small "adapter" matrices on top. With LoRA, the adapter typically trains a tiny fraction of the total parameters (often under 1%) and the result is close to full fine-tuning for many tasks. QLoRA goes further by quantizing the frozen base model to 4-bit precision (using NF4 format), dramatically reducing memory.

The win: a 70B-class model that would need around 140 GB of VRAM in 16-bit precision fits in roughly 46 GB with QLoRA — within reach of a single 80 GB GPU. The cost: a small quality tax at the upper end of model size, usually within a couple of points on benchmarks.

Family 2

Supervised Fine-Tuning (SFT)

The simplest objective: show the model labeled input–output pairs and train it to match. Cross-entropy loss against the labeled answer. This is the workhorse for teaching style, format, and behavior — "respond in JSON," "use this brand voice," "follow this taxonomy."

SFT works best with curated data: roughly 1,000 to 10,000 high-quality examples typically outperform 100,000 noisy ones. SFT is also what most "instruction-tuning" runs are doing under the hood. You'll almost always do SFT first, then layer alignment on top.

Family 3

Preference-Based Alignment (DPO, PPO/RLHF, IPO, KTO)

Once you have a model that produces approximately what you want, you align it to actual human preferences. The classic approach was RLHF with PPO: train a reward model from preference pairs, then use reinforcement learning to maximize the reward. It works but it's expensive — PPO carries four model copies (policy, reference, reward, critic), needing on the order of 220+ GB for a 7B model.

DPO (Direct Preference Optimization) skips the reward model and uses preference pairs (chosen vs rejected) to directly update the policy. It needs roughly half the GPU memory, trains more stably, and produces quality comparable to RLHF for most product cases. By 2026, DPO has displaced PPO/RLHF for most production teams, while frontier labs with very large preference datasets still get marginal gains from PPO.

IPO and KTO are variants that handle preference noise and unpaired feedback better, but for most teams DPO is the right default.

Family 4

Reinforcement Fine-Tuning (RFT, GRPO)

RFT trains the model by rewarding it for producing verifiably correct outputs rather than imitating a labeled answer. The reward function is automated and checks the output against ground truth — whether a math answer is right, whether a code sample passes tests, whether a tool call returns the expected result.

This family powered much of the recent wave of reasoning-focused models, where the reward function naturally exists. For most product teams the prerequisite of a reliable, automatable reward function is what makes RFT impractical — if you can write the reward, you may not need the model to learn it. Use RFT when correctness is cheap to check (math, code, structured outputs, tool use), not when quality is subjective.

Side-by-Side Comparison

Method What it does When to use Skip if
SFT (full) Updates all weights to match labeled examples Small models (<7B), domain shift, want top quality, have GPUs Compute budget tight; model is 13B+
LoRA Trains low-rank adapters; freezes base in full precision Multi-tenant serving, you need to swap adapters at inference time Quantization is acceptable; cost matters more than the 1–2 quality points
QLoRA LoRA on top of a 4-bit-quantized frozen base Default for SFT in 2026; large models on small GPU budgets You have spare GPU memory and want the marginal quality of full LoRA
PPO / RLHF Trains reward model, uses RL to maximize reward Frontier labs with large preference datasets and platform investment You're a product team; DPO is almost always the better starting point
DPO Directly trains policy from preference pairs (no reward model) Default for alignment in 2026; most product teams You don't have preference pairs yet; collect them with SFT first
KTO Like DPO but works with single-rating feedback (thumbs up/down) You have lots of unpaired feedback but few chosen/rejected pairs You already have clean preference pairs — DPO will outperform
RFT / GRPO RL with verifiable-reward function (math, code, tool use) Correctness is automatable and important; reasoning-style models Quality is subjective; reward function isn't clean
Continued pre-training Long training run on new corpus before any task-specific tuning Genuine domain shift (specialized legal, medical, code, language) Your task is "make it write like our brand"; SFT is the right tool

The Production Stack Most Teams Should Run

If you're building a fine-tuned model into a product in 2026, this is the stack that wins for the median case:

Step 1: Pick a base model

Choose an open-weights model that's competitive on the benchmarks closest to your task. Llama, Qwen, Mistral, and Gemma families are the common starting points. Stay one model generation behind the absolute frontier — the tooling, quantization support, and known-good recipes lag the latest model by 3–6 months. Latest isn't always best for production.

Step 2: SFT with QLoRA

Curate a high-quality dataset (start with 1,000–10,000 examples) and run QLoRA SFT. NF4 base, 16-bit LoRA adapters, rank 16 or 32 as a starting point. Single 80 GB GPU is usually enough for models up to ~70B. Training time for a small instruction set is hours, not days.

The hidden trap: data quality. The single biggest predictor of fine-tuning success is whether your labeled examples are correct, diverse, and reflective of how the model will be used in production. Spend 2x the time on data curation that you spend on training scripts.

Step 3: Evaluate before alignment

Run a real evaluation against held-out examples. If SFT alone meets your bar, stop here. Many teams skip this step and immediately move to alignment — only to discover later that the alignment didn't help because SFT already solved the problem. Evals are also where most production fine-tuning projects fail; see our LLM evaluation guide.

Step 4: DPO for alignment (if needed)

If SFT alone isn't enough, collect preference pairs and run DPO. The preference pair format is simple: prompt + chosen response + rejected response. 5,000 to 50,000 preference pairs is the typical working range. DPO is more stable to train than PPO/RLHF, needs roughly half the GPU memory, and is usually competitive in quality.

Generating preference pairs can be done with human labelers, with a stronger model labeling outputs from your weaker model, or with synthetic data pipelines — the last has its own quality risks but is increasingly common in production teams.

Step 5: RFT only if correctness is automatable

RFT is the right tool when your task has a clean correctness signal — math, code, structured outputs, tool use. If you can write a function that grades the output, RFT can lift quality significantly. If you can't, skip it. For most product use cases (customer support, internal assistants, document QA), RFT is overkill and DPO is the right ceiling.

4-bit
QLoRA base precision
~2×
DPO vs PPO memory saving
1K–10K
SFT examples sweet spot

Open Weights vs Closed-Model Fine-Tuning APIs

Several closed providers offer hosted fine-tuning APIs. The trade-off is straightforward.

Hosted fine-tuning APIs are the right call when: you want fast iteration with no infrastructure work, your team doesn't have ML platform engineering, your data volume is modest, and you don't need to ship the model artifact to private infrastructure. They handle the training, the quantization, the serving, the rollback. You write data, you write a config, you get an endpoint.

Open-weights fine-tuning is the right call when: you need control over the model artifact, you want to ship to private infrastructure (regulated industries, sovereign cloud, on-prem), you need to do alignment work that vendor APIs don't support, or your inference volume is high enough that vendor inference cost is painful. The labor cost is higher but the flexibility is real.

Most production teams in 2026 use both: closed-model APIs for prototyping and experimentation, open-weights for what they actually ship.

What's Hot in 2026 (And Mostly Hype)

A few things you'll see on Twitter that you don't need to chase as a practitioner:

What Hiring Teams Want From Fine-Tuning Engineers

If you're an AI engineer using JobsByCulture's ML/AI job board to research roles, here's the depth bar hiring teams are setting for fine-tuning work in 2026:

Companies hiring heavily for fine-tuning work in 2026 include Databricks (Mosaic AI platform), Anthropic, OpenAI, Cohere, Mistral, and a long tail of applied-AI startups across enterprise verticals.

Frequently Asked Questions

Should I use LoRA or QLoRA in 2026?+
QLoRA is the right default for most production fine-tuning in 2026. It quantizes the frozen base model to 4-bit precision (NF4), which makes a 70B-class model fit on a single 80GB GPU instead of needing multiple. The quality loss is small for instruction-tuning workloads, and the cost reduction is large. Use plain LoRA without quantization only when you have spare GPU memory and need to squeeze out the last point or two of quality on a model that's small enough to fit anyway.
Is DPO replacing RLHF for alignment work?+
For most production teams, yes. Direct Preference Optimization skips the reward-model training step entirely and uses preference pairs (chosen vs rejected) to directly update the policy. It needs roughly half the GPU memory of PPO-based RLHF because it doesn't carry a separate reward and critic model. The training is more stable, the data pipeline is simpler, and the quality is comparable for most alignment tasks. PPO/RLHF still wins for frontier labs that have very large preference datasets and the engineering budget to run it well.
What is RFT (Reinforcement Fine-Tuning) and when should I use it?+
RFT trains a model by rewarding it for producing verifiably correct outputs, rather than imitating a labeled answer. The reward function is automated and checks the output against ground truth. It works well when correctness is checkable (math, code, structured outputs, tool use) but poorly when quality is subjective. The recent wave of reasoning models has been trained largely with RFT-style approaches. For most product teams in 2026, RFT is overkill — SFT plus DPO is the safer starting point unless your task has a clean correctness check.
When should I fine-tune at all versus using prompting or RAG?+
Try prompting first, RAG second, fine-tuning third. Prompting handles roughly 80% of business use cases at zero training cost. RAG handles cases where you need fresh, proprietary, or large factual context. Fine-tuning is right when you need consistent style or format, lower latency than prompt-stuffing allows, lower per-token cost at high volume, or specialized behavior that prompting can't reliably elicit. See our fine-tuning vs RAG vs prompt engineering guide.
Can I fine-tune a 70B model on a single GPU?+
Yes, with QLoRA. The base model gets quantized to 4-bit precision (NF4) and frozen, and only low-rank LoRA adapters get trained. A 70B-class model that would need around 140GB of VRAM in 16-bit precision fits in roughly 46GB once quantized — within reach of a single 80GB GPU like an A100 or H100. Training time depends on the dataset size, but small instruction-tuning datasets (tens of thousands of examples) typically complete in hours, not days.
How much data do I need to fine-tune?+
For SFT instruction-tuning, 1,000 to 10,000 well-curated examples usually outperforms 100,000 noisy examples. For DPO alignment, 5,000 to 50,000 preference pairs is the typical range. For RFT, you need a high-quality reward function more than you need a lot of data — even a few hundred prompts can work if the reward signal is reliable. The single biggest predictor of fine-tuning success is data quality, not data quantity.
Should I use a closed model's fine-tuning API or train open-weights myself?+
Closed-model fine-tuning APIs are the right call when you want fast iteration with no infrastructure work, your team doesn't have deep ML engineering, and your data volume is modest. Open-weights fine-tuning is the right call when you need control over the model artifact, you want to ship to private infrastructure, you need to do alignment work (DPO/RFT) that isn't well-supported in vendor APIs, or your inference volume is high enough that vendor inference costs are painful. Most production teams in 2026 use both — closed-model APIs for prototyping, open-weights for what they ship.

Looking for AI engineering roles?

Browse ML / AI jobs at culture-first companies on JobsByCulture — filtered by the engineering culture values that matter to specialists.

Browse ML/AI Jobs → AI Skills Hub →