← Posts

Updating the Model from Within: Fine-Tuning, LoRA, RLHF and Knowledge Distillation

In the previous article, we covered RAG, Tool Use and Agentic AI, which support the model from the outside: fetch information, call tools, plan and orchestrate. The four approaches in this article, however, change the model itself. External support leaves the model unchanged. An internally updated model becomes a different model altogether.

1. Fine-Tuning: From Off-the-Rack to Bespoke

A large language model is like a general knowledge expert who has completed training with massive amounts of text (14.8 trillion tokens for DeepSeek-V3-Base). It can reason about any topic but lacks deep expertise in any specific domain. Fine-tuning deepens this general capability in a particular area: by retraining the model with a new dataset, it transforms it into a domain specialist.

After Google’s Med-PaLM 2 model was fine-tuned with medical data, it achieved 86.5% accuracy on the US Medical Licensing Examination; a 19-point leap from its previous version. Moreover, doctors preferred this model’s responses over real doctors’ responses on 8 out of 9 clinical criteria.

An even more striking example: Stanford’s Alpaca project fine-tuned Meta’s LLaMA 7B model with just 52,000 AI-generated samples. The total cost, including data generation and training, was under $600. The resulting model demonstrated performance comparable to GPT-3.5. An open-source model with 7 billion parameters was transformed into a model that could compete with a commercial giant—all on a master’s student’s budget.

Why it matters: Fine-tuning is the most direct way to turn a general-purpose model into a domain-specific expert. But classical fine-tuning comes at a cost: updating all the model’s weights is extremely expensive and resource-intensive for large models. At GPT-3 scale, this means retraining 175 billion parameters.

2. LoRA and QLoRA: Making Fine-Tuning Accessible

Classical fine-tuning updates all of a model’s parameters. LoRA (Low-Rank Adaptation) freezes the model’s original weights and adds small “adapter” matrices, training only those. It’s like getting better performance from the same engine by installing a performance chip rather than completely rebuilding the engine.

In Microsoft’s original 2021 LoRA research, applying LoRA to GPT-3 175B reduced the number of trainable parameters by 10,000x, dropped GPU memory requirements during training by 3x, and shrank the training output size from 350GB to 35MB. Quality was equivalent to full fine-tuning, sometimes even better. This paper has been cited over 26,000 times to date.

QLoRA took this a step further: it compressed normally 16-bit weights to 4-bit precision and applied LoRA on top. This enabled a 65-billion-parameter model to be fine-tuned on a single 48GB GPU in 24 hours. The resulting Guanaco model reached 99.3% of ChatGPT performance. What once required multiple expensive GPU clusters became possible with a single graphics card.

Another example: when Bloomberg trained a financial NLP model from scratch, it cost approximately $2.7 million. Researchers from Columbia and NYU Shanghai achieved similar financial sentiment analysis performance with LoRA applied to open-source models through the FinGPT project for under $300. That’s roughly 1/10,000th the cost. Moreover, with such inexpensive fine-tuning, FinGPT can be retrained weekly, keeping pace with market changes using up-to-date information.

Why it matters: LoRA and QLoRA (along with methods we didn’t cover like DoRA, LoRA-FA, and Unsloth) democratized fine-tuning, taking it from the exclusive domain of big-budget companies and making it accessible to university labs, startups, and individual developers. Today, over 30,000 LoRA adapters have been shared for just a single model (Flux.1) on HuggingFace. While a fully fine-tuned model output is about 11GB, a LoRA adapter is only 19MB in size.

3. RLHF and DPO: Teaching the Model the “Preferred Answer” Instead of Just the “Correct Answer”

Fine-tuning equips the model with new knowledge and skills. But what if the model knows the correct information yet presents it in a way that’s inappropriate, rude, verbose, or unhelpful? RLHF (Reinforcement Learning from Human Feedback) addresses exactly this problem: it teaches the model not “what to know” but “how to behave.” If fine-tuning is like training a specialist doctor, RLHF is like teaching that doctor how to interact with a hospitalized patient or how to communicate with patients.

How does this work? The process has three fundamental steps:

1. Collecting Human Feedback (Sampling): First, the model is asked various questions and requested to generate several different responses for each. Human evaluators then read these responses and rank them: “A is most helpful, then C, B is worst.” This creates a dataset of what people like and dislike.

2. Training the Reward Model: Since it’s impossible for humans to read and score millions of responses individually, a second AI model enters the stage: the Reward Model. This model is shown the human rankings and transformed into a digital jury that simulates human preferences. A jury that can predict which responses will be more appropriate, polite, or helpful to humans.

3. Optimization (Model Self-Improvement): Our main language model (LLM) starts generating different responses again. But this time, instead of asking humans, it asks the “digital jury” we just trained. The jury assigns a score (reward) based on the response’s tone and helpfulness. When the model receives a high score, it internalizes that style: “Great, I’m on the right track!” When it receives a low score, it moves away from that style. Like a player trying new strategies to achieve the highest score in a computer game, it adopts the most appropriate communication style through trial and error.

In OpenAI’s InstructGPT experiment, a 1.3 billion parameter model trained with RLHF was preferred by humans over the raw 175 billion parameter GPT-3. A model 100x smaller outperformed the larger model simply by being aligned with human preferences. Hallucination rates dropped from 41% to 21%.

DPO (Direct Preference Optimization) is a simpler, more stable alternative to RLHF. RLHF requires training a separate reward model; DPO skips this step by presenting both good and bad responses to the model directly and optimizing it by telling it to prefer the good response. HuggingFace’s Zephyr-7B model, trained with DPO, outperformed the 70 billion parameter LLaMA-2-Chat (trained with expensive RLHF) on the MT-Bench chat benchmark. Thus, a model 10x smaller was trained with zero human feedback and a total training cost of just $500.

Why it matters: RLHF and DPO are the technical foundation behind ChatGPT’s ability to “have conversations.” The model may already know the information; but alignment—ensuring the model behaves in line with its purpose and human expectations—is essential for presenting it in a useful, safe, and understandable way. Remember the Toolformer example from the previous article: a 6.7B model using tools competed with a 175B model without tools. RLHF/DPO tells a similar story: size isn’t what matters, alignment is.

4. Knowledge Distillation: Transferring the Big Brain’s Knowledge to the Small Brain

In Knowledge Distillation, a large “teacher” model transfers its knowledge to a smaller “student” model. The student learns not just by copying answers, but by imitating the teacher’s reasoning process. Fine-tuning trained the model with new, correct data prepared in advance; distillation trains the model with better responses from another model.

In Google’s 2023 “Distilling Step-by-Step” research, a 770 million parameter T5 model outperformed the 540 billion parameter PaLM on NLP benchmarks. That’s a 700x reduction in model size.

In 2025, DeepSeek took this story to the next level. The 1.5 billion parameter DeepSeek-R1-Distill-Qwen-1.5B, distilled from the 671-billion-parameter R1 model, outperformed GPT-4o and Claude 3.5 Sonnet on math benchmarks (28.9% on AIME 2024, 83.9% on MATH-500). A model small enough to fit on a smartphone left two commercial giants behind in mathematical reasoning. This research led DeepSeek to conclude that distillation from a powerful model (DeepSeek-R1) yields better results than training small models (Qwen2.5-1.5B) with large-scale reinforcement learning.

According to Anthropic’s announcement in February 2026, the company revealed it detected “distillation attacks” by AI companies including DeepSeek, Moonshot, and MiniMax to copy Claude’s capabilities. These companies attempted to unauthorizedly extract Claude’s coding, tool use, and advanced reasoning capabilities to train their own models by generating millions of queries through tens of thousands of accounts.

Why it matters: Knowledge Distillation enables us to transfer the capabilities of large models to smaller, faster, and cheaper models. This makes AI runnable not just on cloud servers, but on phones, laptops, and edge devices.

External Support or Internal Update?

Two articles ago we saw the limitations of LLMs. In the previous article we examined approaches that overcome these limitations from the outside. In this article, we covered methods that update the model from within. So when should we use which?

In practice:

  • If you want to reduce hallucinations → RAG (bring reliable sources from outside)
  • If you need current or real-time data → Tool Use (call APIs, search the web)
  • If you have complex multi-step tasks → Agentic AI (plan, orchestrate)
  • If you want to specialize the model in a domain → Fine-tuning
  • If your fine-tuning budget is limited → LoRA / QLoRA / DoRA / Unsloth
  • If you want to improve model behavior and response quality → RLHF / DPO
  • If you want to transfer large model capabilities to a small model → Knowledge Distillation

External support is fast and flexible: it adds new capabilities without touching the model. Internal updates are permanent and deep: they change the model itself. Both can be used together, because one is not an alternative to the other—they are complementary.

In your own projects, do you use the model as-is, or have you tried customizing it with fine-tuning or LoRA?

Share