I Fine Tuned an Open Source Model and the Bhagavad Gita Explained It Better Than Any Paper

A personal experience told the way I actually lived it.

When I was fine tuning an open source model recently, I had one simple question stuck in my head

“What does fine tuning really mean beyond the math?”

We see the equations everywhere. Matrices. Gradients. Low rank decomposition. But when you are sitting in front of your terminal watching a training loop run, the math alone does not tell the full story. Something deeper is happening something very human.

This is the mental model that finally made it click for me. It Started With a Graduate

Imagine someone who just finished their degree.

They spent years building something real not just memorizing facts, but developing a way of thinking. Problem solving. Structured reasoning. A foundation that took enormous effort to build.

Now they join a company on their first day.

Nobody sits them down and says “Forget everything you learned. We are going to retrain you from scratch for this job.” That would be absurd. It would waste everything they worked for. Instead, something much more natural happens:

Their degree knowledge stays exactly where it is intact, preserved, untouched.

They start learning the specific skills this job requires. New tools. New domain knowledge. New context. But all of it is being built on top of what they already know, not in place of it.

And they keep a notebook.

A small, practical notebook where they write down the things that are specific to this role. A client preference here. A workflow shortcut there. Notes that make them effective in this context, without needing to relearn how to think.

After six months on the job, they answer questions using both the deep foundation from their education, and the specific adaptations they have written in that notebook. The notebook is small. The degree is vast. But together they produce something neither could produce alone.

That is LoRA. That is exactly what it does.

Mapping the Analogy to LoRA

The Degree Is the Pre Trained Model

When a large language model is pre trained, it goes through something not entirely unlike that university education. It is exposed to an enormous amount of human knowledge text, reasoning, language patterns, facts, concepts and it develops something deep and general as a result.

That knowledge is baked into its weights. Every parameter in the model is part of that foundation. And when you want to use this model for a specific task customer support, medical Q&A, legal document analysis, anything your first instinct might be to retrain it.

Do not.

Updating every single weight in a large model is expensive in ways that are easy to underestimate. A 7 billion parameter model in half-precision takes about 14 gigabytes just to sit in memory. To actually train it with gradients and optimizer states you are looking at four times that. Most people simply do not have the hardware. And even if they did, there is a deeper problem: if you aggressively update all those weights, the model starts forgetting what it already knew. This is called catastrophic forgetting, and it is exactly what it sounds like.

The general intelligence you were trying to leverage starts eroding the moment you push too hard on specific training.

LoRA says: stop trying to rewrite the degree. Work with the notebook instead.

What LoRA Actually Does

LoRA Low Rank Adaptation keeps the original model weights completely frozen. Not slowed down. Not partially updated. Frozen. The pre trained weights do not change by a single decimal point during fine tuning.

Instead, LoRA adds two small matrices alongside the existing weights. In the original paper these are called A and B. Together they represent the adaptation the delta, the change, the notebook. During training, only these small matrices receive gradients. Only they get updated. The original model watches from the sidelines, untouched.

At inference time, the model uses both. The frozen weights carry the broad general knowledge. The small adapter matrices carry the task-specific adaptation. The answer comes from both working together, just like the graduate using their education and their notebook at the same time.

The reduction in trainable parameters is dramatic. A weight matrix that would normally require sixteen million numbers to update can be approximated with sixty five thousand. That is a reduction of over ninety-nine percent. Training becomes faster, cheaper, and far less likely to destroy what the model already knew.

What requires_grad = False Really Means

There is a line of code in PyTorch that I kept staring at while I was working through this:

for param in model.parameters():
    param.requires_grad = False

# The work notebook — actively learning
lora_A = nn.Parameter(torch.randn(d_in, rank) * 0.01)
lora_B = nn.Parameter(torch.zeros(rank, d_out))

# At inference — both working together
output = W @ input  +  (lora_A @ lora_B) @ input
#          ↑                    ↑
#    frozen degree          work notebook

Before I had the graduate analogy, this felt like a technical detail. After it, it felt obvious.

**requires_grad = False** means: this parameter will not learn anything new. It is frozen. It will not receive gradients. It will not be updated. It is the degree the thing we are preserving.

The LoRA matrices A and B are created separately, and they have **requires_grad = True** by default. They are the notebook the thing that is actively being written.

When the model produces an answer, it uses both. The frozen knowledge from the pre trained weights. The fresh adaptation from the learned matrices. Core knowledge plus contextual adaptation, combined.

output = W × input  +  (A × B) × input
           ↑                 ↑
      frozen degree      work notebook

This is not a hack or a workaround. It is an elegant solution to a real problem. You want to specialize something without damaging what made it valuable in the first place.

Why This Approach Feels Right

The more I worked with LoRA, the more I realized it solves a very human problem.

We do not grow by replacing ourselves. We grow by layering experience on top of what already exists. A surgeon who learns a new procedure does not forget how to perform surgery. A writer who adopts a new style does not lose their fundamental sense of craft. A professional who changes industries brings everything they have ever learned with them.

Models, it turns out, should work the same way.

The pre trained model carries something genuinely valuable the result of processing more language and knowledge than any person could read in a lifetime. That foundation should be respected. Fine tuning should add, not erase.

LoRA makes this possible. And once you see it through that lens, the technical details fall into place naturally. The frozen weights make sense because you would never erase the degree. The small adapter matrices make sense because you would never try to cram a new specialty into every corner of someone’s brain. The notebook is small because it only needs to contain what is new.

Part Two: QLoRA The Vishwarupa of Lord Krishna

After LoRA solved my training memory problem, I ran into a different wall entirely. Even with frozen weights and tiny adapters, just loading a large model into memory is its own challenge. A 70 billion parameter model in half-precision needs around 140 gigabytes before a single training step begins. That is beyond what most researchers have access to.
This is where QLoRA enters. And this is where I found an analogy that I did not expect ## one that came not from computer science, but from the Bhagavad Gita.

In the eleventh chapter, Arjuna asks Lord Krishna to reveal his true form.

Krishna grants this wish, and what Arjuna sees is the Vishwarupa the Cosmic Universal Form. Infinite faces. Infinite arms. Every being in existence contained within. Every star, every god, every moment of time and space, all present simultaneously. It is everything. It is complete. It is the full 32bit floating point representation of divine reality every detail, every dimension, infinite precision.

And Arjuna, for all his courage and devotion, cannot bear it.
Not because he is weak. But because the human form was never designed to hold that level of detail
. The mind, the eyes, the capacity for understanding none of it was built to process the infinite at full resolution. Even with divine sight granted by Krishna himself, Arjuna trembles. He begs Krishna to return to his human form.
So Krishna compresses.

He folds the Vishwarupa back into the familiar form the friend, the charioteer, the guide standing beside Arjuna on the battlefield. The same infinite wisdom is present. The same divine intelligence is intact. Nothing essential has been lost. But now it is expressed in a form that a human can stand beside, speak to, learn from, and actually work with.

That is quantization. That is QLoRA.

The Compression Is Not a Reduction

This is the insight that changes everything.
When QLoRA quantizes a model from 32bit floating point down to 4bit integers, it is not making the model less intelligent. It is not discarding the Vishwarupa. It is expressing the same knowledge in a form that the hardware our modern day Arjuna can actually work with.

A 70 billion parameter model that needed 140 gigabytes now needs around 35. A 7 billion parameter model that needed 28 gigabytes now needs around 6. The same model. The same pretrained knowledge. The same frozen degree. Just stored in a representation that fits within the constraints of what we have available.

Vishwarupa (32bit) → 140 GB → incomprehensible to hardware
Human Form ( 4bit) → 35 GB → workable, trainable,reachable

Krishna did not become less divine by taking human form. The 4bit model does not become less knowledgeable by being quantized. The essence is preserved. Only the representation changes to something we can actually stand beside and learn from.

The Notebook Still Writes in Full Precision

Here is where the analogy deepens in a way I find extraordinary.
Even in human form, Krishna does not give Arjuna a simplified or reduced version of wisdom. The Bhagavad Gita the teaching that flows from that compressed, accessible form is delivered in complete, full, uncompromised clarity. Every verse. Every concept. Every layer of meaning. Full precision teaching, from a compressed form.

This is exactly what happens with the LoRA adapters in QLoRA.
The base model the frozen weights, the degree, the compressed Vishwarupa is stored in 4bit. But the LoRA adapter matrices, A and B, the notebook, the active learning they remain in full 16bit precision throughout training. The notebook is never compressed. The teaching is never diluted.

You cannot quantize what you are actively writing. The wisdom being transmitted must remain complete.

Base model weights → 4bit (Vishwarupa, compressed for accessibility)
LoRA adapters A, B → BF16 (the Gita, full precision teaching)
When the model produces an answer, it dequantizes the base weights temporarily, combines them with the full-precision adapter output, and delivers something that draws from both the compressed cosmic knowledge and the full-precision specific adaptation.

Arjuna received the full Gita from a form he could comprehend. We receive full-quality outputs from a model our hardware can hold.

What This Means in Practice

The numbers become meaningful once you feel the analogy underneath them.
Full Fine-Tuning → ~56 GB VRAM → hardware breaks, Arjuna trembles
LoRA (BF16) → ~16 GB VRAM → manageable, notebook approach
QLoRA (4-bit) → ~6 GB VRAM → Krishna in human form, fully reachable

A 7 billion parameter model that previously required server-grade hardware with 56 gigabytes of memory can now be fine tuned on a single consumer GPU with 6 gigabytes. The same model. The same intelligence. A form we can actually work with.

This is not a compromise. This is wisdom expressed accessibly.

What I Wish Someone Had Told Me

Looking back at when I started, a few things would have saved me significant confusion.

Fine-tuning is not retraining. These are different activities with different costs and different risks. The graduate does not go back to university every time they start a new role. Neither should your model.

The original weights never change in LoRA. W is frozen. Completely. Only A and B learn anything. The degree is not touched. Ever.
In QLoRA, the adapters are not quantized. This is the part most explanations get wrong. The base model is compressed the Vishwarupa made accessible. But the adapters stay in full precision the Gita stays complete. Never confuse the two.

Quantization does not reduce intelligence. It changes representation. Krishna was not less divine in human form. Your model is not less capable at 4bit. The knowledge is the same. The storage is different.

Rank 8 is enough for most tasks. Start there. Adjust only if you have a specific reason.

The Bigger Idea

When I finished these experiments, something stayed with me that felt larger than the technical details.

Both LoRA and QLoRA are fundamentally about the same thing: expressing knowledge in a form that the receiver can actually work with, without losing what makes that knowledge valuable.

The graduate does not lose their education by keeping a notebook. They make their education useful in a specific context.

Krishna does not lose his divinity by taking human form. He makes his wisdom reachable to someone who needed it on a battlefield.

The model does not lose its intelligence by being quantized or adapted. It becomes something we can actually stand beside, learn from, and work with on the hardware we have, for the tasks we need.
That is the philosophy underneath all of this.

Preserve wisdom. Express it accessibly.
Adapt with precision.
That is how both models and deities reach us.

Thanks
Sreeni Ramadorai

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

What STEM Professionals Should Know About EB1A Self-Petition in 2026

Next Post

Machine Vision Lighting Solutions for Unwanted Glare

Related Posts