
We got a real performance lift: Delta A = +0.263 (p < 0.0001).
But that result exposed a harder question:
Did the adapter learn how Tenacious writes, or just what repeated Tenacious-like samples looked like?
This post answers that at the mechanism level: cross-entropy token-by-token, LoRA gradient flow, and why low-diversity augmentation can make convergence look better than generalization.
1) What SFT cross-entropy actually optimizes
In autoregressive SFT, the model predicts the next token at each step.
Cross-entropy loss measures how much probability mass the model gave the correct next token.
So the objective is:
not “be honest,”
not “be cautious,”
not “be Tenacious,”
but: assign high probability to target tokens in the training distribution.
If your targets consistently reflect Tenacious behavior, style improves indirectly.
But the optimization target is still token prediction.
2) How gradients flow in LoRA when base weights are frozen
For each adapted layer:
W = W0 + BA
W 0is frozen
only A and B are trainable
During backprop, gradients pass through the full forward graph, but updates only changeA/B.
That means LoRA acts as a low-rank steering update on top of a fixed backbone.
Practical interpretation: you are not retraining the model’s full knowledge. You are learning a compact directional adjustment that shifts output tendencies.
3) What your seven target modules imply
You adapted:
attention projections: q_proj, k_proj, v_proj, o_proj
feed-forward projections: gate_proj, up_proj, down_proj
A useful diagnostic lens:
Attention-heavy updates often correlate with better context routing (e.g., weak signal -> interrogative phrasing).
MLP-heavy updates often correlate with lexical/phrase-shape adaptation (which can be desired style—or shortcut memorization).
This is why module-level gradient norms matter. Without them, “it improved” is under-explained.
4) Why low diversity is a gradient problem, not just a datasheet warning
Your datasheet states that 94.3% of training pairs are augmented variants of only 128 originals.
That has direct optimization consequences.
Near-duplicate examples produce highly aligned gradient directions repeatedly.
Cross-entropy rewards those repeated token patterns quickly. Training loss falls. Metrics can rise.
But this can represent two different realities:
Generalizable policy learning (what you want)
Surface-pattern reinforcement (what you fear)
Cross-entropy alone cannot tell which one happened.
5) Why your Delta A is real but not fully sufficient
A statistically strong Delta A means the adapter improved on your evaluation distribution.
It does not automatically prove robust style generalization out-of-family.
The defensible claim is:
“The adapter improved predictive behavior on measured data; generalization vs memorization requires additional diagnostics.”
That is stronger science and better engineering.
6) Minimal diagnostics to separate style learning from memorization
A) Grouped holdout by original family
Do not split augmentation siblings across train/held-out.
Keep all variants of one original together in one split.
Stable performance on grouped holdout -> stronger evidence of true style learning
Large drop -> evidence of augmentation-family memorization
B) Gradient norm breakdown by LoRA module
Log gradient norms for LoRA params and aggregate by:
q/k/v/o
gate/up/down
This doesn’t “prove style” alone, but it makes your mechanism claim concrete: where did training pressure concentrate?
7) Practical conclusion for FDE fine-tuning work
This issue generalizes to any narrow, augmented SFT project (sales writing, summarization, code style, domain formatting):
loss convergence is necessary,
benchmark gain is valuable,
but neither alone proves intended behavior learning.
If you want to claim “learned policy,” add grouped holdout and module-level gradient diagnostics as standard evidence.
Final takeaway
Your LoRA adapter likely learned a useful steering update.
But with heavy augmentation concentration, the safest conclusion is:
“We improved next-token policy on this distribution; we are validating whether that policy generalizes beyond augmentation families.”
That framing is honest, technically grounded, and production-defensible.