Self-Distillation Enables Continual Learning

May 17, 2026 · ai · Source ↗

TLDR

Paper introduces SDFT, a method using in-context learning to generate on-policy training signals from demonstrations, reducing catastrophic forgetting without reward functions.

Standard SFT is off-policy, causing catastrophic forgetting; SDFT uses a demonstration-conditioned model as its own teacher to stay on-policy.
SDFT requires no explicit reward functions, unlike RL-based continual learning approaches, making it practical for real fine-tuning pipelines.
In sequential learning experiments, a single model accumulated multiple skills over time with no performance regression on prior tasks.
SDFT outperformed SFT on both new-task accuracy and forgetting reduction across skill learning and knowledge acquisition benchmarks.

One commenter flagged that the paper’s language (“enable,” “establishing”) overstates certainty, raising skepticism about how broadly the results generalize.