Paper introduces SDFT, a method using in-context learning to generate on-policy training signals from demonstrations, reducing catastrophic forgetting without reward functions.
Key Takeaways
Standard SFT is off-policy, causing catastrophic forgetting; SDFT uses a demonstration-conditioned model as its own teacher to stay on-policy.
SDFT requires no explicit reward functions, unlike RL-based continual learning approaches, making it practical for real fine-tuning pipelines.
In sequential learning experiments, a single model accumulated multiple skills over time with no performance regression on prior tasks.
SDFT outperformed SFT on both new-task accuracy and forgetting reduction across skill learning and knowledge acquisition benchmarks.
Hacker News Comment Review
One commenter flagged that the paper’s language (“enable,” “establishing”) overstates certainty, raising skepticism about how broadly the results generalize.