On the Surprising Efficacy of Distillation as an Alternative to Pre-Training Small Models
arxiv(2024)
摘要
In this paper, we propose that small models may not need to absorb the cost
of pre-training to reap its benefits. Instead, they can capitalize on the
astonishing results achieved by modern, enormous models to a surprising degree.
We observe that, when distilled on a task from a pre-trained teacher model, a
small model can achieve or surpass the performance it would achieve if it was
pre-trained then finetuned on that task. To allow this phenomenon to be easily
leveraged, we establish a connection reducing knowledge distillation to modern
contrastive learning, opening two doors: (1) vastly different model
architecture pairings can work for the distillation, and (2) most contrastive
learning algorithms rooted in the theory of Noise Contrastive Estimation can be
easily applied and used. We demonstrate this paradigm using pre-trained teacher
models from open-source model hubs, Transformer and convolution based model
combinations, and a novel distillation algorithm that massages the
Alignment/Uniformity perspective of contrastive learning by Wang Isola (2020)
into a distillation objective. We choose this flavor of contrastive learning
due to its low computational cost, an overarching theme of this work. We also
observe that this phenomenon tends not to occur if the task is data-limited.
However, this can be alleviated by leveraging yet another scale-inspired
development: large, pre-trained generative models for dataset augmentation.
Again, we use an open-source model, and our rudimentary prompts are sufficient
to boost the small model`s performance. Thus, we highlight a training method
for small models that is up to 94
paradigm without sacrificing performance. For practitioners discouraged from
fully utilizing modern foundation datasets for their small models due to the
prohibitive scale, we believe our work keeps that door open.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要