Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training
arxiv(2024)
摘要
Mixture-of-experts (MoE) models facilitate efficient scaling; however,
training the router network introduces the challenge of optimizing a
non-differentiable, discrete objective. Recently, a fully-differentiable MoE
architecture, SMEAR, was proposed (Muqeeth et al., 2023), which softly merges
experts in the parameter space; nevertheless, its effectiveness was only
demonstrated in downstream fine-tuning on classification tasks. In this paper,
we present Lory, the first approach that scales such architectures to
autoregressive language model pre-training. Lory introduces two key techniques:
(1) a causal segment routing strategy that achieves high efficiency for expert
merging operations while preserving the autoregressive nature of language
models; (2) a similarity-based data batching method that encourages expert
specialization by grouping similar documents in training instances. We
pre-train a series of Lory models on 150B tokens from scratch, with up to 32
experts and 30B (1.5B active) parameters. Experimental results show significant
performance gains over parameter-matched dense models on both perplexity
(+13.9
routing, Lory models achieve competitive performance compared to
state-of-the-art MoE models with token-level routing. We further demonstrate
that the trained experts in Lory capture domain-level specialization without
supervision. Our work highlights the potential of fully-differentiable MoE
architectures for language model pre-training and advocates future research in
this area.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要