Radial Networks: Dynamic Layer Routing for High-Performance Large Language Models
CoRR(2024)
摘要
Large language models (LLMs) often struggle with strict memory, latency, and
power demands. To meet these demands, various forms of dynamic sparsity have
been proposed that reduce compute on an input-by-input basis. These methods
improve over static methods by exploiting the variance across individual
inputs, which has steadily grown with the exponential increase in training
data. Yet, the increasing depth within modern models, currently with hundreds
of layers, has opened opportunities for dynamic layer sparsity, which skips the
computation for entire layers. In this work, we explore the practicality of
layer sparsity by profiling residual connections and establish the relationship
between model depth and layer sparsity. For example, the residual blocks in the
OPT-66B model have a median contribution of 5
advantage of this dynamic sparsity and propose Radial Networks, which perform
token-level routing between layers guided by a trained router module. These
networks can be used in a post-training distillation from sequential networks
or trained from scratch to co-learn the router and layer weights. They enable
scaling to larger model sizes by decoupling the number of layers from the
dynamic depth of the network, and their design allows for layer reuse. By
varying the compute token by token, they reduce the overall resources needed
for generating entire sequences. Overall, this leads to larger capacity
networks with significantly lower compute and serving costs for large language
models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要