Temporal Scaling Law for Large Language Models
arxiv(2024)
摘要
Recently, Large Language Models (LLMs) are widely adopted in a wide range of
tasks, leading to increasing attention towards the research on how scaling LLMs
affects their performance. Existing works, termed as Scaling Laws, have
discovered that the loss of LLMs scales as power laws with model size,
computational budget, and dataset size. However, the performance of LLMs
throughout the training process remains untouched. In this paper, we propose
the novel concept of Temporal Scaling Law and study the loss of LLMs from the
temporal dimension. We first investigate the imbalance of loss on each token
positions and develop a reciprocal-law across model scales and training stages.
We then derive the temporal scaling law by studying the temporal patterns of
the reciprocal-law parameters. Results on both in-distribution (IID) data and
out-of-distribution (OOD) data demonstrate that our temporal scaling law
accurately predicts the performance of LLMs in future training stages.
Moreover, the temporal scaling law reveals that LLMs learn uniformly on
different token positions, despite the loss imbalance. Experiments on
pre-training LLMs in various scales show that this phenomenon verifies the
default training paradigm for generative language models, in which no
re-weighting strategies are attached during training. Overall, the temporal
scaling law provides deeper insight into LLM pre-training.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要