Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts
CoRR(2024)
摘要
Despite the remarkable strides made by autoregressive language models, their
potential is often hampered by the slow inference speeds inherent in sequential
token generation. Blockwise parallel decoding (BPD) was proposed by Stern et
al. (2018) as a way to improve inference speed of language models. In this
paper, we make two contributions to understanding and improving BPD drafts. We
first offer an analysis of the token distributions produced by the BPD
prediction heads. Secondly, we use this analysis to inform algorithms to
improve BPD inference speed by refining the BPD drafts using small n-gram or
neural language models. We empirically show that these refined BPD drafts yield
a higher average verified prefix length across tasks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要