Decoding Speculative Decoding
arxiv(2024)
摘要
Speculative Decoding is a widely used technique to speed up inference for
Large Language Models (LLMs) without modifying its outcome. When performing
inference on an LLM, speculative decoding uses a smaller draft model which
generates speculative tokens and then uses the target LLM to verify those draft
tokens. The speedup provided by speculative decoding heavily depends on the
choice of the draft model. It has been widely suggested to select a draft model
that provides a high probability of the generated token being accepted by the
LLM to achieve the highest throughput. However, our experiments indicate the
contrary with throughput diminishing as the probability of generated tokens to
be accepted by the target model increases. To understand this phenomenon, we
perform extensive experiments to characterize the different factors that affect
speculative decoding and how those factors interact and affect the speedups.
Based on our experiments we describe an analytical model which can be used to
decide the right draft model for a given workload. Further, using our insights
we design a new draft model for LLaMA-65B which can provide 30
throughput than existing draft models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要