Accelerating ViT Inference on FPGA through Static and Dynamic Pruning
CoRR(2024)
摘要
Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various
computer vision tasks. However, their high computational complexity prevents
them from being applied to many real-world applications. Weight and token
pruning are two well-known methods for reducing complexity: weight pruning
reduces the model size and associated computational demands, while token
pruning further dynamically reduces the computation based on the input.
Combining these two techniques should significantly reduce computation
complexity and model size; however, naively integrating them results in
irregular computation patterns, leading to significant accuracy drops and
difficulties in hardware acceleration.
Addressing the above challenges, we propose a comprehensive
algorithm-hardware codesign for accelerating ViT on FPGA through simultaneous
pruning -combining static weight pruning and dynamic token pruning. For
algorithm design, we systematically combine a hardware-aware structured
block-pruning method for pruning model parameters and a dynamic token pruning
method for removing unimportant token vectors. Moreover, we design a novel
training algorithm to recover the model's accuracy. For hardware design, we
develop a novel hardware accelerator for executing the pruned model. The
proposed hardware design employs multi-level parallelism with load balancing
strategy to efficiently deal with the irregular computation pattern led by the
two pruning approaches. Moreover, we develop an efficient hardware mechanism
for efficiently executing the on-the-fly token pruning.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要