WSQ-AdderNet: Efficient Weight Standardization Based Quantized AdderNet FPGA Accelerator Design with High-Density INT8 DSP-LUT Co-Packing Optimization.

Yunxiang Zhang,Biao Sun,Weixiong Jiang,Yajun Ha,Miao Hu,Wenfeng Zhao

ICCAD（2022）

引用 1|浏览36

暂无评分

摘要

Convolutional neural networks (CNNs) have been widely adopted for various machine intelligence tasks. Nevertheless, CNNs are still known to be computational demanding due to the convolutional kernels involving expensive Multiply-ACcumulate (MAC) operations. Recent proposals on hardware-optimal neural network architectures suggest that AdderNet with a lightweight ℓ 1 -norm based feature extraction kernel can be an efficient alternative to the CNN counterpart, where the expensive MAC operations are substituted with efficient Sum-of-Absolute-Difference (SAD) operations. Nevertheless, it lacks an efficient hardware implementation methodology for AdderNet as compared to the existing methodologies for CNNs, including efficient quantization, full-integer accelerator implementation, and judicious resource utilization of DSP slices of FPGA devices. In this paper, we present WSQ-AdderNet, a generic framework to quantize and optimize AdderNet-based accelerator designs on embedded FPGA devices. First, we propose a weight standardization technique to facilitate weight quantization in AdderNet. Second, we demonstrate a full-integer quantization hardware implementation strategy, including weight and activation quantization methodologies. Third, we apply DSP packing optimization to maximize the DSP utilization efficiency, where Octo-INT8 can be achieved via DSP-LUT co-packing. Finally, we implement the design using Xilinx Vitis HLS (high-level synthesis) and Vivado to Xilinx Kria KV-260 FPGA. Our experimental results of ResNet-20 using WSQ-AdderNet demonstrate that the implementations achieve 89.9% inference accuracy with INT8 implementation, which shows little performance loss as compared to the FP32 and INT8 CNN designs. At the hardware level, WSQ-AdderNet achieves up to 3.39× DSP density improvement with nearly the same throughput as compared to INT8 CNN design. The reduction in DSP utilization makes it possible to deploy large network models on resource-constrained devices. When further scaling up the PE sizes by 39.8%, WSQ-AdderNet can achieve 1.48× throughput improvement while still achieving 2.42× DSP density improvement.

查看译文

关键词

CNN,AdderNet,model quantization,DSP packing,SIMD,FPGA neural network accelerators

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要