SySMOL: Co-designing Algorithms and Hardware for Neural Networks with Heterogeneous Precisions
arxiv(2023)
摘要
Recent quantization techniques have enabled heterogeneous precisions at very
fine granularity, e.g., each parameter/activation can take on a different
precision, resulting in compact neural networks without sacrificing accuracy.
However, there is a lack of efficient architectural support for such networks,
which require additional hardware to decode the precision settings for
individual variables, align the variables, and provide fine-grained
mixed-precision compute capabilities. The complexity of these operations
introduces high overheads. Thus, the improvements in inference latency/energy
of these networks are not commensurate with the compression ratio, and may be
inferior to larger quantized networks with uniform precisions.
We present an end-to-end co-design approach encompassing computer
architecture, training algorithm, and inference optimization to efficiently
execute networks with fine-grained heterogeneous precisions. The key to our
approach is a novel training algorithm designed to accommodate hardware
constraints and inference operation requirements, outputting networks with
input-channel-wise heterogeneous precisions and at most three precision levels.
Combined with inference optimization techniques, existing architectures with
low-cost enhancements can support such networks efficiently, yielding optimized
tradeoffs between accuracy, compression ratio and inference latency/energy.
We demonstrate the efficacy of our approach across CPU and GPU architectures.
For various representative neural networks, our approach achieves >10x
improvements in both compression ratio and inference latency, with negligible
degradation in accuracy compared to full-precision networks.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要