Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2022)

引用 15|浏览26
暂无评分
摘要
Deep neural network (DNN)-based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU + GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU + ASIC designs. However, deploying DNN-based video analysis on CPU + FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions; 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multithreading programming for the CPU. In this article, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames/s by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves $20.6\times $ and $5.3\times $ higher throughput performance.
更多
查看译文
关键词
Acceleration,coherent accelerator processor interface (CAPI),deep neural network (DNN),heterogeneous system,HW/SW co-design
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要