ROLLER: Fast and Efficient Tensor Compilation for Deep Learning

Hongyu Zhu,Ruofan Wu,Yijia Diao,Shanbin Ke,Haoyu Li,Chen Zhang,Jilong Xue,Lingxiao Ma,Yuqing Xia,Wei Cui,Fan Yang,Mao Yang,Lidong Zhou,Asaf Cidon,Gennady Pekhimenko

USENIX Symposium on Operating Systems Design and Implementation (OSDI)（2022）

引用 36|浏览108

暂无评分

摘要

Despite recent advances in tensor compilers, it often takes hours to generate an efficient kernel for an operator, a compute-intensive sub-task in a deep neural network (DNN), on various accelerators (e.g., GPUs). This significantly slows down DNN development cycles and incurs heavy burdens on the development of general kernel libraries and custom kernels, especially for new hardware vendors. The slow compilation process is due to the large search space formulated by existing DNN compilers, which have to use machine learning algorithms to find good solutions. In this paper, we present ROLLER, which takes a different construction-based approach to generate kernels. At the core of ROLLER is rTile, a new tile abstraction that encapsulates tensor shapes that align with the key features of the underlying accelerator, thus achieving efficient execution by limiting the shape choices. ROLLER then adopts a recursive rTile-based construction algorithm to generate rTile-based programs (rProgram), whose performance can be evaluated efficiently with a micro-performance model without being evaluated in a real device. As a result, ROLLER can generate efficient kernels in seconds, with comparable performance to the state-of-the-art solutions on popular accelerators like GPUs, while offering better kernels on newer accelerators like IPUs.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要