SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation
CoRR(2024)
摘要
Large language models (LLMs) have become a significant workload since their
appearance. However, they are also computationally expensive as they have
billions of parameters and are trained with massive amounts of data. Thus,
recent works have developed dedicated CUDA kernels for LLM training and
inference instead of relying on compilergenerated ones, so that hardware
resources are as fully utilized as possible. In this work, we explore the
possibility of GPU native instruction optimization to further push the CUDA
kernels to extreme performance. Contrary to prior works, we adopt an automatic
optimization approach by defining a search space of possible GPU native
instruction schedules, and then we apply stochastic search to perform
optimization. Experiments show that SIP can further improve CUDA kernel
throughput by automatically discovering better GPU native instruction schedules
and the optimized schedules are tested by 10 million test samples.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要