Automatic Asynchronous Execution of Synchronously Offloaded OpenMP Target Regions

2022 IEEE/ACM Eighth Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)(2022)

引用 0|浏览9
暂无评分
摘要
Use of heterogeneous architectures has steadily increased during the past decade. However, non-homogeneous systems present a challenge to the programming model as the execution models between CPU and accelerator might differ considerably. OpenMP, since version 4.0, has been trying to bridge this gap by allowing to offload a code block to a target device. Among the additions to the OpenMP offloading API since, the most notably probably is asynchronous execution between device and host. By default, offloaded regions are executed synchronously, thus the host thread blocks until their completion. The nowait clause allows work to overlap between the host and target device. However, nowait must be manually added by the user, along with the tasks data dependencies and appropriate synchronization to avoid race conditions, increasing the program complexity and developer burden. In this work, we present automatic asynchronous execution for OpenMP offloaded regions. By taking advantage of the distinct host and target data environments, we discover opportunities that allow them to overlap execution without any need for user intervention. We also describe the necessary changes in the LLVM/OpenMP runtime. We evaluate our implementation through multiple HPC proxy applications and well known parallel benchmarks executed on GPUs. The measured performance can double for an ideal test case while real application exhibit speedups between 5% and 34%.
更多
查看译文
关键词
LLVM,OpenMP,accelerator offloading,GPU,Asynchronicity
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要