Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware Strategies.

Diego A. Roa Perdomo,Rodrigo Ceccato, Rémy Neveu,Hervé Yviquel,Xiaoming Li,Jose Manuel Monsalve Diaz,Johannes Doerfert

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis（2023）

引用 0|浏览14

暂无评分

摘要

Modern high-performance computing systems feature myriad compute units connected via a complex network of nonuniform links. While extensive connectivity can improve communication latency and bandwidth between components, it requires careful orchestration. However, heterogeneous programming models generally expose a flat view of the hardware where all components are connected in a star topology through a uniform, non-descriptive link to the central processing unit. This discrepancy between actual architecture and the simplified abstraction most often employed by programmers results in suboptimal utilization of the complex system interconnects. In this paper, we provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware subtasks, we automatically exploit generally underused connections of the system. The utilized links can be targeted by the user, but it is complex, machine-specific, and cumbersome to do so in a portable way. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our AutoStrategizer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared with naive strategies. For our evaluation, we integrated the AutoStrategizer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 6 × compared with naive versions. Integrated into LLVM/OpenMP, our AutoStrategizer accelerates cross-device memory movement by a factor of ≈ 2 × for large transfers, resulting in 4 × end-to-end execution time decrease for a scientific proxy application.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要