GPU Cluster Scheduling for Network-Sensitive Deep Learning

Aakash Sharma,Vivek M. Bhasi,Sonali Singh,George Kesidis,Mahmut T. Kandemir,Chita R. Das

CoRR（2024）

引用 0|浏览11

暂无评分

摘要

We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads that enables proximity based consolidation of GPU resources based on the DDL jobs' sensitivities to the anticipated communication-network delays. Our scheduler consists of three major components: (i) a classical delay scheduling algorithm to facilitate job placement and consolidation; (ii) a network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism to optimize delay timers for effective delay scheduling. Additionally, to enable a cost-effective methodology for large-scale experiments, we develop a data-driven DDL cluster simulation platform. Employing the simulation platform we compare against several state-of-the-art alternatives on real-world workload traces to demonstrate the benefits of our design. Our scheduler can provide improvement of up to 69 to the prevailing consolidation-based scheduling methods, while reducing the average job completion time by up to 83 overheads by up to 98

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要