GPU Cluster Scheduling for Network-Sensitive Deep Learning
CoRR(2024)
摘要
We propose a novel GPU-cluster scheduler for distributed DL (DDL) workloads
that enables proximity based consolidation of GPU resources based on the DDL
jobs' sensitivities to the anticipated communication-network delays. Our
scheduler consists of three major components: (i) a classical delay scheduling
algorithm to facilitate job placement and consolidation; (ii) a
network-sensitive job preemption strategy; and (iii) an "auto-tuner" mechanism
to optimize delay timers for effective delay scheduling. Additionally, to
enable a cost-effective methodology for large-scale experiments, we develop a
data-driven DDL cluster simulation platform. Employing the simulation platform
we compare against several state-of-the-art alternatives on real-world workload
traces to demonstrate the benefits of our design. Our scheduler can provide
improvement of up to 69
to the prevailing consolidation-based scheduling methods, while reducing the
average job completion time by up to 83
overheads by up to 98
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要