Stragglers in Distributed Matrix Multiplication

JOB SCHEDULING STRATEGIES FOR PARALLEL PROCESSING, JSSPP 2023(2023)

引用 0|浏览4
暂无评分
摘要
A delay in a single processor may affect an entire system since the slowest processor typically determines the runtime. Problems with such stragglers are often mitigated using dynamic load balancing or redundancy solutions such as task replication. Unfortunately, the former option incurs high communication cost, and the latter significantly increases the arithmetic cost and memory footprint, making high resource overhead seem inevitable. Matrix multiplication and other numerical linear algebra kernels typically have structures that allow better straggler management. Redundancy based solutions tailored for such algorithms often combine codes in the algorithm's structure. These solutions add fixed cost overhead and may perform worse than the original algorithm when little or no delays occur. We propose a new loadbalancing solution tailored for distributed matrix multiplication. Our solution reduces latency overhead by O (P/ log P) compared to existing dynamic load-balancing solutions, where P is the number of processors. Our solution overtakes redundancy-based solutions in all parameters: arithmetic cost, bandwidth cost, latency cost, memory footprint, and the number of stragglers it can tolerate. Moreover, our overhead costs depend on the severity of delays and are negligible when delays are minor. We compare our solution with previous ones and demonstrate significant improvements in asymptotic analysis and simulations: up to x4.4 and x5.3 compared to general-purpose dynamic load balancing and redundancy-based solutions, respectively.
更多
查看译文
关键词
Dynamic Load Balancing,Straggler Mitigation,Distributed Computing,Numerical Linear Algebra
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要