Anticipatory Resource Allocation for ML Training

Tapan Chugh,Srikanth Kandula,Arvind Krishnamurthy,Ratul Mahajan,Ishai Menache

PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON CLOUD COMPUTING, SOCC 2023（2023）

引用 0|浏览17

暂无评分

摘要

Our analysis of a large public cloud ML training service shows that resources remain unused likely because users statically (over-)allocate resources for their jobs given a desire for predictable performance, and state-of-the-art schedulers do not exploit idle resources lest they slow down some jobs excessively. We consider if an anticipatory scheduler, which schedules based on predictions of future job arrivals and durations, can improve over the state-of-the-art. We find that realizing gains from anticipation requires dealing effectively with prediction errors, and even the best predictors have errors that do not conform to simple models (such as bounded or i.i.d. error). We devise a novel anticipatory scheduler called SIA that is robust to such errors. On real workloads, SIA reduces job latency by an average of 2.83x over the current production scheduler, while reducing the likelihood of job slowdowns by orders of magnitude relative to schedulers that naively share resources.

查看译文

关键词

Systems for machine learning,Machine learning for systems,Multi-tenancy in the cloud,Machine Learning-as-a-Service

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要