Anticipatory Resource Allocation for ML Training

PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON CLOUD COMPUTING, SOCC 2023(2023)

引用 0|浏览17
暂无评分
摘要
Our analysis of a large public cloud ML training service shows that resources remain unused likely because users statically (over-)allocate resources for their jobs given a desire for predictable performance, and state-of-the-art schedulers do not exploit idle resources lest they slow down some jobs excessively. We consider if an anticipatory scheduler, which schedules based on predictions of future job arrivals and durations, can improve over the state-of-the-art. We find that realizing gains from anticipation requires dealing effectively with prediction errors, and even the best predictors have errors that do not conform to simple models (such as bounded or i.i.d. error). We devise a novel anticipatory scheduler called SIA that is robust to such errors. On real workloads, SIA reduces job latency by an average of 2.83x over the current production scheduler, while reducing the likelihood of job slowdowns by orders of magnitude relative to schedulers that naively share resources.
更多
查看译文
关键词
Systems for machine learning,Machine learning for systems,Multi-tenancy in the cloud,Machine Learning-as-a-Service
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要