Materialization and Reuse Optimizations for Production Data Science Pipelines

PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA (SIGMOD '22)(2022)

引用 8|浏览20
暂无评分
摘要
Many companies and businesses train and deploy machine learning (ML) pipelines to answer prediction queries. In many applications, new training data continuously becomes available. A typical approach to ensure that ML models are up-to-date is to retrain the ML pipelines following a schedule, e.g., every day on the last seven days of data. Several use cases, such as A/B testing and ensemble learning, require many pipelines to be deployed in parallel. Existing solutions train each pipeline separately, which generates redundant data processing. Our goal is to eliminate redundant data processing in such scenarios using materialization and reuse optimizations. Our solution comprises of two main parts. First, we propose a materialization algorithm that given a storage budget, materializes the subset of the artifacts to minimize the run time of the subsequent executions. Second, we design a reuse algorithm to generate an execution plan by combining the pipelines into a directed acyclic graph (DAG) and reusing the materialized artifacts when appropriate. Our experiments show that our solution can reduce the training time by up to an order of magnitude for different deployment scenarios.
更多
查看译文
关键词
machine Learning pipeline, materialization, reuse
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要