Jointly optimizing task granularity and concurrency for in-memory mapreduce frameworks

2017 IEEE International Conference on Big Data (Big Data)(2017)

引用 10|浏览15
暂无评分
摘要
Recently, in-memory big data processing frameworks have emerged, such as Apache Spark and Ignite, to accelerate workloads requiring frequent data reuse. With effective in-memory caching these frameworks eliminate most of I/O operations, which would otherwise be necessary for communication between producer and consumer tasks. However, this performance benefit is nullified if the memory footprint exceeds available memory size, due to excessive spill and garbage collection (GC) operations. To fit the working set in memory, two system parameters play an important role: number of data partitions (N partitions ) specifying task granularity, and number of tasks per each executor (N threads ) specifying the degree of parallelism in execution. Existing approaches to optimizing these parameters either do not take into account workload characteristics, or optimize only one of the parameters in isolation, thus yielding suboptimal performance. This paper introduces WASP, a workload-aware task scheduler and partitioner, which jointly optimizes both parameters at runtime. To find an optimal setting, WASP first analyzes the DAG structure of a given workload, and uses an analytical model to predict optimal settings of N partitions and N threads for all stages based on their computation types. Taking this as input, the WASP scheduler employs a hill climbing algorithm to find an optimal N threads for each stage, thus maximizing concurrency while minimizing data spills and GCs. We prototype WASP on Spark and evaluate it using six workloads on three different parallel platforms. WASP improves performance by up to 3.22× and reduces the cluster operating cost on cloud by up to 40%, over the baseline following Spark Tuning Guidelines and provides robust performance for both shuffle-heavy and shuffle-light workloads.
更多
查看译文
关键词
data spills,Spark Tuning Guidelines,task granularity,concurrency,in-memory mapreduce frameworks,in-memory big data processing frameworks,Apache Spark,I/O operations,data partitions,workload-aware task scheduler,WASP scheduler,data reuse,cluster operating cost reduction,in-memory caching,garbage collection
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要