Fast and Effective Distribution-Key Recommendation for Amazon Redshift.

Proc. VLDB Endow.(2020)

引用 17|浏览3
暂无评分
摘要
How should we split data among the nodes of a distributed data warehouse in order to boost performance for a forecasted workload? In this paper, we study the effect of different data partitioning schemes on the overall network cost of pairwise joins. We describe a generally-applicable data distribution framework initially designed for Amazon Redshift, a fully-managed petabyte-scale data warehouse in the cloud. To formalize the problem, we first introduce the Join Multi-Graph, a concise graph-theoretic representation of the workload history of a cluster. We then formulate the "Distribution-Key Recommendation" problem - a novel combinatorial problem on the Join Multi-Graph - and relate it to problems studied in other subfields of computer science. Our theoretical analysis proves that "Distribution-Key Recommendation" is NP-complete and is hard to approximate efficiently. Thus, we propose BAW, a hybrid approach that combines heuristic and exact algorithms to find a good data distribution scheme. Our extensive experimental evaluation on real and synthetic data showcases the efficacy of our method into recommending optimal (or close to optimal) distribution keys, which improve the cluster performance by reducing network cost up to 32x in some real workloads.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要