Analysis of Distributed Optimization Algorithms on a Real Processing-In-Memory System
arxiv(2024)
摘要
Machine Learning (ML) training on large-scale datasets is a very expensive
and time-consuming workload. Processor-centric architectures (e.g., CPU, GPU)
commonly used for modern ML training workloads are limited by the data movement
bottleneck, i.e., due to repeatedly accessing the training dataset. As a
result, processor-centric systems suffer from performance degradation and high
energy consumption. Processing-In-Memory (PIM) is a promising solution to
alleviate the data movement bottleneck by placing the computation mechanisms
inside or near memory.
Our goal is to understand the capabilities and characteristics of popular
distributed optimization algorithms on real-world PIM architectures to
accelerate data-intensive ML training workloads. To this end, we 1) implement
several representative centralized distributed optimization algorithms on
UPMEM's real-world general-purpose PIM system, 2) rigorously evaluate these
algorithms for ML training on large-scale datasets in terms of performance,
accuracy, and scalability, 3) compare to conventional CPU and GPU baselines,
and 4) discuss implications for future PIM hardware and the need to shift to an
algorithm-hardware codesign perspective to accommodate decentralized
distributed optimization algorithms.
Our results demonstrate three major findings: 1) Modern general-purpose PIM
architectures can be a viable alternative to state-of-the-art CPUs and GPUs for
many memory-bound ML training workloads, when operations and datatypes are
natively supported by PIM hardware, 2) the importance of carefully choosing the
optimization algorithm that best fit PIM, and 3) contrary to popular belief,
contemporary PIM architectures do not scale approximately linearly with the
number of nodes for many data-intensive ML training workloads. To facilitate
future research, we aim to open-source our complete codebase.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要