Distributed Out-of-Memory SVD on CPU/GPU Architectures

2022 IEEE High Performance Extreme Computing Conference (HPEC)(2022)

引用 0|浏览10
暂无评分
摘要
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix $A$ are typically associated with the computation of the Gram matrix $A^{T}A$ , which can be significant when $A$ is large and dense, or when $A$ is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of $A^{T}A$ by using a batching strategy where the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of le-6 sparsity with size of 128 PB in dense format.
更多
查看译文
关键词
SVD,out-of-memory,latent features,data compression,distributed processing,parallel programming,big data,heterogeneous HPC systems,GPU,CUDA,NCCL,cupy
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要