H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems

Jialin Liu,Evan Racah,Quincey Koziol,Richard Shane Canon,Alex Gittens,Lisa Gerhardt, Suren Byna,Mike F. Ringenburg, Prabhat

Cray user group（2016）

引用 32|浏览1

暂无评分

摘要

The Spark framework has been tremendously powerful for performing Big Data analytics in distributed data centers. However, using Spark to analyze large-scale scientific data on HPC systems has several challenges. For instance, parallel file systems are shared among all computing nodes, in contrast to shared-nothing architectures. Additionally, accessing data stored in commonly used scientific data formats, such as HDF5 and netCDF, is not natively supported in Spark. Our study focuses on improving I/O performance of Spark on HPC systems when reading and writing scientific data stored in HDF5/netCDF. We select several scientific use cases to drive the design of an efficient parallel I/O API for Spark on HPC systems, called H5Spark, which optimizes I/O performance and takes into account Lustre file system striping. We evaluate the performance of H5Spark on Cori, a Cray XC40 system located at NERSC.

查看译文

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要