H 5 Spark : Bridging the I / O Gap between Spark and Scientific Data Formats on HPC Systems
Cray user group(2016)
摘要
The Spark framework has been tremendously powerful for performing Big Data analytics in distributed data centers. However, using Spark to analyze large-scale scientific data on HPC systems has several challenges. For instance, parallel file systems are shared among all computing nodes, in contrast to shared-nothing architectures. Additionally, accessing data stored in commonly used scientific data formats, such as HDF5 and netCDF, is not natively supported in Spark. Our study focuses on improving I/O performance of Spark on HPC systems when reading and writing scientific data stored in HDF5/netCDF. We select several scientific use cases to drive the design of an efficient parallel I/O API for Spark on HPC systems, called H5Spark, which optimizes I/O performance and takes into account Lustre file system striping. We evaluate the performance of H5Spark on Cori, a Cray XC40 system located at NERSC.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要