Efficient User-Level Storage Disaggregation for Deep Learning

Yue Zhu,Weikuan Yu,Bing Jiao,Kathryn Mohror,Adam Moody,Fahim Chowdhury

2019 IEEE International Conference on Cluster Computing (CLUSTER)（2019）

引用 28|浏览119

暂无评分

摘要

On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization.

查看译文

关键词

large-scale high performance computing systems,aggregated resources,I/O patterns,deep learning applications,HPC systems,burst buffers,deep neural networks,DLFS,in-memory tree-based sample directory,user-level SPDK protocol,kernel-based Ext4 file system,user-level storage disaggregation,flash memory,nonvolatile memory,NVM express devices,NVMe devices,deep learning file system,CPU utilization,parallel training tasks,metadata management,API

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要