Detection and correction of silent data corruption for large-scale high-performance computing

SC Companion(2012)

引用 375|浏览2
暂无评分
摘要
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores, and this situation will only become more dire as we reach exascale computing. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate but report incorrect results. This paper introduces RedMPI, an MPI library residing in the profiling layer of any standards-compliant MPI implementation. RedMPI is capable of both online detection and correction of soft errors that occur in MPI applications without requiring code changes to application source code. By providing redundancy, RedMPI is capable of transparently detecting corrupt messages from MPI processes that become faulted during execution. Furthermore, with triple redundancy RedMPI "votes'' out MPI messages of a faulted process by replacing corrupted results with corrected results from unfaulted processes. We present an evaluation of RedMPI on an assortment of applications to demonstrate the effectiveness and assess associated overheads. Fault injection experiments establish that RedMPI is not only capable of successfully detecting injected faults, but can also correct these faults while carrying a corrupted application to successful completion without propagating invalid data.
更多
查看译文
关键词
parallel processing,application program interfaces,consistency protocol,redmpi,MPI application,MPI redundancy,fault tolerant computing,silent data corruption correction,standards compliant mpi implementation,exascale computing,message passing interface,high performance computing,fault diagnosis,soft error detection,mpi library,data handling,message passing,fault injector,online detection,silent data corruption,faulted process,high-end computing cluster,silent data corruption detection,soft error correction,large scale high performance computing
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要