Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Reliable Distributed Systems(2013)

引用 58|浏览5
暂无评分
摘要
Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC_Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC_Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.
更多
查看译文
关键词
fault tolerant computing,parallel machines,resource allocation,ANCOR diagnostics system,TACC_Stats data,TACC_Stats resource,cluster log data,cluster systems,comprehensive resource monitoring tool,message logs,ranger supercomputer,resource usage anomalies,system failures,Cluster log data,Large clusters,Linux O/S,Lustre file-system,Resource Anomalies and Failures
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要