Comprehensive, open-source resource usage measurement and analysis for HPC systems

Concurrency and Computation: Practice and Experience(2014)

引用 18|浏览131
暂无评分
摘要
The important role high-performance computing HPC resources play in science and engineering research, coupled with its high cost capital, power and manpower, short life and oversubscription, requires us to optimize its usage - an outcome that is only possible if adequate analytical data are collected and used to drive systems management at different granularities - job, application, user and system. This paper presents a method for comprehensive job, application and system-level resource use measurement, and analysis and its implementation. The steps in the method are system-wide collection of comprehensive resource use and performance statistics at the job and node levels in a uniform format across all resources, mapping and storage of the resultant job-wise data to a relational database, which enables further implementation and transformation of the data to the formats required by specific statistical and analytical algorithms. Analyses can be carried out at different levels of granularity: job, user, application or system-wide. Measurements are based on a new lightweight job-centric measurement tool 'TACC_Stats', which gathers a comprehensive set of resource use metrics on all compute nodes and data logged by the system scheduler. The data mapping and analysis tools are an extension of the XDMoD project. The method is illustrated with analyses of resource use for the Texas Advanced Computing Center's Lonestar4, Ranger and Stampede supercomputers and the HPC cluster at the Center for Computational Research. The illustrations are focused on resource use at the system, job and application levels and reveal many interesting insights into system usage patterns and also anomalous behavior due to failure/misuse. The method can be applied to any system that runs the TACC_Stats measurement tool and a tool to extract job execution environment data from the system scheduler. Copyright © 2014 John Wiley & Sons, Ltd.
更多
查看译文
关键词
supremm,usage analysis,system management,xsede,tacc_stats,hpc resource management,performance analysis,xdmod
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要