Detection Is Better Than Cure: A Cloud Incidents Perspective

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023(2023)

引用 0|浏览16
暂无评分
摘要
Cloud providers use automated watchdogs or monitors to continuously observe service availability and to proactively report incidents when system performance degrades. Improper monitoring can lead to delays in the detection and mitigation of production incidents, which can be extremely expensive in terms of customer impacts and manual toil from engineering resources. Therefore, a systematic understanding of the pitfalls in current monitoring practices and how they can lead to production incidents is crucial for ensuring continuous reliability of cloud services. In this work, we carefully study the production incidents from the past year at Microsoft to understand the monitoring gaps in a hyperscale cloud platform. We conduct an extensive empirical study to answer: (1) What are the major causes of failures in early detection of production incidents and what are the steps taken for mitigation, (2) What is the impact of failures in early detection, (3) How do we recommend best monitoring practices for different services, and (4) How can we leverage the insights from this study to enhance the reliability of the cloud services. This study provides a deeper understanding of existing monitoring gaps in cloud platforms, uncover interesting insights and provide guidance for best monitoring practices for ensuring continuous reliability.
更多
查看译文
关键词
Empirical Study,Reliability,Cloud Services
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要