Machine fault tolerance for reliable datacenter systems

Danyang Zhuo,Qiao Zhang,Dan R. K. Ports,Arvind Krishnamurthy,Thomas Anderson

ApSys（2014）

引用 3|浏览91

暂无评分

摘要

Although rare in absolute terms, undetected CPU, memory, and disk errors occur often enough at datacenter scale to significantly affect overall system reliability and availability. In this paper, we propose a new failure model, called Machine Fault Tolerance, and a new abstraction, a replicated write-once trusted table, to provide improved resilience to these types of failures. Since most machine failures manifest in application server and operating system code, we assume a Byzantine model for those parts of the system. However, by assuming that the hypervisor and network are trustworthy, we are able to reduce the overhead of machine-fault masking to be close to that of non-Byzantine Paxos.

查看译文

关键词

design,experimentation,fault tolerance,measurement,reliability,distributed databases,performance

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

Chat Paper

正在生成论文摘要