Data-centric Reliability Management in GPUs

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)(2021)

引用 2|浏览10
暂无评分
摘要
Graphics Processing Units (GPUs) have become the default choice of acceleration in a wide range of application domains. To keep up with computational demands, the GPU memory system is constantly being innovated from both the cache and DRAM perspectives. Such innovations can adversely affect GPU reliability and in fact, can lead to an increase in the number of multi-bit faults. To address this problem, we systematically study a wide range of GPGPU applications and find that usually, only a small percentage of data needs protection to increase application resilience. This data is highly accessed and shared (constitutes hot memory), which implies that faults in this space can often lead to incorrect application output. An in-depth analysis of application code shows that information of such data can be passed on to the hardware to guide low-overhead detection/correction schemes. In this vein, we developed low-overhead partial data replication schemes that exploit latency tolerance in GPUs. Overall, this data-centric approach dramatically improves GPGPU application resilience, with a minimal additional average performance overhead of 1.2% for detection-only and 3.4% for detection-and-correction.
更多
查看译文
关键词
GPUs,Reliability,Multi-bit Faults,Application Resilience
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要