Rethinking Software Runtimes for Disaggregated Memory Extended Abstract

semanticscholar(2020)

引用 0|浏览9
暂无评分
摘要
Disaggregated memory addresses resource provisioning inefficiencies in current datacenters by improving memory utilization and decreasing the total memory over-provisioning necessary to avoid out-of-memory errors or swapping [3]. In addition, disaggregated memory enables independent scaling of memory and compute, and it disentangles hardware failures and replacements from the monolithic server. Finegrain microsecond-latency networking technologies, such as Remote Direct Memory Access (RDMA) [8, 14, 15] and GenZ [6] are key enablers for hardware disaggregated memory and make the technology feasible in the near future. Unfortunately, enabling applications to efficiently adopt disaggregated memory is not straightforward. Many software systems [10, 4, 2, 7, 12] have been proposed to enable applications to transparently, without code changes, use remote memory – the memory of another host in the rack or memory that has been physically disaggregated from the compute. These systems use various kernel subsystems [7, 2, 12] or redesign the kernel altogether [11]. Fundamentally, they all rely on the core virtual memory mechanism for three essential functions: 1) fetching remote data by detecting remote accesses using page faults, and caching the remote pages in a local DRAM cache; 2) dirty data tracking the cached data by write-protecting pages and causing page faults on the first write of each page; and 3) evicting cached pages from the local DRAM cache, which requires marking the pages as not present and flushing the translation look-aside buffers (TLBs). Virtual memory provides application transparency, but results in high overhead and causes a significant drop in application performance, even when the amount of remote data accessed is small. Page faults latencies exceed network latencies, creating a bottleneck in the system software stack. Moreover, virtual memory requires moving and tracking data at page-granularity, with a page size of 4KB or higher. In contrast, throughout their lifetimes, applications often access only a small part of each page, causing large amplification and poor network utilization, by re-writing data that is already in remote memory. For example, we typically see applications modifying only 1-8 cache-lines in a 4KB page, but the entire page is marked dirty and transferred over the network. Overall, there is a mismatch between applications requirements for remote memory—low latency, fine-grain access, transparency—and the mechanism used to implement these requirements because virtual memory has not been designed to provide low-latency or fine-grain access. We need a different, more efficient mechanism to realize practical memory disaggregation.
更多
查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要