Big data analytics frameworks, such as Spark and Giraph, need to process and cache massive datasets that do not always fit on the managed heap. Therefore, frameworks temporarily move long-lived objects outside the heap (off-heap) on a fast storage device. However, this practice results in (1) high serialization/deserialization (S/D) cost and (2) high memory pressure when off-heap objects are moved back for processing.
In this article, we propose TeraHeap, a system that eliminates S/D overhead and expensive GC scans for a large portion of objects in analytics frameworks. TeraHeap relies on three concepts: (1) It eliminates S/D by extending the managed runtime (JVM) to use a second high-capacity heap (H2) over a fast storage device. (2) It offers a simple hint-based interface, allowing analytics frameworks to leverage object knowledge to populate H2. (3) It reduces GC cost by fencing the collector from scanning H2 objects while maintaining the illusion of a single managed heap, ensuring memory safety.
We implement TeraHeap in OpenJDK8 and OpenJDK17 and evaluate it with fifteen widely used applications in two real-world big data frameworks, Spark and Giraph. We find that for the same DRAM size, TeraHeap improves performance by up to 73% and 28% compared to native Spark and Giraph. Also, it can still provide better performance by consuming up to 4.6× and 1.2× less DRAM than native Spark and Giraph, respectively. TeraHeap can also be used for in-memory frameworks and applying it to the Neo4j Graph Data Science library improves its performance by up to 26%. Finally, it outperforms Panthera, a state-of-the-art garbage collector for hybrid DRAM-NVM memories, by up to 69%.
Published at:
ACM Transactions on Programming Languages and Systems (TOPLAS), 2024