Architecture of a distributed storage that combines file system, memory and computation in a single layer

Jia Zou; Arun Iyengar; Chris Jermaine

doi:10.1007/s00778-020-00605-w

VLDB Journal

Paper

26 Feb 2020

Architecture of a distributed storage that combines file system, memory and computation in a single layer

View publication

Abstract

Storage and memory systems for modern data analytics are heavily layered, managing shared persistent data, cached data, and non-shared execution data in separate systems such as a distributed file system like HDFS, an in-memory file system like Alluxio, and a computation framework like Spark. Such layering introduces significant performance and management costs. In this paper, we propose a single system called Pangea that can manage all data—both intermediate and long-lived data, and their buffer/caching, page replacement, data placement optimization, and failure recovery—all in one monolithic distributed storage system, without any layering. We present a detailed performance evaluation of Pangea and show that its performance compares favorably with several widely used layered systems such as Spark.

Conference paper