ProvLake - overview


Overview

ProvLake is a lineage data management system capable of capturing, integrating, and querying data across multiple, distributed services, programs, databases, stores, and computational workflows by leveraging provenance data. It is made for tracking data in hybrid cloud deployments due to its distributed and heterogeneous nature. It is efficient for High-Performance Computing workloads due to its system design principles that aim at attaining very low data capture overhead. ProvLake is highly motivated by its use to support the explainability of the generation of complex Artificial Intelligence models, such as the ones found in Computational Science and Engineering projects, like in the Oil and Gas industry.

ProvLake captures multiworkflow data at runtime to provide for integrated data analysis in heterogeneous, distributed projects. ProvLake logically integrates and ingests multiworkflow data into a provenance database, named ProvLake Data View (PLView), ready for data analyses at runtime. During a multiworkflow execution, the PLView is filled with the following contents: domain data extracted from data stored in multiple stores; explict data relationships datasets distributed across the multiple stores; and the multiworkflow data relationships. The PLView is materialized in a DBMS.

ProvLake provides a lightweight data trackers API to be added to workflow codes, like scripts. Also, a query API is provided for runtime analytical queries that integrate multistore data at runtime. When combined with a polystore, it can query data directly in the multiple stores jointly with their provenance data. ProvLake follows W3C PROV standards for provenance data representation.

 

System Architecture

The PLView is the main component and the ProvLake server has three components: ProvCapturer, ProvManager, and PolyProvQueryEngine.

To populate the PLView, ProvLake Lib (which encapsulates API calls added to workflow codes) sends data to the ProvCapturer, which receives data coming from the running workflows and transforms into provenance data following ProvLake's data representation which extends W3C PROV-DM. Then, the ProvCapturer sends provenance data to ProvManager which has the connectors to insert data into the DBMS managing the PLView.

To query the PLView, clients send API query requests to PolyProvQueryEngine which connects to ProvDataManager to query the PLView and employs a polystore to query data directly in the multiple stores, jointly with their provenance data.

See usage examples and implementation details for further understanding of ProvLake.

Short URL: https://ibm.biz/provlake