ProvLake - Provenance Data Representation


PROVLake Provenance Data Representation

ProvLake's provenance data representation follows the PROV-DM standards. It also inherits concepts from background work on representing workflow provenance [1,2,3,4].

Similarly to background work [1,2,3], we separate the data representation into prospective (white background) and retrospective provenance (gray background). <<Stereotypes>> give the PROV concepts (mainly Entity, Agent, Activity) from which ProvLake's concepts are inherited. Below we show an UML class diagram of ProvLake's Provenance Data Representation.

Prospective Provenance

The OWL file of the ontology is available files/br-lga/provlake.turtle.owl.

Starting point

A Project  has many Workflows, where each has many Data Transformations. Each Data Transformation has many Attributes, which can be either input or output attributes, depending on whether the attributes are used (input) or generated (output) by the Data Transformation. 

Attributes and Schemas

DataSchemas are used to group attributes that semantically belong to a same data representation. For example, {givenName, familyName} are two attributes that belong to a same DataSchema named Person.

One Schema can inherit from another schema, e.g., File and SeismicFile. It is represented by specializationOf relationship.

Thus, Attributes can be part of a DataSchema, which can be part of a DatabaseSchema, which can be part of a Database, which are stored in a DataStore. DataStores have a type property to determine if it is a FILE_SYSTEM, or RELATIONAL_DBMS, or DOCUMENT_DBMS, etc., depending on the data stores in use by the multiworkflow.

An Attribute may not be associated to any Data Schema if grouping attributes is not required when modeling the application data. For example, filePath is an attribute that may not be associated to any DataSchema. In this case, filePath can be directly associated to a DataStore whose type is FILE_SYSTEM.

An Attribute may be simple, list or dicitonary. A simple Attribute is one that is not subdivided. A list is a complex attribute, and we use hadMember relationship to represent that composition. Besides, we also store the order of elements in the list. A dictionary is an Attibute that is composed by other attributes without an order. We are also using hadMember to represent that.

An attribute in a schema can be equivalent to another attribute in another schema, even in different data stores. For instance, attribute "name" of an entity can be stored in different data stores, even with different spelling, but they have the same meaning. We are using alternateOf self-relationship to represent equivalence semantics. 

Schemas and Identifiers

DataSchemas have identifying attributes. An attribute or a set of attributes may uniquely identify a record (e.g., tuple in RDBMSs, a document in a Doc DBMS, a resource in a RDF DBMS, a file in a file system) in a data collection (e.g., a table in a RDBMS, a collection in a Doc DBMS, a class in a RDF DBMS, a collection of files in a file system). In this case, an Identifier has one (in case of simple identifiers) or more (if case of composite identifiers) attributes. For example, one can model a DataSchema named Person with attributes {socialSecurityNumber, givenName, familyName} and, in this case, there is an identifier instance with one attribute, socialSecurityNumber.

Relationships between identifiers

Analogously to foreign keys in relational schema modeling, in ProvLake's provenance representation, an attribute that is part of an IdentifierSet may refer (referred relationship) to other attributes that are also part of another IdentifierSet, even if they belong to different schemas (and data stores). This to keep the relationship between data in multiple stores.

Retrospective Provenance

Data Transformation Execution and Attribute Values

When a data transformation executes, it uses (as input) attribute values and generates (as output) attribute values.

Some attributes can be stored in DataStoreInstances. DataStoreInstance are specialized into Relational DBMS, Document DBMS, KeyValue Store, FileSystem, etc., depending on the data stores in use by the multiworkflow. Each specialized class has specific properties for specific stores.

 

General

We are using prov:value (https://www.w3.org/TR/2013/REC-prov-o-20130430/#value) to present a description of an entity.

References

[1] Renan Souza, Leonardo Azevedo, Raphael Thiago, Elton Soares, Marcelo Nery, Marco Netto, Emilio Vital Brazil, Renato Cerqueira, Patrick Valduriez, Marta Mattoso. IEEE International Conference on e-Science (eScience), pp. 1--10, 2019Efficient Runtime Capture of Multiworkflow Data Using Provenance

[2] Costa, F., Silva, V., de Oliveira, D., Ocaña, K., Ogasawara, E., Dias, J., Mattoso, M.: Capturing and querying workflow runtime provenance with PROV: a practical approach. In: EDBT/ICDT workshops. pp. 282–289 (2013).

[3] Silva, V., Leite, J., Camata, J.J., de Oliveira, D., Coutinho, A.L.G.A., Valduriez, P., Mattoso, M.: Raw data queries during data-intensive parallel workflow execution. Future Generation Computer Systems (2017).

[4] Souza, R., Mattoso, M.: Provenance of Dynamic Adaptations in User-steered Dataflows. International Provenance and Annotation Workshop (2018).

Disclaimer

This is a work in progress. We are evolving this data representation as the project evolves.

Short URL: https://ibm.biz/provlake-dm