ProvLake - Usage examples


This page presents some examples of ProvLake to preprocess data for a deep learning model for the Oil & Gas domain.

Multiworkflow use case

 

Deep dive into Workflow 2

 

Example of Python workflow script with added lightweight API

The following lines are a small excerpt of a real Python script of a data processing workflow. The excerpt represents the "Import Seismic" data transformation of the Workflow 2 displayed in the figure above.

  1. from provlake import ProvLake, DT
  2. prov = ProvLake(wf_specification_path)
  3. args = [
  4.   segy_path,
  5.   inline_byte,
  6.   xline_byte,
  7.   geox_byte,
  8.   geoy_byte]
  9. with DT(prov, "import_seismic", args) as dt:
  10.      document_id = import_seismic(args)
  11.      dt.output(document_id)
  12. prov.close

Line 10 has a library call that transforms SEG-Y raw data into data in MongoDB, passing the input arguments for the data transformation. The return of this call is a reference to a document in MongoDB representing the raw SEG-Y data.

Results of Querying the PLView

(Click to enlarge)

For the prefixes, “p:” is used for types (they are represented as <<stereotypes>> in the figure) in ProvLake ontology,“i:” is for the instances, and “prov:” is the prefix for W3C PROV ontology. The excerpt shows the data dependencies when data containing a seismic cube data acquired in Netherlands basin were processed in all four workflows. The figure shows the classes (prov:Activities and prov:Entities) and their associations. The associations are relationships that are RDF object properties reused from W3C PROV ontology, such as used and generated, and newly created ones, like hadDataRelationhsip and hadStore). Further, each instance has dataset attributes. For example, when ProvLake tracks domain-specific data values, like x,y geographic coordinates of a seismic cube, extracted from database reference attributes, it creates the corresponding associations to the prov:Entities as RDF data properties. Similarly, parameter values used in the data transformation execution (subclass of prov:Activity) and data values that were output of the data transformation are also associated to the data transformation executions as data properties. These associations follow the attribute semantics definition (Section 4). In the figure, we see both the multi-dataflow and multi-database aspects. Grayscale pattern between a database reference attributes and a data store (represented as prov:Entities with a dashed stroke in the figure) is used to illustrate the hadStore relationship. All database reference attributes that are in a same data store follow the background color of the data store instance.  To illustrate, during an execution of a data transformation in Workflow 1, ProvLake registered that the data transformation used a seismic raw file (netherlands.sgy), extracted data from it (not shown in the figure), and generated an instance in a relational table, which is stored in PostgreSQL (represented as i:PostgreSQL-001 in the figure). In addition, ProvLake also creates relationships between i:PostgreSQL-001 and the database reference attributes with their corresponding values, and stores the hadDataRelationship relationship between the seismic file and the instance in the relational table. Similarly, when Workflows 2—4 execute, ProvLake tracks domain data values extracted from the data stores, parameters, output values from data transformations, the database reference attributes, and creates the relationships among data references with hadDataRelationship. Then, given the PLView generated during the execution of the multi-workflow, to answer queries PolyProvQueryExecutor executes predefined parametrized graph traversal data analytical query on the PLView contents jointly with the polystore.