ProvLake - PROV-ML



Here we describe PROV-ML, a W3C PROV- and W3C MLS-compliant data representation for Machine Learning. PROV-ML is depicted in Fig. 1, using a UML class diagram, where the light-color classes represent prospective provenance, and dark-color, retrospective.

Figure 1. PROV-ML: a W3C PROV- and W3C ML Schema-compliant workflow provenance data representation.

The colors in the figure map to these concepts: the blue-shaded classes account for the Learning Data; the gray-shaded, for the Learning; and the yellow-shaded, for the Model.
The stereotypes indicated in the figure represent the classes inherited from PROVLake ( All classes illustrated in the figure are individually described in Table~\ref{tab:provlakeml-classes. 

In PROV-ML, the Study class introduces a series of experiments, portrayed by the LearningExperiment class, which defines one of the three major phases in the lifecycle, the Learning phase. A learning experiment comprises a set of learning stages, represented by the BaseLearningStage class, which are the primary data transformation within the Learning phase and with whom the agent (Persona class) is associated.
The base learning stage serves as an abstract class where the LearningStage and LearningStageSection classes inherit from. Also, it relates the ML algorithm, evoked through Algorithm class, used in the stage might be defined in the context of a specific ML task (e.g., classification, regression), represented in the LearningTask class. This approach favors both the learning stage and learning stage section to conserve the relationships among other classes while grant them to have special characteristics discussed in the following. A learning stage varies regarding its type, i.e., Training, Validation, and Evaluation classes. The provision of a specific class for the learning stage allows the explicit representation of the relationship between the Learning Data Preparation phase, through its Learning Data, and the Learning phase of an ML lifecycle. The LearningStageSection class introduces the sectioning semantics that grant capabilities of referencing subparts of the learning stage and the data, respectively. An example of sectioning elements relevance is the ability to reference a specific epoch within a training stage, or mentioning a set of batches within a specific epoch. The Learning Data appears in the model over the LearningDataSetReference class. Another data transformation specified in PROV-ML is the Feature Extraction class, which represents the process that transforms the learning dataset into a set of features, represented by FeatureSet class. This modeling favors the ML experiment to be reproducible since it relates the dataset with the feature extraction process and the resulting feature set.

Further fundamental aspects regarding the Learning phase are the outputs and the parametrization used to produce these outputs. Like so, The ModelSchema class describes the characteristic of the models produced in a learning stage or learning stage section, such as the number of layers of a neural network or the number of trees in a random forest. The ModelProspection class represents the prospected ML models, \ie{ the reference for the ML models learned during a learning stage or learning stage section of a training stage. In addition to the data produced in the Learning phase is the EvaluationMeasure class. This class, combined with EvaluationProcedure and EvaluationSpecification classes, provide the representation of evaluation mechanisms of the produced ML models during any stage of learning, specifically:  an evaluation measure defines an overall metric used to evaluate a learning stage (e.g., accuracy, F1-score, area under the curve);
an evaluation specification defines the set of evaluation measures used in the evaluation of learned models; and, an evaluation procedure serves as the model evaluation framework, i.e., it details the evaluation process and used methods. On the parametrization aspect, PROV-ML afford two classes LearningHyperparameter and ModelHyperparameter. The first hyperparameter-related class represents the hyperparameter used in a learning stage or learning stage section (e.g., max training epochs, weights initialization). The second class is used in the representation of the models' hyperparameters (e.g., network weights). Finally, PROV-ML addresses the retrospective counterpart of the classes mentioned above. The classes ending in Execution and Value are the derivative retrospective analogous of data transformations and the attributes, respectively.

Table 1. Summary of PROV-ML data representation classes.

Class Description
Study Investigation (e.g., research hypothesis) leading ML workflow definitions.
Experiment The set of analyses (e.g., research questions), that drives the ML workflow.
LearningProcessExecution An ML workflow execution. This is equivalent to mls:Run and was renamed to explicitly preserve the aspects of retrospective provenance, which are not explicitly handled in MLS.
LearningTask and LearningTaskValue Defines the goal of a learning process, i.e., the ML task (e.g., LearningTask: Classification; LearningTaskValue: Seismic Stratigraphic Classification).
LearningStageType A stage in the learning process. It is one of: training, testing or validation.
LearningDatasetReference Defines the dataset to be used by a LearningStage and LearningDatasetReferenceValue is the dataset reference used in a LearningStageExectution.
DatasetCharacteristic and DatasetCharacteristcValue Defines metadata about the LearningDatasetReference (e.g., #instances), and DatasetCharacteristcValue relates with a LearningDatasetReferenceValue (e.g., #instances =8).
FeatureSet and FeatureSetData Defines the features FeatureExtraction to generate over LearningDatasetReference and, FeatureSetData is the generated values in the execution.
FeatureSetCharacteristic Defines the set of metadata that describes the FeatureSet (e.g., number of features, features’ type).
FeatureExtraction and FeatureExtractionExecution Defines the features retrieval process.
Software Defines a collection of ML techniques’ implementations (e.g., Scikit-Learn).
Algorithm ML technique with no associated technology, software or implementation (e.g., k-means clustering technique).
Implementation Defines the retrospective aspect of an Algorithm, i.e., an ML technique’s implementation in a software (e.g., ScikitLearn’s k-means implementation).
ImplementationCharacteristic Defines the implementation’s set of metadata, (e.g., version, git hash).
LearningHyperParameter Defines the prior parameter of an Algorithm used by a LearningStage.
LearningHyperParameterSetting Defines the parameter values of an execution (e.g., the k value in a k-means clustering technique, range of epochs in a neural network training).
ModelSchema The scope of the resulting model.
ModelReference and ModelReferenceValue The resulting model of a LearningStage should generate and the generated value (e.g., the trained model after the training stage).
ModelHyperParameter and ModelHyperParameterValue Hyperparameters a LearningStage generates and the resulting model with their values (e.g., the epoch which the resulting model was generated), respectively.
DataStoreInstance Resulting model (i.e., ModelReferencevalue) storage.
EvaluationMeasure and ModelEvaluation A measure a LearningStage should evaluate and its associated value generated in execution (e.g., the precision of classifier model).
EvaluationSpecification and EvaluationProcedure Classes directly inherited from MLS, with their semantics preserved.



Figure 2 presents an object diagram corresponding to the same example used to illustrate W3C-ML model. The example is derived from the OpenML portal.  Some elements were created due to not exist in the example. The example describes entities involved to model a single run of the implementation of a logistic regression algorithm from a Weka machine learning environment. The referenced individuals can easily be looked up online. For instance, run 100241 can be found on

Figure 2. PROV-ML object diagram example.


A demo of PROV-ML was presented in 2019 IBM Colloquium in Artificial Inteligence. We make available the presentation and the Jupyter ouptut.


[1] Renan Souza, Leonardo Azevedo, Vítor Lourenço, Elton Soares, Raphael Thiago, Rafael Brandão, Daniel Civitarese, Emilio Vital Brazil, Marcio Moreno, Patrick Valduriez, Marta Mattoso, Renato Cerqueira, Marco A. S. Netto Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering Workflows in Support of Large-Scale Science (WORKS) workshop co-located with the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2019

[2] W3C PROV-DM.

[3] W3C Machine Learning Schema (MLS).