IBM Research India - Trusted Data and AI
<! -- ========================== PAGE CONTENT ========================== ->
Trusted Data and AI
Data is the central piece in any enterprise’s journey to AI. In the past decade, AI/ML technologies have become pervasive in academia and industry, finding their utility in newer and challenging applications. However, for any enterprise, there are multiple challenges around managing the data, ensuring it is of good quality and building trust in the models trained using this data before deploying them in the real world. While there has been a focus to build better, smarter and automated ML models little work has been done to systematically understand the challenges in the data and assess its quality issues before it is fed to an ML pipeline. Issues such as incorrect labels, synonymous categories in a categorical variable, heterogeneity in columns etc. which might go undetected by standard pre-processing modules in these frameworks can lead to sub-optimal model performance. Although, some systems are able to generate comprehensive reports with details of the ML pipeline, a lack of insight and explainability w.r.t. to the data quality issues leads to data scientists spending ~80% time on data preparation before employing these AutoML solutions. This is why data preparation has been called out as one of the most time-consuming step in an AI lifecyle. At IBM Research, this is the broad lens through which we have been looking for building AI toolkits, algorithms and systems with a human in the loop which enables and accelerates the enterprise's journey to AI.
Let’s look at a standard Data and AI lifecycle as shown in Figure 1. A lot of research and tools have been built to simplify model learning steps but there is a need to focus on data assessment and testing.
Figure 1: Different steps of Data and AI lifecycle from data acquisition to model deployment.
Data Readiness for AI
. Several studies have shown that data preparation is one of the most time-consuming steps in the AI lifecycle. Beyond it being time consuming, it is non standardized, and includes a lot of trial and error on the user’s part to get it correct. One of the reasons for this is the absence of standard tools and techniques to measure and improve the quality of data systematically and correct the data at step 0 of the data and AI lifecycle. We have been working in this area for last several years and have built a toolkit that helps a user with (1) Systematically measuring the quality of the data using several different metrics (2) Provide explanations for cases of low quality by pointing to regions of data responsible for low quality (3) Recommend actions to improve the quality of data and provide an easy way to execute the recommendations (4) Present an auto generated report that captures the history of operations on the dataset for easy reference for all the changes made to the data, and can also serve for audit and governance needs . The toolkit can serve as a decision support system for data scientists and other personas to take informed decisions for later parts of the lifecycle. Some of these decisions could be: selecting a model that can take care of issues present in the data, dropping irrelevant features or giving feedback to the data acquisition process. The toolkit is so designed that each data quality assessment metric provides a score on a scale of 0 to 1. A score of 1 indicates that no issues have been identified in the data. Our work addresses multiple modalities like tabular data, free form text data and log datasets. Let us look at different dimensions for assessing the data to give the reader a flavor of the different problems. Data needs to be assessed from various dimensions to be fit for an AI pipeline. We divide this into four dimensions: (1) Assessment related to training data labels, (2) Assessment related to data distribution, (3) Assessment related to privacy and trust, (4) Assessment related to cleanliness of the data.
We have also released a subset of this toolkit as free APIs available for trial use. You can try out the algorithms via our trial APIs and join the slack community to discuss and share challenges with other like-minded practitioners.
Next, we discuss the importance of steps post model learning and our efforts associated with AI testing.
AI Model Testing.
AI Model testing occurs in different phases of the AI Lifecycle. Post model building, the data scientist uses the validation data to iteratively strengthen the model, and then one model is selected from multiple models based on their performance on the hold-out test data. Moreover, in some regulated industries, an auditor or a risk manager further performs the model validation or risk assessment of the model. Once deployed, the model is continuously monitored with the payload data to check the runtime performance.
Each of these four steps brings out some unique challenges to model testing. The first step requires that the data scientists understand the reason for test failures and repair the model using either hyperparameter tuning or changing the training data. This needs sophisticated techniques to localize faults in the data or model. Testing with the hold-out test data does not require debugging but needs comprehensive test data to compare multiple models and the capability to understand the behavioral difference between models. The risk assessment step needs unseen test samples without the additional burden of labeling the data. The monitoring phase needs to pinpoint the model failures to the production data drift or some particular characteristics of the trained model.
Performing testing across multiple modalities such as tabular, time-series, NLP, image, speech, and for multiple testing properties such as generalizability, fairness, robustness, interpretability, and business KPIs is a daunting challenge that our work tries to address. Except for the generalizability property, all properties are metamorphic which means that the labeling of test data is not required. For example, to check the robustness of the model, one can create two very similar data instances and check if the model is returning the same prediction for them. This paves the way for synthetic test data generation which is a requirement in the risk assessment phase and alleviates other problems of train-test split. Firstly, test data obtained from the training data split could be limited in size. This could be an issue for testing properties like group fairness which needs a sufficient representation of the entire data distribution. Secondly, the test data may not contain enough variations in the samples. For example, a chatbot can predict a different intent for a semantic preserving variation of the training instance. Thirdly, most model failure occurs due to changes in the distribution of the production workload from the training data. We address these challenges by generating synthetic test data having different characteristics based on users' choice: 1) realistic yet different from training data, 2) user-customizable to generate different distributions and variations than training data, and 3) even from the low-density region specifically for tabular data.
Generating effective test cases which can reveal issues with the model and mitigating such problems pose some great technical challenges across various properties and modalities.
IBM Research is actively engaging with premier academic institutes in India for solving these problems. We taught a course on data lifecycle management. We are always looking for new collaborations. Feel free to reach out to us if you are interested in this area.