Data Readiness for AI     


Shashank Mujumdar photo Vitobha Munigala photoHima Patel photo photo

Data Readiness for AI - overview

In the last several years, AI/ML technologies have become pervasive in the industry, finding its utility in newer applications and industry verticals. While there has been a focus to build better, smarter and automated ML models, little work has been done to systematically understand the challenges in the data before it is fed to a model. It is well understood that the quality of an ML model is only as good as the data it has - and hence the systematic analysis of readiness of data before building AI/ML models are of utmost importance. Also, several studies have shown that a data scientist spends 60-80% of their time in the data preparation stage. We aim to help reduce this effort by analyzing the data at the start of the data-science lifecycle and suggesting suitable recommendations to the data scientists to prepare the data. We solve the above challenges by building a framework that can act as a decision support system for data scientists and gives insights about the quality of data with respect to building a machine learning model. Specifically, our framework does three things:

1. Analyze the data and gives a score on data readiness across different dimensions.
2. Explain the low score by pointing to regions in the data that are of low quality
3. Recommend data fixes to make the data ready for AI.
We build algorithms and tools so that both data scientists and data stewards can use them to analyze and clean the data.


Relevant Articles

  1. If Your Data Is Bad, Your Machine Learning Tools Are Useless
  2. Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says


Workshop proposal accepted at PAKDD 2021 Data Assessment and Readiness for Artificial Intelligence

Tutorial accepted at KDD 2020 Overview and Importance of Data Quality for Machine Learning Tasks

Data Readiness Toolkit for AI will be demoed at IBM booth conducted as a part of SIGMOD 2020.