Learning Based Data Cleansing       


 Zhong Su photo

Learning Based Data Cleansing - overview

In this project, we extend our machine learning packages from Named Entity Recognition (NER) to the area of data cleansing, which includes entity profiling, standardization, matching, deduplication and householding. We focus on the challenge of how to improve and adapt the learning algorithms effectively to various languages, locales, scalabilities, and poor-quality data sets, all at a minimal manual labeling effort.

IRMS (Identity Resolution & Matching Solution)

Data cleansing is an important element of data warehouse, customer relationship management, and marketing, privacy and security applications. IRMS detects and identifies a single entity - which can be a person, organization, inventory, or product - from multiple sources, even if the data is insufficient, incorrect or fraudulent. By providing a consolidated customer view, IRMS helps banks, healthcare & insurance providers, government agencies, telecom service providers, etc., to better "Know Your Customer", and thus improve service across channels and lines of business. IRMS also helps faster and more accurate fact-checking the identity of existing and potential clients.