Human-in-the-Loop Entity Resolution and Knowledge Curation - overview


Our group is conducting research towards building the next generation of human-in-the-loop systems and tools for building curated knowledge from combinations of unstructured and semi-structured data sources, for various deep domains. A key operation in such knowledge curation is entity resolution, which is the ability to link pieces of semantically related information across large amounts of data. Other related but important operations are: entity attribute normalization, data transformation, and entity fusion, all used towards building entity-centric views from underlying data. Our work is complementary to natural language understanding and text extraction (see the related SystemT project), and it is often applied on top of data that results from applying natural language techniques. 

Examples: a) Matching authors across medical research publications; b) An integrated entity view obtained via text extraction and entity resolution from many unstructured regulatory filings (from SEC and FDIC).

              Matching authors across medical research publications

                     ER-Finance

HIL. One of the main technology components that we have built is the high-level language HIL, used as a representation language for unstructured entity integration and resolution algorithms. Once expressed in the language, these operations become reusable, explainable and scalable artifacts which can be run across platforms such as Hadoop or Spark as well as in a single node. HIL has been extensively used within IBM for various applications and customer engaments, and has been productized as part of Infosphere BigMatch. 

Active Learning and Human-in-the-Loop. We also focus on making it easier to create, maintain, refine and deploy complex entity resolution and integration flows over heterogeneous datasets. One of the key challenges here is the development of human-in-the-loop tools that target domain experts (rather than programmers), and focus on extracting the domain or task-specific knowledge from the human expert, in order to map this knowledge into the "right" algorithms for entity resolution and the associated operations. As part of this, we are developing active learning techniques, sometimes complemented with deep learning and transfer learning, in order to further automate the entity resolution process and reduce the amount of labeled data required from an expert.