Human-in-the-Loop Entity Resolution and Knowledge Curation - overview
Our group is conducting research towards building the next generation of human-in-the-loop systems and tools for building curated knowledge from combinations of unstructured and semi-structured data sources, for various deep domains. A key operation in such knowledge curation is entity resolution, which is the ability to link pieces of semantically related information across large amounts of data. Our work is complementary to natural language understanding and text extraction (see the related SystemT project), and it is often applied on top of data that results from applying natural language techniques.
In more details, we focus on developing the algorithms, tools and languages that make it easier to create, maintain, refine and deploy complex entity resolution and integration flows over heterogeneous datasets. Such flows require entity-centric operations to link, combine and fuse information, in order to construct a rich view around the concrete entity instances and their relationships, in a specific data domain.
One of the main technology components that we have built is HIL, a high-level language for expressing unstructured entity integration and resolution algorithms. Once expressed in the language, these operations become reusable, explainable and scalable artifacts which can be run across platforms such as Hadoop or Spark. HIL has been extensively used within IBM for various applications and customer engaments, and has been productized as part of Infosphere BigMatch.
One of the key related challenges that we address is the development of human-in-the-loop tools that target domain experts (rather than programmers), and focus on extracting the domain or task-specific knowledge from the human expert, in order to map this knowledge into the "right" algorithms for entity resolution and the associated operations. As part of this, we have recently developed a new active learning strategy for entity resolution algorithms, in which the system actively searches for a small set of most representative examples (matches or non-matches) to be labeled by the expert. As a result of a sequence of such interactions, the system refines its understanding of the domain or of the task at hand, and learns algorithms that have quality guarantees, while using HIL as the target representation language.
The following paper introduces the HIL language (see also the additional HIL page for more details):
HIL: A High-Level Scripting Language for Entity Integration. Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ryan Wisnesky. In EDBT 2013.
Entity integration flows are at the heart of the more advanced data analytics applications and systems that require or benefit from a unified, entity-centric view of the data. Some prominent examples where HIL is featured include:
- Financial entity profile construction from public regulatory filings (SEC and FDIC):
Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. Douglas Burdick, Mauricio A Hernández, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, Sanjiv R Das. In IEEE Data Eng. Bull., 2011
- Consumer profile construction from social media and other consumer demographics data:
Surfacing time-critical insights from social media. Bogdan Alexe, Mauricio A Hernández, Kirsten Hildrum, Rajasekar Krishnamurthy, Georgia Koutrika, Meenakshi Nagarajan, Haggai Roitman, Michal Shmueli-Scheuer, Ioana R. Stanoi, Chitra Venkatramani, Rohit Wagle. SIGMOD Demonstration, 2012.
- Matching of consumer profiles with internal customer records. See this link: Social MDM Matching in IBM InfoSphere Master Data Management
- Water Cost Index calculation from water utility filings. See this link: Water Cost Index
Another related and important area for our research is focused on developing the theoretical foundations that can provide us with a better understanding of the entity integration space. In the past, we have done foundational work on schema mapping and data exchange (with two Test-of-Time Awards, in ICDT 2013, and PODS 2014). More recently, we have introduced a new declarative framework to provide the foundations for entity linking, and we have shown several interesting connections to HIL as well as to probabilistic approaches for matching (e.g., Markov Logic Networks). This work appeared in the ICDT 2015 conference, where it received the Best Paper Award, with a follow-up paper in ICDT 2017 that focused on the expressive power of entity linking frameworks.
A Declarative Framework for Linking Entities. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan. In ICDT 2015.
Expressive Power of Entity-Linking Frameworks. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan. In ICDT 2017.