Midas - overview
The Unstructured Entity Integration group (also known as the Midas group) is conducting research in the general area of large-scale data integration over combinations of structured and unstructured sources, with a focus on deriving rich entity views from the raw data. More specifically, we focus on developing the algorithms, tools and languages that make it easier to create, maintain, refine and deploy complex entity resolution and integration flows over unstructured data. Such flows require multiple types of entity-centric operations to extract the raw data, and then link, combine and fuse information, in order to construct a rich view around the concrete entity instances and their relationships, in a specific data domain.
One of the main technology components that we have built is HIL, a high-level language for expressing unstructured entity integration and resolution algorithms. Once expressed in the language, these operations become reusable, explainable and scalable artifacts which can be run across platforms such as Hadoop or Spark. HIL has been extensively used within IBM for various applications and customer engaments, and has been productized as part of Infosphere BigMatch.
One of the key related challenges that we address is the development of human-in-the-loop tools that target domain experts (rather than programmers), and focus on extracting the domain or task-specific knowledge from the human expert, in order to map this knowledge into the "right" algorithms for entity resolution and the associated operations. As part of this, we have recently developed a new active learning strategy for entity resolution algorithms, in which the system actively searches for a small set of most representative examples (matches or non-matches) to be labeled by the expert. As a result of a sequence of such interactions, the system refines its understanding of the domain or of the task at hand, and learns algorithms that have quality guarantees, while using HIL as the target representation language.
The following paper introduces the HIL language (see also the additional HIL page for more details), while the second paper below describes the active learning strategy that we have built on top of HIL.
HIL: A High-Level Scripting Language for Entity Integration. Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ryan Wisnesky. In EDBT 2013.
Active Learning for Large-Scale Entity Resolution Algorithms. Kun Qian, Lucian Popa, Prithviraj Sen. In CIKM 2017.
Entity integration flows are at the heart of the more advanced data analytics applications and systems that require or benefit from a unified, entity-centric view of the data. Some prominent examples where HIL is featured include:
- Financial entity profile construction from public regulatory filings (SEC and FDIC):
Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. Douglas Burdick, Mauricio A Hernández, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, Sanjiv R Das. In IEEE Data Eng. Bull., 2011
- Consumer profile construction from social media and other consumer demographics data:
Surfacing time-critical insights from social media. Bogdan Alexe, Mauricio A Hernández, Kirsten Hildrum, Rajasekar Krishnamurthy, Georgia Koutrika, Meenakshi Nagarajan, Haggai Roitman, Michal Shmueli-Scheuer, Ioana R. Stanoi, Chitra Venkatramani, Rohit Wagle. SIGMOD Demonstration, 2012.
- Matching of consumer profiles with internal customer records. See this link: Social MDM Matching in IBM InfoSphere Master Data Management
- Water Cost Index calculation from water utility filings. See this link: Water Cost Index
Another related and important area for our research is focused on developing the theoretical foundations that can provide us with a better understanding of the entity integration space. In the past, we have done foundational work on schema mapping and data exchange (with two Test-of-Time Awards, in ICDT 2013, and PODS 2014). More recently, we have introduced a new declarative framework to provide the foundations for entity linking, and we have shown several interesting connections to HIL as well as to probabilistic approaches for matching (e.g., Markov Logic Networks). This work appeared in the ICDT 2015 conference, where it received the Best Paper Award, with a follow-up paper in ICDT 2017 that focused on the expressive power of entity linking frameworks.
A Declarative Framework for Linking Entities. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan. In ICDT 2015.
Expressive Power of Entity-Linking Frameworks. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan. In ICDT 2017.