photo Douglas R. (Doug) Burdick photo Mauricio A. Hernandez photo photo Min Li photo Lucian Popa photo

Midas - overview

The Unstructured Entity Integration group (also known as the Midas group) is conducting research in the general area of large-scale data integration over combinations of structured and unstructured sources, with a focus on deriving rich entity views from the raw data. More specifically, we focus on developing the algorithms, tools and languages that make it easier to create, maintain, refine and deploy complex entity resolution and integration flows over unstructured data. Such flows require multiple types of operations to extract the raw data, and then link, combine and fuse information, in order to construct a rich, entity-centric view of a specific data domain.

One of the main technology components that we have built is HIL, a high-level language for expressing unstructured entity integration and resolution algorithms. HIL has been extensively used within IBM for various applications and customer engaments, and has been recently productized as part of Infosphere BigMatch. See the following paper, and also the additional HIL page for more details.


HIL: A High-Level Scripting Language for Entity Integration. Mauricio A. Hernández, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ryan Wisnesky. In EDBT 2013.


Entity integration flows are at the heart of the more advanced data analytics applications and systems that require or benefit from a unified, entity-centric view of the data. Some prominent examples where HIL is featured include:

  • Financial entity profile construction from public regulatory filings (SEC and FDIC):
Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. Douglas Burdick, Mauricio A Hernández, Howard Ho, Georgia Koutrika, Rajasekar Krishnamurthy, Lucian Popa, Ioana Stanoi, Shivakumar Vaithyanathan, Sanjiv R Das. In IEEE Data Eng. Bull., 2011
  • Consumer profile construction from social media and other consumer demographics data:

Surfacing time-critical insights from social media. Bogdan Alexe, Mauricio A Hernández, Kirsten Hildrum, Rajasekar Krishnamurthy, Georgia Koutrika, Meenakshi Nagarajan, Haggai Roitman, Michal Shmueli-Scheuer, Ioana R. Stanoi, Chitra Venkatramani, Rohit Wagle. SIGMOD Demonstration, 2012.

  • Water Cost Index calculation from water utility filings See this link: Water Cost Index


Another related and important area for our research is focused on developing the theoretical foundations that can provide us with a better understanding of the entity integration space. In the past, we have done foundational work on schema mapping and data exchange (with two Test-of-Time Awards, in ICDT 2013, and PODS 2014). More recently, we have introduced a new declarative framework to provide the foundations for entity linking, and we have shown several interesting connections to HIL as well as to probabilistic approaches for matching (e.g., Markov Logic Networks).  This work appeared in the ICDT 2015 conference, where it received the Best Paper Award.


A Declarative Framework for Linking Entities. Douglas Burdick, Ronald Fagin, Phokion G. Kolaitis, Lucian Popa, Wang-Chiew Tan. In ICDT 2015.