Many enterprises maintain large repositories of unstructured text data, ranging from email and web pages to call-center records and business reports. Unfortunately, this data is of limited use as long as it remains in its unstructured form. Consequently, there has been an increasing interest in enterprise information extraction: building annotators that extract structured information from unstructured enterprise data. Existing information extraction systems have difficulty scaling to enterprise-wide document collections, and building new annotators generally requires specialized expertise and training.
The SystemT project makes information extraction orders of magnitude more scalable and easy to use. Our information extraction system is built around AQL, a declarative rule language with a familiar SQL-like syntax. AQL replaces multiple obscure languages typically used to build annotators. Because AQL is a declarative language, rule developers can focus on what to extract, allowing SystemT's cost-based optimizer to determine the most efficient execution plan for the annotator.
SystemT is currently shipping in InfoSphere BigInsights and Streams, deployed in many IBM products (Lotus Notes, Social Media Analytics, IBM eDiscovery Analyzer, etc.), installed in more than 400 customers and is being used in several ongoing research projects.
Our current research is focused on building a complete tooling framework around SystemT aimed at facilitating the development and maintenance of extraction rules, for both expert and non-expert users. We are interested in the algorithmic aspects, as well as the user experience aspects of the framework.