The SystemT project is an amalgam of two major research themes centered around analytics and search over unstructured content. These two themes are represented by two corresponding sub-projects: SystemT-Information Extraction (SystemT-IE) and SystemT-Programmable Search (SPS).
SystemT-IE: Many enterprises maintain large repositories of unstructured text data, ranging from email and web pages to call-center records and business reports. Unfortunately, this data is of limited use as long as it remains in its unstructured form. Consequently, there has been an increasing interest in enterprise information extraction: building annotators that extract structured information from unstructured enterprise data. Existing information extraction systems have difficulty scaling to enterprise-wide document collections, and building new annotators generally requires specialized expertise and training.
The SystemT-IE project makes information extraction orders of magnitude more scalable and easy to use. Our information extraction system is built around AQL, a declarative rule language with a familiar SQL-like syntax. AQL replaces multiple obscure languages typically used to build annotators. Because AQL is a declarative language, rule developers can focus on what to extract, allowing SystemT-IE's cost-based optimizer to determine the most efficient execution plan for the annotator. SystemT-IE's information extraction engine is currently deployed in many IBM products (Lotus Notes, IBM eDiscovery Analyzer, etc.) and is being used in several ongoing research projects.
Our current research is focused on building a complete tooling framework around SystemT-IE aimed at facilitating the development and maintenance of extraction rules, for both expert and non-expert users. We are interested in the algorithmic aspects, as well as the user experience aspects of the framework.
SPS: SPS is a platform for developing and integrating high quality search into a wide range of enterprise applications. There are two key ideas underlying SPS:
Concept-based search: The main idea of concept-based search is that user query terms are not merely matched against document text but "interpreted" (i.e., associated with a specific meaning) in the context of a search taxonomy. For example, consider the domain of personal email search with an appropriate taxonomy describing the concepts that people often look for in their emails -- persons, phone numbers, addresses, etc. Given such a taxonomy, SPS will interpret the query "linda address" as "find emails that mention person linda's address", as opposed to merely retrieving emails containing the words "linda" and "address". In general, the task of producing all possible such interpretations and narrowing down to the likely ones of interest to the user is non-trivial. The SPS platform comes built-in with sophisticated general-purpose algorithms to perform precisely these tasks.
Rule-driven: Unlike classical approaches to search that depend on complex weights and hard-to-unravel ranking functions to achieve search quality, SPS adopts a fully transparent rule-driven approach. Every steps of the search process -- from tokenization and parsing, to index term generation, interpretation, and ranking -- is controlled by rules with well-defined semantics. This ensures that searches that are configured to work in a certain way continue to reliably produce the expected results even as underlying data collections and their statistics change. In addition, the rule-driven approach enables application developers and domain experts (e.g., a search administrator) to easily configure and customize SPS components to a specification application or deployment scenario. Most importantly, often such customization can be accomplished without writing a single line of code or tweaking ranking function parameters.