Statistical Information and Relation Extraction (SIRE) - overview
To improve a computer's ability to process rich unstructured text (and speech), one needs to detect mentions (mention detection) in the text to entities of interest (e.g. Person, Organization, Medication, etc.) and group all mentions that refer to the same entity in the world together (co-reference resolution) and extract relations between the detected entities from the text (relation extraction), e.g. the snippet "... its chairman ..." implies the relation ManagerOf between the entity with the nominal Person mention "chairman" and the entity with pronominal Organization mention "its."
We are developing the Statistical Information and Relation Extraction (SIRE) toolkit to build trainable extractors for new applications and domains. SIRE provides components for mention detection using Maximum Entropy models that can be trained from annotated data created by using a highly optimized web-browser annotation tool, called HAT, a trainable co-reference component for grouping detected mentions in a document that correspond to the same entity, and a trainable relation extraction system. The SIRE training tools produce annotators that can be deployed as UIMA annotators.
We have developed SIRE systems for multiple domains including a news domain (about 50 entity types and about 40 relations); more recently SIRE is being applied in healthcare analytics for fact extraction such as vital signs, allergies, medications in medical reports.