2019
Unified text analytics annotator development life cycle combining rule-based and machine learning based techniques
Laura Chiticariu, Jeffrey Kreulen, Rajasekar Krishnamurthy, Pritviraj Sen, Shivakumar Vaithyanathan
Patent 10,289,963
Abstract
One embodiment provides a method for developing a text analytics program for extracting at least one target concept including: utilizing at least one processor to execute computer code that performs the steps of: initiating a development tool that accepts user input to develop rules for extraction of features of the at least one target concept within a dataset comprising textual information; developing, using the rules for feature extraction, an evaluation dataset comprising at least one document annotated with the at least one target concept to be extracted by the text analytics program; creating, using the rules for feature extraction, a rule-based annotator to extract the at least one target concept; training, using the evaluation dataset, a machine-learning annotator to extract the at least one target concept within the dataset; combining the rule-based annotator and the machine learning annotator to form a combined annotator; evaluating, using the evaluation dataset, extraction performance of the combined annotator against a predetermined threshold; and publishing, when the extraction performance of the combined annotator exceeds the predetermined threshold, the combined annotator for use in an application that extracts the at least one target concept from a plurality of datasets.
2018
Cross-lingual information extraction program
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Huaiyu Zhu
Patent 10,042,846
Abstract
One embodiment provides method for constructing a cross-lingual information extraction program, the method including: utilizing at least one processor to execute computer code that performs the steps of: constructing a plurality of language-specific representations from text expressed in a plurality of languages by parsing the text of each language using a language-specific semantic parser; mapping the plurality of language-specific representations to a single cross-lingual semantic representation, wherein the cross-lingual semantic representation encompasses the plurality of languages; and constructing the cross-lingual information extraction program based on the cross-lingual semantic representation. Other aspects are described and claimed.
Constructing concepts from a task specification
Laura Chiticariu, George Cypher, Rajasekar Krishnamurthy, Yunyao Li, Huahai Yang
Patent 10,162,852
Abstract
Embodiments relate to facilitating construction of concepts from a task specification. A method includes receiving, from a user via a user interface, a task specification in natural language form. The method also includes parsing the task specification into a plurality of components, and searching a database for an existing concept having a pattern that approximates at least a portion of the plurality of components. The concept includes semantic meanings that are representable by textual patterns. The method further includes identifying any components of the plurality of components that are not included in the existing concept, and building a new concept that combines the existing concept and the components of the plurality of components that are not included in the existing concept.
Generation of a natural language resource using a parallel corpus
Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Huaiyu Zhu
Patent US Patent 9,898,460
Abstract
One embodiment provides a method for generating a natural language resource using a parallel corpus, the method including: utilizing at least one processor to execute computer code that performs the steps of: receiving, from a parallel corpus, natural language text in a source language and a corresponding translation of the natural language text in a target language, wherein the natural language text in the source language comprises linguistic annotations; projecting the linguistic annotations from the source language natural language text to the target language natural language text; applying one or more filters to remove at least one projected linguistic annotation from the target language natural language text that results in at least one error; selecting at least one target language natural language text having substantially complete linguistic annotations; training a machine learning model using the selected at least one target language natural language text and annotations; and adding, using the trained machine learning model, linguistic annotations to at least one target language natural language text having incomplete linguistic annotations. Other aspects are described and claimed.
2017
Probabilistic surfacing of potentially sensitive identifiers
Varun Bhagwan, Laura Chiticariu, Daniel F. Gruhl
Patent US Patent 9,652,627
Abstract
Probabilistic surfacing of potentially sensitive identifiers is provided. In one embodiment of the present invention, a method of and computer program product for surfacing of potentially sensitive identifiers are provided. An input string is read. The input string has a length. The input string is divided into a plurality of tokens. Each of the tokens has a predetermined length. A score is determined for each of the plurality of tokens. A composite score is determined based on the scores of each of the plurality of tokens. Whether the input string comprises an identifier is determined by comparing the composite score to a predetermined threshold.
2015
Identifying and ranking pirated media content
Vijay Bommireddipalli, Laura Chiticariu, Yunyao Li, Richard Maraschi, Ah-Fung Sit, Shivakumar Vaithyanathan, Shankar Venkataraman
Patent U.S. Patent 9,215,243
Abstract
A computer identifies and ranks URL hyperlinks to possible pirated media content by searching a web page from a first website for one or more indicator keywords, wherein a strength of an indicator keyword is related to a likelihood of pirated media content. Responsive to locating a plurality of instances of the one or more indicator keywords, identifying a plurality of hyperlinks respectively associated with one or more of the plurality of instances. Weighting, the identified plurality of hyperlinks based on at least one of: a strength of associated indicator keywords, number of associated indicator keywords, number of times each hyperlink was identified, and date of posting. Ranking the plurality of hyperlinks according to weight indicating a relative likelihood that respective hyperlinks point to pirated media content in a ranked list.
2014
Refining a dictionary for information extraction
Laura Chiticariu, Vitaly Feldman, Frederick R. Reiss, Huaiyu Zhu, Sudeepa Roy
Patent U.S. Patent 8,775,419
Abstract
A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results.
Extraction of information from clinical reports
Tanveer F. Syeda-Mahmood, Laura Chiticariu
Patent U.S. Patent 8,793,199
Abstract
A phrase matching system is described. The system includes a training engine and a matching engine. The training engine is configured to: learn terms and term variants from a training corpus, wherein the terms and the term variants correspond to a specialized dictionary related to the training corpus; and generate a list of negative indicators found in the training corpus. The matching engine is configured to: perform a partial match of the terms and the term variants in a set of electronic documents to create initial match results; and perform a negation test using the negative indicators and a positive terms test using the terms and the term variants on the initial match results to remove matches from the initial match results that fail either the negation test or the positive terms test, resulting in final match results.
Building and maintaining information extraction rules
Arnaldo Carreno Fuentes, Laura Chiticariu, Eser Kandogan, Yunyao Li, Huahai Yang
Patent U.S. Patent 9,436,660
Abstract
Methods and arrangements for managing development of information extraction rules. One or more documents are opened for extraction. An interface is provided to create a label and thereupon label a portion of the document. The created label is stored, and an extractor is developed based on the labeling. A test interface is provided for the extractor, and results of a test conducted through the test interface are displayed. The extractor is exported. In accordance with at least one embodiment, developers are presented with eased automated guidance to write extractors, which thereby reduces an overall manual effort involved in extractor development. Generally, a focused, tutorial-type environment serves as a guide based on previously developed best practices.
2013
Automatic refinement of information extraction rules
Laura Chiticariu, Bin Liu, Frederick R Reiss
Patent US patent 8,417,709 B2
Abstract
A method and system for automatically refining information extraction (IE) rules. A provenance graph for IE rules on a set of test documents is determined. The provenance graph indicates a sequence of evaluations of the IE rules that generates an output of each operator of the IE rules. Based on the provenance graph, high-level rule changes (HLCs) of the IE rules are determined. Low-level rule changes (LLCs) of the IE rules are determined to specify how to implement the HLCs. Each LLC specifies changing an operator's structure or inserting a new operator in between two operators. Based on how the LLCs affect the IE rules and previously received correct results of applying the rules on the test documents, a ranked list of the LLCs is determined. The IE rules are refined based on the ranked list.
2012
Interactive Generation of Integrated Schemas
Laura Chiticariu, Phokion Kolaitis, Lucian Popa
Patent US Patent: 8,180,810
Abstract
Methods, systems and computer program products for interactive generation of integrated schemas. Exemplary embodiments include a method for schema integration, the method including recasting a first source schema into a first graph of concepts with HasA relationships, recasting a second source schema into a second graph of concepts with HasA relationships, identifying matching concepts in the first graph and the second graph based on correspondences between attributes of the concepts of the first and second graphs, producing an integrated schema, based on a fixed specification of matching concepts to merge, and generating a mapping from the first source schema to the integrated schema and from the second source schema to the integrated schema.