SystemT - Datasets

This page will be updated to host data sets we describe in our scientific publications.


Universal Proposition Banks

In this project, we are investigating ways to enable multilingual semantic role labeling into a unified representation of semantics. We are releasing our generated propbanks with SRL annotation for several languages, including Chinese, French and German.

Check out the project here.

The corresponding publication is:

Multilingual Aliasing for Auto-Generating Proposition Banks. Alan Akbik, Xinyu Guan and Yunyao Li. 26th International Conference on Computational Linguistics, COLING 2016










2019 - AKBC Best Application Paper Award

2018 - NAACL Test-of-Time Award

2014 - IBM Corporate Award

2013 - IBM Research Outstanding Technical Accomplishment Award

2008, 2010, 2013 - IBM Research A-Level Accomplishment Award

Recent News


Research paper "NormCo: Deep Disease Normalization for Biomedical Knowledge Base Construction" received the Best Application Paper Award at AKBC 2019


Yunyao gave a talk on Building Domain-Specific Knowledge with Human in the Loop at Robust Machine Learning Algorithms and Systems: Detection & Mitigation of Adversarial Attacks and Anomalies Workshop, National Academies


Yunyao gave a talk on Building Domain-Specific Knowledge with Human in the Loop at University of Michigan AI Lab


Research paper "DIMSIM: An Accurate Chinese Phonetic Similarity Algorithm based on Learned High Dimensional Encoding" is accepted at CONLL 2018 (IBM Research Blog Post).


Research paper "Exploiting Structure in Representation of Named Entities using Active Learning" is accepted at COLING 2018.


Officially joined NSF Center for Big Learning as an Industry Partner.


Demoed LUSTRE an interactive system for entity understanding and standardization at ICDE 2018


Hosted Stanford professor Mark Musen's visit to IBM Research - Almaden


Industry track paper on the design and implementation of SystemT is accepted at NAACL-HLT 2018 Industry Track (the very first industry track at a major NLP conference).


Hosted Univ. of Washington professor Luke Zettlemoyer's visit to IBM Research - Almaden


Yunyao is co-chairing the very first NAACL-HLT Industry Track


Demo paper on Creating and Interacting with Large-Scale Domain-Specific Knowledge Bases is presented at VLDB 2017 [video] [poster]


Research paper on Distant Meta-Path Similarities for Text-Based Heterogeneous Information Networks is accepted at CIKM 2017


Research paper on Crowd-in-the-loop: A Hybrid Approach for Annotating Semantic Roles is accepted at EMNLP 2017


Hosted Stanford professor Dan Jurafsky's visit to IBM Research - Almaden


Research paper on Hardware Compilation Framework for Text Analytics Queries is accepted to Journal of Parallel and Distributed Computing (JPDC)


SEER, a system on learning extractors from examples, presented at CHI and SIGMOD 2017 [video] [paper]


Workshop paper on understanding relationships in the financial domain presented at DSMM 2017 [paper]