My research interest is in the general area of data mining, information retrieval, data management and database. I am interested in developing novel machine learning techniques as well as applying existing techniques to solve challenging problems.
I got my Ph.D. degree from the College of Information Sciences and Technology from Pennsylvania State University in 2010. Before that, I got a M.S. degree and a Bachelor degree in Electrical Engineering from Jilin University, China.
Some Recent Work at IBM
My current research work in IBM is about extracting chemical and biological related information from unstructured or semi-structured data and managing the information so that they can be accessed, searched, and consumed easily. Data in the domain of chemistry and biology have many specialties that are not shared by more general data. Existing information extraction, natural language processing (NLP), and data storage techniques are not readily applicable to our data, and require novel methods and techniques to be developed. One interesting problem I worked on is cross media information extraction and linkage, which is a problem untouched in existing work. The motivation is that, image is an arguably equally important information source as text for chemical and biological data. Independently mining the two sources will lose the logical association between the two media. If you are interested in details, please take a look at my recent work on this topic "Cross Media Entity Extraction and Linkage for Chemical Documents".
The current problem I am working on is "Information-Rich Named-Entity Extraction (NE)". An entity is typically associated with multiple attributes. For examples, attributes of a drug name can be drug ID, synonyms, brand names, generic names, SMILES, category etc., and attributes for a gene name can be gene ID, gene symbols, chromosome, gene types etc.. Traditional NE systems cannot associate each extracted entity name with attributes, which hinders downstream data analysis. Naively, one might expect given extracted names, simply querying an entity database for associated attributes would be a reasonable solution. In fact, this is seldom the case. The surface form of a name in free text can vary substantially from its dictionary version. I design and develop a novel NE system that systematically return extracted entity names with rich associated information to users. The main ideas are leveraging Web as a universal knowledge base for data preprocessing and designing a novel inverted index structure for efficient querying. If you are interested in details, please take a look at my technical report on this topic "Using Web and Length-based Inverted Index for Efficient Named-Entity Extraction"(under review).
Some Recent Work outside IBM
As a researcher, I am actively collaborating with many researchers and professors from universities. My recent work include large scale social network analysis and spam detection, semi-supervised dimensionality reduction, general entity index and search.
In pharmaceutical industry, the confidentiality of data has always been a concern. For this reason, many companies and research institutes spend big money in developing in-house data analysis tools. I recently developed a light-weight, self-contained tool which is called "BioMiner" and can be used as an in-house tool with no data security concern. BioMiner extracts a range of biological entities, include chemicals, drugs, genes, diseases and target from free text. For each type of entity, with the new technique mentioned above, BioMiner also returns a rich list of attributes to users without requiring users to install or manage any database.
My group's product: IBM BAO strategic IP insight platform (SIIP)
SIIP is a wonderful project that I have been working on since I joined IBM. I am in charge of the bio-informatics related components of this product, such as chemical and biological annotations. Here is a list of links that you may find useful if you want to learn more about our work.