Text Analytics

Overview

The aim of our text mining project is to research technologies to discover useful knowledge from enormous collections of documents, and to develop systems to present this knowledge and to support the users' decisions. Traditional data mining technologies mine knowledge from data structured with well-formed schemes such as relational tables. However, text data does not use such schemes, and the information is described freely within the documents. Therefore, we are focusing on Natural Language Processing (NLP) technologies to extract the information. Using NLP technologies, target documents are transformed into collections of concepts, which are described using terms discovered within the texts.

Often, "text mining" is used to refer to a text search technique, but we think of text mining from a more functional perspective. Text mining technologies extract more information than just picking up keywords from texts: facts, authors' intentions, their expectations, and their claims. This knowledge is helpful for many applied tasks such as marketing, trend analysis, claim processing, or generating FAQs (frequently asked questions).

In this project, we are currently working primarily on two text mining solutions. One solution is for CRM which extracts useful information from call center logs for marketing strategies and so on. The other solution is for life science, which helps find hidden knowledge from large numbers of biomedical documents.

Framework Technologies

IBM TAKMI

TAKMI is a text mining project that we have been working on since 1997. It was initially created to provide technologies for understanding the voices of customers as contained in the logs of contact centers, and its usage is now expanding to extract valuable information from various types of text (such as medical records, weblogs, and highway accident reports). IBM Content Analytics is developed based on this TAKMI technology.

UIMA

UIMA stands for Unstructured Information Management Architecture, which IBM Research originally created as a common platform for natural language processing modules. Now it is expanding to deal with other types of media (such as voice or video). UIMA has been released as an open source project to foster wider use.

Text and Network Analysis (TENA)

Text and Network Analysis is a composite framework to perform both text analysis and network analysis in environments where the text data resides in personal networks (e.g. SNS).

Component Technologies

Alerting

Alerting is a technology to detect incidents which may cause problems later. This kind of technology is especially helpful for business managers who are seeking to act more proactively.

Top-K

Top-K is a fast indexing method for extracting the K most frequent keywords in a document set based on some search condition.

Relation Extraction

This is a technology to extract a (direct or indirect) relationship between various entities in the documents.

Detection of Evaluation

This is a technology to extract phrases with polar opoinions (positive or negative feelings) from text documents such as blogs. We developed a method to extract domain-dependent sentiment expressions automatically.

Advantage Analysis

Advantage analysis aims to extract and analyze phrases expressing technical achievements from documents.

Business Applications

Analysis of the voices of the customers

This is a solution to help people understand the issues and problems by analyzing the voices of actual customers. The solution recognizes and extracts questions, requests, complaints, and similar categories from contact center logs.

Early identification of problems

This is a solution for detecting problems with products and services based on text data (such as contact center logs) by using alerting technology.

Sales acceleration by opportunity identification

This is a solution for extracting valuable information from sales logs to improve future sales activities. Use of customers' opinions

There are many impressions of services or products reported by customers on the Internet, and these reactions, which are called words-of-mouth information, are very valuable information sources for companies. Sentiment analysis is one of useful technologies for analyzing such information.

Trend analysis from technical documents

We are working on technologies for extracting valuable information to outline the major trends in technical areas, to grasp the possible advances in technologies and related tasks.

Analysis of discussions

We have developed a tool to increase the effectiveness of online discussions with functions such as detecting important messages and visualizing message threads.

Life sciences

We are studying a solution to extract valuable information from life-science-related documents (such as published papers) for pharmaceutical research.