Group Name


Speech Technologies



Automatic Speech Recognition (ASR) is a technology that converts utterances into text by analyzing human voices with computers. In 1997, IBM Research – Tokyo commercialized IBM ViaVoice, the first Large Vocabulary Continuous Speech Recognition (LVCSR) software package for Japanese. At that time LVCSR software focused on analyzing well-pronounced speech in quiet environments. As the technologies improved, new targets included spontaneous speech, such as daily conversations. Our research findings have been utilized in various business areas such as call center monitoring, smart phones, and car navigation systems. IBM Research – Tokyo continues to study various speech technologies involving speech recognition, synthesis, and analytics in practical environments.

In 2011, the IBM Corporation marked its 100th anniversary and commemorated many of the technologies produced during that century (*1). In the field of speech technology, IBM teams around the world have led in many business areas since the 1960s and we are continuing with our advanced research. IBM prepares an annual report entitled “IBM 5 in 5” (*2), which focuses on five innovations that may drastically change our lives over the next five years. In 2012, speech technology was included in “IBM 5 in 5” as one of the hottest research fields that is transforming the world. The webpages below describe some of our past and current speech technology research at IBM Research – Tokyo.

(*1) IBM 100 - Pioneering Speech Recognition
(*2) IBM 5 in 5


Speech Solutions

There are many opportunities to apply speech technology in places where human speak. For example, call centers record customer calls, but have found it hard to utilize this valuable data in their businesses. Looking at the recent progress in speech technologies, there is an enormous potential from applying speech technologies in contact centers. Agent-support technologies will provide real-time assistance to call agents, such as relevant background information to respond to customers’ questions. Mining technologies will support the back-offices of call center by extracting and finding useful patterns and knowledge from accumulated dialogues between customers and call agents. However there are many difficult problems for such applications. For example, much of the speech from call centers is acoustically messy and linguistically confusing, so we are tackling various approaches that may lead to solutions. To solve such problems, we face various technical challenges and difficulties. In other words, we are encountering many opportunities and new research topics. IBM Research – Tokyo must resolve these problems in collaboration with speech teams around the world, other researchers, and LOB groups to deliver advanced speech technologies to the client.



Robust speech recognition

In 1997, IBM Research – Tokyo helped launch IBM ViaVoice, the first commercially successful large vocabulary continuous speech recognition (LVCSR) system for Japanese. ViaVoice mainly focuses on creating Microsoft Word documents and email using voice input, and it can accurately recognize clear speech in quiet environments. However ASR research is continuing for better results. For example, close microphones work well, but distant microphones still have low accuracy because the speech is mixed with too many ambient noises and echos. Although there are many kinds of sounds in the real world such as driving noises, music, children’s voices, and announcements at stations, they are just noise sources for ASR. In contrast, we humans can easily distinguish voices from noises when speaking. Therefore one of our major goals is to create robust ASR systems that can handle all kinds of noise with a performance closer to the human ear. To do this, IBM Research – Tokyo is studying the acoustic aspects of noise reduction through signal processing techniques, feature extraction based on the mechanisms of human ears, acoustic modeling using discriminative criteria, speaker and environmental adaptations, and other areas. Related techniques involve machine learning and deep learning with neural networks. In addition, we are actively studying microphone array processing and independent component analysis using multiple microphones that can be used for such applications as in-car ASR and controlling robots.


LVCSR for spontaneous speech

Some businesses such as call centers want to transcribe telephone conversations for call monitoring and text mining of the transcribed speech. Although ASR is already in use in many business scenarios with some limitations, there are many advanced speech applications that would become possible if ASR can be significantly improved. The potential value of the information in calls and discussions at meetings is very high, but conversational speech has many ambiguities, filled pauses, and casual errors that currently confuse or defy recognition. Conversational ASR lags behind ASR for read speech, such as TV news broadcasts. IBM Research – Tokyo is continuing to research various techniques for fundamental language modeling, speech corpus utilization, and named entity extraction, and so on.


Speech Analytics

Speech data carries not only its textual information but paralinguistic information such as emotions. Conversational speech includes filled pauses, sighs, and laughter that are removed in the process of obtaining accurately transcribed speech. However these original speech data offer many clues for such things as the speakers’ emotions, response times that reveal the complexity of the topic, and how the replies evolved to their final states. IBM Research – Tokyo is continuing fundamental research to detect stress in speech and ways to extract emotions from the call center conversations. Here are some of the most promising approaches:

  • Transcribed speech with precise time-stamps
  • Clustered speech segments with human utterances, instrumental sounds, noises, paralinguistic events, and speaker derivation
  • Detecting emotions (such as happiness or anger) by using acoustic and linguistic analyses
  • Tracking turn-taking in long conversations