Language and Robots     


Michiaki Tatsubori photo

Language and Robots - overview

Much of natural language -- humans' preferred communication medium -- concerns itself with nouns. Since it would be very convenient to have machines directly "programmable" by this method, it would help it they could understand objects in a manner similar to humans. In the real, physical world this means being able to automatically detect and characterize graspable objects, people, and places, as well as determine the spatial relations between them. Ideally we would like an Extensible Linguistic Interface (ELI) that let the system learn new objects, places, and actions.

Publications (click to view):

Extensible Grounding of Speech for Robot Instruction
J. Connell, in Robots that Talk and Listen, J. Markowitz (ed.), De Gruyter, 2014.

An Extensible Language Interface for Robot Manipulation
J. Connell, E. Marcheret, S. Pankanti, M. Kudoh, and R. Nishiyama, Proc. Artificial General Intelligence Conf. (AGI-12), LNAI 7716, pp. 21-30, December 2012.

Presentations (click to view):

Quick Intro
This frames the project in an AI context and highlights the varied capabilities of the robot.

Overview of project
This outlines the eldercare problem, provides a walkthrough of the videos (below), and gives some details on noun learning, remote DB access (with TRL), and verb learning.

Business case
Costs, markets, and deployment scenario for eldercare (IBM Internal only).

Arm videos (click individually, or download compendium):

Robot using speech dialog to manipulate objects on a tabletop.
This illustrates the robot's understanding of objects, colors, sizes, and positions as well as the use of gesture, both by the human and by the robot. In addition it demonstrates cross-utterance pronoun resolution and the tracking of dialog focus.

Robot learning nouns based on speech and gesture.
This shows how the robot can be taught the names of objects and what they look like. It can then respond to requests for these new names and automatically locate the requested objects.

Using remote database access to veto or modify actions.
Once objects have names, additional sources of information can be accessed. These include user-specific databases, web ontologies, and personal histories.

Robot learning a verb from an action walk-through.
The robot's vocabulary can also be extended by describing how to do new tasks. Here the verb "poke" is taught as a sequence of object-relative motions.

Whole body videos (click individually):

Robot responding to simple speech commands.
This works using MS Speech v5 (under XP) at 3-4 feet, but is very low level control. The goal of the project is to be able to autonomously carry out much less detailed commands, like simply "Put this away".

Robot lifting object off floor using (human) joystick control.
This demonstrates the ability of the robot to reach surfaces at a variety of heights, as might be needed to help Grandma pick up her dropped glasses. The motion is sort of jerky because I am not such a good driver.

Robot Scenarios:

The final object-centered cognitive architecture can be applied in many domains. One example is as the basis for a robotic "gopher" in hospital settings. Having a grounded language ability would let it handle incidental chores like "Bring Mr. Brown's medication to Nurse Betty at station 3" or spontaneous requests such as "Retrieve the crash cart from exam room 47B". This would free up the highly skilled nurses to spend more of their time on the actual patients.

Another application would be in elder-care situations. People often have decreased mobility and dexterity as they age. Rather than demoralizing institutionalization or a potentially intrusive live-in caretaker, many could keep living at home with some simple help. Meal preparation and occasional cleaning would likely still have to be done by a human. But a robot that could "Pick up my glasses from the floor" or "Get me that book I left in the bathroom" would be a real boon in terms of empowerment (and a potential cost-saver).

The technology could also be used with remotely-operated vehicles, to combat operator fatigue. Typically there are only a few properly trained operators and you would like them to keep their attention on the high level problem as long as possible. For instance, in a military scenario a bomb-disposal robot could be instructed to "Enter the building, go to the northeast corner, and open the closet". Note that here a headset, or even typed input, might be acceptable. In a commercial setting, an undersea robot might similarly be told to "Rise and hold position 1m in front of the blowout preventer". This relieves the operator of having to control the moment-to-moment details of the vehicle and instead focus on more important aspects of the mission.

Major Issues to Address:

A good testbed for developing such high-level "scripting" capabilites would be a fetch-and-carry mobile robot. Yet to build an integrated demo we must address a number of topics in various subfields.

In general, a crucial ingredient for successful deployment of robots is keeping their cost low. To this end we aim to use computation to replace, whenever possible, expensive special-purpose sensors and high-precision mechanical components. As indicated below, for speech this means using beam-forming, blind source separation, and echo cancellation software rather than dense microphone arrays. For navigation we intend to rely solely on a pair of stereo color cameras rather than laser rangefinders or active beacon systems. And, while vision is known to be extremely computationally intensive, we believe we can keep the total cost down by exploiting modern graphics processors (GPUs) that contain hundreds of cores.


In the robot context, naturally the solid objects are targets for the fetch operations, but they can also be destinations (especially when they are people). In addition they can form the basis for identifying places and local directions. That is, the kitchen is the open space which contains the fridge and the sink. Similarly, to go from the living room to the kitchen, head through the door which is to the left of the big blue chair. Note that this not only lets us communicate with the robot in human terms, it also lets us do away with precise geometric maps which are difficult to build and maintain. To implement this style of navigation there are several sub-areas that need to be investigated.

  • Object approach: Once an object has been detected and localized by the vision system, the robot needs to be able to drive toward it. Also, if detours from the direct path for some reason, it needs to be able to re-find the target and resume its path. Finally, depending on whether the object is graspable or an environmental landmark, the robot needs to know how close it is required to get to an object in order to have "arrived".

  • Obstacle avoidance: It the robot relies on tracking objects to generate trajectories, it still needs some "tactical" method for preventing collisions along its route. A good choice for this is coarse stereo-vision. This generates a depth image of areas in front of the room from which a short-term local occupancy map of structures protruding from the floor can be produced. Although computationally rather expensive this technique relies on commodity cameras rather than low-volume special-purpose sensors like sonar or laser scanners.

  • Freespace parsing: To ground negative-space nouns such as rooms, halls, and doorways the robot needs some way to detect and reify such areas. We propose using the short-term occupancy maps similar to those for collision avoidance but looking at structures above the robots head instead of near the floor.

  • Person following: We do not contemplate any autonomous exploration or mapping capability. Instead we intend for the user to lead the robot around naming places and demonstating paths (room sequences). In order to make use of this sort of instruction the robot has to be able to tag along after the human.


The robot is largely visually guided as this is the most cost-effective solution. In particular, its command language, manipulation actions, and navigation abilities all revolve around the concept of "objects". It is up to the vision system to deliver the necessary groundings for these terms. To do this there are several classes of problems it needs to solve.

  • Figure/ground separation: The robot will need to be able to autonomously divide the world into solid objects and locally convex spaces. The space representation will could from a short-range depth map that is also used for obstacle avoidance. This would let us detect positive spaces (e.g. objects) as well as negative spaces (e.g. rooms). We could then localize objects based on pre-attentive color and texture segmentation.

  • Object recognition: We can obtain some overall charaterization of an object based simply on its segmentation. We can get its size, overal shape, and color as bulk properties. This can be refined by looking at more detailed description around "interest points" such as corners or inflections. Commonly used descriptors for this purpose include SIFT (Scale-Invariant Feature Transform) and HOG (Histogram Of Gradients).

  • Diectic gestures: The robot needs to build associations between detected objects and their linguistic denotations. The way humans do this is by pointing or "polishing" the relevant item. We need methods to draw the robot's attention to some particular object or location in order to impart a name.

  • Face finding & recognition: Sometimes the robot might be asked to bring an object to a specific person. Alternatively, to improve the robustness of speech recognition it can help to know who is speaking. While a person is a general object, there is some evidence that primates a have special-purpose facility for finding and identifying faces. Similar face-specific algorithms exist in computer vision and could be usefully incorporated into the system.


This is the core of the research problem -- binding objects to terms and interpreting commands. It also can be broken into a number of subparts.

  • Predicates and arguments: Use of language model terminology here is intended to cover grammars, n-gram language models, and n-gram language models with embedded grammars. In the simplest case a constrained grammar can be tied to actions and entities, but this would be fairly rigid and difficult to extend. Use of an n-gram language model with tokens that represent embedded grammars would provide flexibility in command usage, with the language model acting as filler between grammar slots that are tied to your actions. In general we need to be able to pick out the action/command elements and determine what their arguments/objects are.

  • Relation handling: Sometimes we do not just want any "book", but instead the book "on the counter". For this reason it is important to bind adjectives and prepositional phrases to their heads. Similarly, it may not make sense to the robot to just "drop" an item, but instead it should drop it "into the box". Such locative phrases must be correctly attached to the matrix verb.

  • Dialog: Sometimes the argument of an action may not be clear, or may not even be know yet. Thus it is important that the robot be able to ask questions to fill in the required blanks. Sometimes this is most efficiently done by tying specialized language models to particular dialog states.

    One possible way to model the dialog with the robot is to draw analogies to natural language call routers. For example, using a vector space model we may train a routing matrix based on the statistics of words, and word sequences and the labelled action. These routing classifiers can be designed for various dialog states of the robot, and are readily adaptable in a semi-supervised approach. The decoding graph here could be a general n-gram language model.

    Hong-Kwang Jeff Kuo, Chin-Hui Lee, "Discriminative Training of Natural Call Routers", IEEE Transactions on Speech and Audio Processing, Vol. 11, No. 1, January 2003.

  • Sequencing and control structure: To do useful work it would not make sense to drive the robot using a "linguistic joystick", that is a series of discrete on-time commands like "turn right 90 degrees" or "go forward". Instead we would like to load the robot up with a whole sequence like "go through the door then grab the red book to your right". To do this we must separate the two command conjoined by the word "then", and realize that the robot has to turn to the right so it can see the book before it has any hope of grabbing it. There are other linguistic patterns indicating conditional action (e.g. "until") or looping (e.g. "all") that must also be properly handled. Basic tools exist for statistical parsing,

    which will provide us with robust techniques to parse a phrase containing multiple commands. By parsing these multiple commands and designing appropriate action classifiers as discussed above we have a structured way to build up the language interface to the robot.

  • Noun addition: Nouns generally are an "open" class -- new elements are routinely added over time. Generally the robot's focus of attention will be on some object, person, surface, room, or passageway indicated by the user. The question then becomes how to associate a name with this item. From parsing it might be determined that there is only one unknown in the sentence and hence the attended object must be its referand. Alternatively, it may be more robust to have an explict command sequence for teaching new words like "Robot, this object is a ... BOX."

  • Verb addition: We plan to treat verbs as a "semi-closed" class. That is, the robot will have innate bindings for all its basic actions such as "goto", "grasp", "drop", etc. New verbs are added as subroutines composed of sequences of these primitives. For instance the robot could be taught that "To CLEAN-UP this room means ... grab each of the toys on the floor, goto the corner of the room, and drop them in the box". The challenge becomes understanding how to parameterize the inferred subroutine and how to infer its control structure (conditional, looping, etc.).


To enhance usability we believe that the robot should respond to verbal commands. However this does not necessarily entail the full difficulty of some speech systems. Still, there are some high level topics that must be dealt with on the speech side.

  • Confidence estimation: This is important to any type of disambiguation dialog, where an acceptable false rejection rate will be required to drive up the correct decoded action. This rejection threshold can be relaxed with more reliable ASR decoded output and action classifier robustness.

    B. Maison and R.A. Gopinath, "Robust Confidence Annotation and Rejection for Continuous Speech Reognition", Proc. ICASSP 2001.

  • Acoustic add-word: We would like to have the ability to build up the vocabulary of the robot on the fly, a simple example being a new object that is being introduced, and confidence indicates to us that this word is not in the vocabulary after the appropriate disambiguation dialog.

    B. Ramabhadran, L.R. Bahl, P.V. DeSouza and M. Padmanabhan, "Acoustics-Only Based Automatic Phonetic Baseform Generation", Proc. ICASSP 1998

  • Speaker adaptation: It is reasonable to assume that the typical user will make use of the robot enough that speaker specific acoustic models will be available, either through MAP/MLLR adaptation, or unsupervised (supervised if reasonable) adaptation of discriminative feature space transforms (Recent work on fMMI level 2 speaker specific transforms) . The method of adaptation ultimately dictated by the amount of speaker specific training data available. The speaker enrolled acoustic models can be reliably and rapidly selected as needed using visual biometrics.

    For short enrollment, or unseen speakers, robust techniques exist for maximum likelihood feature space transform methods.

    V.Goel, K. Visweswariah, R. Gopinath, "Rapid Adaptation with Linear Combinations of Rank-One Matrices", Proc. ICASSP 2002.
    K. Visweswariah, V. Goel, R.A. Gopinath, "Structuring Linear Transformations for Adaptation Using Training Time Information", Proc. ICASSP 2002

  • Far-Field Speech: The ultimate goal is to have the ability to speak with the robot in a far-field manner. The acoustic channel in this case brings its own set of ASR performance issues. Some existing techniques in beam forming can elevate the SNR in this condition. We may also constrain the position of the robot to be in front of the speaker, and have the robot use a directional microphone.

    J. Huang, E. Marcheret, K. Visweswariah, V. Libal, G. Potamianos, "The IBM Rich Transcription 2007 Speech to Text Systems for Lecture Meetings, CLEAR-2007.

    Use of the visual channel can also be used to drive up the effecive SNR by bringing this modality into the ASR Process. Dynamic stream weighting at the ASR HMM level of the audio and visual modalities allows us to take into account rapid changes in both the audio and visual environments.

    G. Potamianos, J. Luettin, and C. Neti, "Hierarchical discriminant features for Audio Visual LVCSR", Proc. ICASSP 2001.
    E. Marcheret, V. Libal, G. Potamianos, "Dynamic Stream Weight Modeling for Audio Visual Speech Recognition, Proc. ICASSP 2007.

    Another important role for the visual channel is assisting with determination of who is speaking, which is needed to drive the directional microphones and becomes relevant when multiple people are present and in close proximity. Here you need to address both the diarization and the audio/visual synchrony problems.

    K. Kumar, J. Navratil, E. Marcheret, V. Libal, G. Potamianos,"Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction", Proc. Interspeech 2009.
    J. Huang, E. Marcheret, K. Visweswariah, "Improving Speaker Diarization for CHIL Lecture Meetings", Proc. Interspeech 2007.

    As an initial step, however, we propose to use close talking microphones, or possibly a cell phone to design the language understanding portion of the project. Yet the ultimate goal is to end up with a farfield implementation.