The DeepQA Research Team - Engineering and Science


A team of approximately 25 IBM researchers built Watson and its underlying DeepQA architecture in just four short years. To make such rapid progress the team had to create an entire environment for high-speed innovation.

At the core of this environment is a rigorous software engineering methodology where research and development investment is guided by metrics. Only by constantly measuring Watson's performance and identifying its current strengths and weaknesses could the team decide the best way to spend its limited resources and explore new algorithms, analytics, and approaches.

Of course, performance results are meaningful only if they are collected over a large enough data set, so early on the team adopted a methodology of evaluating Watson over very large test sets containing several thousand questions. Dealing with such large data sets creates two problems. First, the logging information from a single experiment is enormous (tens of gigabytes) and impossible to understand in the raw. Second, running large scale experiments is expensive and time consuming, especially since Watson can take nearly two hours to evaluate a single clue on a single high-end processing core.

To address the first problem, the team created sophisticated error analysis tools using DB2 to store experimental results and a web front end that allows quick access to summary performance metrics, a wide variety of ways to filter and compare experiments, and the ability to dive down into the details of how Watson came up with its final answer list. The team also created a tool for detailed browsing and visualization of the various evidence dimensions that contribute to an answer's final score and confidence.

To address the second problem, the team created the QACollider, a computational cluster with hundreds of high-end, large main memory, multi-core servers running a cluster management and load balancing system that schedules experiments on the cluster and manages resource sharing among the users. The QACollider enables a single experiment of several thousand clues to run in just a few hours, and for the entire DeepQA team to share the resources fairly with near optimal resource utilization.

The final component to conducting science in this rapid innovation environment is careful versioning of the system to ensure that experiments can be repeated and progress can be carefully tracked. The DeepQA team addressed this aspect by adopting strict source code versioning and tracking protocols and, in particular, a regular system release and test process where all of the system components are tagged and a "weekly run" is conducted to evaluate the system on a suite of test sets.