Stream Computing Platforms, Applications, and Analytics - overview


Overview

We conduct research in the areas of data-intensive applications, analytics, and platforms with a particular focus on stream computing. Most of our research efforts come under the umbrella System S project, which spans several teams at the IBM Thomas J. Watson Research Center.

Stream Computing

The stream processing computational paradigm consists of assimilating data readings from collections of software or hardware sensors in stream form (i.e., as an infinite series of tuples), analyzing the data, and producing actionable results, possibly in stream format as well.

Stream processing application.

In a stream processing system, applications typically act as continuous queries, ingesting data continuously, analyzing and correlating the data, and generating a stream of results. Applications are represented as data-flow graphs composed of operators and interconnected by streams, as shown in the figure. The individual operators implement algorithms for data analysis, such as parsing, filtering, feature extraction, and classification. Such algorithms are typically single-pass because of the high data rates of external feeds (e.g., market information from stock exchanges, environmental sensors readings from sites in a forest, etc.).

Stream processing applications are usually constructed to identify new information by incrementally building models and assessing whether new data deviates from model predictions and, thus, is interesting in some way. For example, in a financial engineering application, one might be constructing pricing models for options on securities, while at the same time detecting mispriced quotes, from a live stock market feed. In such an application, the predictive model itself might be refined as more market data and other data sources become available (e.g., a feed with weather predictions, estimates on fuel prices, or headline news).

Streams applications may consist of dozens to hundreds of analytic operators, deployed on production systems hosting many other potentially interconnected stream applications, distributed over a large number of processing nodes.

System S

System S is a distributed stream processing platform under development by our group. The project consists of a multi-disciplinary effort, bringing together researchers from several Computer Science areas from High Performance Systems, Programming Languages, Knowledge Representation, Data Management, to Optimization, and Analytics (including experts from signal processing and data mining areas).

The System S prototype is capable of running large distributed data analysis applications, demonstrating the ingestion and processing of high-bandwidth multi-modal data streams (audio, video, text, low-level network traffic, among others). System S prototype has been productized under the name IBM InfoSphere Streams. Advanced research in the area of stream computing still continues on the research prototype.

Areas of Research

Our group conducts research in the following areas:

  • Systems: Runtime design for distributed stream processing systems, including transport, scheduling, parallelization, debugging and profiling, fault-tolerance, load shedding, hardware acceleration, to name a few.
  • Distributed Data Management: Shared state and storage support for streaming systems, integration with Big Data technologies like Hadoop and HDFS, distributed key/value stores, graph-analytics and graph-processing systems.
  • Languages & Compilers: Language and compiler design for stream processing systems, including streaming languages, intermediate representations, optmizations, language translation, macro systems, etc.
  • Analytics: Time series analysis, signal processing, and data mining for streaming systems, with a focus on adaptive and proactive analytics. Integration of offline and online analytics.
  • Knowledge Representation: Automated composition of flows, goal-oriented planning in streaming systems.

Selected Publications

The System S project has produced hundreds of publications. Here we include some of the most significant papers on System S. Please see the Publications tab for a more complete list.