Stream Computing Platforms, Applications, and Analytics - overview
OverviewWe conduct research in the areas of data-intensive applications, analytics, and platforms with a particular focus on stream computing. Most of our research efforts come under the umbrella System S project, which spans several teams at the IBM Thomas J. Watson Research Center.
The stream processing computational paradigm consists of assimilating data readings from collections of software or hardware sensors in stream form (i.e., as an infinite series of tuples), analyzing the data, and producing actionable results, possibly in stream format as well.
In a stream processing system, applications typically act as continuous queries, ingesting data continuously, analyzing and correlating the data, and generating a stream of results. Applications are represented as data-flow graphs composed of operators and interconnected by streams, as shown in the figure. The individual operators implement algorithms for data analysis, such as parsing, filtering, feature extraction, and classification. Such algorithms are typically single-pass because of the high data rates of external feeds (e.g., market information from stock exchanges, environmental sensors readings from sites in a forest, etc.).
Stream processing applications are usually constructed to identify new information by incrementally building models and assessing whether new data deviates from model predictions and, thus, is interesting in some way. For example, in a financial engineering application, one might be constructing pricing models for options on securities, while at the same time detecting mispriced quotes, from a live stock market feed. In such an application, the predictive model itself might be refined as more market data and other data sources become available (e.g., a feed with weather predictions, estimates on fuel prices, or headline news).
Streams applications may consist of dozens to hundreds of analytic operators, deployed on production systems hosting many other potentially interconnected stream applications, distributed over a large number of processing nodes.
System S is a distributed stream processing platform under development by our group. The project consists of a multi-disciplinary effort, bringing together researchers from several Computer Science areas from High Performance Systems, Programming Languages, Knowledge Representation, Data Management, to Optimization, and Analytics (including experts from signal processing and data mining areas).
The System S prototype is capable of running large distributed data analysis applications, demonstrating the ingestion and processing of high-bandwidth multi-modal data streams (audio, video, text, low-level network traffic, among others). System S prototype has been productized under the name IBM InfoSphere Streams. Advanced research in the area of stream computing still continues on the research prototype.
Areas of Research
Our group conducts research in the following areas:
- Systems: Runtime design for distributed stream processing systems, including transport, scheduling, parallelization, debugging and profiling, fault-tolerance, load shedding, hardware acceleration, to name a few.
- Distributed Data Management: Shared state and storage support for streaming systems, integration with Big Data technologies like Hadoop and HDFS, distributed key/value stores, graph-analytics and graph-processing systems.
- Languages & Compilers: Language and compiler design for stream processing systems, including streaming languages, intermediate representations, optmizations, language translation, macro systems, etc.
- Analytics: Time series analysis, signal processing, and data mining for streaming systems, with a focus on adaptive and proactive analytics. Integration of offline and online analytics.
- Knowledge Representation: Automated composition of flows, goal-oriented planning in streaming systems.
Selected PublicationsThe System S project has produced hundreds of publications. Here we include some of the most significant papers on System S. Please see the Publications tab for a more complete list.
- SPL Stream Processing Language Specification M Hirzel, H Andrade, B Gedik, V Kumar, G Losa
- A Model-Based Framework for Building Extensible, High Performance Stream Processing Middleware and Programming Language for IBM InfoSphere Streams, B Gedik and H Andrade, Software: Practice & Experience, 2012
- Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva, Zbigniew Kalbarczyk, Buğra Gedik, Henrique Andrade, Kun-Lung Wu, and Ravishankar K. Iyer, International Conference on Dependable Systems and Networks. IEEE/IFIP DSN 2011
- COLA: Optimizing Stream Processing Applications Via Graph Partitioning Rohit Khandekar, Kirsten Hildrum, Sujay Parekh, Deepak Rajan, Joel Wolf, Henrique Andrade, Kun-Lung Wu, and Buğra Gedik, ACM/IFIP/USENIX International Middleware Conference, Middleware, 2009
- Scale-up strategies for processing high-rate data streams in System S H Andrade, B Gedik, K L Wu, P S Yu, Data Engineering, 2009
- SODA: an optimizing scheduler for large-scale stream-based distributed computer systems J Wolf, N Bansal, K Hildrum, S Parekh, D Rajan, R Wagle, K L Wu, L Fleischer, Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Springer, 2008