IBM Workshop on Big Data Analytics - June 28, 2013 - Program & Schedule

Program At A Glance

Friday June 28th, 2013

8:00 - 8:40

Registration + Breakfast

8:40 - 9:00

Welcome Speech by David McQueeney, Vice President, IBM Research

9:00 - 9:30

Laura Haas: The Real Challenge with Big Data for Big Business Isn't Size

9:30 - 10:00

Jignesh Patel: Data @ Bare Metal Speed

10:00 - 10:30

Hamid Pirahesh: Big Data Platform and Ecosystem, Nothing new or Not Your Grandfather's SQL Ones

10:30 - 11:00

Coffee Break

11:00 - 11:30

Ashraf Aboulnaga: Workload Management for Big Data Analytics

11:30 - 12:00

Tasos Kementsietsidis: Large-Scale Graph Data Management for Big Data and Beyond

12:00 - 12:30

Tim Kraska: MLBase: A User-friendly System for Distributed Machine Learning

12:30 - 13:30

Lunch Break

13:00 - 14:30

Poster Session (see list of posters below)

14:30 - 15:30

Panel on Enterprise Data Analytics

15:30 - 16:00

Murthy Devarakonda: Big Data Challenges in Watson

16:00 - 16:30

Floris Geerts, Wenfei Fan, Frank Neven: Making Queries Tractable on Big Data with Preprocessing

16:30 - 17:30

Open Discussion Sessions and One-to-One Meetings


Program Details - Talks

Laura Haas, IBM Fellow

TITLE: The Real Challenge with Big Data for Big Business Isn't Size

ABSTRACT: Yes, there's lots of data out there, but we have a whole host of new platforms that can process those volumes while doing complex analytics at warp speed. While large-scale science will continue to demand improvements in these platforms and the analytics algorithms that run on them, the current capabilities are more than powerful enough to handle most of the world's business problems today. However, according to a 2012 IBM study, only 6% of all businesses are really exploiting Big Data, and fewer than 30% have even tried a small pilot. We believe that Big Data can be leveraged to accelerate discoveries and innovation in many industries, for many aspects of businesses. However, unlocking the value of this data requires not only a powerful platform, but a rich environment that makes these tools consumable, lowering skill and expertise barriers to their use. This talk will describe the efforts of the IBM Research Accelerated Discovery Lab to address this challenge, highlight a few important research topics, and describe some of the emerging use cases that we are trying to address.

Jignesh Patel, University of Wisconsin-Madison

TITLE: Data @ Bare Metal Speed

ABSTRACT: Big data systems (SQL, NoSQL, and everything in between) today largely employ data processing techniques that were developed for relational database management systems many decades ago. However, the underlying hardware on which such data management software runs has made a fundamental shift in recent years, driven by two dominating factors. First, driven by energy dissipation considerations, we are now firmly in the era of massive multicore multi-socket systems. Second, for the first time in many decades, the traditional memory and I/O hierarchy is now underdoing dramatic transformation with large main memory and NVRAM-based technologies shaking up the memory hierarchy. A natural question that then follows is: Should we design and build the internals of a contemporary big data processing engine in dramatically different ways for this new hardware reality? The Quickstep project at Wisconsin aims to answer this question by designing, building and evaluating data processing kernels that aims to run at the speed of the bare metal hardware. This talk presents what we have discovered so far, showing how some core data processing techniques (e.g. joins and storage organization) must be reconsidered to deliver data processing rates that fully utilize the underlying hardware.

BIO: Jignesh Patel is a Professor in Computer Sciences at the University of Wisconsin-Madison, which is where he also got his PhD. He has worked in the area of database systems for over two decades. He is the recipient of an NSF Career Award, and multiple Google, IBM and Microsoft faculty awards. He also has a strong interest in seeing research ideas transition to actual products. His thesis work was commercialized, and acquired by NCR/Teradata. At Wisconsin, he founded an undergraduate entrepreneurship contest called CS NEST (along with two undergraduate students) to foster startups. A number of startups seeded from this program are now operating in Madison, WI. Jignesh is also an ACM Distinguished Scientist.

Hamid Pirahesh, IBM Fellow

TITLE: Big Data Platform and Ecosystem, Nothing new or Not Your Grand Father's SQL Ones

Ashraf Aboulnaga, Qatar Computing Research Institute

TITLE: Workload Management for Big Data Analytics

ABSTRACT: This talk is based on a SIGMOD 2013 tutorial by Shivnath Babu and me. I will present an overview of workload management research, and highlight the increasing importance of this research in the era of Big Data. Systems for Big Data analytics (e.g., parallel database systems and Hadoop) process multiple concurrent workloads consisting of complex user requests, where each request is associated with an (explicit or implicit) service level objective. Workload management focuses on ensuring that the system meets the service level objectives of various requests while at the same time minimizing the resources required to achieve this goal. This requires technical innovation in the areas of scheduling and resource allocation, performance prediction, workload isolation, and execution control, all of which will be covered in this talk.

Anastasios Kementsietsidis, IBM Research

TITLE: Large-Scale Graph Data Management for Big Data and Beyond

ABSTRACT: Whether it's Big data, Linked Data, or Your Data, managing graph data is an integral part of any future data analytics platform. Yet, graph data management is still in its infancy and there are many open challenges. In this talk, we address some of these challenges and describe a novel storage and query mechanism for graph data which works on top of existing relational representations. There are significant challenges in storing graph in relational, which include data sparsity and schema variability. We describe novel mechanisms to shred graph data into relational that do not require schema changes as the graph data evolve, and novel query translation techniques to maximize the advantages of this shredded representation. We show that these mechanisms result in consistently good performance across multiple benchmarks, even when compared with current state-of-the-art stores. This work is going to be presented in SIGMOD 2013, and provides the basis for graph support in DB2 v.10.1.

BIO: Anastasios Kementsietsidis is a Research Staff Member at IBM T.J. Watson Research Center. Anastasios has a PhD in computer science from the University of Toronto. He is currently interested in various aspects of graph data management (including, querying, storing and benchmarking graph data). In the past, he worked on data integration, cleaning, provenance and annotation, as well as (distributed) query evaluation and optimization on relational and semi-structured data. He has several publications in the leading database conferences, including a best paper award in ICDE 2007, a best demo award in EDBT 2006, and his CIKM 2009 paper was a runner-up for a best paper award. He has served on the program committee of several leading conferences and workshops. Anastasios is a Senior Member of IEEE and a regular member of ACM SIGMOD.

Tim Kraska, Brown University

TITLE: MLBase: A User-friendly System for Distributed Machine Learning

ABSTRACT: Machine learning (ML) and statistical techniques are crucial to transforming Big Data into actionable knowledge. However, the complexity of existing ML algorithms is often overwhelming. End-users often do not understand the trade-offs and challenges of parameterizing and choosing between different learning techniques. Furthermore, existing scalable systems that support ML are typically not accessible to ML developers without a strong background in distributed systems and low-level primitives. In this work we present MLbase, a system designed to tackle both of these issues simultaneously. MLbase provides (1) a simple declarative way for end-users to specify ML tasks, (2) a novel optimizer to select and dynamically adapt the choice of learning algorithm, (3) a set of high-level operators to enable ML developers to scalably implement a wide range of ML methods without deep systems knowledge, and (4) a distributed run-time optimized for the data-access patterns of these high-level operators.

BIO: Tim Kraska is an Assistant Professor in the Computer Science department at Brown University. Currently, his research focuses on Big Data management in the cloud and hybrid human/machine database systems. Before joining Brown, Tim Kraska spent 2 years as a PostDoc in the AMPLab at UC Berkeley after receiving his PhD from ETH Zurich, where he worked on transaction management and stream processing. He was awarded a Swiss National Science Foundation Prospective Researcher Fellowship (2010), a DAAD Scholarship (2006), a University of Sydney Master of Information Technology Scholarship for outstanding achievement (2005), the University of Sydney Siemens Prize (2005), a VLDB best demo award (2011) and an ICDE best paper award (2013).

Murthy Devarakonda, IBM Research

TITLE: Big Data Challenges in Watson

ABSTRACT: After arriving on the big stage with the historical Jeopardy! win, Watson is now mastering the health care domain. For Watson, big data means not only the linear size of the domain corpus and question scenarios, but it also means the possible semantic relationships among words, phrases, and sentences in the corpus and in the questions. Those possible relationships are big, really big. This is partly why the Jeopardy! Watson, while using only a few hundred TBytes of corpus, required 3000 cores to determine a response in 2-6 seconds. The problem is even worse in the medical domain, because the discourse and inferencing are even harder. In addition, we raised the bar for Watson by moving the problem to one of inference chaining from factors in the question to possible answers rather than just producing an answer with a confidence score. We also raised the bar in terms of more complex question scenarios and larger question inputs such as the Electronic Medical Records. This talk will discuss the new challenges and current progress along these lines in the Watson project.

BIO: Murthy Devarakonda is a Research Staff Member and Manager at IBM T. J. Watson Research Center. His career at IBM Research involved transforming results from systematic measurement, monitoring and study of systems into foundational observations, and applying them to novel systems design and management. He practiced this methodology in various systems areas in the past. Now as a technical leader in the Watson Technologies team in IBM Research, he is leading Watson adaptation to extracting insights from Electronic Medical Records to help physicians in patient care. He received Ph.D. from University of Illinois at Urbana-Champaign, and he is an ACM Distinguished Engineer.

Floris Geerts, University of Antwerp

TITLE: Making Queries Tractable on Big Data with Preprocessing

ABSTRACT: A query class is traditionally considered tractable if there exists a polynomial-time (PTIME) algorithm to answer its queries. When it comes to big data, however, PTIME algorithms often become infeasible in practice. A traditional and effective approach to coping with this is to preprocess data off-line, so that queries in the class can be subsequently evaluated on the data efficiently. In this talk we discuss a formal foundation of such preprocessing approaches in terms of computational complexity. This work is a first step towards a revised complexity theory that allows to distinguish those query classes that can be regarded as being feasible on big data, and those that are not. This is a joint work with Wenfei Fan & Frank Neven.


Poster Presentations

  • Communication Steps for Parallel Query Processing
    By: Paraschos Koutris - University of Washington

  • Timeline Index: A Unified Data Structure for Processing Queries on Temporal Data
    By: Martin Kaufmann - ETHZ

  • Parallel Analytics as a Service
    By: Eric Lo - The Hong Kong Polytechnic University

  • Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
    By: Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, Peter Pietzuch - Imperial College London

  • Data-driven Neuroscience: Enabling Breakthroughs Via Innovative Data Management
    By: Thomas Heinis - EPFL

  • DBalancer: Distributed Load Balancing for NoSQL Data-stores
    By: Dimitris Tsoumakos, Giagkos Mytilinis, Ioannis Konstantinou, CSLAB-NTUA

  • Deepsea: Self-adaptive Data Partitioning and Replication in Scalable Distributed Data Systems
    By: Jiang Du - University of Toronto

  • SystemML: Declarative Machine Learning on MapReduce
    By: Matthias Boehm - IBM Almaden Research Center

  • WoW: What the World of (Data) Warehousing Can Learn from the World of Warcraft
    By: Tim Kaldewey, Rene Mueller - IBM Almaden Research Center

  • Matrix Completion Algorithms for Massive Data
    By: Rainer Gemulla - Max Planck Institute, Saarbrucken, Yannis Sismanis & Peter Haas - IBM Almaden Research Center

  • Information Scout: A Social and Intelligent Conversation to Transform Data into Insight
    By: Eser Kandogan, Mary Roth, Joshua Hui, Anshu Jain, Holger Kache, Cheryl Kieliszewski, Sarah Knoop, Fatma Ozcan, Bob Schloss, Marc-Thomas Schmidt, Peter Schwarz, Kevin Shank, Kavitha Srinivas, John Timm, Michael Ward, April Webster, Steffen Zeuch - IBM Research

  • Building 360-degree Customer Profiles from Social Media Data
    By: Mauricio Hernandez, Kirsten Hildrum, Prateek Jain, Rohit Wagle, Bogdan Alexe, Rajasekar Krishnamurthy, Ioana Roxana Stanoi, Chitra Venkatramani - IBM Research

  • ClouDiA: A Deployment Advisor for Public Clouds
    By: Tao Zou - Cornell University

  • Fast Iterative Graph Computation with Block Updates
    By: Wenlei Xie - Cornell University

  • SQPR: Stream Query Planning with Reuse
    By: Evangelia Kalyvianaki - City University London, Peter Pietzuch - Imperial College London

  • A Vision for Personalized Service Level Agreements in the Cloud
    By: Jennifer Ortiz, Victor Teixeira de Almeida, Magdalena Balazinska - University of Washington