IBM Student Workshop for Frontiers of Cloud Computing 2013 - Program
IBM Frontiers of Cloud Computing Workshop, Day 1 (Monday, July 29, 2013)
8:15am - 8:45am
8:45am - 9:00am
Introduction (Salman Baset)
9am - 10am
Keynote by Daniel Dias
10:15am - 11:45am
BIG DATA (Session chair: Roman Vaculin)
Sameer Agarwal (UC Berkeley). BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data
We present BlinkDB, a massively parallel, ap- proximate query engine for running interactive SQL queries on large volumes of data. BlinkDB allows users to trade- off query accuracy for response time, enabling interactive queries over massive data by running queries on data samples and presenting results annotated with meaningful error bars. To achieve this, BlinkDB uses two key ideas: (1) an adaptive optimization framework that builds and maintains a set of multi-dimensional stratified samples from original data over time, and (2) a dynamic sample selection strategy that selects an appropriately sized sample based on a query’s accuracy or response time requirements. We evaluate BlinkDB against the well-known TPC-H benchmarks and a real-world analytic workload derived from Conviva Inc., a company that manages video distribution over the Internet. Our experiments on a 100 node cluster show that BlinkDB can answer queries on up to 17 TBs of data in less than 2 seconds (over 200x faster than Hive), within an error of 2-10%.
Abdul H. Quamar (University of Maryland). Scalable big data analytics on the cloud: Challenges and Solutions
Increasing volumes of data are being stored and processed in the cloud due to the inherent advantages associated with the economies of scale. In this talk, I will present my work on addressing the challenges associated with cloud data management and processing for two different application domains.
First, I will present my ongoing work on developing a general-purpose distributed programming framework for processing large-scale graph-structured data in the cloud. A large number of graph applications are neighborhood-centric, i.e., interested in statistical properties, behavior, and communication patterns of individual nodes or node neighborhoods in a graph. I argue that such applications are not well-served by the graph programming frameworks proposed to-date. I describe NScale, a neighborhood-centric distributed graph processing framework for efficient large-scale graph analytics and interactive query processing on the cloud. NScale provides materialized in-memory subgraphs (e.g., neighborhoods of interest, motifs, etc.), as a fundamental abstraction and enables processing of analytical and interactive queries in the form of neighborhood computations provided by the user. Second, I present a new data-parallel, progressive analytics system, called NOW, that provides early results to analytical queries based on partial data, and progressively refines these results as more data is received. NOW has “progress” built-in as a first class citizen, provides explicit data provenance, and efficient and deterministic query processing over progressive samples communicated to the system. If time permits, I will briefly present SWORD, our work on transparently scaling out transactional (OLTP) workloads on relational databases, to support database-as-a-service in the cloud.
Oded Green - (Georgia Institute of Technology). Creating and redesigning algorithms for social network analysis for shared-memory systems (Awarded Best Talk)
Analysis of social networks is challenging due to the rapid changes of its members and their relationships. For many cases it is impractical to recompute the analytic of interest, therefore, streaming algorithms are used to reduce the total runtime following graph update. Analytics of interest include connected components, clustering coeffcients, modularity, and different centrality metrics. The output of these analytics can be inputs to applications of different purposes which include community detection, anomaly detection, and finding key players in a social network. In this talk, I will present some recent algorithms and modifications to existing algorithms for social network analysis with an emphasis on high performance computing. I will a focus on betweenness centrality and will provide performance analysis.
12:45p - 1:15pm
IBM talk (Session chair: Nilton Bila)
Marcio Silva: Recent Work on CloudBench.