Frontiers of Cloud Computing and Big Data Workshop 2014 - Program


8:15am Gathering with breakfast (Mezzanine) Slides
8:45am Welcome (Salman)  
Session 1 (9-10:30am) Salman + Kaoutar (chair)  
9:00-9:30am IBM Talk 1: (Malgorzata Steinder - Research Challenges in Cloud Computing)  
9:30-9:45am

Swapna Buccapatnam, Ohio State, PMA) 

Stochastic Bandits with Side Observations on Networks

 
9:45-10am Wolfgang Reichter, CMU 
Agentless Cloud-wide Monitoring of Virtual Disk State
 
10:00-10:15am Muhammad Naveed, UIUC 
Controlled Functional Encryption
 
10:15-10:30am Puneet Jain, Duke University
Practical Mobile Augmented Reality
 
     
Break (10:30-10:45am) Mezzanine  
     
Session 2 (10:45am-12:15pm) Ramya + Xin (chair)  
10:45-11:15am ThinkLab tour (Sambit Sahu)  
11:15-11:30am Sara Elspaugh, UC Berkeley
Mining user interaction data to provide user assistance
 
11:30-11:45am Ankita Kejriwal, Stanford University
SLIK: Scalable Low-Latency Indexes for a Key-Value Store
 
11:45-12:00pm Ali Munir, Michigan State University
Friends, not Foes – Synthesizing Existing Transport Strategies for Data Center Networks
 
12:00-12:15pm Venkatanathan Varadarajan, U Wisconsin
How can VMM schedulers improve security in public clouds?
 
     
Lunch 12:15-1:00pm Cafeteria Annex  
Session 3 (1:00-2:20pm) (poster session) Cafeteria Annex + Tea/Coffee (setup at 2:00pm) - Ramya  
Session 4 (2:30-4:15pm) Oktie + Roman (chair)  
2:30-2:45pm Martin Jergler, Munich, Services Computing
Data-centric Workflow Management in the Cloud
 
2:45-3:00pm Pavan Kapanipathi, Wright State University
Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases
 
3:00-3:15pm Robert Escriva, Cornell University
HyperDex: A Consistent, Fault-tolerant, Transaction Key-Value Store
 
3:15-3:30pm Jiang Du, U Toronto
DeepSea: Revisiting materialized views in scalable data analytics
 
3:30-3:45pm Shumo Chu, U Washington
Evaluating Multiway Joins in a Parallel Database System
 
3:45-4:15pm IBM Talk 2: Watson Platform Next. Speaker: Marc-Thomas H. Schmidt - Distinguished Engineer, Chief Architect Watson Platform Next)  
4:15-4:45pm

IBM Talk 3: (Dimitrios Pendarakis - Cloud Security)

 
4:45-5:15pm Blue Gene Lab tour (Jose Moreira)  
     

 

Abstracts


Ankita Kejriwal (Stanford University)

Title: SLIK: Scalable Low-Latency Indexes for a Key-Value Store

Abstract:

Many large-scale storage systems sacrifice features and/or consistency in favor of scalability or performance. In contrast, I will talk about SLIK, which adds secondary indexes to an existing high-performance key-value store (RAMCloud) without sacrificing either latency or scalability. By keeping index data in DRAM, SLIK performs indexed reads in 10 μs and writes in 24 μs. At the same time, SLIK supports indexes spanning hundreds or thousands of nodes. It allows internal inconsistencies in the implementation of indexes, which makes the implementation more efficient, but provides strong consistency externally to clients. It uses the existing RAMCloud tablet mechanism to store the nodes of index B-trees, which allows SLIK to reuse existing mechanisms for durability and crash recovery.

Bio:

Ankita Kejriwal is a 5th-year PhD candidate in the Computer Science Department at Stanford University. She works with John Ousterhout and the lab on RAMCloud. Her main research interests include Data Center Storage Systems and Distributed Systems. She did an internship at Microsoft Research - Silicon Valley (MSR-SVC) in Summer 2013 on Distributed Computing and enjoyed it thoroughly. Before Stanford, Ankita did her B.E. (Hons) in Computer Science at Birla Institute of Technology and Science (BITS) - Pilani, Goa and interned at Bhabha Atomic Research Center (BARC), Indian Institute of Science (IISc) and Yahoo!. Webpage: http://web.stanford.edu/~ankitak/.

 

Ali Munir ( Michigan State University)

Title:  Friends, not Foes – Synthesizing Existing Transport Strategies for Data Center Networks

Abstract: Many data center transports have been proposed in recent times (e.g., DCTCP, PDQ, pFabric, etc). Contrary to the common perception that they are competitors (i.e., protocol A vs. protocol B), we claim that the underlying strategies used in these protocols are, in fact, complementary. Based on this insight, we design PASE, a transport framework that synthesizes existing transport strategies, namely, self-adjusting endpoints (used in TCP style protocols), innetwork prioritization (used in pFabric), and arbitration (used in PDQ). PASE is deployment friendly: it does not require any changes to the network fabric; yet, its performance is comparable to, or better than, the state-of-the-art protocols that require changes to network elements (e.g., pFabric). We evaluate PASE using simulations and testbed experiments. Our results show that PASE performs well for a wide range of application workloads and network settings.

Bio: Ali Munir received the BSc. degree in Electronics Engineering and Masters degree in Electrical Engineering from the National University of Sciences & Technology (NUST), Pakistan, in 2009 and 2012 respectively. He is a PhD student at Michigan State University, USA, since January 2013. 

 
In Summer 2013, he was a Research Intern at Microsoft Research Cambridge, UK. In summer 2014, he was a Research Intern at AT&T Labs, Bedminster New Jersey, USA. His current research interests include computer networking & distributed systems and span software defined networks, future Internet design, algorithms design, wireless systems and performance modeling of networked systems.
 
 
Wolfgang Richter (CMU)

Title: Agentless Cloud-wide Monitoring of Virtual Disk State

Abstract: What if the cloud infrastructure assisted in monitoring the virtual servers it maintains?  It already monitors them at a coarse-grained level for metering their consumed resources.  From a research perspective, we have a huge opportunity to clean-slate design monitoring interfaces for virtualized clouds which handle up to 70% of all x86 applications as of 2014.  We propose

implementing deep monitoring interfaces directly into the cloud infrastructure that are scalable and do not intrude into monitored virtual servers.  These interfaces are designed for cloud scale and to perform their duties agentlessly---not a single instruction is executed within a cloud customer environment.

Bio: Wolfgang Richter is a 5th-year PhD student in Computer Science at Carnegie Mellon University's School of Computer Science.  He has won research and teaching awards at both Carnegie Mellon and the University of Virginia where he received his BS in Computer Science with Highest Honors.  During his tenure as a graduate student at Carnegie Mellon he has been supported by an NSF Fellowship and an IBM Research Fellowship.  His research interests lie in distributed systems and cloud computing.  He has collaborated predominantly

with researchers at Carnegie Mellon, IBM Research, Intel Research, and the Georgia Institute of Technology.  He blogs professionally for the ACM and his articles are often syndicated in the ACM XRDS magazine.  His thesis work was proposed in February 2014, and a publication related to this work received the best paper award at IC2E'14.  He plans on defending his thesis and graduating in December 2014.

 

Sara Alspaugh (UC Berkeley)

Title: Mining user interaction data to provide user assistance 

Abstract: In this era of big data, one type of data that has not yet realized its full potential use is records of user interactions with software systems. Complex software systems create execution logs primarily for debugging. Logs have not been used for understanding how users have interacted with the system. By making an effort to collect high-volume, high-quality user interaction data, we can enable applications like interface optimization, recommendations, interaction analysis, user assistance, and so on, much like has been done in the web domain over the past two decades. In this talk we will discuss our work on collecting and mining records of log analysis activity from Splunk, a major platform for data analytics. The results of our analysis have been used to make improvements to the Splunk product and interface. We are currently working on harvesting exploratory visual analytic activity in Tableau from trained volunteers as part of an ongoing research seminar. We will use this activity data to construct exploratory analysis guidelines and train an intelligent assistance application that can help beginner users more quickly solve a given task or extract value from a given data set. We will conclude this short talk with a discussion of how such approaches can be applied in the domain large-scale multi-component distributed system debugging given appropriate activity data from such domains.

Bio: Sara Alspaugh is a PhD candidate in the EECS Department at UC Berkeley. She works in the AMPLab. Her advisors are Randy Katz and Marti Hearst. She received her Master’s in computer science in 2012 from UC Berkeley and her BA in computer science in 2009 from the University of Virginia. Her research interests include data mining, visualization, systems, and user interaction with data analysis tools. In particular, she is interested in mining user interaction records from data analysis and visualization tools to improve system and interface design.

 

Muhammad Naveed (UIUC)

Title: Controlled Functional Encryption

Abstract: Motivated by privacy and usability requirements in various scenarios where existing cryptographic tools (like secure multi-party computation and functional encryption) are not adequate, we introduce a new cryptographic tool called Controlled Functional Encryption (C-FE). As in functional encryption, C-FE allows a user (client) to learn only certain functions of encrypted data, using keys obtained from an authority. However, we allow (and require) the client to send a fresh key request to the authority every time it wants to evaluate a function on a ciphertext. We obtain efficient solutions by carefully combining CCA2 secure public-key encryption (or rerandomizable RCCA secure public-key encryption, depending on the nature of security desired) with Yao’s garbled circuit. Our main contributions in this work include developing and formally defining the notion of C-FE; designing theoretical and practical constructions of C-FE schemes achieving these definitions for specific and general classes of functions; and evaluating the performance of our constructions on various application scenarios.

 
Bio: Muhammad Naveed is a fourth year PhD student in computer science at the University of Illinois at Urbana-Champaign. He is interested in cryptography, security, and privacy. He is working on practically efficient cryptographic primitives. His current research focuses on secure cloud storage, secure multiparty computation, functional encryption, smartphone security, and genomics privacy. He has done internships at SRI International, EPFL (Switzerland), and University of Virginia. Currently, he is interning at Microsoft Research.  For more details about him, visit www.cryptoonline.com.

 

Robert Escriva (Cornell University)

Title: HyperDex: A Consistent, Fault-tolerant, Transaction Key-Value Store

Distributed key-value stores are now a standard component of high-performance web services and cloud computing applications. While key-value stores offer significant performance and scalability advantages, the first wave of NoSQL stores typically compromise on consistency and limit the system API to key-based operations.

This talk will present HyperDex, a novel, distributed key-value store developed by my group that provides (1) strong consistency guarantees, (2) fault-tolerance for failures and partitions affecting up to f nodes, and (3) a rich API which includes ACID transactions and a unique search primitive that enables queries on secondary attributes. HyperDex achieves these properties through the combination of three recent technical advances called hyperspace hashing, value-dependent chaining and linear transactions.  Performance measurements from the industry-standard YCSB benchmark show that these properties do not extract a high overhead: HyperDex is actually a factor of 2-13 faster than Cassandra and MongoDB.

Overall, HyperDex offers a rich API and combines strong consistency with fault-tolerance guarantees. We'll discuss how these techniques work behind the scenes, do a brief tour of the HyperDex API, and outline how these properties relate to the oft-repeated CAP credo.

Bio:

Robert Escriva is a PhD candidate in Computer Science at Cornell University working with Prof. Emin Gun Sirer. He is interested in distributed systems to support large-scale applications. His current project is HyperDex, a new NoSQL system.

 

Venkatanathan Varadarajan (Wisconsin)

Title: How can VMM schedulers improve security in public clouds?

Abstract: Infrastructure as a Service (IaaS) solutions often pack multiple customer virtual machines (VMs) onto the same physical server and multiplex resources for greater efficiency and low service cost. Unfortunately, the system managing these resources are often directly adopted from a system that is tailored for private datacenters and stand-alone operating systems. Such systems are fraught with security threats when they are used in a publicly accessible and open environment of the public clouds where security is as important as performance and
efficiency.

Recent works have shown how to mount cross-VM side-channel attacks to steal cryptographic secrets. Such attacks rely on vulnerabilities in the Hypervisor's CPU scheduler. In this talk, I will first present how a simple change to the CPU scheduler can defend against a class of side-channel attacks with almost no performance overhead. I will also touch on other related works like, Resource-Freeing Attacks and VM placement gaming, done at Wisconsin that demonstrate similar vulnerabilities in resource schedulers and motivate the need for security research on resource schedulers.

Bio:

Venkatanathan Varadarajan is a research assistant and a fifth year Ph.D student at University of Wisconsin-Madison working under the excellent guidance of Prof. Thomas Ristenpart and Prof. Michael Swift. He has worked on various layers of interactions between hardware and system software. His dissertation involves studying various security vulnerabilities in resource schedulers that are uniquely enabled through multi-tenancy and openness of the public clouds. He is a member of the security team in the Wisconsin Institute on Software-defined Datacenters Of Madison (WISDOM). In Summer 2013, he interned at VMware R&D, Palo Alto where he worked on automating horizontal scaling virtual machines of a multi-tier application. He completed my Bachelor's from College of Engineering Guindy, Anna University, Chennai in June 2010 and flew directly to University of Wisconsin-Madison to receive his Master's degree in December 2012. 

 

Martin Jergler (Technical University of Munich)

Title: Data-centric Workflow Management in the Cloud

Abstract
In recent years, data-centric workflows became increasingly popular. The unification of activity flow and associated data enables operational management and concurrent analytics of business processes (BPs). Moreover, it significantly alleviates adaptations of BPs in order to meet changing policies or other constraints. A prominent approach to manage data-centric workflows are Business Artifacts with Guard-Stage-Milestone lifecycles (GSM). In GSM, a workflow is described by a data model and a lifecycle model. The data model represents application data and the workflow status. The lifecycle model is defined by a set of ECA-like rules to describe how the status evolves over time and when individual tasks are executed. As nowadays BPs span across administrative and geographic boundaries, manage lots of possibly long running instances, consider vast amounts of data, and coordinate participants from all over the world, centralized workflow systems do no longer scale. Varying load characteristics, e.g., due to user behavior or the nature of the workflow itself impose additional challenges. To this end, we present a distributed architecture for data-centric workflows that is based on a publish/subscribe messaging infrastructure. A workflow is mapped into a set of interacting workflow components (IWCs). IWCs, for instance, represent gateways for attribute updates, evaluate single ECA-like rules, and manage the interaction with the environment. In this sense, (1) each IWC subscribes to updates on a subset of the data model, (2) evaluates a policy over the data, and (3) potentially updates other attributes by publishing to the infrastructure. The implicit order, in which IWCs communicate ensures the same execution semantics as compared to a centralized execution. In our architecture, IWCs are the unit of distribution and inherit all properties of publish/subscribe systems. This allows for a very flexible deployment and incremental scalability on cloud infrastructures as IWCs can be easily migrated from one VM to another. IWCs do not only distribute the execution semantics, but also partition the global data model into independent subsets. This facilitates the data management to be compliant with geographical constraints.

Biography
Martin Jergler is a doctoral candidate working with Professor Hans-Arno Jacobsen in Computer Science at the Technical University of Munich, Germany. He is a member of the chair for Application & Middleware Systems since 2012. Martin received his BSc (2009) in Internet Computing and MSc (2012) in Computer Science from the Department of Informatics and Mathematics at the University of Passau, Germany. In his undergraduate studies he focused on multimedia technologies and applications and received the :a:k:t: student scholarship in 2010. Martin’s current research interests revolve around distributed data management and applications. These include publish/subscribe middleware, service-oriented architectures, data-centric workflows and case management.

 

Puneet Jain (Duke University)

Title:  Practical Mobile Augmented Reality

Abstract:

Mobile Augmented Reality remains a fascinating concept in the research community since the early adoption days of smartphones. Despite several attempts by the App Designers, we see no real-world application in the leading App hosting platforms such as Google Play and Apple AppStore. The realization Mobile AR’s true potential and development of new applications need immediate addressing of many open challenges. In this talk, we shall try to address one key question: "What does it take to enable real-time mobile augmented reality on current generation smartphones"? Specifically, we will demonstrate how careful system design rooted on sensing, vision, cloud offloading, and semi-supervised learning leads to sufficient precision for useful mobile applications and tolerable latency to an end user.

Bio:

Puneet Jain is a Ph.D. candidate in  computer science at Duke University. He is a part of Systems Networking Research Group at UIUC led by Professor Romit Roy Choudhury. Jain holds M.S. in CS (2013) and M.Tech/B.Tech in CSE (2009) from Duke and IIT Kharagpur respectively. Jain’s current research interests include real-time computer vision, mobile sensing, and big data for social analytics. More about him can be found at : http://www.cs.duke.edu/~puneet

 

Jiang Du (University of Toronto)

Title:  DeepSea: Revisiting materialized views in scalable data analytics

Abstract
Materialized views are a powerful abstraction for improving query performance in DBMS and have been proposed for use in shared- nothing distributed systems. We revisit some important decisions that need to be made when using materialized views for big data analysis systems. Specifically, we consider how to efficiently cre- ate new views from intermediate query results, how to determine when to use a view in query processing, and how to determine, based on the query workload, what views to save (and what views to discard). Using DeepSea, an extension of Hive, we demonstrate how to exploit the fault-tolerance mechanisms of big data analysis systems for creating new views at minimal cost that we predict will be beneficial in the future. 
 
Bio
Jiang Du researches in the fields of databases. His current research interests include the physical design of database systems, specifically non-relational systems (NoSQL), online transaction processing systems (NewSQL), data streaming and large-scale data analytics. He received his BSc and MSc degrees in Computer Science from the University of Toronto.

 

Swapna Buccapatnam (Ohio State University)

Title: Stochastic Bandits with Side Observations on Networks

Abstract:

I will present my work on stochastic multi-armed bandit problems with side-observations on networks. In our model, choosing an action provides additional side observations for its neighboring actions in the network. One example of this occurs in the problem of targeting users in online social networks where users respond to their friends’ activity. These side observations can be leveraged to improve scalability of bandit policies in the presence of a large set of actions.
Our contributions are as follows: 1) We derive an asymptotic lower bound (as a function of the network structure) on the regret (loss) of any uniformly good policy that achieves the maximum long term average reward. 2) We propose two policies, both of which explore each action at a rate that is a function of its network position. We further show that these policies are optimal, i.e. they achieve the asymptotic lower bound on the regret up to a multiplicative factor independent of the network structure. Finally, we use numerical examples on a real-world social network to demonstrate the significant benefits obtained by our policies against other existing policies.

Bio:

Swapna Buccapatnam is a Ph.D. candidate in the Department of Electrical and Computer Engineering at the Ohio State University. She received her B.Tech and M.Tech. degrees in Electrical Engineering from the Indian Institute of Technology, Madras, India, in 2008. Her research interests lie in the analysis and control of complex networks such as wireless networks, cloud computing systems, online social networks, and crowd-sourcing markets.
Webpage: http://www.ece.osu.edu/~buccapat/

 

Shumo Chu (University of Washington)

Title: Evaluating Multiway Joins in a Parallel Database System
Abstract:
Join evaluation is the corner stone of query processing in database systems. In past decades of database research, evaluating joins are mostly focused on choosing different join operators (hash join, merge join etc.) and on choosing optimal join orders and the layout of join trees. If joins need to be evaluated in distributed database system, each binary join along the join tree requires a hash partition or range partition on the joined attributes before the join. Until recently, there are advances in both data partition algorithms and new paradigm of join algorithms. Afrati etc proposed HyperCube shuffle algorithm which uses one step replicated shuffle to send data to destination workers. Ngo etc. proposed new join algorithms with non-trivial optimality guarantee which uses a multiway join operator to join more than two relations at one time. In this paper, we studied the following three problems: 1. What is the ideal approach of join evaluation considering different shuffle strategy and join algorithms. 2. The practical server allocation algorithm for HyperCube replicated shuffle. 3. How to optimize multiway join operator locally.

Bio:
Shumo Chu is a PhD student at University of Washington Computer Science and Engineering, working with Prof. Dan Suciu and Prof. Magdalena Balazinska.  Before UW, he was an intern at Data Management Group of Microsoft Research Asia and had worked with Prof. James Cheng, at NTU, Singapore (now at CUHK). He obtained his bachelor degree from Wuhan University, China.  He is interested in various aspects of big data management, such as system building, theoretical foundations and domain specific applications.

 

Pavan Kapanipathi (Wright State University)
Title: Harnessing Volume and Velocity Challenge on the Social Web using Crowd-Sourced Knowledge Bases

Abstract:
Due to the increased adoption of social web, users, specifically Twitter users are facing information overload. Unless a user is willing to restrict the sources (eg number of followings), important information relevant to users' interests often go unnoticed. The reasons include (1) the postings may be at a time the user is not looking for; (2) the user unaware and hence not following the information source; (3) and the information arrives at a rate at which the user cannot consume. Furthermore, some information that are temporally relevant, discovered late might be of no use.

My research addresses these challenges by
(1) Generating user profiles of interests from Twitter using Wikipedia. The interests gleaned from users' Twitter data can be leveraged by personalization and recommendation systems in order to reduce information overload/Volume for users.
(2) Filtering twitter data relevant to dynamically evolving entities. Including Volume, this addresses the velocity challenge in delivering relevant information in real-time. The approach is deployed on Twitris to crawl for dynamic event-relevant tweets for analysis. The prominent aspect of the approaches is the use of crowd-sourced knowledge-base such as Wikipedia. 

Bio
Pavan Kapanipathi is a PhD student at Knoesis Center-Ohio Center of Excellence in Knowledge Enabled Computing under the guidance of Dr. Amit Sheth. He is a web enthusiast and enjoy working with big data, especially in Social Web and Semantic Web areas. His work focuses on addressing the information overload problems on the Social Web by leveraging Wikipedia.