The 6th Student Workshop on Cloud and Data Services - Program
Thursday, December 3, 2015
08:30 : Registration Lobby
09:00 : Welcome Remarks - Seetharami Seelam, IBM Research Room 12-113
09:15 : Keynote -"The Power of Cloud to fuel Innovation," Tamar Eilam, IBM Fellow Room 12-113
10:00 : Coffee Break Room 12-113
Session-1: Resource Management and Performance (Room 12-113)
10:30 : Using Efficient Redundancy to Reduce Latency in Cloud Systems. Gauri Joshi, Massachusetts Institute of Technology
11:00 : Towards Resource QoS in Container Clouds. Stephen Herbein, University of Delaware
11:30 - 12:30 : Lunch
12:30 – 14:00 : Poster Session. Cafeteria Annex.
Session – 2 : Big Data Part-I (Room 12-113)
14:00 : Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics. Benjamin Heintz, University of Minnesota, Twin Cities
14:30 : Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff. Souvik Bhattacharjee, University of Maryland, College Park
15:00 : SocialTrove: A Cloud-Backed Summarization Infrastructure for Social Sensing. Tanvir Amin, University of Illinois, Urbana Champaign
15:30 : Coffee Break
Session – 3: Storage and Caching (Room 12-113)
16:00 : NetKV: Scalable, Self-Managing, Load Balancing as a Network Function. Wei Zhang, George Washington University
16:30 : Be Fast, Cheap, and in Control with SwitchKV. Xiaozhou Li, Princeton University
17:00 : Flash Caching for Cloud Computing Systems. Dulcardo Arteaga, Florida International University
Friday, December 4, 2015
Session – 4 : Big Data Part-II (ThinkLab CR2)
09:00 : Network Scheduling Aware Task Placement in Cloud. Ali Munir, Michigan State University
09:30 : Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs. Lu Fang, University of California, Irvine
10:00 : Scalable Reasoning and Querying over Semantic Data. Raghava Mutharaju, Wright State University
10:30 : Coffee Break
Session – 5 : Security and Forensics (ThinkLab CR2)
11:00 : Automatically Recovering Human-Understandable Evidence from Memory Images. Brendan Saltaformaggio, Purdue University
11:30 : Improving Network Security with a Secure SDN Architecture. Jill Jermyn, Columbia University
12:00 : Mining Temporal Lag from Fluctuating Events for Correlation and Root Cause Analysis. Chunqiu Zheng, Florida International University
12:30 : Lunch -- Cafeteria Main Dining Area
13:30 : Research Highlights and Think Lab Tour
14:30 : Workshop Ends
Title: Using Efficient Redundancy to Reduce Latency in Cloud Systems
Abstract: In cloud computing systems, assigning a job to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers. Although adding redundant replicas always reduces service time, the total computing time spent per job may be higher, thus increasing waiting time in queue. In this talk I show how the the log-concavity of the service time distribution is a key factor determining whether adding redundancy reduces latency and cost. Using insights from this analysis I develop cost-efficient task scheduling strategies to minimize latency.
Title: Towards Resource QoS in Container Clouds
Abstract: Innovations in Operating System (OS) level virtualization technologies like resource control groups, isolated namespaces, layered file systems propelled a new bread of virtualization solutions called Containers. Applications running in containers depend on the host operating system for resource allocation, throttling, and qualities of service (QoS). However, OS can provide only best effort/fair-share resource allocation. Lack of QoS enforcement, like in VMMs, constrains the use of containers and container cloud to a subset of the workloads. In this paper, we describe issues with fair-share QoS enforcement of CPU, Network bandwidth, and I/O bandwidth and present solutions to allocate, throttle and enforce QoS for each of these three critical resources. These techniques enable container cloud providers to host applications with different QoS requirements and enforce them effectively so a large collection of applications can benefit from flexibility, portability and agile DevOps characteristics of containers.
Title: Optimizing Grouped Aggregation in Geo-Distributed Streaming Analytics
Abstract: Large quantities of data are generated continuously over time and from disparate sources such as users, devices, and sensors located around the globe. This results in the need for efficient geo-distributed streaming analytics to extract timely information. A typical analytics service in these settings uses a simple hub-and-spoke model, comprising a single central data warehouse and multiple edges connected by a wide-area network (WAN). A key decision for a geo-distributed streaming service is how much of the computation should be performed at the edge versus the center. In this paper, we examine this question in the context of windowed grouped aggregation, an important and widely used primitive in streaming queries. Our work is focused on designing aggregation algorithms to optimize two key metrics of any geo-distributed streaming analytics service: WAN traffic and staleness (the delay in getting the result). Towards this end, we present a family of optimal offline algorithms that jointly minimize both staleness and traffic. Using this as a foundation, we develop practical online aggregation algorithms based on the observation that grouped aggregation can be modeled as a caching problem where the cache size varies over time. This key insight allows us to exploit well known caching techniques in our design of online aggregation algorithms. We demonstrate the practicality of these algorithms through an implementation in Apache Storm, deployed on the PlanetLab testbed. The results of our experiments, driven by workloads derived from anonymized traces of a popular web analytics service offered by a large commercial CDN, show that our online aggregation algorithms perform close to the optimal algorithms for a variety of system configurations, stream arrival rates, and query types.
Title: Principles of Dataset Versioning: Exploring the Recreation/Storage Tradeoff
Abstract: The desire to derive valuable insights from huge and diverse datasets produced from the Internet, smart phones and wireless sensors has given rise to collaboratory efforts between "data science" teams spanning across multiple organizations. This in turn has led to a proliferation of many thousands or millions of versions of the same datasets in many scientific and commercial domains. In such scenarios, it is essential to keep track of the datasets, the derived products, and be able to recreate them quickly on demand; and simultaneously minimize the storage costs by eliminating redundancy. The fundamental challenge here is the storage−recreation trade−off: the more storage we use, the faster it is to recreate or retrieve versions, while the less storage we use, the slower it is to recreate or retrieve versions. Despite the fundamental nature of this problem, there has been a surprisingly little amount of work on it. In this talk, I present our work on studying this trade-off in a principled manner and propose a suite of inexpensive heuristics drawing from techniques in delay-constrained scheduling, and spanning tree literature, to solve these problems. We have built a prototype version management system, that aims to serve as a foundation to our DataHub system for facilitating collaborative data science. We demonstrate,via extensive experiments, that our proposed heuristics provide efficient solutions in practical dataset versioning scenarios.
Title: SocialTrove: A Cloud-Backed Summarization Infrastructure for Social Sensing
Abstract:The increasing availability of smartphones, cameras, and wearables with instant data sharing capabilities, and the exploitation of social networks for information broadcast, heralds a future of real-time information overload. With the growing excess of worldwide streaming data through the social channel, an increasingly common problem is one of data summarization. The objective is to obtain a representative sampling of large data streams at a configurable granularity, in real-time, for subsequent consumption by a range of data-centric applications. This is an expensive task in terms of both computational capacity and implementation complexity. Global knowledge of the events and trends is necessary which is not readily available. In this talk, we argue the case for a summarization infrastructure as an indexing and serving layer for the applications. We show how Spark, HDFS, and Memcached has been used to build SocialTrove, the backend for UIUC's Apollo Real-time Social Sensing Stack. SocialTrove summarizes data streams from human sources, or sensors in their possession, by hierarchically clustering received information in accordance with an application-specific distance metric. It then serves a sampling of produced clusters at a configurable granularity in response to application queries. While SocialTrove is a general service, we illustrate its functionality and evaluate it in the specific context of workloads collected from Twitter. Results show that SocialTrove supports a high query throughput, while maintaining a low access latency to the produced real-time application-specific data summaries. We showcase various social sensing applications that have been implemented on top of SocialTrove.
Title: NetKV: Scalable, Self-Managing, Load Balancing as a Network Function
Abstract: Distributed key-value systems (e.g., memcached) are critical tools to cache popular content in memory, which avoids complex and expensive database queries and file system accesses. To efficiently use cache resources, balancing the load across a cluster of cache servers is important. Current approaches place a proxy at each client that can redirect requests across the cluster, but this requires modification to each client and makes dynamic replication of keys difficult. While a centralized proxy can be used, this traditionally has not been scalable.
This talk describes NetKV, a scalable, self-managing, load balancer for memcached clusters. NetKV exploits recent advances in Network Function Virtualization to provide efficient packet processing in software, producing a high performance, centralized proxy that can forward over 10.5 million requests per second, a hundred fold increase over Twitter’s twemproxy. NetKV efficiently and accurately detects hot keys using stream-analytic techniques, then replicates them to meet the allowed load imbalance bound set by administrators. NetKV uses “balls and bins” load analysis to adaptively determine the replication factor and set of hot keys. This project is one part of my research exploring how NFV and SDNs can be combined to build more flexible and intelligent software-based networks.
Title: Be Fast, Cheap, and in Control with SwitchKV
Abstract: SwitchKV is a new key-value store system design that combines high-performance cache nodes with resource-constrained backend nodes to provide load balancing in the face of unpredictable workload skew. The cache nodes absorb the hottest queries so that no individual backend node is over-burdened. Compared with previous designs, SwitchKV exploits SDN techniques and deeply optimized switch hardware to enable efficient content-based routing. Programmable network switches keep track of cached keys and route requests to the appropriate nodes at line speed, based on keys encoded in packet headers. A new hybrid caching strategy keeps cache and switch forwarding rules updated with low overhead and ensures that system load is always well-balanced under rapidly changing workloads. Our evaluation results demonstrate that SwitchKV can achieve up to 5x throughput and 3x latency improvements over traditional system designs.
Title: Flash Caching for Cloud Computing Systems
Abstract: As the size of cloud systems and the number of hosted virtual machines (VM) rapidly grow, the scalability of shared VM storage systems becomes a serious issue. Client-side flash-based caching has the potential to improve the performance of cloud VM storage by employing flash storage available on the client-side of the storage system to exploit the locality inherent in VM IOs.
However, because of the limited capacity and durability of flash storage, it introduces multiple challenges to the effective use of flash caching in cloud systems. First, because the cache configurations such as size, write policy, meta-data persistency and RAID level, have tremendous impact on the cost, performance and reliability of flash caching, it is critical to determine the proper configurations of flash caches. Second, because flash cache space is limited compared to the dataset of co-hosted VMs that share a single cache device, it is important to allocate the shared cache among competing VMs according to their demands.
In this talk, I will present my research to address these challenges. First, I will introduce my results from a thorough cache configuration study using a large amount of long-term traces collected from real-world public and private clouds. Second, I will present my on-demand flash cache management solution which allocates shared cache space efficiently according to the VMs’ cache demands and balances cache load across hosts by live migrating cached data along with the VMs.
Title: Network Scheduling Aware Task Placement in Cloud
Abstract: To improve the application performance in data centers and cloud, existing cluster schedulers try to optimize the task placement or schedule network flows. The task scheduler tries to place tasks close to data (maximizing data locality) to minimize network traffic, while assuming fair sharing of network resources. On the other hand, the network scheduler tries to schedule flows based on properties of tasks (such as flow size, deadline, which flows belong to the same task (co-flow), etc), while ignoring the decisions made by the task scheduler and as a result compromise the performance gains achievable by network scheduling. In this project, we develop, NEAT, a task scheduling framework that uses the information from the underlying network scheduler to make task placement decisions. NEAT leverages the task priority and predicted completion time information from the network scheduler to make placement decisions to minimize the overall FCT of applications. NEAT uses a novel FCT prediction model to estimate the completion times of a task under the current network conditions. Our evaluation, using a variety of workloads and network settings, shows that the FCT performance can be improved by upto 50% when using network scheduling aware task placement.
Title: Interruptible Tasks: Treating Memory Pressure as Interrupts for Highly Scalable Data-Parallel Programs
Abstract: Real-world data-parallel programs commonly suffer from great memory pressure, especially when they are executed to process large datasets. Memory problems lead to excessive garbage collection (GC) effort and out-of-memory errors, significantly hurting system performance and scalability. We propose a systematic approach that can help data-parallel tasks survive memory pressure, improving their performance and scalability without needing any manual effort to tune system parameters or additional hardware resources. Our approach advocates interruptible task (ITask), a new type of data-parallel tasks that can be interrupted upon memory pressure — with part or all of their used memory reclaimed—and resumed when the pressure goes away. To support ITasks, we propose a novel programming model and a runtime system, and have instantiated them on two state-of-the-art platforms Hadoop and Hyracks. A thorough evaluation demonstrates the effectiveness of ITask: it has helped real-world Hadoop programs survive 13 out-of-memory problems reported on StackOverflow; a second set of experiments with 5 already well-tuned programs in Hyracks on datasets of different sizes shows that the ITask-based versions are 1.5–3× faster and scale to 3–24× larger datasets than their regular counterparts.
Title: Scalable Reasoning and Querying over Semantic Data
Abstract: Providing structure and meaning to the data on the Web helps in cutting through the Big Data clutter. In the context of Semantic Web, OWL and RDF are used to capture the domain knowledge and provide structure to the data. Reasoning and querying are two of the most important operations that can be performed over semantically structured data. Popular reasoners work only on single machines and are thus constrained by the memory and computational capacity available to them. They cannot scale with the size of the data. We will present a distributed OWL reasoner and describe the partition techniques and optimizations used.
Next part of the talk focusses on scalable query processing over RDF data. There are nearly 90 billion triples in Linked Open Data cloud. There is a need for scalable RDF data management and querying mechanisms. We discuss a system that uses graph partitioning and implements query processing on top of a distributed NoSQL store.
Title: Automatically Recovering Human-Understandable Evidence from Memory Images
Abstract: Cyber forensics is rapidly transforming investigations, in the cyber and physical worlds, with new techniques that recover ever more complex evidence from a wide variety of digital crime scenes. Increasingly, forensics investigators are looking to access the wealth of digital evidence stored in a system's volatile memory, and uncovering this evidence (called memory image forensics) has become an essential capability in modern cyber investigations. However, efficiently locating and accurately analyzing such evidence locked in memory images remains an open research problem. In this talk, I will present a new direction to this state of the art challenge: Both recovering and rendering evidence from a memory image in simple, human-understandable formats. Based on algorithmic reconstruction and program analysis techniques, this line of work develops memory image forensics tools that automatically locate, reconstruct, and render the vast array of complex, visual evidence stored in a device's memory image. These techniques are capable of recovering many highly probative forms of evidence --- from images, documents, and user accounts, to full smartphone app graphical user interfaces. The raw in-memory contents of such data would previously be unfathomable to human investigators.
Abstract: Software Defined Networks and Network Function Virtualization are quickly becoming adopted in cloud computing. As the prevalence of these technologies increases, so will the need to secure them. The logical centralization of the control plane offers many benefits, but could also introduce a new attack surface. At the same time, the flexibility of SDN and scalability of NFV can help to improve attack detection approaches. In this talk I will discuss opportunities for securing large networks using SDN while ensuring security of the architecture itself.
Title: Mining Temporal Lag from Fluctuating Events for Correlation and Root Cause Analysis
Abstract: The importance of mining time lags of hidden temporal dependencies from sequential data is highlighted in many domains including system management, stock market analysis, climate monitoring, and more. Mining time lags of temporal dependencies provides useful insights into understanding the sequential data and predicting its evolving trend. Traditional methods mainly utilize the predefined time window to analyze the sequential items or employ statistic techniques to identify the temporal dependencies from the sequential data. However, it is a challenging task for existing methods to find time lag of temporal dependencies in the real world, where time lags are fluctuating, noisy, and tend to be interleaved with each other. This presentation introduces a parametric model to describe noisy time lags. Then an efficient expectation maximization approach is proposed to find the time lag with maximum likelihood. An approximation method for learning time lag is also presented to improve the scalability without incurring significant loss of accuracy. Extensive experiments on both synthetic and real data sets are conducted to demonstrate the effectiveness and efficiency of proposed methods. Additionally, we come up with a system named TDMS (Temporal Data Mining System) to demonstrate our methodology for discovering temporal dependencies with time lag from sequential event data.