2019 Student Workshop on Systems and Cloud - overview
The 2019 IBM Research Student Workshop on Systems and Cloud was held on Tuesday, November 19th, 2019 at IBM Research - Almaden.
New: Photos from the workshop and selected slides are published!
The goal of the student workshop is to bring together the next generation of researchers from local universities and IBM Research to collectively discuss the imminent challenges and solutions in the area of cloud and systems. Both students and IBM staff will present their research work in the form of talks, which will be followed by joint discussions.
|9:30||Continental Breakfast @ Cafe West|
|10:00||Introduction by the workshop organizers|
|11:00||Distributed Memory for Futures with Ownership
Stephanie Wang (University of California, Berkeley)
Daniel Waddington (IBM Research)
Stephanie Wang, University of California, Berkeley
Title: Distributed Memory for Futures with Ownership
Abstract: Futures are quickly becoming a popular model for distributed programming. A future is a proxy for a value that will become available at a later time, which makes them a natural fit to express asynchronous RPCs. Many implementations also allow futures to be composed together, i.e. passing a future into an RPC and receiving a second future as the result. By composing futures together, a program can naturally express both parallelism and pipelining to the underlying system. Meanwhile, since futures contain only a pointer to the value rather than the value itself, the system can then optimize data movement for large objects.
Bio: Stephanie Wang is a fourth-year PhD student in computer science at UC Berkeley, in the RISELab. She is advised by Professor Ion Stoica, and her research interest is in distributed systems, specifically on problems in fault tolerance. She is especially interested in building reliable large-scale systems that can be used and programmed by everyday developers. Her work is entirely open-source, much of it on the Ray project, a distributed execution framework for Python that has been adopted both in academia and in industry. Stephanie received her B.S. degrees in computer science and mathematics from MIT in 2015, her M.Eng. degree in computer systems in 2016, and the UC Berkeley Chancellor's Fellowship from 2016-2018.
Kunal Lillaney, Johns Hopkins University
Title: Agni: An Efficient Dual-access File System over Object Storage
Abstract: Object storage is a low-cost, scalable component of cloud ecosystems. However, interface incompatibilities and performance limitations inhibit its adoption for emerging cloud-based workloads. Users are compelled to either run their applications over expensive block storage-based file systems or use inefficient file connectors over object stores. Dual access, the ability to read and write the same data through file systems interfaces and object storage APIs, has promise to improve performance and eliminate storage sprawl.
Bio: Kunal Lillaney received his PhD in Computer Science from Johns Hopkins University in 2019, working with Randal Burns in the Hopkins Storage Systems Lab (HSSL). His dissertation was titled “Building Hierarchical Storage Services in the Cloud” and his research focused on enabling cloud object storage for different applications. His research has been published in multiple journal and conferences including Nature, IEEE e-Science, USENIX HotStorage and ACM Symposium of Cloud Computing (SoCC). Kunal has interned with IBM Research-Almaden, Lawrence Livermore National Laboratory and Cigital. He has also served as the Secretary of the Upsilon Pi Epsilon (UPE) between 2015 and 2017, and won the UPE Executive Council Award in 2016.
Daniel Bittman, University of California, Santa Cruz
Title: A Sharing Problem we all Share: Data-Centric OSes, Persistent Memory, and OS evolution
Abstract: Byte-addressable non-volatile memory (NVM) is coming, and with it, our programming models and OSes need to evolve. The low-latency of NVM demands a reconsideration of how programs operate on persistent data: we need to avoid expensive operations like system calls and serialization. Yet removing the kernel from the persistence path opens a number of challenges and opportunities, including how do we effectively secure and share persistent data whose references refer directly to persistent data in a global address space. We are building Twizzler, and OS that presents a new interface for persistent memory programming while taking into account issues such as sharing, heterogeneous hardware, and security. Twizzler provides programming models for NVM that reduce complexity and remove software layers, thus dramatically improving application performance.
Bio: Daniel is a graduate student in the SSRC. His interests are in kernel programming and design, security, non-volatile memory, and highly concurrent programming. His current project is on developing an operating system for non-volatile memories and developing operating system designs and interfaces for better programming and data models in such an environment.
Kostis Kaffes, Stanford University
Title: Request Scheduling for μsec-scale Tail Latency
Abstract: In the first part of the talk we will present Shinjuku, a single-address space operating system that uses hardware support for virtualization to make preemption practical at the microsecond scale. This allows Shinjuku to implement centralized scheduling policies that preempt requests as often as every 5µsec and work well for both light and heavy tailed request service time distributions. We demonstrate that Shinjuku provides significant tail latency and throughput improvements over state-of-the-art dataplane operating systems for a wide range of workload scenarios.
In the second part of the talk, we argue that as application demands continue to increase, scaling up is not enough, and serving larger demands requires these systems to scale out to multiple servers in a rack. We present RSCS, the first rack-level microsecond-scale scheduler that provides the abstraction of a rack-scale computer (i.e., a huge server with hundreds to thousands of cores) to an external service with network-system co-design. The core of RSCS is a two-layer scheduling framework that integrates inter-server scheduling in the top-of-rack (ToR) switch with intra-server scheduling in each server to approximate centralized scheduling policies, and achieves near optimal performance for both light-tailed and heavy-tailed workloads. We design a custom switch data plane for the inter-server scheduler, which realizes power-of-k-choices, ensures request affinity, and tracks server loads accurately and efficiently. We implement an RSCS prototype on Barefoot Tofino switches and commodity servers. End-to-end experiments on a twelve- server testbed show that RSCS is able to scale out throughput near linearly, while maintaining the same tail latency as one server until the system is saturated.
Bio: Kostis is a PhD student in Electrical Engineering at Stanford University, advised by Christos Kozyrakis. His research interests lie in the areas low latency datacenter systems and scheduling. Previously, he completed his Diploma in Electrical and Computer Engineering in the National Technical University of Athens, Greece.
Neeraja Yadwadkar, Stanford University
Title: Machine Learning for Resource Management in the Cloud
Abstract: Traditional resource management techniques that rely on simple heuristics often fail to achieve predictable performance in contemporary complex systems that span physical servers, virtual servers, private and/or public clouds. My research aims to bring the benefits of Machine Learning (ML) models to optimize and manage such complex systems by deriving actionable insights from the performance and utilization data these systems generate. To realize this vision of model-based resource management, we need to deal with the following key challenges data-driven ML models raise: uncertainty in predictions, cost of training, generalizability from benchmark datasets to real-world systems datasets, and interpretability of the models.
In this talk, I will present our the ML formulations to demonstrate how to handle these challenges for two main problem domains in distributed systems: (I) Scheduling in parallel data-intensive computational frameworks for improved tail latencies, and (II) Performance-aware resource allocation in the public cloud environments for meeting user-specified performance and cost goals. Along the way, I will also share a list of guidelines for leveraging ML for solving problems in systems, based on my experience.
Bio: Neeraja graduated with a PhD in Computer Science from University of California, Berkeley. Her thesis was on automatic resource management in the datacenter and the cloud. She is now a post-doctoral researcher in the Computer Science Department at Stanford University where she continues to work on distributed systems, cloud computing, and machine learning. Neeraja received her masters in Computer Science from the Indian Institute of Science, Bangalore, India.
Most of Neeraja’s research straddles the boundaries of systems, and Machine Learning (ML). Advances in Systems, Machine Learning (ML), and hardware architectures are about to launch a new era in which we can use the entire cloud as a computer. New ML techniques are being developed for solving complex resource management problems in systems. Similarly, systems research is getting influenced by properties of emerging ML algorithms, and evolving hardware architectures. Bridging these complementary fields, her research focuses on using and developing ML techniques for systems, and building systems for ML.