Goal-oriented management of computation and data - overview
Our group focuses on a broad set of topics pertaining to the management of computation and data in large-scale distributed environments where resources are shared among multiple applications and users. Most recently, systems like these are studied in the context of cloud computing. We are developing management technologies to enable advanced workload-centric pattern deployment and life-cycle policies in private and hybrid cloud contexts and across multiple heterogeneous clouds. Workload-centric approach focuses on a holistic view of an application including its compute, network, and storage components, recognizes relationships between them and permits diverse goals and policies to be associated with such workloads. Our most recent interest is in workload scheduling solutions for AI workloads, which require innovations in batch scheduling and in efficient management of batch workloads with interactive workloads (AI inferencing).
To solve this interdisciplinary problem, we are pursuing research in several areas:
Distributed systems: We study concurrency, resilience, and scalability issues of workload scheduling in large scale clouds.
Optimization theory: We develop new solutions for solving assignment problems with non-linear objectives
Control theory: We study issues of stability and responsiveness
Performance modeling: We develop analytic models of workload performance, security, and availability
Analytics and AI technologies: We leverage AI and statistical models to learn new insights from observations when we do not have complete information about the managed system
We like seeing our system work. We prototype all our solutions while leveraging opensource technologies such as Kubernetes. We also contribute to kube-arbitrator project.
One key aspect management tackled by our team over recent years has been the allocation of resources (CPU, memory, disk, storage, etc.) to applications. One resource allocation technique we investigate is application placement. We first studied this problem in the context of middleware environments, such as IBM WebSphere, where resource allocation may be performed by controlling the number and placement of application servers running a replica of an application. We developed Application Placement Controller, which dynamically adjusts the number and placement of application replicas so as to maximize the overall system utility and obey a variety of configurable constraints.
With the increasing popularity of virtualization, we investigated the application of our application placement technology to placement of virtual machines. We developed Vespa, a highly efficient and scalable placement controller for virtual machines. We developed technologies to estimate VM demand in black-box manner, model and optimize for software licence limits, and manage electrical energy usage by dynamic server consolidation.
In the context of application server middleware platforms, we also studied techniques to perform request load balancing and flow control. We developed technology to differentiate SLAs of request flows (such as response time or relative delay) by traffic shaping at the edge L7 proxy.
To implement these resource management systems, we also developed a scalable and low-overhead communication layer, called Bison.
Most recently, we investigated techniques to improve performance and reliability of multi-stage data flows running in a large scale data analytic environments. We studied ways to better manage intermediate data of MapReduce jobs, to dynamically control the replication factor of intermediate data in a data flow, and to perform resource and SLA-aware scheduling of MapReduce jobs.