Cloud Infrastructure Services - overview
IBM offers cloud-based infrastructure services to enterprise customers. Such Infrastructure-as-a-Service (IaaS) offerings allow clients to obtain almost instantaneous access to significant compute and storage resources with no capital investment. There are, however, major challenges in providing enterprise cloud services, including much stricter security requirements, service-level agreement demands, the ability to customize the service management and related business processes, the need to enable legacy applications on the cloud, and the often complex deployment topology of business applications. These challenges generally reduce the opportunity for resource sharing and dampen the economy-of-scale benefits, and therefore threaten to weaken the potential value of the cloud to the enterprise.
IBM Research has been at the forefront of cloud infrastructure services from the very beginning, tackling many difficult technical problems. IBM Research has also played an instrumental role in IBM’s services offerings and customer engagements in this space. Innovations from IBM Research include contract-based offering and rating management, cloud business office automation and back-office integration, return-on-investment analytics for the cloud, the Fast Virtual Disk image format with copy-on-read and adaptive prefetching, redundancy-aware virtual image access and management, and the Virtual Hypervisor approach to fair and economical resource partitioning. IBM Research's contributions have significantly increased the efficiency, flexibility, usability, and cost efficiency of IBM’s cloud services, enable IBM to deliver the full potential of IaaS to customers, and also provide the technical underpinnings needed to build an enterprise cloud services ecosystem.
Selected projects:
Redundancy-Aware VM Image Access and Management
Cloud service providers pursue a broad range of infrastructure optimization in order to provide high quality Infrastructure-as-a-Service (IaaS) at a low cost. One of the challenges faced by large scale IT Cloud providers is lifecycle management of virtual disks. The lifecycle operations include migration of virtual disks from customer premises to the provider's data center, transfer and distribution of virtual disks across multiple geographically distributed data centers, provisioning of virtual machine instances based on an image, cache management in a hypervisor, and also frequent check-in and check-out of virtual disks to and from Cloud image library. All of these operations have to handle large amounts of data and are significant contributors to both latency experienced by the users and also to the provider's cost base. Therefore, optimizing them is critical to IT Cloud provider's business success.
Our research addresses several aspects of this broader problem by novel applications of data de-duplication to virtual image provisioning and transfer. Our work is motivated by empirical studies [1,2,4] of the similarity within and between VM image libraries from production IaaS clouds and virtual appliances.

The key findings show that the number of instances created from the same VM image is relatively small and thus conventional p2p file-based sharing approaches may not be effective in provisioning and transfer optimization. However, different VM image files often have many common chunks of data, with fraction of redundant content within a library typically reaching 70% for 4KB chunking. Based on these observations we have developed an image library clustering model [2] and associated management algorithms that allowed us to efficiently represent and manage image overlap information.
It allows to improve virtual disk transfer time by reconstituting an image on target data center using information about overlap among images and content already available in the target data center. The evaluation shows an average 6 times reduction in terms of network transfer volume and time and can result in even larger reductions in case of images with small configuration changes. We have also explored improvements to agile DevOps development model that can be gained by leveraging image redundancy [3]. To provide more insight into potential benefits we have developed an analytical model to quantify the efficiency of a VM provisioning process [4] that leverages virtual machine image similarity to reduce the data volume transferred from the storage server to the hypervisor.

Moreover, we have developed a topology-aware data sharing mechanism that aims to enable collaborative sharing [5] to speed up VM image provisioning. Our distribution scheme takes advantage of the explicit hierarchical network topology of data centers and enables efficient data sharing and meta-data management. We have also studied cache optimization strategies [6] to use combined guest and hypervisor caches more efficiently. Finally, we proposed an I/O optimization layer [1] tailored for the cloud. It generates a block translation map at VM image creation and capture time, and uses it to redirect accesses for identical blocks to the same filesystem address before they reach the OS. This greatly enhances the cache hit ratio of VM I/O requests and leads to up to 55% performance gains in instantiating VM operating systems, and up to 45% gain in loading application stacks. It also reduces the I/O resource consumption by as much as 70%.
Analytics for Cloud Infrastructure Optimization
Cloud computing promises unlimited, cost-effective and agile computing resources for users. To benefit from economies of scale, cloud providers have to set up and operate their cloud infrastructure in an unprecedentedly efficient manner. Specifically, cloud providers need to best utilize available resources and optimize their deployment processes, so as to achieve the best possible customer satisfaction, while minimizing operating cost.

To achieve this goal, we have been developing analytical approaches to mining the operational data (e.g., service requests, workload time series, management events, and problem tickets) to gain insight on the cloud operations. Specifically, by analyzing the history of VM provisioning requests, we are able to prepare VM instances in advance and deliver to the customer instantly upon request, allowing real-time scaling. By analyzing how users request and release resources, we can precisely forecast the capacity requirements and take proactive actions to use less computing resources for achieving higher availability. Further, by mining historical infrastructure events, we can learn event patterns, predict incidents, and construct event-handling rules for proactive cloud management, so that operational disruptions can be avoided.
Providing Infrastructure Agility with Cloud DevOps
The fast evolving IaaS cloud technology (e.g., OpenStack) requires unprecedented agility in cloud infrastructure development. The DevOps method provides a single path to deliver the same source code and configurations from development, to testing, and eventually to operations. This can only be achieved by addressing the following research challenges:
- Infrastructure as code: all source code and configurations are managed in a version controlled manner and shared across dev, test, and ops teams
- Deeply modeled system dependencies: describes all of the components, policies and dependencies related to a software system. This can greatly simplify reproduction of a system, and enables changes without conflicts.
- Automated deployment and configuration: including dependency discovery and resolution, system construction and provisioning, as well as update and rollback.

Leveraging open source technologies, such as xCAT and Chef, our team has been actively developing the technologies in these areas and providing DevOps solutions to internal partners, as well as external customers.
Highlighted publications:
-
[1] VMAR: Optimizing I/O Performance and Resource Utilization in the Cloud
Zhiming Shen, Zhe Zhang, Andrzej Kochut, Alexei Karve, Han Chen, Minkyong Kim, Hui Lei, Nicholas Fuller
ACM/IFIP/Usenix International Middleware Conference (Middleware 2013) -
[2] Redundancy Aware Virtual Disk Mobility for Cloud Computing
Alexei Karve, Andrzej Kochut
IEEE 6th International Conference on Cloud Computing (CLOUD 2013) -
[3] Image Transfer Optimization for Agile Development
Alexei Karve, Andrzej Kochut
IEEE/IFIP International Symposium on Integrated Network Management (IM 2013) -
[4] Leveraging Local Image Redundancy for Efficient Virtual Machine Provisioning
Andrzej Kochut, Alexei Karve
IEEE/IFIP Network Operations and Management Symposium (NOMS 2012)
[Best Paper Award] -
[5] VDN: Virtual Machine Image Distribution Network for Cloud Data Centers
Chunyi Peng, Minkyong Kim, Zhe ZHANG, Hui Lei
31st IEEE International Conference on Computer Communications (INFOCOM 2012) -
[6] Small is Big: Functionally Partitioned File Caching in Virtualized Environments
Zhe Zhang, Han Chen, Hui Lei
4th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2012) -
[7] Cloud Analytics for Capacity Planning and Instant VM Provisioning
Yexi Jiang, Chang-Shing Perng, Tao Li, Rong N. Chang
IEEE Transactions on Network and Service Management, 10(3), 2013 -
[8] Self-service Financial Control and Organizational Governance in Cloud
Chunqiang Tang, Chang-Shing Perng, Salman Abdul Baset
8th International Conference on Network Service and Management (CNSM 2012) -
[9] On Impact of Dynamic Virtual Machine Reallocation on Data Center Efficiency
Andrzej Kochut
IEEE Conference on Measurement and Simulation of Computer and Telecommunication Systems (MASCOTS), Baltimore, Maryland, USA, September 8-10 2008 [Runner-up to Best Paper Award] -
[10] Dynamic Placement of Virtual Machines for Managing SLA Violations
Norman Bobroff, Andrzej Kochut, Kirk Beaty
IEEE/IFIP International Symposium on Integrated Network Management (IM), Munich, Germany, May 21-25, 2007
[Best Paper Award] -
[11] Strategies for Dynamic Resource Management in Virtualized Server Environments
Andrzej Kochut, Kirk Beaty
IEEE Conference on Measurement and Simulation of Computer and Telecommunication Systems (MASCOTS), Istanbul, Turkey, October 24-26, 2007 -
[12] Managing Responsiveness of Virtual Desktops using Passive Monitoring
Rajdeep Bhowmik, Andrzej Kochut, Kirk Beaty
IEEE/IFIP International Symposium on Integrated Network Management (IM), Long Island, New York, USA, June 1-5, 2009 -
[13] Power and Performance Modeling of Virtualized Desktop Systems
Andrzej Kochut
IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), London, England, September 21-23, 2009
Contact:
Hui Lei: hlei
Vijay Mann: vijamann
