IBM Research India - Reliable Hybrid Cloud
<! -- ========================== PAGE CONTENT ========================== ->
Reliable Hybrid Cloud
Cloud computing offers and promises significant benefits to enterprises, IT professionals and users with the characteristics of on-demand self-service, broad network access, resource pooling, rapid elasticity, measured services and high availability. However, as organizations look to leverage these benefits, they are faced with the challenge of bridging across disparate IT environments - for example, across pre-existing on-premise data centers, newly adopted Public Clouds, or across to Edge and Telecom networks providing connectivity to smart devices. This has led to the need for a hybrid Cloud computing model, or simply, Hybrid Cloud, as a solution to this problem and as a foundational enabler for the future of computing in the industry. As an example, consider a gaming company with a massively popular multi-player online game. Critical portions of the gaming software such as the game map, scene, and the AI engine for bots in the game are located at strategic edge locations close to their end users for a hyper-real-time gaming experience. Other portions of the game such as an in-game chat server, game and player statistics, and leaderboard are hosted on a Public Cloud. The game collects data about its users, that in certain jurisdictions needs to be securely stored within sovereign boundaries and also needs to comply with privacy and security regulations. The gaming company partners with a financial institution to support promotions, rewards and in-game purchases, that are offered from a mainframe system in the financial institution's on-premise data center. The gaming company's developers, site reliability engineers (SRE), compliance engineers, architects and other IT personas need to work across these disparate venues. Hybrid Cloud, when fully realized, will provide a unified experience to these personas and seamless interoperability across such disparate IT environments. This will make it much simpler and cost effective for organizations to optimize their hybrid IT as per the specific needs of their business. Such Hybrid Clouds will need to be highly reliable for organizations to adopt them at scale for mission critical needs. Reliability here implies the ability of such a large scale hybrid environment to consistently perform according to the required specifications as per the needs of the business. Such specifications can span across performance, security, compliance, cost of ownership, resilience and elasticity among many other dimensions. In addition, as businesses become increasingly agile to meet ever evolving consumer and market needs, these specifications can evolve at a rapid pace. Hence, this reliability needs to also support continuous and agile modernization at all levels of the IT stack.
Automate Modernization: Infra, App, Data, DevOps
To make Hybrid Cloud environments agile with continuous modernization, it is important that the process of modernization is specific, repeatable, automated and verifiable. The evolution of microservice architectures and Infrastructure as Code (IaC) has enabled modernization to be a gradual, controlled process with versioning. Porting legacy systems to leverage these involve deep knowledge of the legacy platforms and the to-be architectures, like Kubernetes and Cloud Native technologies. To reduce the barrier of entry and to speed up modernization at scale, we develop innovations and tools that work alongside the architects to automate the process of modernization across all layers of the stack - the devops pipeline, applications, data and infrastructure. This research area is truly multi-disciplinary since brings a convergence of cloud computing advancements, software engineering research and artificial intelligence applied to code (AI4Code) to address the variety challenges the industry faces.
AI-enabled Verification and Testing
Hybrid cloud workloads will be increasingly distributed, elastically scaling and frequently updated. Verifying and testing such workloads introduces interesting and challenging opportunities for research. For example, we are developing AI-enabled automated test generation capabilities for multiple functional levels - the UI, APIs and unit level to obtain better functional and code coverage. This is across cloud-native workloads such as those using OpenAPI specifications, and high performance transactional and batch workloads that run on modern mainframe systems. This brings the unique challenge of cross-programming language approaches for unified testing automation.
Application Aware Networking
In order to meet the diverse reliability, performance and security needs of hybrid cloud applications, the network needs to evolve from being a simple pipe that moves packets across from one place to another, to becoming intelligent and aware of the needs of the applications it supports. Application developers and devops practitioners seek to specify the needs of application connectivity as 'intents', that get translated into a multi-cloud networking fabric of connections that are fine-tuned and optimized for each application. Supporting diverse application needs on the same underlying network is an extremely challenging problem that we aim to address leveraging a combination of technologies including Software Defined Networking (SDN), eBPF-based network datapath and programmable observability, and AI-based network management and optimization.
Security and Compliance as Code
Security and compliance officers struggle to get a handle on their organisation’s cybersecurity posture. This information is usually spread across various policy documents, spreadsheets and operational systems. As a result, gauging whether the existing security program is sufficient or not becomes a hard question to answer for these officers. This is critical since regulators often hold these officers personally responsible for any lapses resulting in data breaches. The problem's complexity has grown manifold with the advent of hybrid cloud infrastructures. An organisation can now have their infrastructure spanning across multiple clouds and on-premise environments. This department focusses on transforming security and compliance operations of an organisation from a document centric process to a data centric one. This spans development time source code scanning for security vulnerabilities and runtime continuous monitoring for compliance. Essentially, all security and compliance artefacts - be it security policies, infrastructure to be secured, specific configurations or raw compliance measurements etc., everything is expressed as code using standardised machine processable representations. This enables the entire security and compliance operation to be automated, with help of AI and keeping human experts in the loop.
Programmable Observability And Proactive AIOps
Modern applications (such as our example of the gaming software) can run on and can consist of dozens and often hundreds of services (to encompass different functionalities like collecting user's data, chatbot server, game map, etc) and can be deployed on thousands of machines in multiple data centers. Supporting these applications is critical because a single outage can impact business adversely, putting billions of dollars at stake. Also, with the ever-rising volume and variety of data, another challenge is how to effectively discover insights from multiple heterogeneous data sources like logs, alerts, metrics, and traces. Artificial Intelligence for IT Operations (AIOps), an emerging field arising in the intersection between the research areas of big data, machine learning, and the management of IT operations, is about applying AI for transforming IT operations. AIOps analyzes and brings together heterogeneous data from multiple disparate sources, application life cycle stages, AI-driven insights, and automation remedial of actions to one common platform. We proffer our vision for an AIOps system that provides end-to-end visibility into applications and underlying infrastructure. We envision AI and automation to be incorporated throughout the DevSecOps lifecycle to proactively detect and prevent outages to help identify potential risks and vulnerabilities as early as possible. As we help predict future outages ahead of time, the vision is a zero downtime resilient model where applications seamlessly work across hybrid environments and AI works in the background to minimize risks, proactively mitigate issues, forecast resource utilization, and optimize performance.
Distributed Data Management
One of the primary drivers for hybrid cloud is the need to organize and manage data distributed across geographic locations, minimize need to move large volumes of data, and in Edge computing scenarios have data processing moved as close to where the data is being generated. This includes providing standard storage abstractions across heterogenous data sources and types, data discovery, ingest, and transform for distributed streaming data in a reliable and performant manner, data lifecycle management, and policy-based data governance, sovereignty and compliance.