IBM Research India - Internship Projects and Research Areas

Summer - 2023 Internship Projects and Research Areas

Artificial Intelligence (AI)

Number : IRL_Project_1

Title : Fairness with data quality constraints             

Description:  The quality of data commonly varies across user groups in our society. Common data pre-processing or data cleaning methods can introduce new biases or worsen existing biases towards some groups and lead to a trained model that is even more unfair. Existing algorithms that improve data quality tend to be fairness-agnostic and techniques that train a fair model from biased data tend to be data quality agnostic. This internship will investigate one or more of the following components: (a) Evaluate tabular & NLP datasets to measure the extent to which quality of data varies across groups, its impact on model fairness, and hypothesise reasons behind the underlying phenomenon. (b) Develop fairness-aware techniques to improve data quality. (c) Develop novel methods to optimally train a fair model.                   

Required Skills: Python programming, deep learning hands-on experience (PyTorch, Tensorflow), knowledge of Trust/Fairness/bias in machine learning, Hands-on/technical skills in optimisation or reinforcement learning.

Number : IRL_Project_2

Title : Identifying assisting languages for improving downstream task performance in the language of interest?             

Description:  Existing literature focuses on multilingual fine-tuning i.e, fine-tuning on combined data of one or more assisting languages to handle data sparsity in low-resource languages. However, it is unclear should the assisting languages be from the same language family as the low-resource language or can we choose English as the assisting language. This work tries to answer the above question empirically.                  

Required Skills: Natural Language Processing, Transformer, Deep Learning, Pytorch, Python

Number : IRL_Project_3

Title : API Testing             

Description:  APIs are a widely replacing downloadable applications, as cloud-based platforms have become common. The project would involve building testing tool for APIs by consuming minimal input from the user, in order to perform functional-testing.                  

Required Skills: At least 2 years' programming experience in Python, well versed on using ML and NLP libraries such as scikit-learn, flair, transformers etc.

Number : IRL_Project_4

Title : Automating Edge Disposition Design Life-cycle            

Description: The edge continuum is a spectrum of compute and allied resources starting with a resource-rich cloud at one end, and a resource-constrained population of the devices at the other. There could be several stages of compute resources at various points within these two extremes.
Given this massive spectrum, planning the necessary infrastructure support and the software stacks across this continuum in a optimal manner so as to ensure minimum overhead of management and orchestration, and maximum utilization and availability across all points of the continuum is a challenge.
This proposal aims to address the above challenge in a systematic and automated manner.                   Required Skills: Golang, Kubernetes

Number : IRL_Project_5

Title : Neural-ODE-based forecasting for efficient intervention modeling             

Description:  Time series data are timestamped observations of a process. A process represents a physical system, viz. customer dynamics in a retail shop, the spread of disease in the population, number of visits on a webpage, stock market, chemical plant, health monitoring wearable readings, etc.  An interpretable mathematical representation of processes is ordinary differential equations. The ordinary differential equation represents how the interaction between different process variable influence their time evolution. For most process ODE representation is unknown.
Time series forecasting estimates the observed values at a future time point given current observations. Exogenous variables in time series forecasting are the factor that influences the future evolution of the time-series observation without getting affected. As an example, the precipitation (rainfall) affects the sales of the umbrella, but not vice versa, thus precipitation forecast is an exogenous variable for umbrella sales forecast. In a process control setup, these exogenous variables are "process control variables". As an example, in a distillation plant, the condensate reflux rate and boiler heat put are controlled to maintain the top product purity along with productivity, these are called interventions. Interventions are infrequent in the observed data, and often not well annotated.
Time series forecasting solutions are data-driven. These models learn approximate extrapolation functions to fill in future time points. In practice, there is no strict restriction on the form of function it learns. Due to the sparsity of the intervention data, they often failed to model intervention accurately. An ODE representation is universal and capable of representing the system behavior under both equilibrium conditions and intervention. In this project, we will attempt to derive the governing differential equation from observed data, most conservatively. This work will enable deriving novel process insight, and more reliable intervention modeling for time-series data.                  

Required Skills: python programming, deep learning hands-on experience (PyTorch, Tensorflow)

Number : IRL_Project_6

Title : Exploring self-supervision and cross-attention techniques for domain-generalizable time-series representation learning             

Description:  Considering the massive success of representation learning (like BERT) in the NLP domain, various studies are underway to test its effectiveness in the time-series domain. In this work - we plan to explore different architectures of transformers and propose novel techniques with respect to self-attention, cross-attention, masking, and positional embeddings that are contextualized to the time-series domain. We then plan to leverage these time-series finetuned transformers to build robust, self-supervised, representation learning models that can be applied across datasets in a similar domain.                  

Required Skills: Deep learning, Machine Learning

Number : IRL_Project_7

Title : Low Latency Converged Data Processing Framework For Resource-Constrained Edge Computing             

Description:  Mobile edge computing is opening up the potential for a new class of low-latency "edge-native" applications bandwidth-intensive such as connected vehicles, UAVs, augmented reality and 5G network analytics functions. Given the limited hardware resources available for computing at the edge, traditional cloud-native approaches based on elasticity based on assumptions of infinite resource pools cannot be reapplied as is in the edge-native context. Further several of these applications require dynamic adaptation of the data processing as user / device operating contexts change, as users/devices move about, as devices adapt to changing mobile battery-life conditions etc. Data processing frameworks today are not optimized for low-latency applications. Further, each domain addresses their data processing requirements at the application-layer resulting in high productivity costs in building edge-native apps and wasted resource overhead when multiple such applications are collocated on a common resource-constrained edge computing environment.                  

Required Skills: Big Data Frameworks, Database Query Engines, Stream Processing

Number : IRL_Project_8

Title : Investigating the applicability of complex deep learning models in time series forecasting             

Description:  There is a growing tendency of employing complex deep learning models (like Transformers) for time series forecasting. However, several benchmarks and research papers have shown degraded performance of these large models in multiple datasets and scenarios. On the other hand, simple and lightweight gradient boosting models like XGBoost and LightGBM have shown promising performance in various competitions with data from specific domains (such as retail data in M5 competition). Hence, it raises some important questions: "In what scenarios are the complex deep learning models preferred?" "When do the gradient boosting models outperform the deep learning models?" We plan to investigate these important questions through extensive analysis and experimentation, and help the research community with theoretical understanding and empirical evidence.                  

Required Skills: machine learning, deep neural networks, gradient boosting models, time series, python, pytorch, tensorflow

Number : IRL_Project_9

Title : Explainable automated exploratory data analysis system that adapts itself based on user's interest             

Description: Exploratory data analysis is an important step of data preparation for AI. There has been a recent trend of building automated exploratory data analysis systems based on learning algorithms that can suggest to a user the best way to explore a new dataset. However, these systems do not take in account the interests of a user, which has been studied as an independent problem. The goal of this problem will be to build an algorithm that can adapt itself to do AutoEDA by taking in user's preferences. The algorithm built should be explainable with respect to the actions that it is suggesting to the user.                  

Required Skills: Machine Learning, Python, AI

Number : IRL_Project_10

Title : Reinforcement Learning for Constrained Process Modelling             

Description: Industrial manufacturing processes like steel production are complex in nature with non-linear interactions among different variables. Optimisation of these processes involve several process control choices over a long horizon and constraints on allowed process state variables. They are further complicated with multiple objectives with respect to resource consumption, production efficiency, wastage etc.                  

Required Skills: Reinforcement Learning, Machine Learning and Optimisation with strong programming skills.

Number : IRL_Project_11

Title : Human-in-the-Loop for AIOPS             

Description: Any Artificial Intelligence (AI) lifecycle does not end when the first model is initially deployed. An AI model must continuously improve over time by learning from the mistakes it makes. The model evolves with each iteration of the feedback loop. In this project, we aim to improve the anomaly detector model both for logs and metrics by taking in feedback from the user. The key tasks here are:
1. To identify the samples for the feedback
2. Translate the feedback from higher-level models to individual features
3. Incorporate the feedback to improve the overall pipeline                  

Required Skills: Active Learning, python, ML

Number : IRL_Project_12

Title : Numerical Question Answering over the hybrid context of table and text             

Description: We aim to build a system for answering questions over hybrid context of table and text requiring numerical reasoning.                  

Required Skills: Python, git, question answering, transformers, PyTorch, BERT, large language models, NLP

Number : IRL_Project_13

Title : Learning tabular data representations via language models for semantic automation tasks in enterprise data lakes            

Description: The aim of this work is to learn representations of tabular data using both data values and metadata of the tables (meta features) capturing the complete context of the tabular data semantically by customizing/adapting a large language model (For ex. BERT style transformer model). Evaluate the model on enterprise scale datasets for a select set of semantic automation tasks like dataset duplication identification, dataset clustering, columns clustering, dataset integration etc. Additionally, the work aims to explore the applicability of foundation models (large pre-trained language models) like GPT-3, Bloom etc. for the given semantic automation tasks.
The problem is quite interesting as it requires application of state-of-the-art NLP techniques (specifically semantic analysis) to model the complex structured tabular data and metadata. In addition to the structure, the scale of the data, generalizability requirements and the pertinent data quality issues in an enterprise data lake setting pose compelling challenges.
Background of the work:
Enterprises have a lot of datasets collected in data lakes. Deriving insights from a data lake is a very challenging task due to a variety of reasons like the scale of data, diversity of data, data quality issues and evolving nature of data. Lot of human effort is needed for discovery, structuring, cleaning, enriching and validating datasets from lakes. Semantic automation aims to alleviate this problem by automating various tasks using deep semantic knowledge. Some of the tasks that are explored on data lake tables(datasets) include, table augmentation, table retrieval, type classification, table search etc. using both ML and non-ML based approaches. Within ML-based approaches, there have been some efforts to use language models from NLP space for learning table data representation and using it for specific downstream tabular tasks. But there are many limitations with the existing approaches like a) limited applicability to select downstream tasks b) challenges with numerical data handling c) suitable for smaller web tables but not large-scale enterprise tables d) model complexity etc. Datasets in a data lake are ingested by people of different business roles at different times leading to data quality issues, duplicates etc which can severely dent the model performance making it a very challenging problem.                  

Required Skills: NLP, Deep learning(Transformers), Pytorch / tensorflow, Python

Number :IRL_Project_14                                                                                                                                       

Title : Natural Language Interface to GraphQL             

Description: GraphQL is emerging to be a widely adopted standard in industry for secure role based data access and is preferred by multiple IBM BUs (WO, Maximo are few examples). A successful NL interface over GraphQL will cater many upcoming needs across and outside IBM as well.                  

Required Skills: NLP, broadly AI/ML.

Number : IRL_Project_15

Title : Counterfactual Reasoning to analyze the impact of IT metrics on Business KPI's              Description: Enterprise business processes are complex and involve the execution of various process steps. These process steps are supported by underlying IT systems and services which are often shared across various process steps. The process owners often need to understand the impact of the errors in the IT systems on the business process performance. It is often non-trivial to quantify the improvement of a IT or process change, without implementing or conducting randomized controlled trials. In several cases, the cost and time for implementing and evaluating the benefits of these changes are high. To address this problem, we would want to explore a principled framework that formally codifies existing cause-effect assumptions about the Business and IT systems, their dependencies, control confounding and answer "what if" questions.                  

Required Skills: Causal Analysis, Machine Learning, Deep Learning, Python

Number : IRL_Project_16

Title : Evolving Documentation Generation for GitHub Repositories             

Description: GitHub is one of the largest and most popular repository hosting service today, having about 14 million users and more than 54 million repositories as of March 2017. This makes it an excellent platform to find projects that developers are interested in exploring. The presence of documentation helps in keeping the track of all aspects of an application and also improves on the quality of the software product. The main focus are based on the development, maintenance and knowledge transfer to other developers. However, majority of projects in GitHub do not have well defined documentation/READMEs present which makes them very difficult to use for someone who is new. We explore the possibility of generating a brief description of these repositories (in the form of README) by automatically extracting features from Issue Tracking System, PRs, Commits, Developer Forums, etc. This documentation will evolve over time as new issues and commits are pushed to GitHub.                  

Required Skills: Natural Language Processing, Python, Pytorch, Tennsorflow, Deep Learning

Number : IRL_Project_17

Title : AIG 360: A toolkit to evaluate data and ML workflows for sustainability.             

Description: Workflows are ubiquitous in the industry. IBM has a number of products on workflows, namely Data workflows in IBM CP4D, ML workflows in AI Openscale, Business process workflows in CP4BA, tasks workflows in Watson Orchestrate.
Optimizing the end to end workflow, will lead to sustained gains both in terms of cost savings and reduced energy consumption. Evaluating the end to end cost and carbon footprint of a data workflow, is the first step towards building sustainable workflows. While some solutions like CodeCarbon [1], Strubell et al, 2019 [2] exist to evaluate the carbon footprint of individual programs, CPU emissions, and AI models, we want to improve the state of the art in evaluating and improving the efficiency of a production workflow end to end.
At IBM, we're solving different problems in the area of Green AI, training energy efficient models, amortizing the cost of model training by using foundation models among our things. In this internship, we'll create a toolkit to evaluate different workflows from a cost, energy and sustainability perspective, and estimate the benefits of using Green workflows for wider adoption in the industry.                  

Required Skills: Python, familiarity with machine learning lifecyle, data pipelines, REST APIs.

Number : IRL_Project_18

Title : Automated exploratory data analysis for relational data using machine learning techniques             

Description: Exploratory data analysis (EDA) deals with examining and fixing issues with a new dataset for the purpose of using it for the downstream machine learning tasks. It is usually a very time consuming process in the data science pipeline. The problem becomes even more challenging when the input data is relational data consisting of multiple tables that have relationships among them. These tables need to be optimally joined together to create good features that can be used for downstream ML task.
The goal of this project is to build a system that can automatically choose operators from the set of EDA & Data Quality operators to apply to the relational dataset to show a view of the dataset that has one of the two following properties:  (a) the selected view should show some interesting characteristics of the data to get good insights about it, (b) the selected view should expose a data issue (such as missing values, unimportant tables, bias, granularity mismatch between tables etc).                  

Required Skills:  (Required) Hands on experience with Machine learning/Deep Learning algorithms, Python programming, (Desired) experience working with any relational database such as DB2, Oracle etc.

Number : IRL_Project_19

Title : Hatespeech Detection and Mitigation using AI; Ethics in AI             

Description: With the multiplication of social media platforms, which offer anonymity, easy access and online community formation and online debate, the issue of hate speech detection and tracking becomes a growing challenge to society, individual, policy-makers and researchers. Despite efforts for leveraging automatic techniques for automatic detection and monitoring, their performances are still far from satisfactory, which constantly calls for future research on the issue. Automatic detection and mitigation of Hatespeech suffers from a lot of issues where cross-data generalisability, good quality datasets for training, unintended bias against minority groups, sampling bias are to name a few. This research area is hot and is need of the hours and need research efforts from both ends : coming up better detectors and smarter methods of detoxification of already deployed models for language, image or both. 
In this  work, we will investigate one of the following directions:
1. Look into ways to develop a common framework for researchers, where HS concepts follow common universal definitions using common guidelines, which will be helpful for domain adaptation. Improve generazilability of HS classifiers cross-datasets, cross-language(multilingual) and cross-concepts. For example a HS classifier trained on Hateful data should perform good on abuse data.
2. HS detection also suffers from lack for proper balanced datasets. There are several small datasets that suffer for quality because of sampling bias or identity bias and improper annotation. Develop new methods or algorithms for artificial HS dataset creation and expand the semantic meaning of existing datasets that will help to create a large-scale balanced dataset. Also, look into ways to use deep-learning to create NLP resources which are currently developed with non-deep learning methods that will leverage HS detection.
3. Look into the problem of Unintended bias which can be studied by investigating hate and bias correlation for automatic HS detection methods or detoxification approaches. We can explore here methods could be used to make the model less biased against specific terms or language styles, from the perspectives of training data or objective.
4.  Lastly, tackling the problem of Hate by detoxifying Language Models. Looking into methods that are post-hoc or in-training or fine-tuning level and explore the merit of these above data centric approaches as post-hoc would be more environment friendly AI.                  

Required Skills: Pytorch/Tensorflow, Deep Learning, NLP
Preferred: Knowledge of transformer architectures and hugging face API is a plus

Number : IRL_Project_20

Title : Generating Faithful Responses Using Large Language Models             

Description: Large Language Models (LLMs) can be directly used (without fine-tuning) as conversation agents. By prompting a document and dialog history as input, LLMs can generate a response that is appropriate for the dialog history and uses the information present in the input document. A major concern when it comes to using LLMs as conversational agents is ensuring that generated responses are faithful to the contents in the documents fed as prompts. This project is aimed at developing techniques that prevent LLMs from generating responses that aren't faithful to the provided document.                  

Required Skills: PyTorch, Transformers, Prompting Engineering, In-Context Learning

Number : IRL_Project_21

Title : Prompt Tuning in Large Language Models with limited compute resources             

Description: Large language models (>100B parameters) are excellent few shot learners that use in-context learning to generalize from a few examples of a given task. However, when one has access to several examples of a task, these models need to be finetuned. This often requires access to several compute nodes with multiple GPUs and hence, quite expensive.                  

Required Skills: Deep Learning (esp. transformers)

Number : IRL_Project_22

Title : Using Large Documents as Input to Large Language Models For Conversational Application.             

Description : Large Language Models (LLMs) can be directly used (without fine-tuning) as conversation agents. By prompting a document and dialog history as input, LLMs can generate a response that is appropriate for the dialog history and uses the information present in the input document. One of the major limitations of using LLMs as document-grounded conversational engines is its limited input size. We cannot provide documents that are longer than the supported input size (typically 2048 sub-words).                 

Required Skills: Deep Learning (esp. transformers)

Number : IRL_Project_23

Title : eBPF based Measurement and Network Acceleration             

Description: eBPF is an emerging linux kernel technology that is widely used by hyperscalars for non-intrusive monitoring and network acceleration by short circuiting paths in the kernel network stack. In the proposed project, we plan to leverage eBPF for run time monitoring of system state and leverage it for network acceleration in the context of a Kubernetes CNI (container network interface), particularly designed for resource constrained edge deployments.                  

Required Skills: eBPF, Kernel, Systems building, performance measurement

Number : IRL_Project_24

Title : Hamiltonian Evolution             

Description: One of the important problems in quantum mechanics is to see how a state evolves in time under Schrodinger's equation. One of the most common approaches for the same is through Trotterization technique. However many others such as¬† Taylor series, quantum walks and signal processing based methods have also been proposed with different running times (Hamiltonian simulation - Wikipedia).¬† The projct involves implementation of these multiple techniques in qiskit, benchmarking them and coming up with other novel algorithms for the same.                  

Required Skills: Qiskit, Quantum mechanics basics, Quantum simulation basics

Number : IRL_Project_25

Title : Logical Foundation Models             

Description: As part of this project, we would be exploring some aspects of logical foundation models like building models to induce logical rules from data and semantic parsing of the text to logical forms in building the logical foundation models. We will try to solve the task for LogicQA using logical foundation models where the task is to covert the hypothesis into logical statements and check if one of the 4 premises (converted from text to logic form) hold and select the option that can be deduced from the hypothesis.                  

Required Skills: Deep learning, fundamentals of AI, Symbolic AI, NLP, python, pytorch

Number : IRL_Project_26

Title : Generating Test Data in Python/Java for Java/Cobol programs             

Description: Unit test generation is one of the import problem and challenging task in software engineering because of it's large number of possibilities. We have a  dataset for java program with natural language description of the problem with parallel Cobol code; one test input and output is also available for each problem.  Main idea is to automatically generate test data (input/output) for java/cobol code.                  

Required Skills: Python, PyTorch, Deep Learning Architecture, NLP, Java

Number : IRL_Project_27

Title : Unsupervised Program Repair for Programming Language Translation             

Description: Migrating codebase from programming languages like COBOL to newer languages like Java and .NET is of significant business value to many companies. One way to do this is by writing a rule-based system that can translate code from one programming language to another. However, this approach is quite expensive as it requires expert programmers in both source and target programming languages and requires writing rules for every pair of source and target programming languages. The success of OpenAI Codex and DeepMind's AlphaCode have shown that Transformer-based large language models (LLMs)) can be used to translate between different programming languages. In our experiments, we have observed that these models generate incorrect/non-compilable programs due to a lack of understanding of programming language semantics. In this project, we aim to design an unsupervised deep neural programming language repair module using the programming language semantics that can correct the programs generated.                  

Required Skills: Courses in Deep Learning and NLP,  Pytorch and excellent programming skills

Number : IRL_Project_28

Title : Intent based learning for Data Exploration             

Description: Data Fabric and Data Marketplace are the most recent domains where dataset search, graph neural networks, and recommender systems are being tried. While Data Fabric is a framework to handle several data related tasks in a modern data stack, a data marketplace is to showcase datasets.
In such a data marketplace, business users want a data exploration capability to explore both a single dataset and multiple datasets. They may want to learn new insights about their data or find relevant data to perform their tasks. This data exploration research is about generating such insights from data and/or generating data views. We want to use semantic and deep learning techniques to tell users what is interesting in their data and/or generate data from multiple datasets for the users' intended downstream tasks.
The input to our system is a dataset upload or the user query. The output can be either insights from the uploaded dataset ("did you know this about your data?") or creating data from one or multiple datasets for model training, business analytics and other user cases. While some exploratory data analysis techniques exist like the Reinforcement Learning approach by [Milo et al, 2018], we want to explore intent based learning (both in the current query or historical queries), leverage knowledge graphs and large language models (LLMs), to improve the state of the art in the above mentioned Data Exploration problems.                  

Required Skills: python, pytorch or similar libraries, familiarity with information retrieval techniques


Number : IRL_Project_29

Title : Enhancing Multi Cluster Networking with Extensible Internet             

Description: Multi-cluster networking, across clusters running on different domains - such as multiple zones, multiple clouds, is becoming an important requirement in the world of hybrid-cloud transformation. Unfortunately, existing extension of single cluster networking solutions fall short of achieving the goals of true independent multi-cluster deployment setups with a single plane of glass view to be able to control, manage and observe applications and their networking interactions. IBM has been building a solution for the same for enterprise clients. Recently, there has also been an interest in the space of Extensible Internet, a novel backward compatible approach towards making new internet services available on the existing internet infrastructure such that older services are not impacted, while clients can now use new services without any tie in to particular clouds or providers.
 In this internship, the aim is to enable easier and transparent cross cluster networking by leveraging the ideas of EI in the end switch/routers that connect clusters to the rest of the data center in way that is agnostic of any technology or provider to be usable by the widest range of end clusters.                  

Required Skills: Systems,Network, Golang, Kubernetes CNI

Number : IRL_Project_30

Title : Scalable, Performance Efficient, Modular Observability pipeline for Edge Computing platforms             

Description: Edge computing is a distributed computing paradigm which tries to bring computation, storage, processing closer to the source of data. There are many emerging use-case scenarios be it Industry 4.0 specific smart manufacturing, IoT, smart Retail, automobile services, telco-5G where response time or latency to take any action is very critical.  In this project we investigate how to architect, build key building blocks for system health monitoring which can handle auto-scaling, self-healing, easy and efficient monitoring of underling distributed workloads running on edge devices connected to a near edge platform.                   

Required Skills: Cloud Technologies e.g. Container, Docker, Kubernetes
Monitoring Technologies e.g. Prometheus, Grafana SW stacks
Exposure to Golang, Python programming
Excellent Programming and Problem solving aptitude

Number : IRL_Project_31

Title : AI assisted techniques to generate micro-services with explanations             

Description:  At IBM Research Labs(India) we have developed AI techniques to partition old legacy monoliths into microservices (link) by reducing this problem to an unsupervised graph clustering task. From the quantitative & qualitative analysis of our solution,¬† we observe two key challenges that if addressed will enable more adoption of our solution. Challenge (1) - Though our model is optimised for many metrics, there is still a need to strike a fine balance between those metrics that a visible tradeoff.¬† Challenge (2) - There is a need to provide explanations using explainable AI(XAI) for the recommendations.
What has been accomplished ?
A) Graph Neural Networks(GNN) for Graph Clustering : We currently employ a GNN that performs an unsupervised learning procedure. The GNN simultaneously learns the structure of the graph, identifies outliers in the graph and also performs clustering on the graph. For more details, please refer to our AAAI'21 work ( Most recently we have also published an improved version of our model at IJCAI'2022 ( - that supports heterogeneous software representations and real world constraints to create better quality micro-services.
B) A new curated dataset : The dataset contains features extracted from readme pages of public GitHub repositories. We will leverage this dataset along with our developed techniques to get explanations about the generated micro-services.
Approach to solve the problem
There are multiple ways to approach this problem. Some examples include techniques like using GNN explainability models like GNN explainer, using hierarchical models to explain functional slices found in each generated micro-service and using reinforcement learning(RL) to generate explainable clusters.                 

Required Skills: Graph Neural Network, Natural Language Processing, Pytorch Framework, Python, Reinforcement Learning(optional)

Number : IRL_Project_32

Title : Carbon-Aware User-Allocation Policy Design with Incentivization for Multi-Access Edge Computing             

Description: With the rapid proliferation of mobile technologies, the number of connected devices has witnessed manifold increase. As a consequence, the number of application services offering a variety of features in such a connected world has also have skyrocketed in the last decade. These connected devices envisioned to percolate towards an Internet-of-Things (IoT), together with the myriad of application service offerings, pose new challenges in effective resource management. Real-time application services such as Augmented Reality (AR), Virtual Reality (VR), on-the-fly object recognition for autonomous vehicles and so forth require high computational capacities. The connected devices are, however, limited by their computational resource capacity and battery life. Executing resource-intensive tasks locally on the devices is associated with high processing times as well as prolonged battery usage. To mitigate the effects of executing such complex tasks locally, executing tasks over cloud services serves as an effective complementary mechanism to enhance the Quality-of-Service (QoS) metrics of application services. However, such a mechanism does not necessarily conform to the QoS requirements of latency-sensitive services. Multi-Access Edge Computing (MEC) is a promising new paradigm that allows an application service provider to deploy its application services packaged as lightweight containers on MEC servers located near base stations. When IoT devices invoke these services, tasks can either be executed locally on the devices or routed to proximate MEC servers to curtail the high latencies of cloud communication networks.
Motivation and Objectives of this Work: Reducing carbon emissions in cloud and data center operations, including edge and MEC locations, is of growing priority for many organizations and users. MEC servers are not as resource-equipped as their cloud server counterparts, i.e., it is not viable to host all services at all edge servers. This is aggravated by the fact that users accessing these services are typically mobile with variable service invocation rates. The workloads deployed can have diverse QoS and QoE requirements ranging from real-time applications to delay-tolerant ones. Additionally, MEC servers are heterogeneous, with varying system configurations leading to a myriad of carbon usage footprint and diverse task execution times depending on the user allocation policy. These considerations and factors necessitate developing mechanisms that can reduce the CFP at MEC locations while also being responsive to user preferences. The objective of this project is to synthesize user allocation policies in multi-access edge computing from a carbon-aware perspective while ensuring conformance to user QoE requirements.                  

Required Skills: Programming in Python, Basic understanding of Integer Linear Programming and Markov Chains is a plus but not essential.

Number : IRL_Project_33

Title : AI assisted techniques to extract business rules from legacy code base             

Description : Motivation:  Several businesses especially in the core banking and finance domain still rely on the prowess of Mainframe systems. Most of the applications on Mainframes are written in COBOL running on native Z platforms. Cost of resources, desire to leverage cloud platforms and skill availability has been driving a lot of the modernization effort of these mainframe applications. Modernizing such legacy systems involves multiple aspects and one of the most challenging ones has been to separate out the business rules embedded in the code and create a design document which can be implemented in a more modern manner. 
1. Code intermediate representation
2. Standardized definition of business rules in the context of the given problem
3. Code analysis to perform business rule extraction
4. Creating a design document out of the rules which can be used by developers to write code
5. Validating correctness and completeness of the extracted rules
What has been accomplished?
Extracting business rules from a complex, large monolithic piece of code where the different flows have intermingled over years of development is not an easy task. However, small dents have been made by various industry and academic groups to make this task achievable. One of the more recent works is COBREX which is an open source tool. The demo of the tool can be found here - and the details of the tool can be found here - The tool finds business variables and then identifies context statements related to the business variables using limited code views. There is an opportunity to extend this work by applying other code views that captures more semantic properties of the code and create a usable business rule extraction tool for large open source communities.                  

Required Skills: Program analysis, Natural Language Processing,  Neural Networks, Pytorch Framework, Python

Number : IRL_Project_34

Title : Building Fingerprint repository using effective fault selection for AIOps             

Description : Fingerprint represents the state an application at a particular instance of a time. Itis prepared by consolidating information artefacts such as log data, metric data, traces, entities and application topology.  Fingerprinting  is  used  for  several  AIOps  processes  in  closed  loop  automation  and remediation  pipeline,  such  as  anomaly  detection,  fault  localization,  incident  remediation  etc.  Generally, models are trained on a repository of fingerprints corresponding to the healthy and unhealthy states of a system.One needs to prepare a fingerprint for each type of fault. Given a vast array of faults that can be injected, it is crucial tointelligently pick faults that can provide a comprehensive coverage of fingerprints.
However,  there  are  several  challenges  in  obtaining  or  building  a  quality  fingerprint  repository corresponding to the exponentially growing fault space. Specifically,
1.State Space exploration: A large distributed application can have infinite states.  It is not practically  possible  to  obtain  and  work  with  all  possible  fingerprints  corresponding  to  all  the states. Therefore, a representative subset of the state space is required.
2.Optimized  representative  fingerprint  subset:  One  way  to  approach  this  problem  is  to randomly sample a sub-set of states. However, doing so, might not result in quality coverage, as some of the important states could be left out. Therefore, the goal is to select a subset such that it has most probable states representative near equivalent of overall system states.                  

Required Skills: devops, cloud skills, Kubernetes,

Number : IRL_Project_35

Title : Artifact refactoring capability for Konveyor Move2Kube             

Description: Hybrid cloud enabled with Kubernetes engine offers high-degree of automation in application orchestration and maintenance along with the capability of interoperability between multiple cloud providers. Applications deployed on non-hybrid cloud platforms require changes to configuration and code to be compatible with the Hybrid cloud environment. Performing such changes at scale for applications with hundreds of components with thousands of artifacts is expensive in terms of time and skill. 
Konveyor Move2Kube is CNCF sandbox tool which automates this transformation of applications to Hybrid cloud. Move2Kube provides a framework for generating Infrastructure as Code (IaC) artifacts (like Dockerfiles, Kubernetes YAMLs, Helm charts, Tekton pipelines, etc.). However, there are use-cases (E.g. Netflix OSS, Ray framework) that require the input code and configuration files to be refactored such that they are suitable for deployment to Kubernetes, Openshift, etc. Hence, there is a need for a domain-specific language that can express the refactoring logic in a programming-language agnostic manner. This proposal addresses the above need and will involve working with the core-team of Move2Kube maintainers and the solution will be code-contribution to the Move2Kube code-base. 
Proposed Deliverables:
Feasibility study on existing meta-programming languages.
Design a new language or extensions/modifications to current ones to suit Move2Kube's requirements.
Develop an engine that understands and can execute the domain-specific refactoring language scripts.                 

Required Skills: Programming language design

Number : IRL_Project_36

Title : Chaos Testing using Deep Learning             

Description : Chaos Engineering is being used extensively to test system resiliency. Although a lot of chaos engineering tools like Chaos Toolkit, Litmus, Chaos Mesh, etc. exist, most of the techniques involve users knowing the system and manually creating chaos experiments using the toolkit. This narrows the scope of the testing done where some complicated scenarios or faults are missed. There is a requirement to use AI techniques to find and test faults in the system, explore bottlenecks and dependencies in the system, and also recommend improvements based on the fault analysis. This internship will investigate and develop code related to the following modules:
- Develop novel deep learning techniques using reinforcement-learning methods (eg. Q-networks) to optimally find failure causing faults from a very large fault set. The goal is to submit a publication based on this work and develop code that can be integrated into existing toolkits, such as IGNITE dSTK etc.
- Develop a recommendation system for improvements based on the fault analysis. This involves using the results and logs of fault injection. For example, the system may recommend increasing memory or bandwidth for a particular pod or adding a backup for a pod.
- Develop techniques to optimally select a subset of faults from available fault space whose injection is equivalent to original fault set.                  

Required Skills: Tensorflow, Reinforcement learning, Deep Learning

Number : IRL_Project_37

Title : AI and Data assisted automation for legacy application testing             

Description: Legacy applications often lack well documented requirements and other technical details that are consistent with the current code base of the application. Therefore generating test cases directly by using the code is often the only reliable option. We propose a AI and data driven approach for generating testing. Ket entities related to the application are extracted from the source code and represented as a knowledge graph. The control flow, data flow and functional slices are then extracted for generating test cases for. AI and data analytics techniques are used to determine the input and target variables for test cases. AI techniques to associate the variables with the descriptive text in the program and UI screens (variable names in legacy codes are often not very meaningful).                  

Required Skills: Program analysis, Information extraction, Knowledge graph, Databases

Number : IRL_Project_38

Title : Unified Representation of Observability data for Continuous Health Check             

Description: It is important to continuously measure an IT system's health to look for the an unusual behavior for an optimal user behaviour. The system health can be gauzed with the observability data consisting of various modality such as logs, metrics, and topology. In isolation, a single modality of observation may not be sufficient enough to capture the behaviour of a system. Therefore, it is important that the techniques for continuous health optimization should look holistically at all the modalities together.
We propose to learn a model that can embed observability data into unified representation for continuous health optimization. The data elements among themselves have different flavours (eg. logs are semi-structured whereas metrics are in timeseries of continuous¬† or categorical¬† values) making it a challenging task to represent them in an unified space. To the best of our knowledge there are currently no available¬† models that can represent all these data elements into one common unified representation.                  

Required Skills: Deep Learning, Machine Learning, Timeseries Forecasting, Natural Language Processing, PyTorch/Tensorflow

Number : IRL_Project_39

Title : Modeling the Hybrid, Distributed, Multi-Cloud             

Description: A Cloud Solution Architect faces several challenges. First of all there is too much information to digest and apply for every solution design. While there are several reference architectures available each customer has unique requirements that need to be catered to. The skill level of architects varies a lot. And there is often heavy price to pay when a solution is architected with incorrect assumptions or while missing a key function.
This complexity increases manifold as cloud environments are evolving from a single cloud to hybrid cloud and to a world of distributed clouds including those at edge of the network. In this project, we shall aim to infuse automation into hybrid, distributed, multi-cloud solutioning process to make the life of cloud architects and engineers easier while providing solution assurance to the
organisations moving to cloud.                  

Required Skills: Software engineering and coding skills.
Data Structures and Algorithms.
Knowledge of systems concepts including networking, distributed systems.
Familiarity with one or more cloud platforms.


Number : IRL_Project_40

Title : Learning Geospatial embeddings using un-supervised objectives             

Description: To expand the scope of geospatial models, location-based information must be accounted for. A spatial coordinate has various properties like Land use land cover (LULC) information, soil characteristics, terrain surface features etc. Geospatial embedding refers to a neural network-based encoding which can represent a point/location in a high-dimensional space. These geospatial embeddings capture the similar patterns/trends analogous to syntactic and semantic similarities captured by word embeddings. These can be verified for water quality prediction on multiple sensors or yield predictions on multiple farms. The problem of water pollution has the potential of erasing all forms of life on earth. Hence, considering the magnitude of this problem our first focus is to use these geospatial embedding with some AI model for water quality parameters predictions for inland water bodies. Most of the location encoders present in literature are trained in a supervised learning fashion which prohibits the application of the trained location embedding on other tasks. Our research focus will be on designing an unsupervised learning framework for location encoding. Properties of geo-spatial embeddings:
1. Distance and Direction Awareness- Two nearby locations should have similar location embeddings and the locations that point into similar directions have more similar (relative) location embeddings than those who point into very different directions.
2. Geographic Awareness- Two locations sharing similar physical properties like altitude, slope, Land use land cover (LULC) datasets, climatic zone maps, costal v/s non-costal  etc should cluster closely in the high dimensional embedding space.
3. Multi-scale adaptation- The learned embeddings should be upscaled and downscaled using simple aggregations and interpolations.                  

Required Skills: Understanding of Spatio-temporal data driven AI modelling, Experience of working with GNNs.

Number : IRL_Project_41

Title : Net Zero Planner             

Description: Enterprises are under significant pressure from investors, consumers, and policymakers to act on climate change mitigation by disclosing their GHG emissions and committing to reduction of emissions from their industrial activities including operations, manufacturing, logistics, and supply chains. Over 20% of the world‚Äôs largest companies have set long term net-zero targets.  Enterprises need technology to measure, track, and reduce their emissions while building operational resiliency to the effects of climate change. While there is significant development on measuring and tracking carbon footprint at Enterprise scale, there is hardly any focus to systemically reduce the carbon footprint. There is an opportunity to address this gap through data-driven approach using enterprise and environmental data available in open domain.                  

Required Skills: Python, ML, DL, AI Explainability

Number : IRL_Project_42

Title : Three way multi-objective-optimization in buildings (Energy, Emissions, Cost)              Description: Buildings cause 30% of GHG emissions and require 40% of all energy produced. Energy production accounts for 73% of GHG emissions. Therefore, buildings have potential of reducing energy consumption as well as creating a wider impact on mitigating climate change.
Study the multi-objective-optimization problem of buildings with three competing variables (energy, cost, emissions). Examine the what if scenarios to suggest action items to reduce some or all of the variables of optimization.                 

Required Skills: Optimization, Machine Learning, Modelling


Number : IRL_Project_43

Title : Design and Implementation of automatic Ansatz selection using classical ML in Qiskit              Description: Selecting a good ansatz is crucial to the obtaining a near-optimal solution in several applications of quantum computing in optimization and machine learning, and domains like quantum chemistry, and finance. Machine learning techniques such as RL can be used to search for a good ansatz. RL training would involve designing a state space which includes the class of all possible circuits, the space of actions which include addition/ removal of quantum gates, and reward function based on ansatz quality. RL based algorithms such as Q-learning can be used to train the parameters and simultaneously search for best ansatz. The project involves design and implementation of ML algorithms like above for ansatz selection and training the same.                  

Required Skills: Qiskit, Variational Algorithms, (classical) Machine Learning, Reinforcement Learning

Number : IRL_Project_44

Title : Estimation of excited energy states in a electronic structure problem             

Description: The objective is to evaluate different methods of estimation of excited eigenstates and their energies in electronic structure theory problem. Given a time independent Hamiltonian which is a function of the geometry of the location of the atoms in the molecule, we would like to estimate the eigenstates. There are a few methods available in literature, namely, i) QEOM, ii) subspace search iii) time evolution/QPE, etc. We would like to evaluate these methods in terms of their accuracy and their ability to be implemented in a real hardware.                  

Required Skills: Qiskit, Quantum computing basics, Preferable to have quantum chemistry basics


Number : IRL_Project_45

Title : Strengthening the Trust Models for Outsourced Homomorphic Encryption             

Description: The recent advent of cloud computing technologies enables individuals and organisations to outsource heavy computations over big data to third-party servers. However, this presents a security and privacy challenge particularly when the data contains sensitive information such as individual medical records, etc. For compliance, regulation, and other essential privacy requirements, the data must be kept secure at rest, in transit, and during computation. Gentry's groundbreaking discovery of fully homomorphic encryption demonstrated the feasibility of general computation over encrypted data. Recent years have seen increased adoption of homomorphic encryption based outsourced computation to better address the data privacy concerns in various domains (finance, healthcare etc).
Despite remarkable progress in making homomorphic encryption more practical, several practical concerns still remain unaddressed:
(i) Verifiability of Results: While FHE ensures privacy of clients' data by allowing the third-party cloud servers to compute over encrypted data it still assumes that the cloud
"correctly" performs the computation (i.e, the cloud is only passively corrupt). For many applications, the clients might want a stronger security model, which protects against malicious cloud servers.
(ii) Multi-party Homomorphic Computation: Typically, the FHE considers one data owner who encrypts and outsources the data to the cloud. Several applications demand computations over data from several data providers. This is either accomplished by data providers sharing the FHE decryption key (a weaker trust model), or using Multi-Key FHE schemes (with prohibitive performance for existing solutions).
The aim of this project is to explore solutions to address the above two concerns by potentially improving existing approaches for verifiable HE to allow "succinct" verification, and looking at performance optimisations to improve the efficiency of FHE in multi-key setting.                  

Required Skills: Cryptography, understanding of HE schemes and hands-on experience with HE libraries (preferred).

Number : IRL_Project_46

Title : Secure database-as-a-service             

Description: Database-as-a-service (DBaaS) is a cloud computing service that allows a user to use a cloud database system without having to install and set it up themselves. It offers various benefits to the end user such as cost savings, scalability, rapid development etc. In fact DBaaS is among the fastest growing Software-as-a-Service market. However the biggest concerns for users of DBaaS is data confidentiality. The key problem is how to protect sensitive data from being accessed by insiders like DBA (Database Administrators) and Cloud Administrators, who have elevated privileges. Various solutions have been proposed to solve this problem such as Searchable Encryption, Fully Homomorphic Encryption, using Trusted Hardware etc. These solutions vary in the trade-offs between security, performance and usability. These solutions rely on modified database engines or work only on specific platforms or support only a limited set of operations.
The goal of our project is to build an open, secure and platform agnostic solution to maintain data confidentiality in DBaaS. The platform will be a reusable and extendable framework which can be extended to different data-stores and can integrate multiple protocols to protect the data. The platform will be built to have minimum impact on the application using it and minimal modifications to the database engine being used. These properties are critical to ensure wide adoption of the platform.
The primary aim of the internship is to design and build prototypes for the core components of the proposed platform. It will involve extensive research to find the best solution for a wide variety of challenges in building the proposed system. The main focus will be on building prototype which can be easily integrated with various database engines and applications. The prototype will be extendible to plug in various cryptographic protocols. A stretch goal will be to tune the prototype to get best performance while maintaining security of data.                   

Required Skills: Database, Cryptography, Software Engineering


Watson Assistant for Citizens - ASK NOW

For Career and Internship - Click here