IBM Research India - Internship Projects and Research Areas


Summer - 2022 Internship Projects and Research Areas

 

Artificial Intelligence (AI)

Number : IRL_Project_1

Title : AI Generating Task-Oriented Dialog Responses from Heterogeneous Knowledge Sources 

Description: Existing task-oriented dialog systems generate responses using the information in just one knowledge source. For example, a restaurant recommendation dialog system uses information in database of restaurants for suggesting restaurants. In many real-world applications, the dialog systems are required to infer information from more than one knowledge source. For example, the restaurant recommendation system can be augmented with a corpus of restaurant reviews as an additional source of knowledge. In this project, we plan to build a deep neural network that can learn to generate responses by inferring from heterogeneous knowledge sources.
Required Skills: Pytorch, Natural Language Processing, Deep Learning

Number : IRL_Project_2

Title : Controlled response generation with learnable attributes

Description: Sequence to sequence models and their variants have often been employed to learn a mapping between a dialog context and the corresponding response. However, these models don't allow any flexibility in controlling the semantic content of the responses for a given dialog context. In actual conversations, there can be multiple diverse responses for a given dialog context that lead to multiple directions in which the conversation can flow. The aim of this project is to equip the dialog models with learnable attributes that control the semantic content of the response for a given dialog history, thereby controlling the direction in which the conversation will flow.
Required Skills: Probability, Deep Neural Networks, Variational Inference, Pytorch AI
 

Number : IRL_Project_3

Title : Tooling for Large Language Model

Description:
Transformer based language models have become defacto standards in modern NLP. We have active threads on pre-training new large language models under various settings, including multilinguality and code-switching. The expectation from the internship work revolves around devising exciting novel usecases, building interactive demo that showcases the usecases,  serving these models, and evaluating performance of these models.  The scope of the work extends to text-like sequential data too. We are experimenting with various transformer models (such GPT, BERT, T5), for generation, classification, and seq2seq tasks.
Required Skills:  Interactive UI building, NLP, PyTorch, Huggingface AI 
 

Number : IRL_Project_4

Title :Dynamic Orchestration of Automation Services

Description: In recent times, Chatops (i.e., conversation-driven collaboration for IT operations) has rapidly accelerated the efficiency of ITops teams by providing a cross-organization and cross-domain platform to resolve and manage issues as soon as possible. The SREs (site reliability engineers) leverage the automation scripts to support user requests. The research problem is to set us a system that can assimilate all the automation scripts and, given a request from the user, is able to dynamically execute one or more automation scripts in sequence to perform the task.
Required Skills: AI planning, NLP, Reinforcement learning.
 

Number : IRL_Project_5

Title : Concept Drift Detection for Attributed Process Models

Description: Process models offer powerful ways to represent the real process behaviour as mined from the execution logs. Rapidly changing business environments lead to continuous changes in process behaviour and performance over time. There is need to automatically detect such changes at an integrated process level and being able to explain those changes. Research so far has focused on detecting and locating changes for a single process attribute at a time. There is need to represent the process model using attributed graph embedding using which drift can be detected and localised at a holistic level. Further, the detected changes need to be explained leveraging concepts from explainable AI. Scope of this research problem is to enable concept drift detection for attributed graph embedding generated to represent a process model.
Required Skills: Machine Learning, Deep Learning AI
 

Number : IRL_Project_6

Title : Incorporating Domain Knowledge in Neural Network for Question Answering Natural Language

Description: Interfaces to Databases/Knowldgebases are very important to extract information stored in the relational database/knowledge base in any Industry. There is various published research work based on either rule based or neural method. We would like to explore Neural/Neural Machine Translation based approaches where domain knowledge can be incorporated. We would also like to explore the same in case of low resource setting.

Required Skills:Python, Pytorch, NLP, Neural Network Architecture AI

Number : IRL_Project_7

Title : Semantic Parsing and Reasoning to Solve Math Word Problems

Description: Solving math word problems involves conversion of the problem statements in NL text into math expressions followed by math reasoning. Semantic parsing step, to convert text into expressions (involving variables, constants, operators and functions), involves identification of contextual mentions of different components of the expression and their inter-relations. Reasoning step uses the expressions obtained together with set of axioms and inference rules to solve the problem through multiple steps of inference, which is essentially a search problem where an efficient guidance mechanism is critical.
Required Skills: NLP, QA, Deep Neural Networks (Graph Neural Networks), Pytorch AI
 

Number : IRL_Project_8

Title : Targeted Fact Extraction from Textual Resources to Improve KBQA

Description: The goal is to explore the usefulness of textual resources towards improving Knowledge Base Question Answering (KBQA). KBQA answers complex natural language questions using facts found in Knowledge Bases (KB). However, one of the main issues faced is incomplete KBs, i.e., all the facts needed to answer the question may not be present always. In this project our goal is to use of textual resources to compensate for this drawback. Knowing what is missing in the KB, it should be possible to perform targeted extraction of specific facts from the text. Such a targeted extraction is likely to be more accurate than unconstrained extraction. Plan is to evaluate on standard KBQA datasets such as QALD, LC-QUAD, WebQSP, and TempQA using Wikipedia as textual resource.
Required Skills: NLP, QA, Deep Neural Networks, Pytorch AI
 

Number : IRL_Project_9

Title : Expressive Reasoning Graph Store

Description: "Resource Description Framework (RDF) and Property Graph (PG) are the two most commonly used data models for representing, storing, and querying graph data. Recently, Expressive Reasoning Graph Store (ERGS) (https://github.com/IBM/expressive-reasoning-graph-store/) is presented, which is graph store built on top of JanusGraph (a Property Graph store) that allows storing and querying of RDF datasets through Tinkerpop framework. It captures translation of RDF data into a Property Graph representation and describes query translation module that converts SPARQL queries into a series of Gremlin traversals.
Currently, ERGS supports reasoning for fixed ontology, i.e., ontology is available in the beginning and cannot be modified during reasoning. Hence, we propose supporting incremental updates to the ontology. It would require modification in existing graph based on change in the ontology. Precisely, the aim is to support ontology updates though originating parallel graph algorithms.
Also, it has been observed that JanusGraph performance goes down for large node scans. As a result reasoning queries become slow which involves large node scans. Therefore, as second sub-problem, we propose using Neo4j as an alternative to JanusGraph to handle these cases. Neo4j supports graph operations through native language called Cypher. It also support Tinkerpop framework though external library. In this proposed work, challenge is to study supporting reasoning operations using Cypher and Tinkerpop on Neo4j from performance and functionality point of view. More specifically, aim is to study the scope of adding ingestion, querying and reasoning related algorithms in Neo4j."
Required Skills: RDF/OWL Reasoning, Graph Database, Graph Theory, Java AI
 

Number : IRL_Project_10

Title : Scalable Anomaly Detection in Cryptographic Networks via Graph Neural Networks

Description: "The topic of the project is anomaly detection in cryptographic networks such as block chains. Examples of anomalies are phishing, money laundering and fraudulent transactions. The subject has been well-studied using classical learning methods. Recent studies have explored applying Graph Neural Networks (GNN). We have three specific goals:i) Design new GNN models catered to anomaly detection problems in cryptographic networks with the aim of achieving improved accuracy; (ii) Prior work has applied GNN models meant for static networks. However, cryptographic networks are dynamic (evolving) in nature. We wish to explore Temporal GNN models in this context; (iii) Cryptographic networks are large in size and so, we wish to study the scalability aspects."
Required Skills: General understanding of machine learning concepts and frameworks such as PyTorch. Experience with high performance computing (HPC) is desirable, but not mandatory. AI
 

Number : IRL_Project_11

Title : A Neuro-symbolic Approach to Semantic Parsing

Description: "The goal of semantic parsing is to convert the natural language utterance to logic-form lambda DCS or query languages like SQL, SPARQL. Supervised training of a neural semantic parser requires a large number of training pairs (utterance, logic-form). Annotating logic-form corresponding to a natural utterance requires expertise in the logic-form or the query language and hence a very costly process. 
Recently, several works have argued that compositional generalization is essential for more human-like natural language understanding and generalization from very few examples. Compositional generalization is fundamental for a parser to generalize well on out-of-domain natural language utterances. By compositional generalization, we mean that by seeing ""jump"", ""walk"", and ""walk twice"" during training time, the parser should be able to generate the logic form for a not seen utterance ""jump twice"". In this project, we want to combine neural approaches that are good at recognizing patterns with symbolic approaches that are good at composing patterns to reduce the training data requirement of semantic parsing.
Required Skills: "Hands-on knowledge on NLP, Expert knowledge on Pytorch AI
 

Number : IRL_Project_12

Title : Detection and Mitigation of Profane content in training data for Large Language Models

Description: "Large Language Models are used in many applications associated with Text generation, classification, conversation, etc. However, very few have tested whether they are under the influence of profane content or not whether at the data selection or model training phase. There are multiple works associated with profane content detection such as XHATE-999, Real Toxicity Prompts, etc. However, they are domain-specific and non-generalizable to wider datasets. SemiEval2019 and Caselli2020 define profane content into three dimensions i.e., Offensive, Abusive, and Hate. Works like HateBERT and MACAS propose a method for detecting all three dimensions of profanity. However, they are very limited in performance improvement. MACAS uses a pre-defined set of embedding layers which limits it. HateBERT has been pre-trained on only Reddit based RAL-E dataset which brings out scope for a lot of improvements in it.We identified profane content to fall under three dimensions: 
Offensive - A message is labeled as offensive if  “it contains any form of non-acceptable language or a targeted offense, which can be veiled or direct"". Example: What is wrong with these idiots?
Abusive - Abusive language is defined as a specific case of offensive language, namely “hurtful language that a speaker uses to insult or offend another individual or a group of individuals based on their personal qualities, appearance, social status, opinions, statements, or actions.” Example: He. Is. A. Sociopath. They are incapable of feeling empathy. Period.​
Hate - “Any communication that disparages a person​ or a group on the basis of some characteristic such as race, color, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics"". Example: I am a white nationalist of a Christian faith but still am a white nationalist for racial survival the anti-racist Christians are the true Christian's enemy.
The proposal consists of two main components:
1. Flagging of Hateful, Offensive, Abusive and Toxic language (Profanity)
Phase 1: Explicit  profanity
Phase 2: Implicit  profanity (e.g., Sarcasm)
2.   Remediation
Phase 1: Removal of complete sentences containing profane content (Sentence level)
Phase 2: Masking/filtering of profane words and phrases (Sub-sentence level)
We will be also aiming for publishing the work done during the internship time at AI conferences like IJCAI, FAT*, AIES, AAAI, NeurIPS."
Required Skills: NLP, Deep Learning, experience in working with BERT based models, Scalable Deep-Learning concepts AI
 

Number : IRL_Project_13

Title : Context aware abbreviation expansion using deep learning

Description: "Abbreviation expansion is an important research problem as well as a very impactful business problem. Abbreviations are usually used in various domains such as shortening biomedical terms, in code blobs while defining variables, functions, parameters, in shortening column headers of tables. Often abbreviations are used in different organizations for defining business terms. However, the task for expanding an abbreviation is a difficult task and the state-of-the-art solutions have low accuracy scores. To concretize the problem, consider a set of column headers {h_1,h_2,‚Ķ,h_n}
appearing in a table which are abbreviated. The final goal is to determine the expanded forms of each of these columns. Of course, when considered each column {h_i, 1 ‚â§i‚â§n} in silo, there will be many expansions of h_i. However, the very fact that all the other column headers are associated with each other should minimize the choices and might help to determine the expanded column headers accurately. For example, consider a table with column headers {CUST_INFO, ADDR, STRT, PHN NMBR, SLRY}. Now each of these column headers have multiple expansions possible, such as, CUST_INFO can expand to Customer Information, Customized information, Customary Information. However, when all the headers are expanded together, then those give rise to a solution which seems to be accurate such as {CUSTOMER INFORMATION, ADDRESS, STREET, PHONE NUMBER, SALARY}.  Also, for answering questions such as which column involve customer related information, one needs to expand the abbreviated column headers first. This problem can be analogously extended to determining expansions for variables, parameters, function names of a code blob. Also, one can ask a question such as which portion of the code involves customer-centric code, then essentially one needs to understand the expanded function names, parameters and variables and then retrieve such information to answer the question.    In the recent times, Neural Machine Translation (NMT) based models have been used for translating text from one language to another language, but however, such models also work when there is almost no data on the target language. Such models are called NMT based models without parallel sentences [1], [2]. However, NMT based approaches can also be applied on abbreviations where a real scenario is that there are a large number of abbreviations available without the expanded forms. To solve this research problem such a NMT based approach can also be applied.  
The problem is quite challenging as it has a broad application in the domain of NLP. Also, a solution to this problem will involve models with Deep Learning Techniques such as LSTM, Attention, Transformer, RNN etc. 
References: [1]  ‚ÄúBilingual Dictionary Based Neural Machine Translation without Using Parallel Sentences‚Äù by Xiangyu Duan et.al. (2007.02671.pdf (arxiv.org)) [2]  ‚ÄúWORD TRANSLATION WITHOUT PARALLEL DATA‚Äù by Alexis Conneau et.al. (1710.04087.pdf (arxiv.org))"
Required Skills:"Excellent coding skills with deep knowledge in NLP. Excellent knowledge in building DL models." AI
 

Number : IRL_Project_14

Title : Automatic Data Products Realization

Description: In Data fabric and ACT phase of data modernization, the target data is realized/transformed in different schemas on disparate DB stores, and different from the source data. A complex co-existence and governance is necessary avoid data duplication and meta-data reuse. An automatic data product realization supporting multiple data schemas and data stores with multiple representation of the source data for target system needs greatly reduces the ACT phase time and immensely useful in data fabric for down stream application. This intern proposal will establish the automatic data product realization using data and AI techniques.
Required Skills: Python, basic ML/DL skills AI
 

Number : IRL_Project_15

Title : Optimal Constrained Decision Tree Surrogates for Blackbox ML Models

Description: Given a blackbox model/oracle and its training data, build an interpretable surrogate decision tree maximizing fidelity under tree sparsity constraints. The constraints will be specified in terms of the maximum no. of leaf nodes and maximum depth/path of the surrogate tree.
Required Skills:Statistics, Discrete Mathematics, Machine Learning, Basics of Optimization AI
 

Number : IRL_Project_16

Title : Dynamic Inventory Planning in Assemble-to-Order Systems

Description: Assemble-to-Order system has an inventory of N make-to-stock parts, which can be assembled into M products. Each of the N parts have different sourcing options (ordering cost, quantity, lead time) and are maintained as N inventories (N different holding cost, backorder cost, etc). While the steady state analysis of M demand profiles help in strategic planning like inventory capacity, we are focused on tactical/operational planning of inventory. Given a demand profile, we need to provide multiple inventory plans, each evaluated with respect to different criteria like cost, robustness, uncertainty, etc.
Required Skills:"Mandatory: Mixed Integer Linear Programming, Stochastic/Robust Optimization, Multi-Criteria Optimization. Programming Skills: Python, Optimization Solvers.
Preferred: Full stack development with React, Mongo, and NodeJS or DASH." AI

Number : IRL_Project_17

Title :Intermittent and hierarchical time series forecasting

Description: Intermittent time series (i.e., a time series with many zeros present in it) are often observed in retail sales. This is generally due to a very sporadic nature of sales in a particular store. Data scientists often face challenges in applying standard time series forecasting models to intermittent data because those time series generally do not possess any meaningful trend or seasonality patterns. Furthermore, it has been observed in various open competitions (like M5) that tree based gradient boosting regressors with no knowledge of time series history outperform classical forecasting models for these intermittent data by just utilizing the exogenous attributes. This essentially denotes the lack of trend and seasonality patterns in those data. However, the intermittent sales data are often hierarchical in nature (e.g, store, product group, department, state etc., going up the hierarchy), and a hierarchically aggregated time series (e.g. state-level sales) generally captures the trends and seasonality. This creates a demand for designing more sophisticated and dedicated machine learning algorithms for intermittent and hierarchical time series data.
Required Skills:Python, Pandas, Tensorflow, PyTorch, scikit-learn, Machine learning, Regression, Classification, Boosting, CART AI
 

Number : IRL_Project_18

Title :Enabling private and public blockchains to interoperate for trustworthy sharing of ledger state and assets

Description: "Distributed ledger technology (DLT) has gone beyond its experimental phase and is now actively managing several enterprise workflows around the world in areas like trade logistics, export finance, inter-bank payments, and regulatory compliance. But this has not led to convergence of a single global network that everyone runs applications on and has resulted in a fragmentation of the blockchain ecosystem. This situation severely undermines the value proposition of blockchain as the processes and assets that are artificially segregated in the blockchain world are interdependent in the real world. One way to maximize network effects of individual blockchain networks without forcing them to merge is enabling interoperability in a decentralized and secure manner while without relying on trusted authorities.
Our team has extensive research experience in this domain having proposed asset and data sharing protocols across private/permissioned networks, a decentralized identity management solution that provides a trust basis for those sharing and exchange protocols and a system to expose ledger state with freshness guarantees. We have been actively developing and maintaining Weaver, which began as a research prototype, implementing these proposals, and is now an open source project under the Hyperledger Labs umbrella. Under this project, we have demonstrated how to link independent permissioned networks built on diverse DLT stacks like Hyperledger Fabric, Corda, and Hyperledger Besu. To enable the original blockchain vision of decentralized trust at global scale, we need to expand our scope to incorporate open (permissionless) networks like Bitcoin and Ethereum in the Weaver interoperability framework. This will be the focus of the internship, whose expected outcome is to link a permissioned network (e.g., Hyperledger Fabric) with an open network (e.g., the Ethereum Mainnet) to run at least one protocol involving trustworthy data or asset exchanges. Though existing Weaver libraries and protocols will be used as a basis, this scenario presents new research challenges in ledger state proof generation and verification, distributed identity management, and security and privacy. The goal of this internship is to identify and select one or more problems to solve in these areas and produce a working research prototype that can be incorporated into Weaver. Additionally, we expect that the internship will a research paper and one or more patentable ideas."
Required Skills:We expect the student to be familiar with blockchains and distributed ledgers in theory. Hands-on development skills in Hyperledger Fabric and/or Ethereum would be a strong plus. We prefer PhD or Master students in this area, but are also open to undergrads who have have a relatively strong background in relevant areas. AI
 

Number : IRL_Project_19

Title : Secure Multiparty Computation over Authenticated Data

Description: "The standard notion of Secure Multiparty Computation guarantees correct evaluation of a public function over private inputs of different participants, while preserving confidentiality. However, no restriction is placed on what private inputs are used by the individual parties. In many real life applications, it is desirable to ensure that while individual inputs are confidential, they are ""authentic"" in some sense (for e.g. have a valid signature from a trusted authority). In the special case of two-party zero knowledge protocols, the problem has existing solutions(AD-SNARK), wherein proofs can be made over authenticated data. In this project, our aim is to explore similar mechanisms to augment multiparty protocols to efficiently ensure privacy, correctness and additionally input authenticity.
A satisfactory resolution of the problem has implications for trustworthy AI where it can be ensured that certified and fair models were used to provide inference, for collaborative supply chains where it can be ensured that key indicators are derived using network authenticated data, and in general to the area of privacy preserving machine learning."
Required Skills:Cryptography, Some familiarity with basic ML concepts, competence in Undergraduate Mathematics such as Discrete Math, Linear Algebra, Basic Algebra, and most importantly ability to critically analyse technical papers and improvise on the techniques. Familiarity with Programming and technical writing using LaTeX is a plus. AI

Number : IRL_Project_20

Title : Resource and Model building for Indic Languages

Description: The goal is to create resources i.e, unlabelled data, labelled data and build various NLP models for Indic languages
Required Skills:Natural Language Processing, Pytorch, TensorFlow, Deep Learning AI
 

Number : IRL_Project_21

Title : Automated Exploratory Data Analysis for building AI pipelines

Description: The goal of this project will be to build automated ranking methods for EDA. Typically, a data scientist would explore many things in the data to understand and find the most interesting patterns in the data. This can take them a number of steps before they discover all the patterns and issues in the data. This process is usually trial and error, without any set mechanism. Moreover this becomes a more challenging problem as the size of the dataset increases. In this project, the goal will be to build algorithms for automated ranking of EDA operators, so that a data scientist can focus on the most important patterns and problems in the dataset.
Required Skills:Machine Learning, Reinforcement Learning, Python, Tensorflow, Pytorch AI
 

Number : IRL_Project_22

Title : Hierarchical probabilistic forecasting for supply chains

Description: Probabilistic forecasts are essential to produce robust decisions against uncertain future conditions. In the context of supply chains, optimal decision making at different echelons and nodes should be based on coherent hierarchical probabilistic forecasts. However,  an important downside of probabilistic forecasting is that it tends to be a lot more computationally intensive, limiting its application to at most few hundred time series. Typically, large scale retail supply chains constitute of thousands of products being moved through thousands of stores and warehouses, generating millions of time series with potential interactions amongst themselves. Thus, leaving scope for major innovations in the area of computationally efficient generation of hierarchical probabilistic forecasts. This project would aim to develop new context driven probabilistic forecasting methods to enable real time AI-driven decision making at different levels of the supply chain hierarchy.
Required Skills:Python, Time series analysis, Probability AI
 

Number : IRL_Project_23

Title : Systematization of Knowledge of Research on Variational Quantum Algorithms

Description : Variational algorithms have found profound impact in quantum computing, especially for the NISQ (Noisy Intermediate Scale Quantum) era where the quantum hardware is constrained in terms of the connectivity between the qubits and the (relatively high) noise level. Variational algorithms have found applications in a wide variety of domains ranging from quantum chemistry, machine learning and optimization.
The high level steps of a variational algorithm are to (i) decide an appropriate ansatz, (ii) decide the input state, (iii) design the Hamiltonian for the problem, (iv) choose a classical optimizer, (v) repeat the classical and quantum routines with this setup. The goal of this internship would be to produce a systematization of knowledge of research on variational algorithms. For each application domain where variational algorithms can be applied, the intern will explore, extensively benchmark and articulate and reason out the benchmarking results for different ansatz, the optimisation algorithms (VQE, QAOA, ...), different classical optimisation routines part of the variational algorithms. The benchmarking will be done in terms of the accuracy of the results, the circuit complexity (i.e. the number of single and multi-qubit gates used and the circuit depth) and the running time of the algorithms.
Required Skills : We expect the student to have the following background. 1) Knowledge of Quantum Computing  2) Basic knowledge of variational methods 3) Programming in Python. Knowledge of Qiskit is a definite plus.
 

Hybrid Cloud (HC)

Number : IRL_Project_24

Title : Orchestration and Interoperability of Network slices spanning across multiple vendors

Description: "A network slice is an isolated end-to-end virtual network that runs on a shared infrastructure and provides a pre-negotiated service quality. Research in Network slicing for 5G has seen tremendous growth over the recent years as it serves as a means to provide network as a service (NaaS) for a plethora of services with very different use cases, traffic patterns and requirements. 
A network slice can span across different vendors with radio network being served by vendor one, core by vendor two and transport by a third vendor. Therefore, there is a requirement for a slice orchestrator & manager to translate the high-level business intent for a service and map it to complicated settings for individual infrastructure elements and network functions spanning across different vendors. The high-level intent needs to interpreted in terms of functional and SLA requirements including speed, quality, latency, reliability and security. Thereafter, requirements for a slice needs to be broken down into requirements for each slice subnet. The slice MANO should also define a domain-specific description language that allow the expression of service characteristics, KPIs and network element capabilities in a comprehensive manner, while also retaining a simple and intuitive syntax. This becomes even more challenging owing to the dynamic nature of slicing to support features such as auto scaling and seamless mobility across multiple service providers. In such scenarios, this slice MANO will also be responsible for modification and movement of slices or slice subnets. There is an additional challenge of finding an interoperable way to compose services with network functions implemented by different vendors communicating with each other. The brute force approach of defining standardized interfaces for each new functionality is not be scalable given the huge amount of use cases 5G aims to address."
Required Skills: Networking, Kubernetes, Strong Programming skills Hybrid Cloud
 

Number : IRL_Project_25

Title : AI Driven logic extraction from legacy applications

Description: Legacy applications that evolved over several decades tend to add features and functionalities over the year and many a times, they become very complex for understanding the logic flow for a maintenance of modernization related activity. The documentations is often not upto date or missing altogether and people who developed the applications or added features have moved on.
Required Skills: Program analysis, NLP, knowledge graph, databases Hybrid Cloud
 

Number : IRL_Project_26

Title : Automating Generation Of Infrastructure-as-Code For Edge

 Description: "Infrastructure planning and provisioning at the edge involves interaction with a large number of computing devices spread across multiple de-coupled domains. Ensuring repeatability in such scenarios eases the deployment, maintenance and recovery from failures.¬†
Infrastructure-as-code (IaC) has emerged as a reliable mechanism to ensure repeatability in cloud environments. Due to the huge device population and complicated communication patterns between edge devices, there is also a pressing need to automatically generate IaC artifacts for a production-grade workload involving thousands of interacting services. Such a capability will not only introduce efficiency in the process but also help reduce human-error and increase adoption due to ease-of-use.
Our open-source tool Move2Kube already generates IaC artifacts for conventional cloud setting involving cloud-native cluster orchestration platforms such as Kubernetes and OpenShift. We intend to extend this capability to edge-friendly clusters (1-node, 3-node, remote-node) used for orchestrating workloads.
The work involves experimentation in the problem space, design, and development of the following two components: 
Infrastructure planning: 
    - Recommend cluster configurations and generate artifacts for creating clusters (3-node, 1-node, remote-node, regular). 
    - Leverage and select existing patterns of infrastructure.¬†
Workload mapping: 
     - Understand workload characteristics 
        * delay sensitive or delay tolerant? 
        * closer-to-data or closer-to-user? 
        * cpu/memory/storage-intensive? 
        * traffic-intensive? - Map workload components to clusters by creating deployments, services, ingress and other artifacts. - Divide and plumb the CI/CD pipeline to adapt to configuration of clusters and workload."
Required Skills: Golang, Openshift, Kubernetes, Docker, Containers Hybrid Cloud
 

Number : IRL_Project_27

Title : Response Generation and Story Summarization for Collaborative AIOps Multiturn Conversations

Description: Understanding Multiturn Conversation and response generation is becoming a popular field of research in Natural Language Understanding and Generation subdomain. There have been a few efforts into generating responses for domain-specific conversations, especially with technical conversations though it being informal or semi-formal. Such AIOps related conversations are becoming prevalent with the increased usage of collaborative channels, e.g. slack, teams, for operations-related issues discussion. These conversations are also representative of an AIOps story that has inherent similarity, indicating what kind of issues have been diagnosed, has a description, anomalies, and resolution within the conversations. The story summarization captures this information succinctly representing the conversation. Multi-turn conversations in Ops domain contain rich information about the nature of the problem and the steps needed to be followed for an efficient solution. The technical jargons are often not a part of general English vocabulary, which makes the problem challenging. The nature of the domain and its vocabulary must be learned from a limited amount of training data available to us. We aim to understand the dialogue for solving many related downstream tasks, such as (i) understanding the complexity of the issue around which the conversation is taking place. (ii) determining if any important question needs to be answered regarding the issue. (iii) finding conversations regarding similar issues from the historical dump.
Required Skills: AI, NLP Hybrid Cloud
 

Number : IRL_Project_28

Title : Security and Privacy for Edge Computing

Description: "Edge computing promises to enable the next advancement in cloud computing with applications delivered at the edge of the network for low latencies. This paradigm brings along with it newer challenges as the computing and state moves from a centralized model to a distributed one.
Since the users’ privacy- sensitive information will be shared and/or stored at the edge computing servers, security and privacy become crucial challenges in such a distributed structure.
In this project, we intend to solve security and privacy challenges while users get connected at runtime to the closest edge device."
Required Skills: Cloud computing, Security and Privacy, Algorithms Hybrid Cloud
 

Number : IRL_Project_29

Title : Logs Flow Control in Vector - a new log collector in Rust

Description: "Logs are an essential source of information to understand the application, system behaviour, or troubleshooting critical issues occurring during Cloud Deployment. They also play a crucial role in developing various algorithms such as fault localization and anomaly detection. There is a need for a centralized log management system in a distributed production environment that collects, transforms, and transports log data to storage systems that can retain these logs for a longer time. Logging pipeline reliability, resiliency is super critical to a robust troubleshooting environment. In this context, log collector of underlying logging pipeline plays a critical role in governing reliability of logs collection process. Current Logging pipeline has got log collector as Fluentd which has got many performance gaps in keeping track of all container log files. As an superior alternative log collector - built-in Rust, Vector, is fast, memory efficient, and is designed to handle the most demanding workloads. Vector supports logs and metrics, making it easy to collect and process all observability data. The Goal of this project is to enable inbound and outbound data traffic to be controlled using appropriate rate control policies in this new Vector collector and showcase its viability as a proof of concept.
Here is the task plan in detail :
1. Study of Vector input plugin which deals with tailing log files for reading
2. Study of current capability on Flow control at each container level, understanding its benefits
3. A new plugin implementation to do data logs flow control (rate control) at group level where groups are formed by meta data rules and metric rules
4. Investigate usage of cluster level metrics to control logging rates at inbound (logfiles to collection IO handler) and outbound (collected logs by the collector sending them to aggregator like Elasticsearch) stages level
5. Attempting PR submission to upstream Vector community"
Required Skills: Familiarity with Container, Docker, good at picking up new programming languages like Rust, Golang, good to have some exposure to Github hosted open source community projects Hybrid Cloud
 

Number : IRL_Project_30

Title : Utilizing Active Learning for Persistent Anomalies Detection in Cloud Environment

Description: "With more and more applications migrating to the cloud, cloud systems are becoming much more complex with a significant number of micro-services to manage. Outage or incidents impacting the availability of cloud-native applications have become very common. An organization suffers a significant loss as a result of the outages. As a result, detecting and recognizing an anomaly before it leads to service or system-level outage is a key aspect of the AIOps platform. The goal of such efforts toward anomaly detection is to provide SREs with ample time to diagnose and troubleshoot the impending failure. Using system logs and metrics, automated ways of detecting anomalous behaviors as early indicators of system failure have become a common practice. Through this internship, we would like to achieve the following goals:
(1) Detect persistent anomalies using logs as a primary data source. The persistent detection algorithm aims to have an ensemble of supervised as well as unsupervised persistent detection (2) The supervised approach uses existing data sources to intelligently label the anomalies of interest for classification purposes (3) The persistent detection algorithm would work on top of anomaly detection, helping in filtering false positive otherwise invalid anomalies (4) Employ active learning techniques by incorporating external feedback by automated sources or humans in the loop"
Required Skills: AI, NLP, Time Series Analysis Hybrid Cloud

Number : IRL_Project_31

Title : Host-network monitoring with eBPF/XDP

Description: eBPF (extended Berkeley Packet Filter) is emerging as a framework that makes it possible to perform high-speed packet processing in linux OS. XDP (eXpress Data Path) allows operators to attach eBPF (extended Berkeley Packet Filter) programs to host network points. eBPF/XDP can be a very powerful tool to perform host network monitoring. While there have been several research works (In-band Network Telemetry, sketch-based libraries, etc), that focus on network debugging from the perspective of the data center network (switches), much work is needed to collect/coordinate this information with the end-host container networks especially since container networks encapsulate packets leading to losing an application's context. This project focusses on bridging this gap and to provide application's context to network debugging.
Required Skills: Networking, C programming, Linux, eBPF(Optional) Hybrid Cloud
 

Number : IRL_Project_32

Title : Heterogeneous Graph Clustering with fine-grained Code Comprehension

Description: "This task will help accelerate the pace at which we modernize legacy code into cloud ready microservices.
What has been accomplished ?
A) Heterogeneous Knowledge Graph : Given a legacy codebase, we employ a system that creates a heterogenous knowledge graph representation. We have heterogeneous nodes like Programs, UI, Tables and Files. We also have heterogeneous edges like CALLS (i.e. function calls) and CRUD (i.e. create, read, update, delete) . 
B) Graph Neural Networks for Graph Clustering : We currently employ a GNN that performs an unsupervised learning procedure. The GNN simultaneously learns the structure of the graph, identifies outliers in the graph and also performs clustering on the graph. For more details, please refer to our AAAI'21 work https://arxiv.org/abs/2102.03827. 
What will this task entail and what problem will this solve ?
Overloaded Nodes (programs) : While trying to group programs into clusters, we often run into large monolithic code snippets that are often 1) overloaded with functionality or 2) play an excessively important role in many up-stream business functionalities. Therefore in this work, we aim to study the impact of going from programs into functions for clustering. This task will involve representing these finer lever details as nodes in graphs and coming up with new node features that will help in more functionally aligned node clusters. We also aim to analyze and explain the edges and features that impact the clustering."
Required Skills: Graph Neural Network, Program Analysis, Knowledge Graphs, Graph Theory, Pytorch Framework, Fluency in Java and Python Languages Hybrid Cloud
 

Impact Science (IS)

Number : IRL_Project_33

Title : Pre-trained representation of geo-spatial (weather/climate) data for accelerated climate aware applications

Description: Climate-aware applications in supply chains and asset management require incorporating a diverse set of climate and geo-spatial data to enable climate resiliency and enable informed decision-making.  Climate and weather data characterize complex spatio-temporal relationships, inherent uncertainties, and the true nature of big data, limiting the development and adoption of climate-aware applications at a faster pace.
Required Skills: AI/ML, Geo-spatial modelling, Experience of working with Weather and Climate data
 

Number : IRL_Project_34

Title : Metamodeling of process-based biogeochemical model for GHG estimation using physics-informed multitask learning

Description: The farming stage of food supply chain accounts for 69% of its total GHG emission. The process-based biogeochemical model provides an opportunity to estimate the GHG emission from any farm fields. It simulates the complex interaction between climate, soil, plant growth and microbial activity. But the scaling of biochemical model is limited by the expensive computational resource and the detailed data requirements. To overcome this challenges, we aim to build a metamodel to learn the complex interaction between all the physical processes driving the GHG emission, using the well-validated biogeochemical model, DNDC. The recent studies shows that the physics-informed multitask learning is very effective in approximating the partial differential equations. In this project, we will apply the physics-informed multitask learning approach to build the metamodel and compare the performance with other approaches.
Required Skills: Machine learning, Python
 

Number : IRL_Project_35

Title : Spectral-spatial-temporal fusion of satellite data to increase the spatio-temporal resolution of methane concentration maps

Description: Many Green House Gas (GHG) measuring satellite missions like Sentinel-5P and GOSAT, etc., are in place to globally monitor the concentration of Methane, a potent greenhouse gas in the atmosphere by using the radiances in the short wave infrared (SWIR) bands. Non-GHG satellites such as Sentinel 2 also measure radiances in the SWIR bands, but are not optimized for methane estimation. The collection of GHG and non-GHG satellites offer opportunities to fuse satellite data and obtain global methane concentration maps. In particular, non-GHG satellites weakly sense methane at 10x10 sq. meters spatial and 3-5 days temporal resolutions and GHG satellites strongly sense methane at 7x7 sq. kilometers spatial and daily temporal resolution. Hence, differences in spectral, spatial and  temporal resolutions, and overpass time of satellite missions put forth a challenge in fusing the data. Statistics,
Required Skills: Deep learning, Geo-spatial analysis



Watson Assistant for Citizens - ASK NOW


For Career and Internship - Click here

http://www-03.ibm.com/employment/index.shtml