IBM Research India - Internship Projects and Research Areas
Summer - 2021 Internship Projects and Research Areas
Number : IRL_Project_1
Title : Relationship Extraction and Linking in Knowledge Base Question Answering.
Description: With advancement of Semantic web, question answering systems over knowledge bases is been evolving over the years in last one decade with availability of increasing structured data. In any KBQA task, parsing the input question to extract entities/relations and linking them with nodes/edges of the given KG is a critical and challenging sub-task. While, there ¬†are many off-the-shelf entity/relation linkers available, state-of-the-art approaches for KBQA use semantic parsing style custom designed linker inbuilt into their approach. But our study shows that performance of such modular architecture (i.e.off-the shelf-linker) critically depends on the performance of the linker. We would like to address the challenge of relationship extraction and linking here and provide an innovative solution.
Required Skills: Python, Natural Language Processing, pytorch, Machine learning, Deep learning Techniques etc.
Number : IRL_Project_2
Title : Semantic Parsing and Reasoning for Math Word Problems
Description: Semantic parsing is a key component to solve math word problems, where the aim is to convert problems stated in NL text format into math expressions involving variables, constants, operators and functions. The focus needed is on identifying the (contextual) mentions of various components of the expressions and their inter-relations towards faithfully replicating the NL problem statement into a valid math expression that could be further resolved through math reasoning.
Required Skills: NLP, Deep Neural Networks (Graph Neural Networks), Pytorch
Number : IRL_Project_3
Title : Goal driven next best action recommendation models for business processes
Description: Business processes and business process management(BPM) are an essential constituent of many modern enter-prises. They constitute organizational and operational knowledge and often perpetuate independently of personnel and infrastructure change. As such, their design, execution and responsiveness to change is of critical significance to establishing and maintaining efficient business operation. In addition, current trends toward exible methods of working, just-in-time organizational reaction times, distributed intra organization and inter-organization collaboration and constantly changing markets are creating new and complex business landscapes. This brings about increased complexity, further motivating the need for real-time dynamic change throughout an enterprise‚ As business processes.
In this problem we would like to enhance AI driven recommendation models for business process tasks such as next best agent, next best task, decision recommendation to be aware of goals and kpis specified by business persona. We would like to explore techniques to making entire AI model build pipeline (such as data prep, process aware feature engineering, model build, online learning, drift detection) to be goal oriented.
Required Skills: Machine learning, Data science, recommender systems, reinforcement learning
Number : IRL_Project_4
Title : A hybrid NLQ engine for robust QA
Description: NLU layer is best designed via Semantic Parsing SOTA and Complex Query Translation is best achieved via deterministic AI applications. Current approaches in NL/DL community tracks the text-to-sql whole as a single problem including query translation. Query translation has lot of domain and/or target language specific constructs that is harder to learn for complex queries and harder to generalize to different domains and backend.
We should model a common semantic parsing layer that takes a NL question and represents the intent as a domain agnostic way and the query translation should be left to a deterministic AI application which will be much more precise and accurate while deriving target queries. Such a hybrid model is expected to give SOTA in real datasets with BI queries, beyond single table/ simple query benchmarks like wikiSQL/wikiTable.
Required Skills: NLP, DL, Solid in Algorithm Designing.
Number : IRL_Project_5
Title : Interpretable latent variables for fine-grained control over response generation
Description: Sequence to sequence models and their variants are often used for generating responses in conversations. These models are often treated as black-box, thereby making it impossible to determine why a response was generated for a given dialog. Moreover, these models provide little control over the attributes of the response, for instance, the grammatical structure, the tone, the action expressed in the response, etc. In this work, we intend to provide the user with a set of knobs for fine-grained control over response generation. Each of these knobs corresponds to an interpretable latent variable that is learnt from the data in an unsupervised manner.
Required Skills: Machine Learning, Probability
Number : IRL_Project_6
Title : Custom data synthesis using GAN
Description: The problem is to understand the structured data generation techniques and create a GAN based synthesizer that not only captures all the details in the input data but also allows external customisations.
These customizations can be extrinsic (on the structure of the data set), explicitly semantic, or implicit (statistical properties and relationships in the data structure). For example, consider a case where the training data is taken from a North America region where Male-Female ratio is, say 3:2. To cater the data for a different geographic region that has a different Male:Female (say 1:1), it is important to generate synthetic data with restriction on Male-Female ratio.
Required Skills: Tensorflow, GAN, VAE
Number : IRL_Project_7
Title : Data Transformation using Programming By Example
Description: Today, data scientists, data stewards, and business analysts increasingly need to clean, and transform diverse data sets, before they can perform analysis. This process of data transformation is an important part of data preparation, and is known to be difficult and time-consuming for end-users. In this problem space, we are looking to solve below mentioned problems automatically where analyst describe the transformation or cleaning intent through few input and output example pair. This kind of technique reduce the burden of writing manual annotators/programs/logic/code.
Required Skills: Python, Machine Learning and Deep Learning, Graph Theory, Optimisation, NLP
Number : IRL_Project_8
Title : Privacy-Preserving Explainable Hierarchical TimeSeries Forecasting Framework
Description: Sales and inventory data in supply chains have Spatio-temporal characteristics, leading to a hierarchical time series framework. Accurate forecasting and decision making in a hierarchical framework depends on the data shared amongst participants from the same hierarchical level and also data sharing across different levels of hierarchy. Although conceptually attractive, many participants in the supply chain are unwilling to share detailed information due to the fear of unfair exploitation by other competing parties. There would also be regulatory requirements that limit the data sharing outside the data owner as it is. Hence, privacy-preserving protocols are the need of the hour. The scalability of the solution infrastructure is essential as it involves handling a large volume of time-series data.
Required Skills: Python Programming, Kubernetes, Timeseries Forecasting
Number : IRL_Project_9
Title : Advance drought forecasting methods at a seasonal and sub-seasonal scales
Description: Droughts are amongst the most disastrous natural hazard that affect several business such as agriculture and food supply chains, hydropower generation and numerous water intensive manufacturing industries. Accurate forecast of drought, in particular agricultural and hydrological droughts, requires a deep understanding of physical aspects such as weather, soil, vegetation, runoff‚Äôs, groundwater, etc and the interactions between them. State of the art forecasting methods take a subset of these processes and interactions into account. In this project, we would focus on developing hybrid methods (data and simulation based) accounting for these non-linear interactions by fusing data from diverse sources such as from improved physical model simulations, high resolution satellites data on vegetation, soil, land characteristics, etc in conjunction with (sub)seasonal forecasts. With this approach, we expect to enhance the predictability skills of droughts compared to the state of the art results reported in literature.
Required Skills: Machine Learning, Numerical modelling, Familiarity with geo-spatial data (satellite, weather)
Number : IRL_Project_10
Title : Automatic conversation disentanglement
Description: Collaboration platforms like Slack,IRC are getting used heavily for IT support. In these platform multiple conversations happen simultaneously mixed together. Automatic disentanglement can extract isolated conversations which could be used for various upstream tasks like intent extraction, resolution classification. This task requires manually annotated datasets for building machine learning based models. There are various manually annotated public datasets available . In this proposal we would like to experiment and build new disentanglement deep learning models.
"A Large-Scale Corpus for Conversation Disentanglement", Kummerfeld Jonathan K., Gouravajhala Sai R.,Peper Joseph J, Athreya Vignesh, Gunasekara Chulaka, Ganhotra Jatin, Patel Siva Sankalp,Polymenakos Lazaros C, Lasecki, Walter in ACL 2019.
Required Skills: Natural Language Processing, Pytorch/Tensorflow, Deep learning
AI and Security
Number : IRL_Project_11
Title : PPHTS: Privacy Preserving Hierarchical Time Series Forecasting Framework
Description: Sales and inventory data in supply chains have Spatio-temporal characteristics, leading to a hierarchical time series framework. Accurate forecasting and decision making in a hierarchical framework depends on the data shared amongst retailers from the same hierarchical level and also data sharing across different levels of hierarchy. Although conceptually attractive, many participants in the supply chain are unwilling to share detailed information due to the fear of unfair exploitation by other competing parties. There would also be regulatory requirements that limit the extent of data shared outside the data owner.
A candidate with a research background in differential privacy will be preferred. For undergrads and masters applicants, we expect the candidate to have mathematical maturity and strong programming skills.
Required Skills: differential privacy, mathematical maturity and strong programming skills
Number : IRL_Project_12
Title : Performance benchmarking of containerized network function chains
Description: Network Functions (NFs) are components that facilitate communication of two end points. For example, consider you are accessing a cloud-based application like google drive from your laptop. Your edits or requests are accurately received by the drive application due to functions like load balancers and firewalls which are part of the network which lie between your home's modem and the servers on the cloud on which the application is running.
With the objective of optimising performance of network functions on Kubernetes, this internship will study problems pertaining to resource allocation and placement of CNF workloads. A future question to answer would be whether based on the analysis, can we further provide an algorithm to predict which CNFs can be placed together without degradation in their performance and how to place chained or interacting CNFs? We will be focusing on working towards a research publication in this regard.
Required Skills: 1. Virtualization concepts, Containers, Kubernetes
2. Basics of networking
Number : IRL_Project_13
Title : Closed-loop fault management for telecom network functions
Description: Network functions, such as the workloads that process the voice, video and data traffic from innumerable mobile devices at a telecom vendor's infrastructure, are extremely complex and have stringent requirements on resiliency and fault tolerance. Today's high speed networks require extremely quick detection, root cause analysis and problem remediation through closed-loop orchestration. The large number of interacting components and highly dynamic environment make this an extremely challenging problem.
This internship will involve the following interesting and challenging facets:
- Real-time monitoring of metrics, logs and changes in deployment topology using modern cloud telemetry tools, including addressing what should be monitored and how
- Cutting-edge AI-based analytics combining insights from metrics, logs and topology for root cause analysis and fault localisation
- Orchestration of remediation actions to remedy or mitigate failures using state-of-the-art orchestration tools
Required Skills: Strong systems development and software engineering skills; Cloud and distributed systems background
Number : IRL_Project_14
Title : Data Centric Modernization
Description: Application modernization focusses on the application artifacts such as source code, binaries, configuration files etc. to analyze and plan application modernization strategy. Data and resources used is the big elephant in the room that is tightly coupled with the application that is migrated. Either data modernization is left to architects and developers based on the application modernization recommendations are not completely addressed. Here, we focus on the data as the key resource to be modernized and identify strategies towards modernizing the data infrastructure, and thereby recommend application modernization inline with the data modernization.
Required Skills: Unsupervised learning methods, Good Database fundamentals
Number : IRL_Project_15
Title : Reactive Microservices
Description: Reactive systems are designed to be Responsive, Resilient, Elastic and Message Driven. Reactive Systems rely on asynchronous message passing which allows for loose coupling (publish/subscribe or async streams or async calls) between established service boundaries, isolation (one failure does not ripple through to upstream services and clients), and improved responsive error handling. We will explore the paradigm of reactive systems to design stateful microservices as there is a growing realization that not every application can or should be modeled as stateless in cloud-native environments. Eventually, we would like to evaluate how do such reactive microservices fare on the properties of resilience, elasticity and responsiveness.
Required Skills: Programming Models, Cloud Computing, Algorithms
Number : IRL_Project_16
Title : Identifying target component mapping when re-platforming to Openshift
Description: When re-platforming applications to Openshift, there are multiple ways in which each component can be moved. For example, for a database component, the choices range from creating containers, using IBM CloudPaks to SaaS offerings. The right mapping depends on the features that are used by the application and the compliance requirements.
The goal of this project would be to use program analysis techniques, natural language understanding and knowledge of the target environments to help users make the best choice for their specific deployment scenarios and to help in configuring the service with the appropriate configurations based on the current deployment.
Required Skills: Golang, Docker, Research experience in container fundamentals, source code and program analysis is very useful.
Number : IRL_Project_17
Title : Hybrid Cloud Platform & Observability Pipeline Performance Improvements
Description: Openshift Platform (OCP) is a flagship product platform developed by Red Hat. This platform is well positioned to win over all Hybrid Cloud deployments of wide variety of services and solutions in near future. It is designed on top of Kubernetes container-orchestration system. In OCP Observability of applications, provisioned infrastructure plays a key role in monitoring applications performance, infrastructure health real time. In this project we will explore and investigate gaps in OCP observability features, logging pipeline and several key performance metrics. Work will give a good amount of exposure of many popular and thriving open source community projects, source code repositories and allow one to contribute on key new features on improving Observability aspects.
Required Skills: Decent exposure to kernel level programming, well versed with Linux OS, programming languages such as Go, Ruby, Python etc. Some what exposure to Elastic Search engine, DB Technologies is good to have. Excellent programming aptitude with problem solving attitude.
Number : IRL_Project_18
Title : Inter-component call refactoring
Description: Porting web applications to Kubernetes brings multiple advantages for the application owners in terms of scalability and agility. Monoliths when hosted on Kubernetes does not allow the same level of agility as an application that has been developed Cloud Native.
In this project we will take a specific case of monoliths, which are deployed as EARs and use program analysis techniques to help refactor the constituent components as containers. As part of this journey the various communication mechanisms used by the components will need to be refactored. By understanding the current systems using program analysis, this project will attempt to help the users find the right rectifications for the target systems. This might include communications such as function calls, Web APIs, file sharing etc.
Required Skills: Program Analysis,
Number : IRL_Project_19
Title : Fault Localization using Heterogeneous Graph in Cloud Environment
Description: Fault localization in heterogeneous cloud environment using graph based approaches. In the heterogeneous environment, an application topology can have heterogeneity in the nodes as well as edges. The edges could be of type causal, dependency, association, linksTo etc. The nodes could have multiple types such as service, server, router depending upon the component it represents. Therefore, standard graph metrics can not be directly applied in this setting, as depending on type of error, which edges and nodes play a role, will differ. We aim to solve the problem of accurately localising faults in heterogeneous cloud environment by customizing various graph centrality metrics to heterogeneous graph setting.
Required Skills: systems, graph theory,
Number : IRL_Project_20
Title : Lifecycle-aware supply chains
Description: Supply chains typically focus on ‚Äòhere and now‚Äô benefits to determine how to operate optimally. However, this perspective might change if we a take a lifecycle perspective. For example, consider transportation of beef (A) vs. raw corn (B): In A, if beef is transported from processing unit to the store over only over 10 km, and in B the same mass of raw corn is transported over 100 km, the ‚Äòhere and now‚Äô outlook would rate B as more polluting than A. However, from a lifecycle perspective - beef being a very carbon-intensive product, it has already accrued a large amount of emissions - A is actually a lot more polluting. What is the best way to introduce these ideas to the current supply chain paradigm?
Required Skills: Life-cycle assessment, Mathematical Optimization, Python, Numpy, Scipy
Number : IRL_Project_21
Title : Towards Energy-Aware Resource Allocation in GPU Clusters
Description: The environmental directives towards enhancing the sustainability of our planet by limiting carbon emisions of human activities has lead to the stepping up of efforts in reducing the energy consumption of computing workloads and datacenter and cloud operations. With the increasing prominence of AL/ML in solving a wide range of problems, the fraction of ML-based workloads in the compute mix is steadily increasing. Hence, efforts on quantifying and mitigating the carbon footprint of ML workloads has started receiving the attention of the research community.
ML workloads, especially those that train deep-learning-based AI models, can be extremely compute intensive and are executed on specialized hardware accelerators, most commonly, clusters of GPU's. There is already some work underway to devise scheduling mechanisms that are suited to a mix of ML model building and training workloads with the objective of improving the resource utilization of GPU's while providing fairness to jobs. These efforts, however, currently do not consider improving the energy efficiency of GPU clusters while executing ML-workload, which we intend to explore. The effort will involve profiling and characterizing the energy consumption of popular GPU's with resource utilization for different types of ML workloads, identifying energy optimal operating points, developing light-weight energy models, and discerning insights that can be used to advise offline resource allocation mechanisms or provide guidelines for online scheduling algorithms. This research will complement other efforts on mitigating AI's carbon footprint (such as Green AI) that are focused on designing compact models and reducing the computational demands of training per se.
Required Skills: ML, Systems, should be comfortable with Linux, GPUs, systems programming (desirable),