2022
Document Structure aware Relational Graph Convolutional Networks for Ontology Population
Abhay M Shalghar, Ayush Kumar, Balaji Ganesan, Aswin Kannan, Akshay Parekh, Shobha G
DLG4NLP Workshop at ICLR 2022
Abstract
Ontologies comprising of concepts, their attributes, and relationships, form the quintessential backbone of many knowledge based AI systems. These systems manifest in the form of question-answering or dialogue in number of business analytics and master data management applications. While there have been efforts towards populating domain specific ontologies, we examine the role of document structure in learning ontological relationships between concepts in any document corpus. Inspired by ideas from hypernym discovery and explainability, our method performs about 15 points more accurate than a stand-alone R-GCN model for this task.
Fine Grained Classification of Personal Data Entities with Language Models
Abhinav Nagpal, Riddhiman Dasgupta, Balaji Ganesan
5th Joint International Conference on Data Science & Management of Data (9th ACM IKDD CODS and 27th COMAD), pp. 130-134, 2022
Abstract
Fine grained entity classification is the task of assigning context-specific, fine grained labels to entities extracted in an NLP Pipeline. Before the advent of language models, several artificial neural network models were proposed for this task. We revisit these models and compare them with BERT-based models for the specific task of classifying Personal Data Entities (PDE). We observe that using side information from rule-based annotators improves neural model performance on this task and can complement language models.
2021
Fair Data Generation using Language Models with Hard Constraints
SK Mainul Islam, Abhinav Nagpal, Balaji Ganesan, Pranay Kumar Lohia
CtrlGen Workshop at NeurIPS 2021
Abstract
Natural language text generation has seen significant improvements with the advent
of pre-trained language models. Using such language models to predict personal
data entities, in place of redacted spans in text, could help generate synthetic
datasets. In order to address privacy and ethical concerns with such datasets, we
need to ensure that the masked entity predictions are also fair and controlled by
application specific constraints. We introduce new ways to inject hard constraints
and knowledge into the language models that address such concerns and also
improve performance on this task.
Reimagining GNN Explanations with ideas from Tabular Data
Anjali Singh, Shamanth K Nayak, Balaji Ganesan
XAI Workshop at ICML 2021
Abstract
Explainability techniques for Graph Neural Networks still have a long way to go compared to explanations available for both neural and decision decision tree-based models trained on tabular data. Using a task that straddles both graphs and tabular data, namely Entity Matching, we comment on key aspects of explainability that are missing in GNN model explanations
Towards Automated Evaluation of Explanations in Graph Neural Networks
Vanya BK, Balaji Ganesan, Aniket Saxena, Devbrat Sharma, Arvind Agarwal
XAI Workshop at ICML 2021
Abstract
Explaining Graph Neural Networks predictions to end users of AI applications in easily understandable terms remains an unsolved problem. In particular, we do not have well developed methods for automatically evaluating explanations, in ways that are closer to how users consume those explanations. Based on recent application trends and our own experiences in real world problems, we propose automatic evaluation approaches for GNN Explanations.
Similar Cases Recommendation using Legal Knowledge Graphs
Jaspreet Singh Dhani, Ruchika Bhatt, Balaji Ganesan, Parikshet Sirohi, Vasudha Bhatnagar
KG Workshop at KDD 2021
Abstract
A legal knowledge graph constructed from court cases, judgments, laws and other legal documents can enable a number of applications like question answering, document similarity, and search. While the use of knowledge graphs for distant supervision in NLP tasks is well researched, using knowledge graphs for downstream graph tasks like node similarity presents challenges in selecting node types and their features. In this demo, we describe our solution for predicting similar nodes in a case graph derived from our legal knowledge graph.
Short Text Clustering in Continuous Time Using Stacked Dirichlet-Hawkes Process with Inverse Cluster Frequency Prior
Avirup Saha, Balaji Ganesan
MiLeTS Workshop at KDD 2021
Abstract
Traditional models for short text clustering ignore the time information associated with the text documents. However, existing works have shown that temporal characteristics of streaming documents are significant features for clustering. In this paper we propose a stacked Dirichlet-Hawkes process with inverse cluster frequency prior as a simple but effective solution for the task of short text clustering using temporal features in continuous time. Based on the classical formulation of the Dirichlet-Hawkes process, our model provides an elegant, theoretically grounded and interpretable solution while performing at par with recent state of the art models in short text clustering.
Data Augmentation for Fairness in Personal Knowledge Graph Population
Lingraj S Vannur, Balaji Ganesan, Lokesh Nagalapatti, Hima Patel, MN Thippeswamy
Data Assessment and Readiness for AI workshop at PAKDD 2021, arXiv
Abstract
A personal knowledge graph comprising people as nodes, their personal data as node attributes, and their relationships as edges has a number of applications in de-identification, master data management, and fraud prevention. While artificial neural networks have led to significant improvements in different tasks in cold start knowledge graph population, the overall F1 of the system remains quite low. This problem is more acute in personal knowledge graph population which presents additional challenges with regard to data protection, fairness and privacy. In this work, we present a system that uses rule based annotators to augment training data for neural models, and for slot filling to increase the diversity of the populated knowledge graph. We also propose a representative set sampling method to use the populated knowledge graph data for downstream applications. We introduce new resources and discuss our results.
2020
Explainable Link Prediction for Privacy-Preserving Contact Tracing
Balaji Ganesan, Hima Patel, Sameep Mehta
SpicyFL Workshop at NeurIPS, 2020
Abstract
Contact Tracing has been used to identify people who were in close proximity to those infected with SARS-Cov2 coronavirus. A number of digital contract tracing applications have been introduced to facilitate or complement physical contact tracing. However, there are a number of privacy issues in the implementation of contract tracing applications, which make people reluctant to install or update their infection status on these applications. In this concept paper, we present ideas from Graph Neural Networks and explainability, that could improve trust in these applications, and encourage adoption by people.
Anu Question Answering System
Balaji Ganesan, Avirup Saha, Jaydeep Sen, Matheen Ahmed Pasha, Sumit Bhatia, Arvind Agarwal
ISWC 2020, CEUR-WS
Abstract
AnuQA is a question answering system built on top of a search index and an enterprise knowledge graph. In this work, we describe five semantic technologies that have helped us address real world challenges in deploying this system. These challenges include bias in knowledge base population, entity re-resolution on streaming data, ontology alignment across data sources, explaining relationships,and providing a single unified query interface for business analytics.
Link Prediction using Graph Neural Networks for Master Data Management
Balaji Ganesan, Gayatri Mishra, Srinivas Parkala, Neeraj R Singh, Hima Patel, Somashekar Naganna
Technical Report, arXiv, 2020
Abstract
Learning graph representations of n-ary relational data has a number of real world applications like anti-money laundering, fraud detection, risk assessment etc. Graph Neural Networks have been shown to be effective in predicting links with few or no node features. While a number of datasets exist for link prediction, their features are considerably different from real world applications. Temporal information on entities and relations are often unavailable. We introduce a new dataset with 10 subgraphs, 20912 nodes, 67564 links, 70 attributes and 9 relation types. We also present novel improvements to graph models to adapt them for industry scale applications.
An Integrated Graph Neural Network for Supervised Non-obvious Relationship Detection in Knowledge Graphs
Phillipp Muller, Xiao Qin, Balaji Ganesan, Nasrullah Sheikh and Berthold Reinwald
EDBT, 2020
Abstract
Non-obvious relationship detection (NORD) in a knowledge graph is the problem of finding hidden relationships between the entities by exploiting their attributes and connections to each other. Existing solutions either only focus on entity attributes or on certain aspects of the graph structural information but ultimately do not provide sufficient modeling power for NORD.
In this paper, we propose KGMatcher -- an integrated graph neural network-based system for NORD. KGMatcher characterizes each entity by extracting features from its attributes, local neighborhood, and global position information essential for NORD. It supports arbitrary attribute types by providing a flexible interface to dedicated attribute embedding layers.
The neighborhood features are extracted by adopting aggregation-based graph layers, and the position information is obtained from sampling-based position aware graph layers. KGMatcher is trained end-to-end in the form of a Siamese network for producing a symmetric scoring function with the goal of maximizing the effectiveness of NORD. Our experimental evaluation with a real-world data set demonstrates KGMatcher's 6% to 35% improvement in AUC and 3% to 15% improvement in F1 over the state-of-the-art.
2019
A Neural Architecture for Person Ontology population
Balaji Ganesan, Riddhiman Dasgupta, Akshay Parekh, Hima Patel, Berthold Reinwald
Technical Report, arxiv, 2019
Abstract
A person ontology comprising concepts, attributes and relationships of people has a number of applications in data protection, de-identification, population of knowledge graphs for business intelligence and fraud prevention. While artificial neural networks have led to improvements in Entity Recognition, Entity Classification, and Relation Extraction, creating an ontology largely remains a manual process, because it requires a fixed set of semantic relations between concepts. In this work, we present a system for automatically populating a person ontology graph from unstructured data using neural models for Entity Classification and Relation Extraction. We introduce a new dataset for these tasks and discuss our results.
Collective Learning From Diverse Datasets for Entity Typing in the Wild.
Abhishek, Abhishek, Amar Prakash Azad, Balaji Ganesan, Ashish Anand, and Amit Awekar.
EYRE Workshop co-located with CIKM, CEUR, 2019
Abstract
Entity typing (ET) is the problem of assigning labels to given entity mentions in a sentence. Existing works for ET require knowledge about the domain and target label set for a given test instance. ET in the absence of such knowledge is a novel problem that we address as ET in the wild. We hypothesize that the solution to this problem is to build supervised models that generalize better on the ET task as a whole, rather than a specific dataset. In this direction, we propose a Collective Learning Framework (CLF), which enables learning from diverse datasets in a unified way. The CLF first creates a unified hierarchical label set (UHLS) and a label mapping by aggregating label information from all available datasets. Then it builds a single neural network classifier using UHLS, label mapping and a partial loss function. The single classifier predicts the finest possible label across all available domains even though these labels may not be present in any domain-specific dataset. We also propose a set of evaluation schemes and metrics to evaluate the performance of models in this novel problem. Extensive experimentation on seven diverse real-world datasets demonstrates the efficacy of our CLF.
2018
Document Structure Measure for Hypernym discovery
Aswin Kannan, Shanmukha C Guttula, Balaji Ganesan, Hima P Karanam, Arun Kumar
Technical Report, arxiv, 2018
Abstract
Hypernym discovery is the problem of finding terms that have is-a relationship with a given term. We introduce a new context type, and a relatedness measure to differentiate hypernyms from other types of semantic relationships. Our Document Structure measure is based on hierarchical position of terms in a document, and their presence or otherwise in definition text. This measure quantifies the document structure using multiple attributes, and classes of weighted distance functions.
A Unified Labeling Approach by Pooling Diverse Datasets for Entity Typing
Abhishek, Amar Prakash Azad, Balaji Ganesan, Ashish Anand, and Amit Awekar
Technical Report, arxiv, 2018
Abstract
Evolution of entity typing (ET) has led to the generation of multiple datasets. These datasets span from being coarse-grained to fine-grained encompassing numerous domains. Existing works primarily focus on improving the performance of a model on an individual dataset, independently. This narrowly focused view of ET causes two issues: 1) type assignment when information about the test data domain or target label set is not available; 2) fine-grained type prediction when there is no dataset in the same domain with finer-type annotations. Our goal is to shift the focus from individual domain-specific datasets to all the datasets available for ET. In our proposed approach, we convert the label set of all datasets to a unified hierarchical label set while preserving the semantic properties of the individual labels. Then utilizing a partial label loss, we train a single neural network based classifier using every available dataset for the ET task. We empirically evaluate the effectiveness of our approach on seven real-world diverse ET datasets. The results convey that the combined training on multiple datasets helps the model to generalize better and to predict fine-types across all domains without relying on a specific domain or label set information during evaluation.
Fine Grained Classification of Personal Data Entities
Riddhiman Dasgupta, Balaji Ganesan, Aswin Kannan, Berthold Reinwald, and Arun Kumar.
Technical Report, arxiv, 2018
Abstract
Entity Type Classification can be defined as the task of assigning category labels to entity mentions in documents. While neural networks have recently improved the classification of general entity mentions, pattern matching and other systems continue to be used for classifying personal data entities (e.g. classifying an organization as a media company or a government institution for GDPR, and HIPAA compliance). We propose a neural model to expand the class of personal data entities that can be classified at a fine grained level, using the output of existing pattern matching systems as additional contextual features. We introduce new resources, a personal data entities hierarchy with 134 types, and two datasets from the Wikipedia pages of elected representatives and Enron emails. We hope these resource will aid research in the area of personal data discovery, and to that effect, we provide baseline results on these datasets, and compare our method with state of the art models on OntoNotes dataset.
2017
Cognitive Compliance for Financial Regulations
Agarwal, Arvind and Balaji Ganesan and Gupta, Ankush and Jain, Nitisha and Karanam, Hima P and Kumar, Arun and Madaan, Nishtha and Munigala, Vitobha and Tamilselvam, Srikanth G
IT Professional 19(4), 28--35, IEEE, 2017
Abstract
Regulations are rules and directives published by authorities to safeguard consumer interest in an industry. Compliance with such regulations is getting increasingly hard due both to the complexity of these documents, which require experts to read, understand, and interpret