Knowledge Discovery and Data Mining Publications
2018
Discovering Urban Travel Demands Through Dynamic Zone Correlation in Location-Based Social Networks
Wangsu Hu, Zijun Yao, Sen Yang, Shuhong Chen, Peter J Jin
Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD), pp. 88--104, 2018
A Probabilistic Hough Transform for Opportunistic Crowd-sensing of Moving Traffic Obstacles
Michiaki Tatsubori, Aisha Walcott-Bryant, Reginald Bryant, John Wamburu
2018 SIAM International Conference on Data Mining (SDM 2018)
Abstract (to appear)
Traffic congestion in developing cities like Nairobi, Kenya can be significantly impacted by the presence of Moving Traffic Obstacles (MTOs). These MTOs are events that temporarily exist on the road, moving with or against the direction of traffic at slower speeds. They include two-wheelers, pushcarts, animals, and pedestrians, which have quite different influence on traffic compared with static obstacles, such as potholes and speed bumps. As smartphones and supporting 3G infrastructures are wide spread even in developing countries, recent studies enabled frugal traffic obstacle data collection from smartphones in probe cars. Assuming the opportunistic, unevenly-distributed, sparse and errorful observation of traffic obstacles, we propose an MTO detection algorithm extending an image analysis technique called Probabilistic Hough Transform for collective observations as input. Based on our experiences with a small set of real-world data collected in a smartphone-based probe car project with Nairobi City County, we conducted experiments with simulated observation data to see the effectiveness of the algorithm.
(to appear)
Scalable Spectral Clustering Using Random Binning Features
Wu, Lingfei and Chen, Pin-Yu and Yen, Ian En-Hsu and Xu, Fangli and Xia, Yinglong and Aggarwal, Charu
ACM KDD (oral paper), 2018
Abstract
Spectral clustering is one of the most effective clustering approaches that capture hidden cluster structures in the data. However, it does not scale well to large-scale problems due to its quadratic complexity in constructing similarity graphs and computing subsequent ÿ&
Distributed Ledger Technology for Document and Workflow Management in Trade and Logistics
Z. Wang, D. Y. Liffman, D. Karunamoorthy, and E. Abebe
ACM CIKM, 2018
Bug Localization by Learning to Rank and Represent Bug Inducing Changes
Pablo Loyola, Kugmoorthy Gajananan, Fumiko Satoh
International Conference on Information and Knowledge Management (CIKM), 2018
E-tail product return prediction via hypergraph-based local graph cut
Jianbo Li, Jingrui He, Yada Zhu
KDD, 2018
Abstract
Recent decades have witnessed the rapid growth of E-commerce. In particular, E-tail has provided customers with great convenience by allowing them to purchase retail products anywhere without visiting the actual stores. A recent trend in E-tail is to allow free shipping and hassle-free returns to further attract online customers. However, a downside of such a customer-friendly policy is the rapidly increasing return rate as well as the associated costs of handling returned online orders. Therefore, it has become imperative to take proactive measures for reducing the return rate and the associated cost. Despite the large amount of data available from historical purchase and return records, up until now, the problem of E-tail product return prediction has not attracted much attention from the data mining community.
To address this problem, in this paper, we propose a generic framework for E-tail product return prediction named HyperGo. It aims to predict the customerÿ s intention to return after s/he has put together the shopping basket. For the baskets with a high re- turn intention, the E-tailers can then take appropriate measures to incentivize the customer not to issue a return and/or prepare for reverse logistics. The proposed HyperGo is based on a novel hypergraph representation of historical purchase and return records, effectively leveraging the rich information of basket composition. For a given basket, we propose a local graph cut algorithm using truncated random walk on the hypergraph to identify similar historical baskets. Based on these baskets, HyperGo is able to estimate the return intention on two levels: basket-level vs. product-level, which provides the E-tailers with detailed information regarding the reason for a potential return (e.g., duplicate products with different colors). One major benefit of the proposed local algorithm lies in its time complexity, which is linearly dependent on the size of the output cluster and polylogarithmically dependent on the volume of the hypergraph. This makes HyperGo particularly suitable for processing large-scale data sets. The experimental results on multiple real-world E-tail data sets demonstrate the effectiveness and efficiency of HyperGo.
2017
Ranking based multitask learning of scoring functions
Ivan Stojkovic, Mohamed Ghalwash, Zoran Obradovic
Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML), pp. 721--736, 2017
Cost Sensitive Time-series Classification
Shoumik Roychoudhury, Mohamed Ghalwash, Zoran Obradovic
Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML), pp. 495--511, 2017
Computational Drug Discovery with Dyadic Positive-Unlabeled Learning
Yashu Liu, Shuang Qiu, Ping Zhang, Pinghua Gong, Fei Wang, Guoliang Xue, Jieping Ye
SIAM International Conference on Data Mining (SDM), 2017
Abstract
Computational Drug Discovery, which uses computational techniques to facilitate and improve the drug discovery process, has aroused considerable interests in recent years. Drug Repositioning (DR) and Drug-Drug Interaction (DDI) prediction are two key problems in drug discovery and many computational techniques have been proposed for them in the last decade. Although these two problems have mostly been researched separately in the past, both DR and DDI can be formulated as the problem of detecting positive interactions between data entities (DR is between drug and disease, and DDI is between pairwise drugs). The challenge in both problems is that we can only observe a very small portion of positive interactions. In this paper, we propose a novel framework called Dyadic Positive-Unlabeled learning (DyPU) to solve the problem of detecting positive interactions. DyPU forces positive data pairs to rank higher than the average score of unlabeled data pairs. Moreover, we also derive the dual formulation of the proposed method with the rectifier scoring function and we show that the associated non-trivial proximal operator admits a closed form solution. Extensive experiments are conducted on real drug data sets and the results show that our method achieves superior performance comparing with the state-of-the-art.
Polyadic Regression and its Application to Chemogenomics
Ioakeim Perros, Fei Wang, Ping Zhang, Peter Walker, Richard Vuduc, Jyotishman Pathak, Jimeng Sun
SIAM International Conference on Data Mining (SDM), 2017
Abstract
We study the problem of Polyadic Prediction, where the input consists of an ordered tuple of objects, and the goal is to predict a measurement associated with them. Many tasks can be naturally framed as Polyadic Prediction problems. In drug discovery, for instance, it is important to estimate the treatment effect of a drug on various tissue-specific diseases, as it is expressed over the available genes. Thus, we essentially predict the expression value measurements for several (drug, gene, tissue) triads. To tackle Polyadic Prediction problems, we propose a general framework, called Polyadic Regression, predicting measurements associated with multiple objects. Our framework is inductive, in the sense of enabling predictions for new objects, unseen during training. Our model is expressive, exploring high-order, polyadic interactions in an efficient manner. An alternating Proximal Gradient Descent procedure is proposed to fit our model. We perform an extensive evaluation using real-world chemogenomics data, where we illustrate the superior performance of Polyadic Regression over the prior art. Our method achieves an increase of 0.06 and 0.1 in Spearman correlation between the predicted and the actual measurement vectors, for predicting missing polyadic data and predicting polyadic data for new drugs, respectively.
Revisiting Spectral Graph Clustering with Generative Community Models
Chen, Pin-Yu and Wu, Lingfei
ICDM (regular paper), 2017
Abstract
Abstract: The methodology of community detection can be divided into two principles: imposing a network model on a given graph, or optimizing a designed objective function. The former provides guarantees on theoretical detectability but falls short when the graph is
Gadei: On scale-up training as a service for deep learning
Wei Zhang, Minwei Feng, Yunhui Zheng, Yufei Ren, Yandong Wang, Ji Liu, Peng Liu, Bing Xiang, Li Zhang, Bowen Zhou, Fei Wang
IEEE International Conference on Data Mining (ICDM), 2017
A Method to Accelerate Human in the Loop Clustering
Anni Coden, Marina Danilevsky, Daniel Gruhl, Linda Kato, and Meena Nagarajan
SDM, 2017
GELL: Automatic Extraction of Epidemiological Line Lists
from Open Sources
Saurav Ghosh, Prithwish Chakraborty, Bryan L Lewis, Maimuna S Majumder, Emily Cohn, John S Brownstein, Madhav V Marathe, Naren Ramakrishnan
Proceedings of the 23rd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 1477--1485, 2017
Learning from multi-modality multi-resolution data: an optimization appraoch
Y. Zhu, J. Li, and J. He
SIAM International Conference on Data Mining, 2017
Abstract
Many complex real applications involve the collection of time series data with multiple modalities and of multiple resolutions. For example, in aluminum smelting processes, the recorded process variables typically reflect various aspects of these processes, such as pressure and temperature, and they are often obtained with different time resolutions, such as every 5 minutes and every day. How can we effectively leverage both the multi-modality property and the multi-resolution property of the data for the sake of more accurate prediction of key process indicators (e.g., the cell temperature of the aluminum smelting processes)?
Different from existing techniques, which can only model the multi-modality property or the multi-resolution property, in this paper, for the first time, we propose to jointly model the two properties such that the prediction results are consistent across multiple modalities and multiple resolutions. To this end, we construct an optimization frame work, which is based on a novel regularizer imposing such consistency. Then, we design an effective and efficient optimization algorithm based on randomized block coordinate descent. Its performance is evaluated on both synthetic and real data sets, outperforming state-of-the-art techniques.
HiMuV: Hierarchical Framework for Modeling Multi-Modality Multi-Resolution Data
J. Li, Y. Zhu, and J. He
IEEE-ICDM, 2017
Local Algorithm for User Action Prediction Towards Display Ads
H. Yang, Y. Zhu, and J. He
KDD, 2017
2016
Discovering Spatial Regions of High Correlation
P. Agarwal, R. Verma, V. M. V. Gunturi
2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)
Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection
Id{\'e}, Tsuyoshi and Khandelwal, Ankush and Kalagnanam, Jayant
Data Mining (ICDM), 2016 IEEE 16th International Conference on, pp. 955--960
Abstract
Abstract: We propose a new approach to anomaly detection from multivariate noisy sensor data. We address two major challenges: To provide variable-wise diagnostic information and to automatically handle multiple operational modes. Our task is a practical extension of
LSHDB: a parallel and distributed engine for record linkage and similarity search
Karapiperis, Dimitrios and Gkoulalas-Divanis, Aris and Verykios, Vassilios S
Data Mining Workshops (ICDMW), 2016 IEEE 16th International Conference on, pp. 1--4
Abstract
Abstract: In this paper, we present LSHDB, the first parallel and distributed engine for record linkage and similarity search. LSHDB materializes an abstraction layer to hide the mechanics of the Locality-Sensitive Hashing (a popular method for detecting similar items in
POI Recommendation: A Temporal Matching between POI Popularity and User Regularity
Zijun Yao, Yanjie Fu, Bin Liu, Yanchi Liu, Hui Xiong
2016 IEEE Conference on Data Mining (ICDM 2016)
Applicability of Latent Dirichlet Allocation for Company Modeling
Katsiaryna Mirylenka, Christoph Miksovic, Paolo Scotton
In Proceedings of 16th Industrial Conference on Data Mining (ICDM), pp. 55-60, 2016
Probabilistic-mismatch anomaly detection: do one's medications match with the diagnoses?
Lingxiao Zhang, Xiang Li, Haifeng Liu, Jing Mei, Gang Hu, Junfeng Zhao, Bing Xie, Guotong Xie
IEEE International Conference on Data Mining (ICDM), 2016
RelSim: Relation similarity search in schema-rich heterogeneous information networks
Wang, Chenguang and Sun, Yizhou and Song, Yanglei and Han, Jiawei and Song, Yangqiu and Wang, Lidan and Zhang, Ming
Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 621--629
Abstract
Abstract Recent studies have demonstrated the power of modeling real world data as heterogeneous information networks (HINs) consisting of multiple types of entities and relations. Unfortunately, most of such studies (eg, similarity search) confine discussions on
Co-Clustering Structureal Temporal Data with Applications to Semiconductor Manufacturing
Y. Zhu and J. He
ACM Transactions on Knowledge Discovery from Data (TKDD)10, 2016
Abstract
Recent years have witnessed data explosion in semiconductor manufacturing due to advances in instrumentation and storage techniques. The large amount of data associated with process variables monitored over time form a rich reservoir of information, which can be used for a variety of purposes, such as anomaly detection, quality control, and fault diagnostics. In particular, following the same recipe for a certain Integrated Circuit device, multiple tools and chambers can be deployed for the production of this device, during which multiple time series can be collected, such as temperature, impedance, gas flow, electric bias, etc. These time series naturally fit into a two-dimensional array (matrix), i.e., each element in this array corresponds to a time series for one process variable from one chamber. To leverage the rich structural information in such temporal data, in this article, we propose a novel framework named C-Struts to simultaneously cluster on the two dimensions of this array. In this framework, we interpret the structural information as a set of constraints on the cluster membership, introduce an auxiliary probability distribution accordingly, and design an iterative algorithm to assign each time series to a certain cluster on each dimension. Furthermore, we establish the equivalence between C-Struts and a generic optimization problem, which is able to accommodate various distance functions. Extensive experiments on synthetic, benchmark, as well as manufacturing datasets demonstrate the effectiveness of the proposed method.
Revisiting Random Binning Feature: Fast Convergence and Strong Parallelizability
Lingfei Wu*, Ian E.H. Yen*, Jie Chen, and Rui Yan (*equally contributed)
In the Proceeding of the 22th SIGKDD conference on Knowledge Discovery and Data Mining, 2016
Unified Point-of-Interest Recommendation with Temporal Interval Assessment
Yanchi Liu, Chuanren Liu, Bin Liu, Meng Qu, Hui Xiong
Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1015--1024, 2016
Risk Prediction with Electronic Health Records: A Deep Learning Approach
Yu Cheng, Fei Wang, Ping Zhang, Jianying Hu
SIAM International Conference on Data Mining (SDM), 2016
Abstract
The recent years have witnessed a surge of interests in data analytics with patient Electronic Health Records (EHR). Data-driven healthcare, which aims at effective utilization of big medical data, representing the collective learning in treating hundreds of millions of patients, to provide the best and most personalized care, is believed to be one of the most promising directions for transforming healthcare. EHR is one of the major carriers for make this data-driven healthcare revolution successful. There are many challenges on working directly with EHR, such as temporality, sparsity, noisiness, bias, etc. Thus effective feature extraction, or phenotyping from patient EHRs is a key step before any further applications. In this paper, we propose a deep learning approach for phenotyping from patient EHRs. We first represent the EHRs for every patient as a temporal matrix with time on one dimension and event on the other dimension. Then we build a four-layer convolutional neural network model for extracting phenotypes and perform prediction. The first layer is composed of those EHR matrices. The second layer is a one-side convolution layer that can extract phenotypes from the first layer. The third layer is a max pooling layer introducing sparsity on the detected phenotypes, so that only those significant phenotypes will remain. The fourth layer is a fully connected softmax prediction layer. In order to incorporate the temporal smoothness of the patient EHR, we also investigated three different temporal fusion mechanisms in the model: early fusion, late fusion and slow fusion. Finally the proposed model is validated on a real world EHR data warehouse under the specific scenario of predictive modeling of chronic diseases.
The SPMF Open-Source Data Mining Library Version 2
Philippe Fournier-Viger, Jerry Chun-Wei Lin, Antonio Gomariz, Ted Gueniche, Azadeh Soltani, Zhihong Deng, Hoang Thanh Lam
ECML/PKDD, Springer, 2016
Open Problem: Accurately Measuring Event Impacts on Time Series.
Lianhua Chi; Bo Han; Yun Wang
2nd SIGKDD Workshop on Mining and Learning from Time Series in KDD16, 2016
Predicting Disk Replacement towards Reliable Data Centers
Mirela Botezatu, Ioana Giurgiu, Jasmina Bogojeska, Dorothea Wiesmann
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2016
World Knowledge as Indirect Supervision for Document Clustering
Wang, Chenguang and Song, Yangqiu and Roth, Dan and Zhang, Ming and Han, Jiawei
ACM Transactions on Knowledge Discovery from Data (TKDD) 11(2), 13, ACM, 2016
Abstract
Abstract One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World
INSIGHT: Dynamic Traffic Management Using Heterogeneous Urban Data
Nikolaos Panagiotou and 24 co-authors
ECML/PKDD, the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, 2016
NVMcached: An NVM-based Key-Value Cache
Xingbo Wu, Fan Ni, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, Zili Shao, Song Jiang
Proceedings of the 7th ACM SIGOPS Asia-Pacific Workshop on Systems, pp. 18:1--18:7, ACM, 2016
An Empirical Study on Hybrid Recommender System with Implicit Feedback
Lee, Sunhwan and Chandra, Anca and Jadav, Divyesh
Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 514--526, 2016
Abstract
Abstract The amount of data generated by systems is growing quickly because of the appearance of mobile devices, wearable devices, and The Internet of Things (IoT), to name a few. Because of that, the importance of personalized recommendations by recommender
2015
Toward Comprehensive Attribution of Healthcare Cost Changes
Dmitriy Katz-Rogozhnikov, Dennis Wei, Gigi Y. Yuen-Reed, Karthikeyan Natesan Ramamurthy, Aleksandra Mojsilovi c
IEEE ICDM Workshops, 2015
Identifying Employees for Re-Skilling using an Analytics-Based Approach
Karthikeyan Natesan Ramamurthy, Moninder Singh, Michael Davis, J. Alex Kevern, Michael Peran
IEEE ICDM Workshops, 2015
Informative Prediction based on Ordinal Questionnaire Data
\bf Tsuyoshi Id\'e, Amit Dhurandhar
Proceedings of 2015 IEEE International Conference on Data Mining (ICDM 15), pp. 191--200
Knowsim: A document similarity measure on structured heterogeneous information networks
Wang, Chenguang and Song, Yangqiu and Li, Haoran and Zhang, Ming and Han, Jiawei
Data Mining (ICDM), 2015 IEEE International Conference on, pp. 1015--1020
Abstract
Abstract: As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and
Identifying Employees for Re-Skilling using an Analytics-Based Approach
Karthikeyan N. Ramamurthy, Moninder Singh, Michael Davis, Jason A. Kevern, Uri Klein and Michael Peran
IEEE International Conference on Data Mining (ICDM) - Workshop on Data Mining for Service, 2015
Flexible Sliding Windows for Kernel Regression Based Bus Arrival Time Prediction Algorithms
Hoang Thanh Lam and Eric Bouillet
ECML/PKDD 2015, Springer
Multi-View Incident Ticket Clustering for Optimal Ticket Dispatching
Mirela Botezatu, Jasmina Bogojeska, Ioana Giurgiu, Hagen Voelzer, Dorothea Wiesmann
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015
Support Measure Data Description for group anomaly detection
Guevara, Jorge and Canu, St{'e}phane and Hirata, R
ODDx3 Workshop on Outlier Definition, Detection, and Description at the 21st ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD2015)
Abstract
We address the problem of learning a data description model from a dataset containing probability measures as observations. We estimate the data description model by optimizing volume-sets of probability measures where each volume-set is defined as a set of probability
Mobility Mining for Journey Planning in Rome
Michele Berlingerio, Veli Bicer, Adi Botea, Stefano Braghin, Nuno Lopes, Riccardo Guidotti, Francesca Pratesi
Machine Learning and Knowledge Discovery in Databases - European Conference,
ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings,
Part III, pp. 222--226
S&P360: Multidimensional Perspective on Companies from Online Data Sources
Michele Berlingerio, Stefano Braghin, Francesco Calabrese, Cody Dunne, Yiannis Gkoufas, Mauro Martino, Jamie C. Rasmussen, Steven I. Ross
Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2015, Porto, Portugal, September 7-11, 2015, Proceedings, Part III, pp. 320--324
Shedding light on the performance of solar panels: a data-driven view
S. A. Chen, A. Vishwanath, S. Sathe and S. Kalyanaraman
ACM SigKDD Explorations 17(2), 24 - 36, 2015
Abstract
The significant adoption of solar photovoltaic (PV) systems in both commercial and residential sectors has spurred an interest in monitoring the performance of these systems. This is facilitated by the increasing availability of regularly logged PV performance data in recent years. In this paper, we present a data-driven framework to systematically characterise the relationship between performance of an existing photovoltaic (PV) system and various environmental factors. We demonstrate the efficacy of our proposed framework by applying it to a PV generation data set from a building located in northern Australia. We show how, in light of limited site-specific weather information, this data set may be coupled with publicly available data to yield rich insights on the performance of the building's PV system.
Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature
Meenakshi Nagarajan, Angela D Wilkins, Benjamin J Bachman, Ilya B Novikov, Shenghua Bao, Peter J Haas, Mar{'i}a E Terr{'o}n-D{'i}az, Sumit Bhatia, Anbu K Adikesavan, Jacques J Labrie, others
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2019--2028, 2015
Opinion Marks: A Human-Based Computation Approach to Instill Structure into Unstructured Text on the Web
Bum Chul Kwon, Jaegul Choo, Sung-Hee Kim, Daniel Keim, Haesun Park, Ji Soo Yi
KDD 2015 Workshop on Interactive Data Exploration and Analytics (IDEA’15), pp. 47--55
LINKAGE: An Approach for Comprehensive Risk Prediction for Care Management
Sun, Zhaonan and Wang, Fei and Hu, Jianying
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1145--1154, 2015
Abstract
Abstract Comprehensive risk assessment lies in the core of enabling proactive healthcare delivery systems. In recent years, data-driven predictive modeling approaches have been increasingly recognized as promising techniques to help enhance healthcare quality and
Incorporating world knowledge to document clustering via heterogeneous information networks
Wang, Chenguang and Song, Yangqiu and El-Kishky, Ahmed and Roth, Dan and Zhang, Ming and Han, Jiawei
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1215--1224, 2015
Abstract
Abstract One of the key obstacles in making learning protocols realistic in applications is the need to supervise them, a costly process that often requires hiring domain experts. We consider the framework to use the world knowledge as indirect supervision. World
Online topic-based social influence analysis for the wimbledon championships
Embar, Varun R and Bhattacharya, Indrajit and Pandit, Vinayaka and Vaculin, Roman
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1759--1768, 2015
Abstract
Abstract Various industries are turning to social media to identify key influencers on topics of interest. Following this trend, the All England Lawn Tennis and Croquet Club (AELTC) is keen to analyze thesocial pulse'around the famous Wimbledon Championships. IBM
Predicting future scientific discoveries based on a networked analysis of the past literature
Nagarajan, Meenakshi and Wilkins, Angela D and Bachman, Benjamin J and Novikov, Ilya B and Bao, Shenghua and Haas, Peter J and Terr{\'o}n-D{\'\i}az, Mar{\'\i}a E and Bhatia, Sumit and Adikesavan, Anbu K and Labrie, Jacques J and others
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2019--2028, 2015
Abstract
Abstract We present KnIT, the Knowledge Integration Toolkit, a system for accelerating scientific discovery and predicting previously unknown protein-protein interactions. Such predictions enrich biological research and are pertinent to drug discovery and the
Big Data System for Analyzing Risky Procurement Entities
Dhurandhar, Amit and Graves, Bruce and Ravi, Rajesh and Maniachari, Gopikrishanan and Ettl, Markus
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1741--1750, 2015
Abstract
Abstract An accredited biennial 2014 study by the Association of Certified Fraud Examiners claims that on average 5% of a company's revenue is lost because of unchecked fraud every year. The reason for such heavy losses are that it takes around 18 months for a fraud to be
On Data Publishing with Clustering Preservation
Michail Vlachos, Johannes Schneider, Vassilios G. Vassiliadis
TKDD 9(3): 23:1-23:30, 2015
Exploiting relevance feedback in knowledge graph search
Su, Yu and Yang, Shengqi and Sun, Huan and Srivatsa, Mudhakar and Kase, Sue and Vanni, Michelle and Yan, Xifeng
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135--1144, 2015
Abstract
Abstract The big data era is witnessing a prevalent shift of data from homogeneous to heterogeneous, from isolated to linked. Exemplar outcomes of this shift are a wide range of graph data such as information, social, and knowledge graphs. The unique characteristics of
Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature
Olivier Lichtarge Meenakshi Nagarajan
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2015
Dynamic Poisson Autoregression for Influenza-Like-Illness
Case Count Prediction
Zheng Wang, Prithwish Chakraborty, Sumiko R. Mekaru, John S. Brownstein, Jieping Ye, Naren Ramakrishnan
Proceedings of the 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pp. 1285--1294, ACM, 2015
Voltage correlations in smart meter data
Rajendu Mitra, Ramachandra Kota, Sambaran Bandyopadhyay, Vijay Arya, Brian Sullivan, Richard Mueller, Heather Storey, Gerard Labut
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1999--2008, 2015
Co-Clustering based Dual Prediction for Cargo Pricing Optimization
Y. Zhu, H. Yang and J. He
KDD, 2015
Abstract
This paper targets the problem of cargo pricing optimization in the air cargo business. Given the features associated with a pair of origination and destination, how can we simultaneously predict both the optimal price for the bid stage and the outcome of the transaction (win rate) in the decision stage? In addition, it is often the case that the matrix representing pairs of originations and destinations has a block structure, i.e., the originations and destinations can be co-clustered such that the predictive models are similar within the same co-cluster, and exhibit significant variation among different co-clusters. How can we uncover the co-clusters of originations and destinations while constructing the dual predictive models for the two stages?
We take the first step at addressing these problems. In particular, we propose a probabilistic framework to simultaneously construct dual predictive models and uncover the co-clusters of originations and destinations. It maximizes the conditional probability of observing the responses from both the quotation stage and the decision stage, given the features and the co-clusters. By introducing an auxiliary distribution based on the co-clustering assumption, such conditional probability can be converted into an objective function. To minimize the objective function, we propose the \cocoa\ algorithm, which will generate both the suite of predictive models for all the pairs of originations and destinations, as well as the co-clusters consisting of similar pairs. Experimental results on both synthetic data and real data from cargo price bidding demonstrate the effectiveness and efficiency of the proposed algorithm.
Modelling trajectories for diabetes complications
Yadav, Pranjul and Pruinelli, Lisiane and Hangsleben, Andrew and Dey, Sanjoy and Hauwiller, Katherine and Westra, Bonnie L and Delaney, Connie W and Kumar, Vipin and Steinbach, Michael and Simon, Gyorgy J
Proceedings of the 4th Workshop on Data Mining for Medicine and Healthcare. 2015 SIAM International Conference on Data Mining
Abstract
Diabetes mellitus (DM) is a prevalent and costly disease and if not managed effectively, it leads to complications in almost every body system. Evidence-based guidelines for prevention and management of DM exist, but they ignore the trajectory along which the …