**2016**

SystemML: Declarative Machine Learning on Spark
Matthias Boehm, Michael Dusenberry, Deron Eriksson, Alexandre Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick Reiss, Prithviraj Sen, Arvind Surve, Shirish Tatikonda
*Proceedings of the VLDB Endowment*, *pp. 1425 - 1436*, 2016
Abstract
The rising need for custom machine learning (ML) algo-
rithms and the growing data sizes that require the ex-
ploitation of distributed, data-parallel frameworks such as
MapReduce or Spark, pose signi cant productivity chal-
lenges to data scientists. Apache SystemML addresses these
challenges through declarative ML by (1) increasing the
productivity of data scientists as they are able to express
custom algorithms in a familiar domain-speci c language
covering linear algebra primitives and statistical functions,
and (2) transparently running these ML algorithms on dis-
tributed, data-parallel frameworks by applying cost-based
compilation techniques to generate ecient, low-level ex-
ecution plans with in-memory single-node and large-scale
distributed operations. This paper describes SystemML on
Apache Spark, end to end, including insights into various
optimizer and runtime techniques as well as performance
characteristics. We also share lessons learned from porting
SystemML to Spark and declarative ML in general. Finally,
SystemML is open-source, which allows the database com-
munity to leverage it as a testbed for further research

Declarative Machine Learning - A Classification of Basic Properties and Types
Matthias Boehm, Alexandre V. Evfimievski, Niketan Pansare, Berthold Reinwald
Technical Report, 2016
Abstract
Declarative machine learning (ML) aims at the high-level specification of ML tasks or algorithms, and automatic generation of optimized execution plans from these specifications. The fundamental goal is to simplify the usage and/or development of ML algorithms, which is especially important in the context of large-scale computations. However, ML systems at different abstraction levels have emerged over time and accordingly there has been a controversy about the meaning of this general definition of declarative ML. Specification alternatives range from ML algorithms expressed in domain-specific languages (DSLs) with optimization for performance, to ML task (learning problem) specifications with optimization for performance and accuracy. We argue that these different types of declarative ML complement each other as they address different users (data scientists and end users). This paper makes an attempt to create a taxonomy for declarative ML, including a definition of essential basic properties and types of declarative ML. Along the way, we provide insights into implications of these properties. We also use this taxonomy to classify existing systems. Finally, we draw conclusions on defining appropriate benchmarks and specification languages for declarative ML.

**2014**

Large-Scale Online Aggregation Via Distributed Systems
Niketan Pansare
2014
Abstract
To deal with huge amounts of data, many analysts use MapReduce, a software framework that parallelizes computations across a compute cluster. However, MapReduce is sometimes still not fast enough to perform the complicated analysis. In this thesis, I address this problem by developing a statistical estimation framework on top of MapReduce to provide for interactive data analysis. I present three projects that I have worked on under
this topic.
My approach is based on Online Aggregation, a method that allows the user to compute an arbitrary aggregation function over the dataset and output probabilistic bounds in an online fashion. However, implementing Online Aggregation on MapReduce is non-trivial as classical sampling theory suffers from "inspection paradox" , which states that at any random time the block that takes longer to process is likely to be sampled. Since there is usually a correlation between processing time and output value from a data block, classical sampling theory will output biased estimates. Therefore, in the first project of my thesis, I propose a bayesian model that addresses the inspection paradox in MapReduce setting and outputs unbiased online estimates for the given aggregation function.
In the second project, I focus on applying OLA techniques to gradient descent, an optimization algorithm that finds the local minima of a function $L(\Theta)$ by starting with an initial point $\Theta_0$ and then taking steps in direction of negative gradient of the function to be optimized. Since the gradient descent algorithm is essentially an user-defined aggregate function, the OLA framework developed in the first part of my thesis can be used to speed up this algorithm in a MapReduce framework. The key technical question that must be answered is "When do we stop the OLA estimation for a given step (or epoch)?" . In this thesis, I propose and evaluate a new statistical model for addressing this question.
As feature selection is an important step in machine learning, my third project in this thesis focuses on building better topic models, a popular feature selection technique, for audio data by taking into account the inherent uncertainty of speech recognizer.

**2012**

**2011**

**2010**

**2009**

Multi-query Optimization In The Datapath System
Niketan Pansare
2009
Abstract
The Datapath system is a novel database that is implemented from the ground-up using a data-centric approach. In this thesis, I describe and evaluate a multi-query optimizer for the Datapath system. Unlike traditional multi-query optimizers that only try to overlap common sub-expressions, I propose an efficient optimization algorithm that minimizes the data (or the overall number of tuples) flowing through the system. Using this objective function, a qualitative and quantitative study is presented comparing the commonly used algorithms against the proposed multi-query optimization algorithm.