Randomized Numerical Linear Algebra for Large Scale Data Analysis - overview


The Sketching Linear Algebra Kernel is a library for matrix computations suitable for general statistical data analysis and optimization applications.

Many tasks in machine learning and statistics ultimately end up being problems involving matrices: whether you're matching lenders and loans in the microfinance space, or finding the key players in the bitcoin market, or inferring where tweets came from, you'll want to have a toolkit for low-rank matrix approximation, least-squares and robust regression, eigenvector analysis, CUR and non-negative matrix factorizations, and other matrix computations.

Sketching is a way to compress matrices that preserves key matrix properties; it can be used to speed up many matrix computations. Sketching takes a given matrix A and produces a sketch matrix B that has fewer rows and/or columns than A. For a good sketch B, if we solve a problem with input B, the solution will also be pretty good for input A. For some problems, sketches can also be used to get faster ways to find high-precision solutions to the original problem. In other cases, sketches can be used to summarize the data by identifying the most important rows or columns.

A simple example of sketching is just sampling the rows (and/or columns) of the matrix, where each row (and/or column) is equally likely to sampled. This uniform sampling is quick and easy, but doesn't always yield good sketches; however, there are sophisticated sampling methods that do yield good sketches.

The goal of this project is to build a sketching-based open-source software stack for NLA and its applications, as shown:

Matrix Completion Nonlinear RLS,
SVM, PCA
Robust Regression Other applications
Python: Python-based data analytics scripting layer
PythonBinding: C++ to Python bindings
NLA: Numerical Linear Algebra primitives
(Least squares regression, low-rank approximation, randomized estimators)
Sketch: Sketching kernels
JL, FJL, Gaussian, Sign, Sparse Embedding
Third-Party Libraries:
MPI, Elemental, BLAS, CombBLAS, FFTW, Boost