**2022**

Approximate Euclidean lengths and distances beyond Johnson-Lindenstrauss
Aleksandros Sobczyk, Mathieu Luisier
*arXiv*, arXiv, 2022
Abstract
A classical result of Johnson and Lindenstrauss states that a set of $n$ high dimensional data points can be projected down to $O(\log n/\epsilon^2)$ dimensions such that the square of their pairwise distances is preserved up to a small distortion $\epsilon\in(0,1)$. It has been proved that the JL lemma is optimal for the general case, therefore, improvements can only be explored for special cases.
This work aims to improve the $\epsilon^{-2}$ dependency based on techniques inspired by the Hutch++ Algorithm, which reduces $\epsilon^{-2}$ to $\epsilon^{-1}$ for the related problem of implicit matrix trace estimation. For $\epsilon=0.01$, for example, this translates to $100$ times less matrix-vector products in the matrix-vector query model to achieve the same accuracy as other previous estimators. We first present an algorithm to estimate the Euclidean lengths of the rows of a matrix. We prove element-wise probabilistic bounds that are at least as good as standard JL approximations in the worst-case, but are asymptotically better for matrices with decaying spectrum. Moreover, for any matrix, regardless of its spectrum, the algorithm achieves $\epsilon$-accuracy for the total, Frobenius norm-wise relative error using only $O(\epsilon^{-1})$ queries. This is a quadratic improvement over the norm-wise error of standard JL approximations. We finally show how these results can be extended to estimate the Euclidean distances between data points and to approximate the statistical leverage scores of a tall-and-skinny data matrix, which are ubiquitous for many applications. Proof-of-concept numerical experiments are presented to validate the theoretical analysis.

Pylspack: Parallel Algorithms and Data Structures for Sketching, Column Subset Selection, Regression and Leverage Scores
Aleksandros Sobczyk, Efstratios Gallopoulos
*ACM Trans. Math. Softw.*, Association for Computing Machinery, 2022
Abstract Just Accepted
We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on tall-and-skinny matrices, which arise in many applications. We provide a detailed analysis of the ubiquitous CountSketch transform and its combination with Gaussian random projections, accounting for memory requirements, computational complexity and workload balancing. We also demonstrate how these results can be applied to column subset selection, least squares regression and leverage scores computation. These tools have been implemented in pylspack, a publicly available Python package,1 whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. Extensive numerical experiments indicate that the proposed algorithms scale well and significantly outperform existing libraries for tall-and-skinny matrices.

**2021**

Estimating leverage scores via rank revealing methods and randomization
Aleksandros Sobczyk, Efstratios Gallopoulos
*SIAM Journal on Matrix Analysis and Applications* *42*(*3*), 1199-1228, Society for Industrial and Applied Mathematics, 2021
Abstract
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms. We first develop a set of fast novel algorithms for rank estimation, column subset selection and least squares preconditioning. We then describe the design and implementation of leverage score estimators based on these primitives. These estimators are also effective for rank deficient input, which is frequently the case in data analytics applications. We provide detailed complexity analyses for all algorithms as well as meaningful approximation bounds and comparisons with the state-of-the-art. We conduct extensive numerical experiments to evaluate our algorithms and to illustrate their properties and performance using synthetic and real world data sets.

**2015**

A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi
Ioannis E. Venetis, Alexandros Sobczyk, Alexandros Kouris, Alexandros Nakos, Nikolaos Nikoloutsakos, Efstratios Gallopoulos
*Parallel Computing: On the Road to Exascale*, *pp. 371-380*, IOS Press, 2015
Abstract
Manycores like the Intel Xeon Phi and graphics processing units like the NVIDIA Tesla series are prime examples of systems for accelerating applications that run on current CPU multicores. It is therefore of interest to build fast, reliable linear system solvers targeting these architectures. Moreover, it is of interest to conduct cross comparisons between algorithmic implementations in order to organize the types of optimizations and transformations that are necessary when porting in order to succeed in obtaining performance portability. In this work we aim to present a detailed study of the adaptation and implementation of g-Spike for the Xeon Phi. g-Spike was originally developed to solve general tridiagonal systems on GPUs, on which it returns high performance while also solving systems for which other state-of-the-art general tridiagonal GPU solvers do not succeed. The solver is based on the Spike framework, using QR factorization without pivoting implemented via Givens rotations. We show the necessary adaptations on the Xeon Phi because of the significant differences in the programming models and the underlying architectures as well as the relative performance differences for data access and processing operations.

A direct tridiagonal solver based on Givens rotations for GPU architectures
I.E. Venetis, A. Kouris, A. Sobczyk, E. Gallopoulos, A.H. Sameh
*Parallel Computing* *49*(*C*), 101-116, Elsevier, 2015
Abstract
A parallel solver for general tridiagonal irreducible systems is described.Solver based on Spike framework and Givens-QR with occasional low-rank modification.Modifications handle singularities exposed by QR in blocks of the parallel partition.The GPU implementation has similar performance to existing methods.Method returns accurate results when current GPU tridiagonal solvers fail. g-Spike, a parallel algorithm for solving general nonsymmetric tridiagonal systems for the GPU, and its CUDA implementation are described. The solver is based on the Spike framework, applying Givens rotations and QR factorization without pivoting. It also implements a low-rank modification strategy to compute the Spike DS decomposition even when the partitioning defines singular submatrices along the diagonal. The method is also used to solve the reduced system resulting from the Spike partitioning. Numerical experiments with problems of high order indicate that g-Spike is competitive in runtime with existing GPU methods, and can provide acceptable results when other methods cannot be applied or fail.