2022
Approximate Computing for Scientific Applications
H. Anzt, M. Casas, A. C. I. Malossi, E. S. Quintana-Orti, F. Scheidegger, S. Zhuang
Approximate Computing Techniques: From Component-to Application-Level, Springer Nature, 2022
Abstract
This chapter illustrates the performance advantages that can be obtained from exploiting approximate computing techniques in the solution of scientific applications connected to linear algebra and deep learning. For the linear algebra case, we address the solution of sparse linear systems via iterative methods, giving experimental evidence of how approximate computing can help to reduce the dominant cost factor, in general due to memory access, for Jacobi solvers, Krylov subspace methods as well as simple block-Jacobi preconditioners. In addition, this chapter proposes a technique to reduce the training cost of Deep Neural Networks by decreasing data movement across heterogeneous architectures composed of several GPUs and multicore CPU devices. In particular, this chapter proposes an algorithm to dynamically adapt the data representation format of network weights during training. This algorithm drives a compression procedure that reduces network parameters size before sending them over the parallel system.
Model-Assisted Labeling via Explainability for Visual Inspection of Civil Infrastructures
K. Janouskova, M. Rigotti, I. Giurgiu, C. Malossi
International Joint Conference on Artificial Intelligence (IJCAI), 2022
Enabling Reproducibility and Meta-learning Through a Lifelong Database of Experiments (LDE)
J. Tsay, A. Bartezzaghi, A. Nolte, A. C. I. Malossi
arXiv, 2022
Abstract
Artificial Intelligence (AI) development is inherently iterative and experimental. Over the course of normal development, especially with the advent of automated AI, hundreds or thousands of experiments are generated and are often lost or never examined again. There is a lost opportunity to document these experiments and learn from them at scale, but the complexity of tracking and reproducing these experiments is often prohibitive to data scientists. We present the Lifelong Database of Experiments (LDE) that automatically extracts and stores linked metadata from experiment artifacts and provides features to reproduce these artifacts and perform meta-learning across them. We store context from multiple stages of the AI development lifecycle including datasets, pipelines, how each is configured, and training runs with information about their runtime environment. The standardized nature of the stored metadata allows for querying and aggregation, especially in terms of ranking artifacts by performance metrics. We exhibit the capabilities of the LDE by reproducing an existing meta-learning study and storing the reproduced metadata in our system. Then, we perform two experiments on this metadata: 1) examining the reproducibility and variability of the performance metrics and 2) implementing a number of meta-learning algorithms on top of the data and examining how variability in experimental results impacts recommendation performance. The experimental results suggest significant variation in performance, especially depending on dataset configurations; this variation carries over when meta-learning is built on top of the results, with performance improving when using aggregated results. This suggests that a system that automatically collects and aggregates results such as the LDE not only assists in implementing meta-learning but may also improve its performance.
Design of a Cloud-Based Data Platform for Standardized Machine Learning Workflows With Applications to Transport Infrastructure
A. Bartezzaghi, I. Giurgiu, C. Marchiori, M. Rigotti, R. Sebastian, A. C. I. Malossi
IEEE Mediterranean Electrotechnical Conference (Melecon), 2022
Abstract
This paper describes the IBM One-Click Learning (OCL) platform, an ongoing internal effort in IBM Research aimed at providing end-to-end support for the full data science process. The design decisions behind the platform are presented, specifically with regard to promoting adoption and applicability within application domains where the potential of machine learning (ML) is recognized but still not fully harnessed. We focus on the representative case study of applying ML to civil engineering, a domain of growing interest also thanks to the EU coordination and support action to improve the European standards for inspection, monitoring and maintenance of bridges, tunnels and other types of transport infrastructures. This example is illustrative of the major roadblocks for adoption of ML in new application domains: the diversity in user profiles and familiarity with data science among domain practitioners; the variety in available hardware infrastructure and computing needs; and the heterogeneity, specificity and unsettled evolution of the use case landscape. Removing these roadblocks requires designing a platform explicitly geared towards usability goals of broad accessibility on one hand, and extensibility and specialization on the other one. We elaborate on a set of functionalities to support these usability goals and thereby enable the design of a task agnostic end-to-end platform to drive ML adoption and workflow standardization in new dynamic application domains. Finally, we present the IBM OCL platform as a proof-of-concept implementation of these functionalities and validate it in a use case where computer vision models are deployed to aid visual inspection of a bridge.
2021
AutoText: An End-to-End AutoAI Framework for Text
A. Chaudhary, A. Issak, K. Kate, Y. Katsis, A. Valente, D. Wang, A. Evfimievski, S. Gurajada, B. Kawas, A. C. I. Malossi, L. Popa, T. Pedapati, H. Samulowitz, M. Wistuba, Y. Li
AAAI Conference on Artificial Intelligence, 2021
Abstract
Building models for natural language processing (NLP) tasks remains a daunting task for many, requiring significant technical expertise, efforts, and resources. In this demonstration, we present AutoText, an end-to-end AutoAI framework for text, to lower the barrier of entry in building NLP models. AutoText combines state-of-the-art AutoAI optimization techniques and learning algorithms for NLP tasks into a single extensible framework. Through its simple, yet powerful UI, non-AI experts (e.g., domain experts) can quickly generate performant NLP models with support to both control (e.g., via specifying constraints) and understand learned models.
2020
Generating Efficient DNN-Ensembles with Evolutionary Computation
M. Ortiz, F. Scheidegger, M. Casas, C. Malossi, E. Ayguad
arXiv, 2020
Abstract
In this work, we leverage ensemble learning as a tool for the creation of faster, smaller, and more accurate deep learning models. We demonstrate that we can jointly optimize for accuracy, inference time, and the number of parameters by combining DNN classifiers. To achieve this, we combine multiple ensemble strategies: bagging, boosting, and an ordered chain of classifiers. To reduce the number of DNN ensemble evaluations during the search, we propose EARN, an evolutionary approach that optimizes the ensemble according to three objectives regarding the constraints specified by the user. We run EARN on 10 image classification datasets with an initial pool of 32 state-of-the-art DCNN on both CPU and GPU platforms, and we generate models with speedups up to 7.60?, reductions of parameters by 10?, or increases in accuracy up to 6.01% regarding the best DNN in the pool. In addition, our method generates models that are 5.6? faster than the state-of-the-art methods for automatic model generation.
XwattPilot: A Full-stack Cloud System Enabling Agile Development of Transprecision Software for Low-power SoCs
Dionysios Diamantopoulos, Florian Scheidegger, Stefan Mach, Fabian Schuiki, Germain Haugou, Michael Schaffner, Frank K. Gurkaynak, Christoph Hagleitner, A. Cristiano I. Malossi, Luca Benini
2020 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS)
Abstract software, embedded system, computer science, cloud systems, agile software development
The performance improvement rate of conventional von Neumann processors has slowed as Moore's Law grinds to an economic halt, giving rise to a new age of heterogeneity for energy-efficient computing. Extending processors with finely tunable precision instructions have emerged as a form of heterogeneity that tradeoffs computation precision with power consumption. However, the prolonged design time due to customization of the supported framework for a system-on-a-chip may counteract the advantages of transprecision computing. We propose XwattPilot, a system aiming at accelerating the transprecision software development of low-power processors using cloud technology. We show that the total energy-to-solution can be significantly decreased by using transprecision computations, whereas the proposed system can accelerate the energy-efficiency evaluation runtime by 10.3×.
doi
software, embedded system, computer science, cloud systems, agile software development
torcpy: Supporting task parallelism in Python
P. E. Hadjidoukas, A. Bartezzaghi, F. Scheidegger, R. Istrate, C. Bekas, A. C. I. Malossi
SoftwareX12, 2020
Abstract
Task-based parallelism has been established as one of the main forms of code parallelization, where asynchronous tasks are launched and distributed across the processing units of a local machine, a cluster or a supercomputer. The tasks can be either completely decoupled, corresponding to a set of independent jobs, or be part of an iterative algorithm where the task results are processed and drive the next step. Typical use cases include the application of the same function to different data, parametric searches and algorithms used in numerical optimization and Bayesian uncertainty quantification. In this work, we introduce torcpy, a platform-agnostic adaptive load balancing library that orchestrates the asynchronous execution of tasks, expressed as callables with arguments, on both shared and distributed memory platforms. The library is implemented on top of MPI and multithreading and provides lightweight support for nested loops and map functions. Experimental results using representative applications demonstrate the flexibility and efficiency of the proposed Python package.
Reducing Data Motion to Accelerate the Training of Deep Neural Networks
Sicong Zhuang, A. Cristiano I. Malossi, Marc Casas
arXiv, 2020
Abstract
This paper reduces the cost of DNNs training by decreasing the amount of data movement across heterogeneous architectures composed of several GPUs and multicore CPU devices. In particular, this paper proposes an algorithm to dynamically adapt the data representation format of network weights during training. This algorithm drives a compression procedure that reduces data size before sending them over the parallel system. We run an extensive evaluation campaign considering several up-to-date deep neural network models and two high-end parallel architectures composed of multiple GPUs and CPU multicore chips. Our solution achieves average performance improvements from 6.18% up to 11.91%.
Efficient Image Dataset Classification Difficulty Estimation for Predicting Deep-Learning Accuracy
Florian Scheidegger, Roxana Istrate, Giovanni Mariani, Luca Benini, Costas Bekas, A. Cristiano I. Malossi
The Visual Computer, 2020
Abstract
In the deep-learning community new algorithms are published at an incredible pace. Therefore, solving an image classification problem for new datasets becomes a challenging task, as it requires to re-evaluate published algorithms and their different configurations in order to find a close to optimal classifier. To facilitate this process, before biasing our decision towards a class of neural networks or running an expensive search over the network space, we propose to estimate the classification difficulty of the dataset. Our method computes a single number that characterizes the dataset difficulty 27x faster than training state-of-the-art networks. The proposed method can be used in combination with network topology and hyper-parameter search optimizers to efficiently drive the search towards promising neural-network configurations.
2019
FloatX: A C++ Library for Customized Floating-Point Arithmetic
G. Flegar, F. Scheidegger, V. Novakovic, G. Mariani, A. Tomas, A. C. I. Malossi, E. S. Quintana-Orti
ACM Transactions on Mathematical Software (TOMS) 45(4), 2019
Abstract
We present FloatX (Float eXtended), a C++ framework to investigate the effect of leveraging customized floating-point formats in numerical applications. FloatX formats are based on binary IEEE 754 with smaller significand and exponent bit counts specified by the user. Among other properties, FloatX facilitates an incremental transformation of the code, relies on hardware-supported floating-point types as back-end to preserve efficiency, and incurs no storage overhead. The article discusses in detail the design principles, programming interface, and datatype casting rules behind FloatX. Furthermore, it demonstrates FloatX's usage and benefits via several case studies from well-known numerical dense linear algebra libraries, such as BLAS and LAPACK; the Ginkgo library for sparse linear systems; and two neural network applications related with image processing and text recognition.
Constrained deep neural network architecture search for IoT devices accounting hardware calibration
Florian Scheidegger, Luca Benini, Costas Bekas, A. Cristiano I. Malossi
NeurIPS - Thirty-third Conference on Neural Information Processing Systems, 2019
Abstract
Deep neural networks achieve outstanding results in challenging image classification tasks. However, the design of network topologies is a complex task and the research community makes a constant effort in discovering top-accuracy topologies, either manually or employing expensive architecture searches. In this work, we propose a unique narrow-space architecture search that focuses on delivering low-cost and fast executing networks that respect strict memory and time requirements typical of Internet-of-Things (IoT) near-sensor computing platforms. Our approach provides solutions with classification latencies below 10ms running on a $35 device with 1GB RAM and 5.6GFLOPS peak performance. The narrow-space search of floating-point models improves the accuracy on CIFAR10 of an established IoT model from 70.64% to 74.87% respecting the same memory constraints. We further improve the accuracy to 82.07% by including 16-bit half types and we obtain the best accuracy of 83.45% by extending the search with model optimized IEEE 754 reduced types. To the best of our knowledge, we are the first that empirically demonstrate on over 3000 trained models that running with reduced precision pushes the Pareto optimal front by a wide margin. Under a given memory constraint, accuracy is improved by over 7% points for half and over 1% points further for running with the best model individual format.
NeuNetS: An Automated Synthesis Engine for Neural Network Design
Atin Sood, Benjamin Elder, Benjamin Herta, Chao Xue, Costas Bekas, A. Cristiano I. Malossi, Debashish Saha, Florian Scheidegger, Ganesh Venkataraman, Gegi Thomas, Giovanni Mariani, Hendrik Strobelt, Horst Samulowitz, Martin Wistuba, Matteo Manica, Mihir Choudhury, Rong Yan, Roxana Istrate, Ruchir Puri, Tejaswini Pedapati
2019
Abstract
Application of neural networks to a vast variety of practical applications is transforming the way AI is applied in practice. Pre-trained neural network models available through APIs or capability to custom train pre-built neural network architectures with customer data has made the consumption of AI by developers much simpler and resulted in broad adoption of these complex AI models. While prebuilt network models exist for certain scenarios, to try and meet the constraints that are unique to each application, AI teams need to think about developing custom neural network architectures that can meet the tradeoff between accuracy and memory footprint to achieve the tight constraints of their unique use-cases. However, only a small proportion of data science teams have the skills and experience needed to create a neural network from scratch, and the demand far exceeds the supply. In this paper, we present NeuNetS : An automated Neural Network Synthesis engine for custom neural network design that is available as part of IBM's AI OpenScale's product. NeuNetS is available for both Text and Image domains and can build neural networks for specific tasks in a fraction of the time it takes today with human effort, and with accuracy similar to that of human-designed AI models.
TAPAS: Train-less Accuracy Predictor for Architecture Search
Roxana Istrate, Florian Scheidegger, Giovanni Mariani, Dimitrios S. Nikolopoulos, Costas Bekas, A. Cristiano I. Malossi
AAAI Conference on Artificial Intelligence, 2019
Abstract reinforcement learning, machine learning, genetic algorithm, extrapolation, computer science, artificial neural network, artificial intelligence, architecture
In recent years an increasing number of researchers and practitioners have been suggesting algorithms for large-scale neural network architecture search: genetic algorithms, reinforcement learning, learning curve extrapolation, and accuracy predictors. None of them, however, demonstrated high-performance without training new experiments in the presence of unseen datasets. We propose a new deep neural network accuracy predictor, that estimates in fractions of a second classification performance for unseen input datasets, without training. In contrast to previously proposed approaches, our prediction is not only calibrated on the topological network information, but also on the characterization of the dataset-difficulty which allows us to re-tune the prediction without any training. Our predictor achieves a performance which exceeds 100 networks per second on a single GPU, thus creating the opportunity to perform large-scale architecture search within a few minutes. We present results of two searches performed in 400 seconds on a single GPU. Our best discovered networks reach 93.67% accuracy for CIFAR-10 and 81.01% for CIFAR-100, verified by training. These networks are performance competitive with other automatically discovered state-of-the-art networks however we only needed a small fraction of the time to solution and computational resources.
reinforcement learning, machine learning, genetic algorithm, extrapolation, computer science, artificial neural network, artificial intelligence, architecture
2018
BAGAN: Data Augmentation with Balancing GAN
Giovanni Mariani, Florian Scheidegger, Roxana Istrate, Costas Bekas, A. Cristiano I. Malossi
ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models, 2018
Abstract
Image classification datasets are often imbalanced, characteristic that negatively affects the accuracy of deep-learning classifiers. In this work we propose balancing GAN (BAGAN) as an augmentation tool to restore balance in imbalanced datasets. This is challenging because the few minority-class images may not be enough to train a GAN. We overcome this issue by including during the adversarial training all available images of majority and minority classes. The generative model learns useful features from majority classes and uses these to generate images for minority classes. We apply class conditioning in the latent space to drive the generation process towards a target class. The generator in the GAN is initialized with the encoder module of an autoencoder that enables us to learn an accurate class-conditioning in the latent space. We compare the proposed methodology with state-of-the-art GANs and demonstrate that BAGAN generates images of superior quality when trained with an imbalanced dataset.
The transprecision computing paradigm: Concept, design, and applications
A. Cristiano I. Malossi, Michael Schaffner, Anca Molnos, Luca Gammaitoni, Giuseppe Tagliavini, Andrew Emerson, Andres Tomas, Dimitrios S. Nikolopoulos, Eric Flamand, Norbert Wehn
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2018, pp. 1105--1110
Abstract
Guaranteed numerical precision of each elementary step in a complex computation has been the mainstay of traditional computing systems for many years. This era, fueled by Moore's law and the constant exponential improvement in computing efficiency, is at its twilight: from tiny nodes of the Internet-of-Things, to large HPC computing centers, sub-picoJoule/operation energy efficiency is essential for practical realizations. To overcome the power wall, a shift from traditional computing paradigms is now mandatory. In this paper we present the driving motivations, roadmap, and expected impact of the European project OPRECOMP. OPRECOMP aims to (i) develop the first complete transprecision computing framework, (ii) apply it to a wide range of hardware platforms, from the sub-milliWatt up to the MegaWatt range, and (iii) demonstrate impact in a wide range of computational domains, spanning IoT, Big Data Analytics, Deep Learning, and HPC simulations. By combining together into a seamless design transprecision advances in devices, circuits, software tools, and algorithms, we expect to achieve major energy efficiency improvements, even when there is no freedom to relax end-to-end application quality of results. Indeed, OPRECOMP aims at demolishing the ultra-conservative "precise" computing abstraction, replacing it with a more flexible and efficient one, namely transprecision computing.
A scalable iterative dense linear system solver for multiple right-hand sides in data analytics
Vassilis Kalantzis, A. Cristiano I. Malossi, Costas Bekas, Alessandro Curioni, Efstratios Gallopoulos, Yousef Saad
Parallel Computing, Elsevier, 2018
Abstract
We describe Parallel-Projection Block Conjugate Gradient (pp-bcg), a distributed iterative solver for the solution of dense and symmetric positive definite linear systems with multiple right-hand sides. In particular, we focus on linear systems appearing in the context of stochastic estimation of the diagonal of the matrix inverse in Uncertainty Quantification. pp-bcg is based on the block Conjugate Gradient algorithm combined with Galerkin projections to accelerate the convergence rate of the solution process of the linear systems. Numerical experiments on massively parallel architectures illustrate the performance of the proposed scheme in terms of efficiency and convergence rate, as well as its effectiveness relative to the (block) Conjugate Gradient and the Cholesky-based ScaLAPACK solver. In particular, on a 4 rack BG/Q with up to 65,536 processor cores using dense matrices of order as high as 524,288 and 800 right-hand sides, pp-bcg can be 2x-3x faster than the aforementioned techniques.
2017
Incremental Training of Deep Convolutional Neural Networks
Roxana Istrate, A. Cristiano I. Malossi, Costas Bekas, Dimitrios Nikolopoulos
International Workshop on Automatic Selection, Configuration and Composition of Machine Learning Algorithms, 2017
Abstract
We propose an incremental training method that partitions the original network into sub-networks, which are then gradually incorporated in the running network during the training process. To allow for a smooth dynamic growth of the network, we introduce a look-ahead initialization that outperforms the random initialization. We demonstrate that our incremental approach reaches the reference network baseline accuracy. Additionally, it allows to identify smaller partitions of the original state-of-the-art network, that deliver the same final accuracy, by using only a fraction of the global number of parameters. This allows for a potential speedup of the training time of several factors. We report training results on CIFAR-10 for ResNet and VGGNet.
Impact of temporal subsampling on accuracy and performance in practical video classification
Florian Scheidegger, Lukas Cavigelli, Michael Schaffner, A. Cristiano I. Malossi, Costas Bekas, Luca Benini
Signal Processing Conference (EUSIPCO), 2017 25th European, pp. 996-1000
Abstract
In this paper we evaluate three state-of-the-art neural-network-based approaches for large-scale video classification, where the computational efficiency of the inference step is of particular importance due to the ever increasing amount of data throughput for video streams. Our evaluation focuses on finding good efficiency vs. accuracy tradeoffs by evaluating different network configurations and parameterizations. In particular, we investigate the use of different temporal subsampling strategies, and show that they can be used to effectively trade computational workload against classification accuracy. Using a subset of the YouTube-8M dataset, we demonstrate that workload reductions in the order of 10x, 50x and 100x can be achieved with accuracy reductions of only 1.3%, 6.2% and 10.8%, respectively. Our results show that temporal subsampling is a simple and generic approach that behaves consistently over the considered classification pipelines and which does not require retraining of the underlying networks.
Energy-Aware High Performance Computing
Martin Wlotzka, Vincent Heuveline, Manuel F. Dolz, M. Reza Heidari, Thomas Ludwig, A. Cristiano I. Malossi, Enrique S. Quintana-Orti
ICT-Energy Concepts for Energy Efficiency and Sustainability, InTech, 2017
Abstract
High performance computing centres consume substantial amounts of energy to power large-scale supercomputers and the necessary building and cooling infrastructure. Recently, considerable performance gains resulted predominantly from developments in multi-core, many-core and accelerator technology. Computing centres rapidly adopted this hardware to serve the increasing demand for computational power. However, further performance increases in large-scale computing systems are limited by the aggregate energy budget required to operate them. Power consumption has become a major cost factor for computing centres. Furthermore, energy consumption results in carbon dioxide emissions, a hazard for the environment and public health; and heat, which reduces the reliability and lifetime of hardware components. Energy efficiency is therefore crucial in high performance computing.
This chapter addresses key issues of energy-aware high performance computing. We outline some numerical methods which are often used in scientific applications, and present an energy profiling and tracing technique suitable to analyse the power consumption of applications. The next section is devoted to the performance and energy characterization of the sparse matrix-vector product, a basic numerical building block. Finally, we discuss opportunities for saving energy in computations by means of two examples. First, we present energy-aware runtimes on shared memory multi-core platforms for the Conjugate Gradient method. Second, we present energy-efficient techniques for multigrid methods on distributed memory clusters.
2016
First Experiences with ab initio Molecular Dynamics on OpenPOWER: The Case of CPMD
Valery Weber, A. Cristiano I. Malossi, Ivano Tavernelli, Teodoro Laino, Costas Bekas, Manish Modani, Nina Wilner, Tom Heller, Alessandro Curioni
Proceedings of the ISC High Performance 2016 International Workshops, Frankfurt, Germany, June 19-23, 2016, pp. 228-234
Abstract
In this article, we present the algorithmic adaptation and code re-engineering required for porting highly successful and popular planewave codes to next-generation heterogeneous OpenPOWER architectures that foster acceleration and high bandwidth links to GPUs. Here we focus on CPMD as the most representative software for ab initio molecular dynamics simulations. We have ported the construction of the electronic density, the application of the potential to the wavefunctions and the orthogonalization procedure to the GPU. The different GPU kernels consist mainly of fast Fourier transforms (FFT) and basic linear algebra operations (BLAS). The performance of the new implementation obtained on Firestone (POWER8/Tesla) is discussed. We show that the communication between the host and the GPU contributes a large fraction of the total run time. We expect a strong attenuation of the communication bottleneck when the NVLink high-speed interconnect will be available.
The Impact of Voltage-Frequency Scaling for the Matrix-Vector Product on the IBM POWER8
Sandra Catalan, A. Cristiano I. Malossi, Costas Bekas, Enrique S. Quintana-Orti
Proceedings of the 22nd International Conference on Parallel and Distributed Computing (Euro-Par 2016), Grenoble, France, August 24-26, 2016, pp. 103-116
Abstract
The physical limitations of CMOS miniaturization have promoted understanding the interplay between performance and energy into a primary challenge. In this paper we contribute towards this goal by assessing the effect of voltage and frequency scaling (VFS) on the energy consumption of the dense and sparse matrix-vector products. The optimization of the sparse kernel, from the perspective of both performance and energy efficiency, is especially difficult due to its irregular memory access pattern, but the potential benefits are remarkable because of its varied applications.
Our experiments with a small synthetic training set show that it is possible to build a general classification of sparse matrices that governs the optimal VFS level from the point of view of energy efficiency. More importantly, this characterization can be leveraged to tune VFS for a major portion of the University of Florida Matrix Collection, when executed on the IBM Power8, yielding significant gains with respect to a (power-hungry) configuration that simply favours performance.
Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance
Peter W. J. Staar, Panagiotis K. Barkoutsos, Roxana Istrate, A. Cristiano I. Malossi, Ivano Tavernelli, Nikolaj Moll, Heiner Giefers, Christoph Hagleitner, Costas Bekas, Alessandro Curioni
2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 812-821
Abstract Best Paper Award Winner
In this era of Big Data, large graphs appear in many scientific domains. To extract the hidden knowledge/correlations in these graphs, novel methods need to be developed to analyse these graphs fast. In this paper, we present a unified framework of stochastic matrix-function estimators, which allows one to compute a subset of elements of the matrix f(A), where f is an arbitrary function and A is the adjacency matrix of the graph. The new framework has a computational cost proportional to the size of the subset, i.e. to obtain the diagonal of f(A) with matrix-size N, the computational cost is proportional to N contrary to the traditional N^3 from diagonalization. Furthermore, we will show that the new framework allows us to write implementations of the algorithm that scale naturally with the number of compute nodes and is easily ported to accelerators where the kernels perform very well.
doi
Best Paper Award Winner
2015
An Extreme-scale Implicit Solver for Complex PDEs: Highly Heterogeneous Flow in Earth's Mantle
Johann Rudi, A. Cristiano I. Malossi, Tobin Isaac, Georg Stadler, Michael Gurnis, Peter W. J. Staar, Yves Ineichen, Costas Bekas, Alessandro Curioni, Omar Ghattas
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12, ACM, 2015
Abstract ACM Gordon Bell Prize Winner 2015
Mantle convection is the fundamental physical process within earth's interior responsible for the thermal and geological evolution of the planet, including plate tectonics. The mantle is modeled as a viscous, incompressible, non-Newtonian fluid. The wide range of spatial scales, extreme variability and anisotropy in material properties, and severely nonlinear rheology have made global mantle convection modeling with realistic parameters prohibitive. Here we present a new implicit solver that exhibits optimal algorithmic performance and is capable of extreme scaling for hard PDE problems, such as mantle convection. To maximize accuracy and minimize runtime, the solver incorporates a number of advances, including aggressive multi-octree adaptivity, mixed continuous-discontinuous discretization, arbitrarily-high-order accuracy, hybrid spectral/geometric/algebraic multigrid, and novel Schur-complement preconditioning. These features present enormous challenges for extreme scalability. We demonstrate that---contrary to conventional wisdom---algorithmically optimal implicit solvers can be designed that scale out to 1.5 million cores for severely nonlinear, ill-conditioned, heterogeneous, and anisotropic PDEs.
doi
ACM Gordon Bell Prize Winner 2015
Systematic derivation of time and power models for linear algebra kernels on multicore architectures
A Cristiano, I Malossi, Yves Ineichen, Costas Bekas, Alessandro Curioni, Enrique S Quintana-Orti
Sustainable Computing: Informatics and Systems7, 24-40, 2015
Abstract
The power wall asks for a holistic effort from the high performance and scientific communities to develop power-aware tools and applications which ultimately drive the design of energy-efficient hardware. Toward this goal, we introduce a systematic methodology to derive reliable time and power models for algebraic kernels employing a bottom-up approach. This strategy helps to understand the contribution of the different kernels to the total energy consumption of applications, as well as to distinguish between the cost of fine-grain components such as arithmetic, memory access, and overheads introduced by, e.g., multithreading or reductions.
To study and validate our methodology, we initially focus on two key memory-bound BLAS-1 vector kernels: the dot product and the axpy operation. Subsequently, we show how these kernels can be composed to accurately predict the energy consumption of more heterogeneous algorithms, such as the Conjugate Gradient method, while tackling the elaborate memory hierarchy and the high degree of concurrency of today's processors; in particular, the evaluation of the models on the IBM Blue Gene/Q supercomputer, as well as on the IBM Power 755 server, reveals that average power consumption is captured at high accuracy, yet the models and the methodology are universal to be portable to any general-purpose multicore architecture.
Fast Exponential Computation on SIMD Architectures
A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, Alessandro Curioni
HiPEAC 2015 - 1st Workshop On Approximate Computing (WAPCO)
Abstract
State of the art implementations of the exponential function rely on interpolation tables, Taylor expansions or IEEE manipulations containing a small fraction of integer operations. Unfortunately, none of these methods is able to maximize the profit of vectorization and at the same time, provide sufficient accuracy. Indeed, many applications such as solving PDEs, simulations of neuronal networks, Fourier transforms and many more involved a large quantity of exponentials that have to be computed efficiently. In this paper we device and demonstrate the usefulness of a novel formulation to compute the exponential employing only floating point operations, with a flexible accuracy ranging from a few digits up to the full machine precision. Using the presented algorithm we can compute exponentials of large vectors, in any application setting, maximizing the performance gains of the vectorization units available to modern processors. This immediately results in a speedup for all applications.
2014
Performance and energy-aware characterization of the sparse matrix-vector multiplication on multithreaded architectures
A. Cristiano, I. Malossi, Yves Ineichen, Costas Bekas, Alessandro Curioni, Enrique S. Quintana-Orti
43rd International Conference on Parallel Processing (ICPP), 3rd International Workshop on Power-aware Algorithms, Systems, and Architectures, pp. 139-148, 2014
Massively Parallel and Near Linear Time Graph Analytics
Fazle E. Faisal, Yves Ineichen, A. Cristiano I. Malossi, Peter Staar, Costas Bekas, Alessandro Curioni
SC'14, 2014
Abstract
Graph theory and graph analytics are essential in many research problems and have applications in various domains. The rapid growth in the data can lead to very large sparse graphs consisting of billions of nodes. This poses many challenges in designing fast algorithms for large-scale graph analytics. We believe that in the era of the big data regime accuracy has to be traded for algorithmic complexity and scalability to drive big data analytics. We discuss the underlaying mathematical models of two novel near linear (O(N)) methods for graph analytics: estimating subgraph centralities, and spectrograms. Parallelism on multiple levels is used to attain good scalability. We benchmarked the proposed algorithms on huge datasets, e.g. the European street network containing 51 million nodes and 108 million non-zeros, on up to 32,768 cores. Due to the iterative behavior of the methods, a very good estimate can be computed in a matter of seconds.
Machine Learning Algorithms for the Performance and Energy-Aware Characterization of Linear Algebra Kernels on Multithreaded Architectures
A. Cristiano I. Malossi, Yves Ineichen, Costas Bekas, Alessandro Curioni, Enrique S Quintana-Orti
SC'14, 2014
Abstract
The performance and energy optimization of the 13 ``dwarfs'', proposed by UC-Berkeley in 2006, can have a tremendous impact on a vast number of scientific applications and existing computational libraries. To achieve this goal, scientists and software engineers need tools for analyzing and modeling the performance-power-energy interactions of their kernels on real HPC systems. In this poster we present systematic methods to derive reliable time-power-energy models for dense and sparse linear algebra operations. Our strategy is based on decomposing the kernels into sub-components (e.g., arithmetics and memory accesses) and identifying the critical features that drive their performance, power, and energy consumption. The proposed techniques provide tools for analyzing and reengineering algorithms for the desired power- and energy-efficiency as well as to reduce operational costs of HPC-supercomputers and cloud-systems with thousands of concurrent users.
2013
On the continuity of mean total normal stress in geometrical multiscale cardiovascular problems
Pablo J. Blanco, Simone Deparis, A. Cristiano I. Malossi
Journal of Computational Physics 251, 136-155, Elsevier, 2013
Abstract
In this work an iterative strategy to implicitly couple dimensionally-heterogeneous blood flow models accounting for the continuity of mean total normal stress at interface boundaries is developed. Conservation of mean total normal stress in the coupling of heterogeneous models is mandatory to satisfy energetic consistency between them. Nevertheless, existing methodologies are based on modifications of the Navier-Stokes variational formulation, which are undesired when dealing with fluid-structure interaction or black box codes. The proposed methodology makes possible to couple one-dimensional and three-dimensional fluid-structure interaction models, enforcing the continuity of mean total normal stress while just imposing flow rate data or even the classical Neumann boundary data to the models. This is accomplished by modifying an existing iterative algorithm, which is also able to account for the continuity of the vessel area, when required. Comparisons are performed to assess differences in the convergence properties of the algorithms when considering the continuity of mean normal stress and the continuity of mean total normal stress for a wide range of flow regimes. Finally, examples in the physiological regime are shown to evaluate the importance, or not, of considering the continuity of mean total normal stress in hemodynamics simulations.
Numerical comparison and calibration of geometrical multiscale models for the simulation of arterial flows
A. Cristiano I. Malossi, Jean Bonnemain
Cardiovascular Engineering and Technology 4(4), 440-463, Springer, 2013
Abstract
Arterial tree hemodynamics can be simulated by means of several models of different level of complexity, depending on the outputs of interest and the desired degree of accuracy. In this work, several numerical comparisons of geometrical multiscale models are presented with the aim of evaluating the benefits of such complex dimensionally-heterogeneous models compared to other simplified simulations. More precisely, we present flow rate and pressure wave form comparisons between three-dimensional patient-specific geometries implicitly coupled with one-dimensional arterial tree networks and (i) a full one-dimensional arterial tree model and (ii) stand-alone three-dimensional fluid-structure interaction models with boundary data taken from precomputed full one-dimensional network simulations. On a slightly different context, we also focus on the set up and calibration of cardiovascular simulations. In particular, we perform sensitivity analyses of the main quantities of interest (flow rate, pressure, and solid wall displacement) with respect to the parameters accounting for the elastic and viscoelastic responses of the tissues surrounding the external wall of the arteries. Finally, we also compare the results of geometrical multiscale models in which the boundary solid rings of the three-dimensional geometries are fixed, with respect to those where the boundary interfaces are scaled to enforce the continuity of the vessels size with the surrounding one-dimensional arteries.
Implicit coupling of one-dimensional and three-dimensional blood flow models with compliant vessels
A. Cristiano I. Malossi, Pablo J. Blanco, Paolo Crosetto, Simone Deparis, Alfio Quarteroni
Multiscale Modeling & Simulation 11(2), 474-506, SIAM, 2013
Abstract
The blood flow in arterial trees in the cardiovascular system can be simulated with the help of different models, depending on the outputs of interest and the desired degree of accuracy. In particular, one-dimensional fluid-structure interaction models for arteries are very effective in reproducing physiological pressure wave propagation and in providing quantities like pressure and velocity, averaged on the cross section of the arterial lumen. In locations where one-dimensional models cannot capture the complete flow dynamics, e.g., in the presence of stenoses and aneurysms or other strong geometric perturbations, three-dimensional coupled fluid-structure interaction models are necessary to evaluate more accurately, for instance, critical factors responsible for pathologies which are associated with hemodynamics, such as wall shear stress. In this work we formalize and investigate the geometrical multiscale problem, where heterogeneous fluid-structure interaction models for arteries are implicitly coupled. We introduce new coupling algorithms, describe their implementation, and investigate on simple geometries the numerical reflections that occur at the interface between the heterogeneous models. We also simulate on a supercomputer a three-dimensional abdominal aorta under physiological conditions, coupled with up to six one-dimensional models representing the surrounding arterial branches. Finally, we compare CPU times and number of coupling iterations for different algorithms and time discretizations.
Numerical simulation of left ventricular assist device implantations: comparing the ascending and the descending aorta cannulations
Jean Bonnemain, A. Cristiano I. Malossi, Matteo Lesinigo, Simone Deparis, Alfio Quarteroni, Ludwig K. von Segesser
Medical Engineering & Physics 35(10), 1465-1475, Elsevier, 2013
Abstract
In this work we present numerical simulations of continuous flow left ventricle assist device implantation with the aim of comparing difference in flow rates and pressure patterns depending on the location of the anastomosis and the rotational speed of the device. Despite the fact that the descending aorta anastomosis approach is less invasive, since it does not require a sternotomy and a cardiopulmonary bypass, its benefits are still controversial. Moreover, the device rotational speed should be correctly chosen to avoid anomalous flow rates and pressure distribution in specific location of the cardiovascular tree. With the aim of assessing the differences between these two approaches and device rotational speed in terms of flow rate and pressure waveforms, we set up numerical simulations of network of one-dimensional models where we account for the presence of an outflow cannula anastomosed to different locations of the aorta. Then, we use the resulting network to compare the results of the two different cannulations for several stages of heart failure and different rotational speed of the device. The inflow boundary data for the heart and the cannulas are obtained from a lumped parameters model of the entire circulatory system with an assist device, which is validated with clinical data. The results show that ascending and descending aorta cannulations lead to similar waveforms and mean flow rate in all the considered cases. Moreover, regardless of the anastomosis region, the rotational speed of the device has an important impact on wave profiles; this effect is more pronounced at high RPM.
2012
Partitioned Solution of Geometrical Multiscale Problems for the Cardiovascular System: Models, Algorithms, and Applications
A. Cristiano I. Malossi
Ph.D. Thesis, Swiss Federal Institute of Technology in Lausanne (EPFL), 2012
Abstract
The aim of this work is the development of a geometrical multiscale framework for the simulation of the human cardiovascular system under either physiological or pathological conditions. More precisely, we devise numerical algorithms for the partitioned solution of geometrical multiscale problems made of different heterogeneous compartments that are implicitly coupled with each others. The driving motivation is the awareness that cardiovascular dynamics are governed by the global interplay between the compartments in the network. Thus, numerical simulations of stand-alone local components of the circulatory system cannot always predict effectively the physiological or pathological states of the patients, since they do not account for the interaction with the missing elements in the network. As a matter of fact, the geometrical multiscale method provides an automatic way to determine the boundary (more precisely, the interface) data for the specific problem of interest in absence of clinical measures and it also offers a platform where to study the interaction between local changes (due, for instance, to pathologies or surgical interventions) and the global systemic dynamics. To set up the framework an abstract setting is devised; the local specific mathematical equations (partial differential equations, differential algebraic equations, etc.) and the numerical approximation (finite elements, finite differences, etc.) of the heterogeneous compartments are hidden behind generic operators. Consequently, the resulting global interface problem is formulated and solved in a completely transparent way. The coupling between models of different dimensional scale (three-dimensional, one-dimensional, etc.) and type (Navier-Stokes, fluid-structure interaction, etc.) is addressed writing the interface equations in terms of scalar quantities, i.e., area, flow rate, and mean (total) normal stress. In the resulting flexible framework the heterogeneous models are treated as black boxes, each one equipped with a specific number of compatible interfaces such that (i) the arrangement of the compartments in the network can be easily manipulated, thus allowing a high level of customization in the design and optimization of the global geometrical multiscale model, (ii) the parallelization of the solution of the different compartments is straightforward, leading to the opportunity to make use of the latest high-performance computing facilities, and (iii) new models can be easily added and connected to the existing ones. The methodology and the algorithms devised throughout the work are tested over several applications, ranging from simple benchmark examples to more complex cardiovascular networks. In addition, two real clinical problems are addressed: the simulation of a patient-specific left ventricle affected by myocardial infarction and the study of the optimal position for the anastomosis of a left ventricle assist device cannula.
Multiscale fluid-structure interaction simulation of anatomically correct left ventricle fluid dynamics with fictitious elastic structure regularization
Toni Lassila, A. Cristiano I. Malossi, M. Stevanella, E. Votta, A. Redaelli, S. Deparis
Technical Report, 2012
Abstract
We propose a new approach for the data assimilation and simulation of anatomically correct left ventricle fluid dynamics based on cardiac magnetic resonance images. The movement of the ventricle is captured by a set of tracking points on the endocardium identified from a time series of magnetic resonance images. The displacement of the endocardium is interpolated using radial basis functions to produce a smooth global deformation field in both space and time. To regularize the motion of the fluid inside the ventricle we impose the displacement on a fictitious elastic structure surrounding the fluid domain, and then solve the problem in a fluid-structure interaction formulation. This allows to provide physiological flow and pressure levels inside the left ventricle. In order to have physically reasonable outflow conditions, we couple the left ventricle to a network of one-dimensional models, where we can simulate the pressure and flow rate waveforms in the major arteries, thus obtaining a geometrical multiscale model. The end result is a baseline model for the left ventricle that captures the basic fluidic phenomena and that can in the future be extended to consider clinical patient-specific studies by integrating more complex models into the multiscale framework.
A two-level time step technique for the partitioned solution of one-dimensional arterial networks
A. Cristiano I. Malossi, Pablo J. Blanco, Simone Deparis
Computer Methods in Applied Mechanics and Engineering 237-240, 212-226, Elsevier, 2012
Abstract
In this work a numerical strategy to address the solution of the blood flow in one-dimensional arterial networks through a topology-based decomposition is presented. Such decomposition results in the local analysis of the blood flow in simple arterial segments. Hence, iterative methods are used to perform the strong coupling among the segments, which communicate through non-overlapping interfaces. Specifically, two approaches are considered to solve the associated nonlinear interface problem: (i) the Newton method and (ii) the Broyden method. Moreover, since the modeling of blood flow in compliant vessels is tackled using explicit finite element methods, we formulate the coupling problem using a two-level time stepping technique. A local (inner) time step is used to solve the local problems in single arteries, meeting thus local stability conditions, while a global (outer) time step is employed to enforce the continuity of physical quantities of interest among the one-dimensional segments. Several examples of application are presented. Firstly a study about spurious reflections produced at interfaces as a consequence of the two-level time stepping technique is carried out. Secondly, the application of the methodologies to physiological scenarios is presented, specifically addressing the solution of the blood flow in a model of the entire arterial network. The effects of non-uniformities of the material properties, of the variation of the radius, and of viscoelasticity are taken into account in the model and in the (local) numerical scheme; they are quantified and commented in the arterial network simulation.
2011
Algorithms for the coupling of one-dimensional arterial networks with three-dimensional fluid-structure interaction problems
A. Cristiano I. Malossi, Simone Deparis, Pablo J. Blanco
Proceedings of the ECCOMAS Thematic International Conference on Simulation and Modeling of Biological Flows (SIMBIO 2011), September, 21-23
Abstract
This work is focused on the development of a geometrical multiscale framework for modeling the human cardiovascular system. This approach is designed to deal with different geometrical and mathematical models at the same time, without any preliminary hypotheses on the layout of the general multiscale problem. This flexibility allows to set up a complete arterial tree model of the circulatory system, assembling first a network of one-dimensional models, described by non-linear hyperbolic equations, and then replacing some elements with more detailed (and expensive) three-dimensional models, where the Navier-Stokes equations are coupled with structural models through fluid-structure interaction algorithms. The coupling between models of different scale and type is addressed imposing the conservation equations in terms of averaged/integrated quantities (i.e., the flow rate and the normal component of the traction vector); in particular, three coupling strategies have been explored for the fluid problem. In all the cases, these strategies lead to small non-linear interface problems, which are solved using classical iterative algorithms.
Geometrical multiscale model of an idealized left ventricle with fluid-structure interaction effects coupled to a one-dimensional viscoelastic arterial network
Toni Lassila, A. Cristiano I. Malossi, Matteo Astorino, Simone Deparis
Proceedings of the ECCOMAS Thematic International Conference on Simulation and Modeling of Biological Flows (SIMBIO 2011), September, 21-23
Abstract
A geometrical multiscale model for blood flow through an ideal left ventricle and the main arteries is presented. The blood flow in the three-dimensional idealized left ventricle is solved through a monolithic fluid-structure interaction solver. To account for the interaction between the heart and the circulatory system the heart flow is coupled through an ideal valve with a network of viscoelastic one-dimensional models representing the arterial network. The geometrical multiscale approach used in this work is based on the exchange of averaged/integrated quantities between the fluid problems. The peripheral circulation is modelled by zero-dimensional windkessel terminals. We demonstrate that the geometrical multiscale model is (i) highly modular in that component models can be easily replaced with higher-fidelity ones whenever the user has a specific interest in modelling a particular part of the system, (ii) passive in that it reaches a stable limit cycle of flow rate and pressure in a few heartbeat cycles when driven by a periodic force acting on the epicardium, and (iii) capable of operating at physiological regimes.
An ALE-based numerical technique for modeling sedimentary basin evolution featuring layer deformations and faults
Matteo Longoni, A. Cristiano I. Malossi, Alfio Quarteroni, Andrea Villa, Paolo Ruffo
Journal of Computational Physics 230(8), 3230-3248, Elsevier, 2011
Abstract
In this paper we present a numerical tool to simulate dynamics of stratified sedimentary basins, i.e. depressions on the Earth's surface filled by sediments. The basins are usually complicated by crustal deformations and faulting of the sediments. The balance equations, the non-Newtonian rheology of the sediments, and the depth-porosity compaction laws describe here a model of basin evolution. We propose numerical schemes for the basin boundary movement and for the fault tracking. In addition, a time splitting algorithm is employed to reduce the original model into some simpler mathematical problems. The numerical stability and the other features of the developed methodology are shown using simple test cases and some realistic configurations of sedimentary basins.
Algorithms for the partitioned solution of weakly coupled fluid models for cardiovascular flows
A. Cristiano I. Malossi, Pablo J. Blanco, Simone Deparis, Alfio Quarteroni
International Journal for Numerical Methods in Biomedical Engineering 27(12), 2035-2057, Wiley Online Library, 2011
Abstract
The main goal of this work is to devise robust iterative strategies to partition the solution of the Navier-Stokes equations in a three-dimensional (3D) domain, into non-overlapping 3D subdomains, which communicate through the exchange of averaged/integrated quantities across the interfaces. The novel aspect of the present approach is that at coupling boundaries, the conservation of flow rates and of the associated dual variables is implicitly imposed, entailing a weak physical coupling. For the solution of the non-linear interface problem two strategies are compared: relaxed fixed-point and Newton iterations. The algorithm is tested in several configurations for problems ranging from academic ones to some related to the computational haemodynamics field, which involve more than two components at each coupling interface. In some cases, it is shown that relaxed fixed-point methods are not convergent, whereas the Newton method leads in all tested cases to convergent schemes. One of the appealing aspects of the strategy proposed here is the flexibility in the setting of boundary conditions at the interfaces, where no hierarchy is established a priori (unlike Gauss-Seidel methods). The usefulness of this methodology is also discussed in the context of dimensionally heterogeneous coupling and preconditioning for domain decomposition methods.
2010
A robust and efficient conservative technique for simulating three-dimensional sedimentary basins dynamics
Matteo Longoni, A. Cristiano I. Malossi, Andrea Villa
Computers & Fluids 39(10), 1964-1976, Elsevier, 2010
Abstract
The development of new efficient numerical techniques is a key point in computational fluid dynamics, and as a consequence in geological simulations. In this paper we present a model for simulating the dynamic of three-dimensional stratified sedimentary basins. This kind of problem contains several numerical complexities such as the presence of high viscosity jumps, or the necessity of tracking multiple surfaces of interface (horizons) independently. To overcome these difficulties, we introduce a new preconditioner, that reduces significantly the amount of time required to solve the finite element linear system resulting from the Stokes problem, and a new tracking method. Using a coupled level set-volume tracking method, indeed, an unlimited number of layers can be tracked with good mass conservation properties. Finally, to prove the efficiency of these new techniques we present the results and the computation performances obtained in simulations of a realistic case with four horizons, together with a complete description of the main physical quantities involved.