2021
Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators
Prasanth Chatarasi, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, Tushar Krishna, Vivek Sarkar
ACM Transactions on Architecture and Code Optimization, 2021
Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators
Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna
30th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2021
Abstract
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW co-design ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.
Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication
Gordon Euhyun Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, Tushar Krishna
IEEE Transactions on Parallel and Distributed Systems, 1-1, Institute of Electrical and Electronics Engineers (IEEE), 2021
Abstract
There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.
2020
Vyasa: A High-Performance Vectorizing Compiler for Tensor Convolutions on the Xilinx AI Engine
Prasanth Chatarasi, Stephen Neuendorffer, Samuel Bayliss, Kees Vissers, Vivek Sarkar
2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-10, IEEE
Abstract
Xilinxs AI Engine is a recent industry example of energy-efficient vector processing that includes novel support for 2D SIMD datapaths and shuffle interconnection network. The current approach to programming the AI Engine relies on a C/C++ API for vector intrinsics. While an advance over assembly-level programming, it requires the programmer to specify a number of low-level operations based on detailed knowledge of the hardware. To address these challenges, we introduce Vyasa, a new programming system that extends the Halide DSL compiler to automatically generate code for the AI Engine. We evaluated Vyasa on 36 CONV2D workloads, and achieved geometric means of 7.6 and 24.2 MACs/cycle for 32-bit and 16-bit operands (which represent 95.9% and 75.6% of the peak performance respectively).
MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings
Hyoukjun Kwon, Prasanth Chatarasi, Vivek Sarkar, Tushar Krishna, Michael Pellauer, Angshuman Parashar
IEEE Micro 40(3), 20-29, IEEE, 2020
Abstract
The efficiency of an accelerator depends on three factors-mapping, deep neural network (DNN) layers, and hardware-constructing extremely complicated design space of DNN accelerators. To demystify such complicated design space and guide the DNN accelerator design for better efficiency, we propose an analytical cost model, MAESTRO. MAESTRO receives DNN model description and hardware resources information as a list, and mapping described in a data-centric representation we propose as inputs. The data-centric representation consists of three directives that enable concise description of mappings in a compiler-friendly form. MAESTRO analyzes various forms of data reuse in an accelerator based on inputs quickly and generates more than 20 statistics including total latency, energy, throughput, etc., as outputs. MAESTROs fast analysis enables various optimization tools for DNN accelerators such as hardware design exploration tool we present as an example.
2019
Experimental Insights from the Rogues Gallery
Jeffrey S. Young, Jason Riedy, Thomas M. Conte, Vivek Sarkar, Prasanth Chatarasi, Sriseshan Srikanth
2019 IEEE International Conference on Rebooting Computing (ICRC), pp. 80-87, IEEE
Abstract
The Rogues Gallery is a new deployment for understanding next-generation hardware with a focus on unorthodox and uncommon technologies. This testbed project was initiated in 2017 in response to Rebooting Computing efforts and initiatives. The Gallerys focus is to acquire new and unique hardware (the rogues) from vendors, research labs, and start-ups and to make this hardware widely available to students, faculty, and industry collaborators within a managed data center environment. By exposing students and researchers to this set of unique hardware, we hope to foster cross-cutting discussions about hardware designs that will drive future performance improvements in computing long after the Moores Law era of cheap transistors ends. We have defined an initial vision of the infrastructure and driving engineering challenges for such a testbed in a separate document, so here we present highlights of the first one to two years of post-Moore era research with the Rogues Gallery and give an indication of where we see future growth for this testbed and related efforts.
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, Tushar Krishna
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 754-768, ACM, 2019
Abstract
The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, which directly impacts the performance and energy efficiency of DNN accelerators. An accelerator micro architecture dictates the dataflow(s) that can be employed to execute layers in a DNN. Selecting a dataflow for a layer can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflow, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the DNN dataflow space in a compiler-friendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Patio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points.
2018
A Unified Approach to Variable Renaming for Enhanced Vectorization
Prasanth Chatarasi, Jun Shirako, Albert Cohen, Vivek Sarkar
International Workshop on Languages and Compilers for Parallel Computing, pp. 1-20, Springer, Cham, 2018
Abstract
Despite the fact that compiler technologies for automatic vectorization have been under development for over four decades, there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units. One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output (write-after-write) dependences. Past approaches, such as variable renaming and variable expansion, break such dependence cycles by either eliminating or repositioning the problematic memory-based dependences. However, the past work suffers from three key limitations: (1) Lack of a unified framework that synergistically integrates multiple storage transformations, (2) Lack of support for bounding the additional space required to break memory-based dependences, and (3) Lack of support for integrating these storage transformations with other code transformations (e.g., statement reordering) to enable vectorization.
A Preliminary Study of Compiler Transformations for Graph Applications on the Emu System
Prasanth Chatarasi, Vivek Sarkar
Proceedings of the Workshop on Memory Centric High Performance Computing, pp. 37-44, ACM, 2018
Abstract
Unlike dense linear algebra applications, graph applications typically suffer from poor performance because of 1) inefficient utilization of memory systems through random memory accesses to graph data, and 2) overhead of executing atomic operations. Hence, there is a rapid growth in improving both software and hardware platforms to address the above challenges. One such improvement in the hardware platform is a realization of the Emu system, a thread migratory and near-memory processor. In the Emu system, a thread responsible for computation on a datum is automatically migrated over to a node where the data resides without any intervention from the programmer. The idea of thread migrations is very well suited to graph applications as memory accesses of the applications are irregular. However, thread migrations can hurt the performance of graph applications if overhead from the migrations dominates benefits achieved through the migrations.In this preliminary study, we explore two high-level compiler optimizations, i.e., loop fusion and edge flipping, and one low-level compiler transformation leveraging hardware support for remote atomic updates to address overheads arising from thread migration, creation, synchronization, and atomic operations. We performed a preliminary evaluation of these compiler transformations by manually applying them on three graph applications over a set of RMAT graphs from Graph500.---Conductance, Bellman-Fords algorithm for the single-source shortest path problem, and Triangle Counting. Our evaluation targeted a single node of the Emu hardware prototype, and has shown an overall geometric mean reduction of 22.08% in thread migrations.
2017
2016
An Extended Polyhedral Model for SPMD Programs and Its Use in Static Data Race Detection
Prasanth Chatarasi, Jun Shirako, Martin Kong, Vivek Sarkar
International Workshop on Languages and Compilers for Parallel Computing, pp. 106-120, Springer, Cham, 2016
Abstract
Despite its age, SPMD (Single Program Multiple Data) parallelism continues to be one of the most popular parallel execution models in use today, as exemplified by OpenMP for multicore systems and CUDA and OpenCL for accelerator systems. The basic idea behind the SPMD model, which makes it different from task-parallel models, is that all logical processors (worker threads) execute the same program with sequential code executed redundantly and parallel code executed cooperatively. In this paper, we extend the polyhedral model to enable analysis of explicitly parallel SPMD programs and provide a new approach for static detection of data races in SPMD programs using the extended polyhedral model. We evaluate our approach using 34 OpenMP programs from the OmpSCR and PolyBench-ACC (PolyBench-ACC derives from the PolyBench benchmark suite and provides OpenMP, OpenACC, CUDA, OpenCL and HMPP implementations.) benchmark suites.
2015
Extending Polyhedral Model for Analysis and Transformation of OpenMP Programs
Prasanth Chatarasi, Vivek Sarkar
2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 490-491, IEEE
Abstract
The polyhedral model is a powerful algebraic framework that has enabled significant advances in analysis and transformation of sequential affine (sub)programs, relative to traditional AST-based approaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral compilation techniques to analyze and transform explicitly parallel programs. In our PACT15 paper titled "Polyhedral Optimizations of Explicitly Parallel Programs" [1, 2], we addressed the problem of analyzing and transforming programs with explicit parallelism that satisfy the serial-elision property, i.e., the property that removal of all parallel constructs results in a sequential program that is a valid (albeit inefficient) implementation of the parallel program semantics.In this poster, we address the problem of analyzing and transforming more general OpenMP programs that do not satisfy the serial-elision property. Our contributions include the following: 1) An extension of the polyhedral model to represent input OpenMP programs, 2) Formalization of May Happen in Parallel (MHP) and Happens before (HB) relations in the extended model, 3) An approach for static detection of data races in OpenMP programs by generating race constraints that can be solved by an SMT solver such as Z3, and 4) An approach for transforming OpenMP programs.
Polyhedral Optimizations of Explicitly Parallel Programs
Prasanth Chatarasi, Jun Shirako, Vivek Sarkar
2015 International Conference on Parallel Architecture and Compilation (PACT), pp. 213-226, IEEE
Abstract
The polyhedral model is a powerful algebraic framework that hasenabled significant advances to analysis and transformation ofsequential affine (sub)programs, relative to traditional AST-basedapproaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworksto optimize explicitly parallel programs. An interesting side effectof supporting explicitly parallel programs is that doing so can alsoenable optimization of programs with unanalyzable data accesses withina polyhedral framework. In this paper, we address the problem ofextending polyhedral frameworks to enable analysis and transformationof programs that contain both explicit parallelism and unanalyzabledata accesses. As a first step, we focus on OpenMP loop parallelismand task parallelism, including task dependences from OpenMP 4.0. Our approach first enables conservative dependence analysis of a givenregion of code. Next, we identify happens-before relations from theexplicitly parallel constructs, such as tasks and parallel loops, andintersect them with the conservative dependences. Finally, theresulting set of dependences is passed on to a polyhedral optimizer, such as PLuTo and PolyAST, to enable transformation of explicitly parallelprograms with unanalyzable data accesses. We evaluate our approach using elven OpenMP benchmark programs fromthe KASTORS and Rodinia benchmark suites. We show that 1) thesebenchmarks contain unanalyzable data accesses that prevent polyhedralframeworks from performing exact dependence analysis, 2) explicitparallelism can help mitigate the imprecision, and 3) polyhedraltransformations with the resulting dependences can further improve theperformance of the manually-parallelized OpenMP benchmarks. Ourexperimental results show geometric mean performance improvements of1.62x and 2.75x on the Intel Westmere and IBM Power8 platformsrespectively (relative to the original OpenMP versions).