I am interested in compiler technology, computer architecture, and programming models.
My work has focused on parallel computing, multithreading, and memory access optimizations for emerging multicore systems, including heterogeneous and accelerator-based systems. I have worked on several high performance computing architectures, including systems with GPUs, processor-in-memory systems, the BlueGene/Q system, and the Cell Broadband Engine.
I currently work on System-level Performance Modeling for Machine Learning Applications, developing tools to help design custom “composable systems” (i.e. systems configured from a choice of available accelerator devices and interconnect types) for specific applications. This work defines an API to specify the target application as a workflow graph, and an API to specify a system configuration graph. The tools automatically explore different ways of executing the application on the system. The user can vary application parameters (e.g. data sizes, degree of parallelism, iteration counts) and system parameters (e.g. number and type of compute nodes, interconnect topology, latencies and bandwidths) to determine their effect on various performance metrics (such as execution time, and resource utilization).
PREVIOUS RESEARCH PROJECTS
LLVM-based OpenMP Compiler for High Performance Computing on GPU-based Systems
The IBM CORAL systems are heterogeneous systems that deliver high performance in the 100-200 Petaflop range, and closely integrate IBM Power processors and NVIDIA GPUs. I helped develop LLVM-based OpenMP compilers to program these systems. Since the CORAL systems primarily target the HPC domain, the compilers support Fortran as well as C/C++. My work also explored the capabilities of a compiler to support implicit data movement between different compute devices (GPUs and host processors) in the system. Features of modern Fortran such as smart pointers, combined with data movement clauses for devices in the OpenMP standard, enabled new techniques for automatic deep-copying of pointer-based data structures.
- T. Chen, Z. Sura, and H. Sung, Automatic Copying of Pointer-Based Data Structures, 29th International Workshop on Languages and Compilers for Parallel Computing (LCPC), 2016
- S. Antao, et al., Offloading Support for OpenMP in Clang and LLVM, Third Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC 2016
- M. Martineau, et al., Performance Analysis and Optimization of Clang's OpenMP 4.5 GPU Support, 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, PMBS@SC 2016
Cross-Layer Approximate Computing
This work explored the potential and limits of the approximate computing paradigm when applying it across multiple layers in the system stack, including architectural, programming model, and algorithmic layers. The study was performed using several applications in the data analytics and cognitive domains.
- A. Agrawal, et al., Approximate Computing: Challenges and Opportunities, IEEE Conference on Rebooting Computing (ICRC), 2016
Compiler for the Active Memory Cube (AMC) Processing-in-Memory System
The AMC system is a heterogeneous system design that integrates in-memory processors in a logic layer within 3D DRAM. The in-memory processors were custom-designed for high performance and power efficiency. Several architectural features made it challenging to compile for an AMC system, including the need to exploit multiple dimensions of parallelism, an exposed pipeline, non-conventional register files, no caches, and a software managed instruction buffer. The AMC compiler was able to match or beat hand-optimized code for several targeted applications.
- R. Nair, et al., Active Memory Cube: A Processing-in-Memory Architecture for Exascale Systems, IBM Journal of Research and Development (IBM JRD), Vol. 59, Issue 2/3, March 2015
- Z. Sura, et al., Data Access Optimization in a Processing-in-Memory System, Computing Frontiers (CF), May 2015
- Z. Sura, Keynote talk at the 8th Euro-Par Workshop on UnConventional High Performance Computing (UCHPC), August 2015
- A. Jacob, et. al., Progressive Codesign of an Architecture and Compiler using a Proxy Application, International Symposium on Computer Architecture and High Performance Computing, October 2015
Performance Acceleration of a Single Thread Using Fine-grained Parallelism
This work defined an execution model using groups of cores, each group with a primary core and some associated secondary cores that collaborate to speed up execution of sequential code. Cores within a group have dedicated queues for low-latency transfer of values between them. Compiler analyses and transformations were developed to automatically derive fine-grained parallel code from sequential code, in order to target such groups of cores. The execution model was simulated to determine speedup and transfer latency tradeoffs (observed 2.05x average speedup using groups of 4 cores).
- Z. Sura, K. O’Brien, and J. Brunheroto, Using Multiple Threads to Accelerate Single Thread Performance, International Parallel and Distributed Processing Symposium (IPDPS), May 2014
Dynamic Optimization with the XL Open Framework (XLOF)
XLOF is an Eclipse Java plugin that allows users to reuse and customize transformations implemented in the IBM XL compiler. XLOF provides users with the capability to both view and modify code at intermediate stages in the compilation process. The XLOF framework was used to implement an online optimization pass that dynamically profiles, analyzes, recompiles and patches executing code to improve data prefetching.
Assist Threads for Software Data Prefetching
This work explored data prefetching using separate hardware threads (assist threads) running asynchronously with the main application thread. Data prefetching is a technique used to bring data into a processor’s cache ahead-of-time, thus reducing the number of memory stall cycles and improving performance. Compiler transformations were used to automatically generate code for the assist threads, and to synchronize their execution with the application thread.
- G. Pekhimenko, Z. Sura, Y. Gao, T. Chen, and K. O’Brien, Assist Threads for Data Prefetching in IBM XL Compilers, 8th Workshop on Compiler-Driven Performance, CASCON, Toronto, November 2009
OpenMP Compiler for the Cell Broadband Engine (BE) Architecture
This work developed the first single-source compiler for the Cell BE architecture. The Cell BE, used in Sony PlayStation 3 systems, has multiple heterogeneous cores on a single chip. The compiler automatically handled the complexity of multiple ISAs and multiple levels of available parallelism (SIMD, multithreading, multiple cores, and cores with heterogeneous capabilities). Static buffers were used to optimize data transfers between cores and to overlap computation with communication. The sizes of static buffers were also optimized for the limited local memory available.
This work also explored automatically configuring a software cache to maximize its performance on a per-application basis. Compiler analysis was used to estimate data access properties, including re-use distance, the size of data accessed before the next use of a data item, and the size of data spatially and temporally co-located.
- T. Chen, Z. Sura, K.M. O'Brien, and J.K. O'Brien, Optimizing the Use of Static Buffers for DMA on a Cell Chip, Languages and Compilers for Parallel Computing (LCPC), Nov 2006
- T. Chen, et. al., Prefetching Irregular References for Software Cache on Cell, International Symposium on Code Generation and Optimization (CGO), April 2008
- J.K. O'Brien, K.M. O'Brien, Z. Sura, T. Chen, and T. Zhang, Supporting OpenMP on Cell, International Journal of Parallel Programming (IJPP) 36(3):289-311, April 2008
- M. Tallada, et. al., Access-Specific Software Cache Techniques for the Cell BE Architecture, Parallel Architecture and Compilation Techniques (PACT), October 2008
- J. Lee, S. Seo, C. Kim, J. Kim, P. Chun, Z. Sura, J. Kim, and S. Han, COMIC: A Coherent Shared Memory Interface for Cell BE, Parallel Architecture and Compilation Techniques (PACT), October 2008
- S. Seo, J. Lee, and Z. Sura, Design and Implementation of Software-Managed Caches for Multicores with Local Memory, Intl Symposium on High-Performance Computer Architecture (HPCA), Feb 2009
Analysis of Inter-thread Dependences for Parallel Execution
This work used escape analysis, synchronization analysis, and delay set analysis to determine inter-thread dependences in a parallel program. Novel analysis algorithms were designed for efficient, incremental, and just-in-time compilation. They were used in a software implementation of memory consistency to determine the ordering constraints imposed by the consistency model. This work demonstrated the feasibility of providing sequential consistency in a Java virtual machine with tolerable performance degradation (10% on average for an Intel Xeon system).
- Z. Sura, X. Fang, C.-L. Wong, S. Midkiff, J. Lee, and D. Padua, Compiler Techniques for High Performance Sequentially Consistent Java Programs, Principles and Practice of Parallel Programming (PPoPP), June 2005
Pointer Analysis Extended for Array Elements
A novel pointer analysis algorithm was developed for improving precision when analyzing array elements in Java. This pointer analysis was used to automatically parallelize numerical codes, and to generate optimized executable code.
- P. Wu, P. Feautrier, D. Padua, and Z. Sura, Instance-wise Points-to Analysis for Loop-based Dependence Testing, International Conference on Supercomputing (ICS), June 2002
Performance of Numerical Java Codes
This work determined a set of kernel templates that are often used in the domain of numerical computing. A static compiler was used to expose (via code transformations) and recognize code sections conforming to a kernel template, and mark them for specific optimization. This technique enabled a Java virtual machine runtime interpreter to deliver performance comparable to pre-compiled code for compute-intensive kernels.