Zehra Sura  Zehra Sura photo         

contact information

Research Scientist, Systems and Architecture for Machine Learning
Thomas J. Watson Research Center, Yorktown Heights, NY USA



I am interested in compiler technology, computer architecture, and programming models.

My work has focused on parallel computing, multithreading, and memory access optimizations for emerging multicore systems, including heterogeneous and accelerator-based systems. I have worked on several high performance computing architectures, including systems with GPUs, processor-in-memory systems, the BlueGene/Q system, and the Cell Broadband Engine.

I currently work on System-level Performance Modeling for Machine Learning Applications, developing tools to help design custom composable systems (i.e. systems configured from a choice of available accelerator devices and interconnect types) for specific applications. This work defines an API to specify the target application as a workflow graph, and an API to specify a system configuration graph. The tools automatically explore different ways of executing the application on the system. The user can vary application parameters (e.g. data sizes, degree of parallelism, iteration counts) and system parameters (e.g. number and type of compute nodes, interconnect topology, latencies and bandwidths) to determine their effect on various performance metrics (such as execution time, and resource utilization).



LLVM-based OpenMP Compiler for High Performance Computing on GPU-based Systems

The IBM CORAL systems are heterogeneous systems that deliver high performance in the 100-200 Petaflop range, and closely integrate IBM Power processors and NVIDIA GPUs. I helped develop LLVM-based OpenMP compilers to program these systems. My work also explored the capabilities of a compiler to support implicit data movement between different compute devices (GPUs and host processors) in the system. Features such as smart pointers, combined with data movement clauses for devices in the OpenMP standard, enabled new techniques for automatic deep-copying of pointer-based data structures.


Cross-Layer Approximate Computing

This work explored the potential and limits of the approximate computing paradigm when applying it across multiple layers in the system stack, including architectural, programming model, and algorithmic layers. The study was performed using several applications in the data analytics and cognitive domains.


Compiler for the Active Memory Cube (AMC) Processing-in-Memory System

The AMC system is a heterogeneous system design that integrates in-memory processors in a logic layer within 3D DRAM. The in-memory processors were custom-designed for high performance and power efficiency. Several architectural features made it challenging to compile for an AMC system, including the need to exploit multiple dimensions of parallelism, an exposed pipeline, non-conventional register files, no caches, and a software managed instruction buffer. The AMC compiler was able to match or beat hand-optimized code for several targeted applications.


Performance Acceleration of a Single Thread Using Fine-grained Parallelism

This work defined an execution model using groups of cores, each group with a primary core and some associated secondary cores that collaborate to speed up execution of sequential code. Cores within a group have dedicated queues for low-latency transfer of values between them. Compiler analyses and transformations were developed to automatically derive fine-grained parallel code from sequential code, in order to target such groups of cores. The execution model was simulated to determine speedup and transfer latency tradeoffs (observed 2.05x average speedup using groups of 4 cores).


Dynamic Optimization with the XL Open Framework (XLOF)

XLOF is an Eclipse Java plugin that allows users to reuse and customize transformations implemented in the IBM XL compiler. XLOF provides users with the capability to both view and modify code at intermediate stages in the compilation process. The XLOF framework was used to implement an online optimization pass that dynamically profiles, analyzes, recompiles and patches executing code to improve data prefetching.


Assist Threads for Software Data Prefetching

This work explored data prefetching using separate hardware threads (assist threads) running asynchronously with the main application thread. Data prefetching is a technique used to bring data into a processor’s cache ahead-of-time, thus reducing the number of memory stall cycles and improving performance. Compiler transformations were used to automatically generate code for the assist threads, and to synchronize their execution with the application thread.


OpenMP Compiler for the Cell Broadband Engine (BE) Architecture

This work developed the first single-source compiler for the Cell BE architecture. The Cell BE, used in Sony PlayStation 3 systems, has multiple heterogeneous cores on a single chip. The compiler automatically handled the complexity of multiple ISAs and multiple levels of available parallelism (SIMD, multithreading, multiple cores, and cores with heterogeneous capabilities). Static buffers were used to optimize data transfers between cores and to overlap computation with communication. The sizes of static buffers were also optimized for the limited local memory available.

This work also explored automatically configuring a software cache to maximize its performance on a per-application basis. Compiler analysis was used to estimate data access properties, including re-use distance, the size of data accessed before the next use of a data item, and the size of data spatially and temporally co-located.


Analysis of Inter-thread Dependences for Parallel Execution

This work used escape analysis, synchronization analysis, and delay set analysis to determine inter-thread dependences in a parallel program. Novel analysis algorithms were designed for efficient, incremental, and just-in-time compilation. They were used in a software implementation of memory consistency to determine the ordering constraints imposed by the consistency model. This work demonstrated the feasibility of providing sequential consistency in a Java virtual machine with tolerable performance degradation (10% on average for an Intel Xeon system).


Pointer Analysis Extended for Array Elements

A novel pointer analysis algorithm was developed for improving precision when analyzing array elements in Java. This pointer analysis was used to automatically parallelize numerical codes, and to generate optimized executable code.


Performance of Numerical Java Codes

This work determined a set of kernel templates that are often used in the domain of numerical computing. A static compiler was used to expose (via code transformations) and recognize code sections conforming to a kernel template, and mark them for specific optimization. This technique enabled a Java virtual machine runtime interpreter to deliver performance comparable to pre-compiled code for compute-intensive kernels.