2018 IBM Research Workshop on Architectures for Secure, Cognitive, and Datacenter Computing - Talk Details

Sagar Karandikar

Sagar Karandikar - Univeristy of California at Berkeley, advised by Krste Asanović

Talk title: FireSim: Scalable FPGA-accelerated Cycle-Accurate Hardware Simulation in the Cloud

Abstract: This talk describes FireSim (https://fires.im), an open-source simulation platform that enables cycle-accurate microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with a scalable, distributed network simulation. To simulate servers, FireSim automatically transforms and instruments open-hardware designs (e.g. RISC-V Rocket Chip and BOOM) into fast FPGA-based simulators that can be used to measure the performance of and validate the functionality of a hardware design before fabrication in silicon. Unlike prior FPGA-accelerated simulation tools, FireSim runs on Amazon EC2 F1, a public cloud FPGA platform, which greatly improves usability, provides elasticity, and lowers the cost of large-scale FPGA-based experiments. We describe the design and implementation of FireSim and show how it can provide sufficient performance to run modern applications at scale, to enable true hardware-software co-design. As an example, we demonstrate automatically generating and deploying a target cluster of 1,024 3.2 GHz quad-core server nodes, each with 16 GB of DRAM, interconnected by a 200 Gbit/s network with 2 microsecond latency, which simulates at a 6.6 MHz processor clock rate (less than 500x slowdown over real-time). In aggregate, this FireSim instantiation simulates 4,096 cores and 16 TB of memory, runs ~27 billion instructions per second, and harnesses millions of dollars worth of FPGAs - at a total cost of only ~$100 per simulation hour to the user. We present several examples to show how FireSim can be used to explore various research directions in warehouse-scale machine design, including modeling networks with high-bandwidth and low-latency, integrating arbitrary RTL designs for a variety of commodity and specialized datacenter nodes, and modeling a variety of datacenter organizations, as well as reusing the scale-out FireSim infrastructure to enable fast, massively parallel cycle-exact single-node microarchitectural experimentation.

Bio: Sagar Karandikar is a fourth-year Ph.D. student in Computer Science at the University of California, Berkeley, focusing in Computer Architecture and Systems. His research focuses on hardware/software co-design in warehouse-scale machines. His work on the FireSim project was recently published at ISCA'18 and he has also contributed to papers at OSDI, NSDI, and FPL. He works in the Berkeley Architecture Research group and the ADEPT and RISE Labs, advised by Krste Asanović. He received the M.S. and B.S. degrees in Electrical Engineering and Computer Science from UC Berkeley in 2018 and 2015 respectively.

Khaled Khasawneh

Khaled Khasawneh - University of California at Riverside, advised by Nael Abu-Ghazaleh

Talk Title: Adversarial Evasion-Resilient Hardware Malware Detectors

Abstract: Machine learning approaches are increasingly used in security-sensitive applications such as malware detection and network intrusion detection. However, adversaries can learn the behavior of classifiers and construct adversarial examples that cause them to make wrong decisions, with potentially disastrous consequences. We explore this space in the context of Hardware Malware Detectors (HMDs), which have recently been proposed as a defense against the proliferation of malware. These detectors use low-level features, that can be collected by the hardware performance monitoring units on modern CPUs to detect malware as a computational anomaly. An adversary can reverse engineer existing HMDs effectively and use the reverse engineered model to create malware that evades detection. To address this critical problem, we developed evasion-resilient detectors that leverage recent results in adversarial machine learning to provide a theoretically quantifiable advantage in resilience to reverse engineering and evasion. Specifically, these detectors use multiple base detectors and switch between them stochastically, providing protection against reverse engineering and therefore evasion. The detectors rely on diversity of the baseline classifiers and their evasion advantage correlates with how often they disagree. Thus, it’s critical to study how correlated the decisions from different baseline detectors are: a characteristic called transferability. We study transferability across different classifier algorithms and internal settings discovering that non-differentiable algorithms make the best candidates for operation in adversarial settings. 

Bio: Khaled Khasawneh is a 5th year Ph.D. candidate in the Department of Computer Science & Engineering at the University of California, Riverside. He is advised by Professor Nael Abu-Ghazaleh. He received his BSc degree in Computer Engineering from Jordan University of Science and Technology in 2012 and his MS degree in Computer Science from SUNY Binghamton in 2014. His work has been reported on widely by technical news outlets and won the best paper award at USENIX Workshop on Offensive Technologies 2018. In the summer of 2018, he was an Intern at Facebook in the Community Integrity team. His research interests are in architecture support for security, malware detection, adversarial machine learning, side channels, covert channels, and speculative attacks. He is expected to graduate by June 2019.

Hyoukjun Kwon

Hyoukjun Kwon - Georgia Institute of Technology, advised by Tushar Krishna

Talk Title: An Open Source Framework for Exploring Dataflow and Generating DNN Accelerators Supporting Flexible Dataflow

Efficiently tiling and mapping high-dimensional convolutions onto limited execution and buffering resources is a challenge faced by all deep learning accelerators today. Because the tiling and mapping is equivalentto determine the data movement and staging schedule, we term each unique approach as dataflow. The dataflow determines overall throughput (utilization of the compute units) and energy-efficiency (reads, writes, and reuse of model parameters and partial sumsacross the accelerator’s memory hierarchy). The research community today lacks an infrastructure to evaluate deep neural network (DNN) dataflows and architectures systematically and reason about performance, power, and area implications of various design choices.

In this work, we first present a framework called MAESTRO [1] to (1) formally describe and analyze DNN dataflows, (2) estimate performance and energy-efficiency when running neural network layers, and (3) reportthe hardware resources (size of buffers across the memory hierarchy, and network-on-chip (NoC) bandwidth) to support this dataflow. Using a set of case studies, we demonstrate that adapting dataflow at runtime can provide significant benefits.

Adaptive dataflows require flexibility from the interconnect within the DNN accelerator. To address this need, next, we present MAERI [2], which is a programmable DNN accelerator built with a set of modular andconfigurable building blocks that can easily support myriad DNN partitions and mappings by appropriately configuring the interconnect within the accelerator. MAERI can run various DNN layers (CNNs, LSTMs, max pooling, fully-connected) and support myriad looporderings, tiling strategies, and optimizations (such as weight-pruning), while providing near 100% utilization of the compute resources. MAERI is written in Bluespec System Verilog and distributed as open source [3].

[1] MAESTRO: http://synergy.ece.gatech.edu/tools/maestro/
[2] MAERI: http://synergy.ece.gatech.edu/tools/maeri/
[3] https://github.com/hyoukjun/MAERI

Bio: Hyoukjun Kwon is a Ph.D. student at School of Computer Science, Georgia Institute of Technology, advised by Dr. Tushar Krishna. He received a bachelor’s degree in computer science and engineering and environmentalmaterial science from Seoul National University. His research interest includes computer architecture, network-on-chip, and spatial accelerators for deep learning and graph applications.

Yu Liu

Yu Liu - Texas A&M, advised by Peng Li

Talk Title: Energy-Efficient FPGA Recurrent Spiking Neural Accelerators with Bio-inspired Learning Mechanisms

Abstract: Spiking neural networks (SNNs) in general have been targeted for VLSI realization recently, however, without on-chip learning capability, due to the difficulty of training. Naive gradient-based learning mechanisms do not work well for SNNs and their hardware implementation can lead to high design complexity and excessive hardware overhead. The Liquid State Machine (LSM) is a model of recurrent SNNs that provides an appealing brain-inspired computing paradigm for machine learning applications such as pattern recognition. Processing information directly on neural spiking activities, the LSM naturally lends itself to an efficient hardware implementation via exploration of typical sparse firing patterns. To this end, the LSM is considered as a good tradeoff between the ability in tapping the computational power of recurrent SNNs and engineering tractability.

We presented an LSM neural accelerator on Xilinx Zynq ZC-706 board with efficient spike-dependent on-chip training algorithms implementation. In our system, the entire training and inference processes are operated online on the FPGA-based LSM neural processor, and the data is transmitted between the neural accelerator and the ARM microprocessor host on board through the AMBA AXI interface. This platform allows for efficient explorations of different LSM neural processor architecture and training algorithm measured on the hardware. For example, we build a demonstration neural processor integrated with a biologically inspired calcium-modulated supervised spike-timing-dependent plasticity (STDP) algorithm and is applied to the real-world application of speech recognition, benchmarked with spoken English letters adopted from the TI46 speech corpus. The results show that the hardware LSM neural processor can achieve 95% learning performance, which is as high as the software simulator, with improved training speed and energy efficiency.

Bio: Yu Liu is a senior Ph.D. student from Texas A&M University under the supervision of Prof. Peng Li. Yu’s research mainly focuses on the hardware design and prototyping of the energy-efficient neuromorphic processor. Currently, she is working on recurrent spiking neural processors with on-chip learning ability with emphasizing on hardware overhead and energy efficiency. During her accomplished Ph.D. years, she has delivered several papers and a demonstration platform on her research topics. Besides, Yu has background knowledge of the hardware AI in general and is interested in exploring different approaches to enable the next generation computing platform with high speed, high performance, and low power. Yu has a strong background in digital design with both academic and industrial experiences. She has done several industrial internships on the ASIC and FPGA design and verification.

Divya Mahajan

Divya Mahajan - Georgia Institute of Technology, advised by Hadi Esmaeilzadeh

Talk Title: Balancing Generality and Specialization for Machine Learning in the Post-ISA Era

Abstract: Advances in Machine Learning (ML) and Artificial Intelligence (AI) are set to revolutionize medicine, robotics, commerce, transportation,and other key aspects of our lives. However, such transformative effects are predicated on providing both high-performance compute capabilities that enable these learning algorithms, and the simultaneous advancement in algorithms that can adapt to the continuouslychanging landscape of the data revolution. To this end, hardware acceleration is seen as an efficient technique to meet the compute requirements in the areas of ML and AI.

In this talk, I will discuss my work that devises a comprehensive full-stack solution for enabling hardware acceleration for machine learning. The goal is to expose a high-level mathematicalprogramming interface to users that have limited knowledge about hardware design, but nevertheless, can benefit from large performance and efficiency gains through acceleration. My goal is to strike a balance between generality and specialization by breakingthe long-held traditional abstraction of Instruction Set Architecture (ISA) and delving into the algorithmic foundations of Machine Learning and Deep Learning. I will also discuss my vision of incorporating hardware acceleration for advanced analytics withina multitude of data management mechanisms such as columnar-based databases, unstructured, and semi-structured systems.

Bio: Divya Mahajan is a PhD candidate in the Computer Science Department at Georgia Institute of Technology where sheis advised by Professor Hadi Esmaeilzadeh. She received her Bachelors (2012) in Electrical Engineering from Indian Institute of Technology Ropar, India where shewas honored with the President of India Gold medal for her outstanding academic performance. Subsequently, she completed her Masters (2014) from the University of Texas, at Austin in Electrical and Computer Engineering. She joined her PhD studies in Fall 2014and since has been a part of Alternate Computing Technologies lab which is directed by Hadi Esmaeilzadeh. Her research interests include computer architecture, microarchitecture design, and developing alternative technologies for efficient computing. She iscontinuously working towards designing full stack solutions and template-based architectures for accelerating Machine Learning and Deep Learning algorithms on an FPGA. Besides her primary research-area of computer architecture, she has also worked at the intersectionof machine learning, hardware design, programming languages and databases. In her free time, she likes to spend time oil painting, cooking, and reading novels.

Caroline Trippel

Caroline Trippel - Princeton University, advised by Margaret Martonosi

Talk Title: CheckMate: Automated Synthesis of Hardware Exploits and Security Litmus Tests

Abstract: Recent research has uncovered a broad class of security vulnerabilities in which confidential data is leaked through programmer-observable microarchitectural state. In this talk, I will present CheckMate, a rigorous approach and automated tool for determining if a microarchitecture is susceptible to specified classes of security exploits, and for synthesizing proof-of-concept exploit code when it is. Our approach adopts “microarchitecturally happens-before” (µhb) graphs which prior work designed to capture the subtle orderings and interlings of hardware execution events when programs run on a microarchitecture. CheckMate extends µhb graphs to facilitate modeling of security exploit scenarios and hardware execution patterns indicative of classes of exploits. Furthermore, it leverages relational model finding techniques to enable automated exploit program synthesis from microarchitecture and exploit pattern specifications.

In addition to presenting CheckMate, I will describe a case study where we use CheckMate to evaluate the susceptibility of a speculative out-of-order processor to FLUSH+RELOAD cache side-channel attacks. The automatically synthesized results are programs representative of Meltdown and Spectre attacks. We also evaluate the same processor on its susceptibility to a different timing side-channel attack: PRIME+PROBE. Here, CheckMate synthesized new exploits that are similar to Meltdown and Spectre in that they leverage speculative execution, but unique in that they exploit distinct microarchitectural behaviors— speculative cache line invalidations rather than speculative cache pollution—to form a side-channel. Most importantly, these results validate the CheckMate approach to formal hardware security verification and the ability of the CheckMate tool to detect real-world vulnerabilities.

Bio: Caroline Trippel is a Ph.D. candidate in the Computer Science department at Princeton University. She is advised by Professor Margaret Martonosi on her computer architecture dissertation research, specifically on the topic of concurrency and security verification in heterogeneous parallel systems. Her work has resulted in formal, automated tools and techniques for specifying and verifying the correct and secure execution of software running on such systems. She has influenced the design of the RISC-V ISA memory consistency model (MCM) both via full-stack MCM analysis of its draft specification and her subsequent participation in the RISC-V Memory Model Task Group. Additionally, she has developed a novel methodology and tool that synthesized two new variants of the recently publicized Meltdown and Spectre attacks. Caroline received her B.S. in Computer Engineering from Purdue University in 2013, her M.A. in Computer Science from Princeton University in 2015, and was a 2017-2018 NVIDIA Graduate Fellow.

Mengjia Yan

Mengjia Yan - University of Illinois at Urbana-Champaign, advised by Josep Torrellas

Talk Title: InvisiSpec: Making Speculative Execution Invisible in the Cache Hierarchy

Abstract: Hardware speculation offers a major surface for micro-architectural covert and side channel attacks. Unfortunately, defending against speculative execution attacks is challenging. The reason is that speculations destined to be squashed executeincorrect instructions, outside the scope of what programmers and compilers reason about. Further, any change to micro-architectural state made by speculative execution can leak information.

In this paper, we propose InvisiSpec, a novel strategy to defend against hardware speculation attacks in multiprocessors by making speculation invisible in the data cache hierarchy. InvisiSpec blocks micro-architectural covert and side channelsthrough the multiprocessor data cache hierarchy due to speculative loads. In InvisiSpec, unsafe speculative loads read data into a speculative buffer, without modifying the cache hierarchy. When the loads become safe, InvisiSpec makes them visible to the restof the system. InvisiSpec identifies loads that might have violated memory consistency and, at this time, forces them to perform a validation step. We propose two InvisiSpec designs: one to defend against Spectre-like attacks and another to defend againstfuturistic attacks, where any speculative load may pose a threat. Our simulations with 23 SPEC and 10 PARSEC workloads show that InvisiSpec is effective. Under TSO, using fences to defend against Spectre attacks slows down execution by 74% relative to a conventional,insecure processor; InvisiSpec reduces the execution slowdown to only 21%. Using fences to defend against futuristic attacks slows down execution by 208%; InvisiSpec reduces the slowdown to 72%.

Bio: Mengjia Yan is a PhD student at the University of Illinois at Urbana-Champaign, advised by Professor Josep Torrellas. Her research interest lies in the areas of computer architecture and security. She has been working on cache-based side channel attacks and defenses. She has been selected as Mavis Future Faculty Fellow in 2018, and I have received W.J. Poppelbaum Memorial Award and ACM SIGARCH Student Scholarships for Celebration of 50 Years of the ACM Turing Award in 2017. Prior to UIUC, she earned her Master degree from UIUC in 2016 and her Bachelor degree from Zhejiang University in 2013.

Xiaofan Zhang

Xiaofan Zhang - University of Illinois at Urbana-Champaign, advised by Deming Chen

Talk Title: DNNBuilder: An Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs

Abstract: Deep Neural Networks (DNNs) are widely used for machine learning applications. With more advanced DNNs developed, higher computation demands are required which slow down the process. It is necessary to develop hardware accelerators for DNN inferences and boost the performance. FPGA is one of the promising candidates with improved latency and energy efficiency. However, building a high-performance FPGA accelerator for DNNs often requires RTL programming, hardware verification, and precise resource allocation, all of which can be time-consuming and challenging to perform even for seasoned FPGA developers. To bridge the gap between fast DNN construction in software (e.g., Caffe, TensorFlow) and slow hardware implementation, we propose DNNBuilder for building high-performance DNN hardware accelerators on FPGAs automatically. Novel techniques are developed to meet the throughput and latency requirements for cloud computing, which include high-quality RTL neural network components, a fine-grained layer-based pipeline architecture, and a column-based cache scheme, to boost throughput, reduce latency, and save FPGA on-chip memory. To address the limited resource challenge, we design an automatic design space exploration tool to generate optimized parallelism guidelines by considering external memory access bandwidth, and data reuse behaviors, FPGA resource availability, and DNN complexity. DNNBuilder is demonstrated on four DNNs (Alexnet, ZF, VGG16, and YOLO) with the best performance (up to 5.15x faster) and power efficiency (up to 5.87x more efficient) compared to previously published FPGA-based DNN accelerators. DNNBuilder can provide millisecond-scale real-time performance for processing HD video input and deliver higher efficiency (up to 4.35x) than the GPU-based solutions.

Bio: Xiaofan Zhang is a Ph.D. student in the Department of Electrical and Computer Engineering at the University of Illinois at Urbana–Champaign. He received the B.S. and M.S. degrees from the University of Electronic Science and Technology of China in 2013 and 2016, respectively. He is a Research Assistant in CAD for Emerging System Group (ES-CAD) and IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) at the University of Illinois at Urbana-Champaign, advised by Prof. Deming Chen. He has received the 3rd place winner award of the System Design Contest at 2018 IEEE/ACM Design Automation Conference and the 2018 IEEE/ACM International Conference on Computer Aided Design Best Paper Award Nomination. His research interests include accelerator design for deep learning, hardware-software co-design, and energy-efficient computing.