2017 IBM Research Workshop on Architectures for Cognitive Computing and Datacenters - Talk Details

Hari Cherupalli profile

Hari Cherupalli - University of Minnesota, advised by Prof. John Sartori

Talk title: Application-specific Design and Optimization for Ultra-Low-Power Embedded Systems

Abstract: For many important and emerging applications, including the Internet of Things(IoT), smart sensors, health monitors, and wearable electronics, system characteristics such as lifetime, size, cost, and security are of paramount importance. These applications rely on ultra-low-power (ULP) microcontrollers and microprocessors that are already the most widely used type of processors in production today and are projected to increase their market dominance in the near future. In the low-power embedded systems used by these applications, energy efficiency is the primary factor that determines critical system characteristics such as size, weight, cost reliability, and lifetime. One option to target these applications is to use application-specific integrated circuits (ASICs), given their energy efficiency advantages over general purpose processors (GPPs). However, GPPs are a preferred solution for many low-power applications, due to the high costs of custom IC design and their support for future updates. Unfortunately, conventional power reduction techniques for GPPs reduce power by sacrificing performance. As such, their impact is limited to the point where performance degradation becomes unacceptable.
In this talk, I present several techniques that save system size, reduce cost, reduce power, and improve the security of low-power embedded systems. These techniques exploit the embedded nature of these applications using a novel HW/SW co-analysis technique that identifies the parts of a processor that an application is guaranteed to not exercise in an input-independent fashion.

Bio: Hari Cherupalli is a final year PhD Student at the University of Minnesota, working under Prof. John Sartori, and is currently an intern at ARM Research. He has a B.Tech and an M.Tech in Electrical Engineering from IIT Kharagpur. His research focuses on developing techniques for power management, cost reduction and security in ultra-low-power processors. He has published in several computer architecture and CAD conferences such as ISCA, ASPLOS, MICRO, HPCA, ICCAD and DAC. His research has been recognized by a Best Paper Award and coverage on several blogs and press outlets, including IEEE Spectrum. Hari enjoys playing the piano in his free time.

Kevin Hsieh Profile

Kevin Hsieh - Carnegie Mellon University, advised by Prof. Phillip Gibbons and Prof. Onur Mutlu

Talk title: Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Abstract: Machine learning (ML) is widely used to derive useful information from large-scale data (such as user activities, pictures, and videos) generated at increasingly rapid rates, all over the world. As it is costly and infeasible to move all this globally-generated data to a centralized data center, we need a geo-distributed ML system spanning multiple data centers. Unfortunately, communicating over wide-area networks (WANs) can significantly degrade ML system performance (by as much as 53.7x in our study) because the communication overwhelms the limited WAN bandwidth. In this talk, I will introduce Gaia, a geo-distributed ML system that (1) employs an intelligent communication mechanism over WANs to efficiently utilize the scarce WAN bandwidth, while retaining the accuracy and correctness guarantees of an ML algorithm; and (2) is generic and flexible enough to run a wide range of ML algorithms, without requiring any changes to the algorithms. Gaia decouples the communication within a data center from the communication between data centers, enabling different communication and consistency models for each. Gaia employs a new ML synchronization model, Approximate Synchronous Parallel (ASP), whose key idea is to dynamically eliminate insignificant communication between data centers while still guaranteeing the correctness of ML algorithms. Deployments across 11 Amazon EC2 global regions show that Gaia provides 1.8–53.5x speedup over two state-of-the-art distributed ML systems, and is within 0.94–1.40x of the speed of running the same ML algorithm on machines on a local area network (LAN).

Bio: Kevin Hsieh is a Ph.D. student at Carnegie Mellon University, working with Phil Gibbons and Onur Mutlu. He is interested in research problems that lie at the intersection of Machine Learning, Distributed System, and Computer Architecture. His recent research spans software and hardware, including large-scale video analytics, distributed machine learning systems, and near-data processing architectures. Prior to the Ph.D. program, he was a manager of an engineering team in Mediatek, Taiwan, where he worked on processor/system architecture of mobile SoCs.

Patrick Judd profile

Patrick Judd - University of Toronto, advised by Prof. Andreas Moshovos

Talk title: Bit-Pragmatic Deep Neural Network Computing

Abstract: Deep Neural Networks expose a high degree of parallelism, making them amenable to highly data parallel architectures. However, data-parallel architectures often accept inefficiency in individual computations for the sake of overall efficiency. We show that on average, activation values of convolutional layers during inference in modern Deep Convolutional Neural Networks (CNNs) contain 92% zero bits. Processing these zero bits entails ineffectual computations that could be skipped. We propose Pragmatic, a massively data-parallel architecture that eliminates most of the ineffectual computations on-the-fly, improving performance and energy efficiency compared to state-of-the-art high-performance accelerators. The idea behind Pragmatic is deceptively simple: use serial-parallel shift-and-add multiplication while skipping the zero bits of the serial input. However, a straightforward implementation based on shift-and-add multiplication yields unacceptable area, power and memory access overheads compared to a conventional bit-parallel design. Pragmatic incorporates a set of design decisions to yield a practical, area and energy efficient design. Measurements demonstrate that for convolutional layers, Pragmatic is 4.31x faster than DaDianNao using a 16-bit fixed-point representation. While Pragmatic requires 1.68x more area than DaDianNao, the performance gains yield a 1.70x increase in energy efficiency in a 65nm technology. With 8-bit quantized activations, Pragmatic is 2.25x faster and 1.31x more energy efficient than an 8-bit version of DaDianNao.

Bio: Patrick Judd is a 4th year PhD candidate with Prof. Andreas Moshovos at the University of Toronto where he also received his MASc and BASc. He has also done internships at Actel and Nvidia. Patrick’s research focuses on the design of ASIC accelerators for deep neural networks. His work examines the numerical properties of neural networks and how they can be exploited in hardware to improve performance and energy efficiency. These accelerators leverage numerical approximation to minimize memory footprint or computation while maintaining the network's output classification accuracy. Furthermore, he is interested in accelerator designs with the flexibility to trade classification accuracy for further improvements in performance and energy at a fine granularity.

David Marquez profile

David González Márquez - University of Buenos Aires, advised by Adrian Cristal (BSC) and Esteban Mocskos (UBA)

Talk title: Towards Efficient Parallelization Hardware-Software Co-Designed Solutions

Abstract: Multi-core processors are ubiquitous in all market segments from embedded to high performance computing, but only few applications can efficiently utilize them. Existing parallel frameworks aim to support thread-level parallelism in applications, but the imposed overhead prevents their usage for small problem instances or sections of code.

In this talk, I present Micro-threads (Mth) a hardware-software co-designed proposal focused on a shared thread management model enabling the use of parallel resources in applications that have small chunks of parallel code or small problem inputs by a combination of software and hardware: delegation of the resource control to the application, an improved mechanism to store and fill processor's context, and an efficient synchronization system.

I am going to show the factibility of the proposal on a set of well-defined algorithms. Our results show remarkable speedups in all sizes, even when the instance size is very small. Also, I will show how Mth can be used in a real widely used application: adding support to parallel query planning in an embedded database engine. Mth enables the efficient execution of very small queries, showing remarkable gains in terms of execution time and energy consumption.

The results encourage the adoption of Mth and show how it can smooth the use of multiple cores for applications that currently can not take advantage of the proliferation of the available parallel resources in each chip.

Bio: David Gonzalez Marquez is a researcher and graduate teaching assistant at the Computer Science Department (DC), FCEyN-UBA. During his degree, he worked on Grid Computing and Resource Distribution using Simgrid, a discrete events simulation platform. He was the sysadmin of the first cluster of low latency in the DC, he installed and setup the HPC system in 2008. He has a broad teaching experience, with more than 10 years mainly in computer organization courses. For the last five years, he has been working on processor architecture topics under the direction of Esteban Mocskos (UBA, Argentina) and Adrian Cristal (BSC, Spain). He published several papers in journals and specialized international conferences focused on parallel resources and distributed systems.

Apoorve Mohan profile

Apoorve Mohan - Northeastern University, advised by Gene Cooperman

Talk title: Marrying HPC and Cloud for Long-Term Happiness

Abstract: Cloud providers plan their hardware purchases by considering peak user utilization and adding some slack so as to not turn down customers and to be able to guarantee as much isolation as possible. This leads to under-utilized cloud deployments and an increase in total cost of ownership for cloud systems. Even though solutions such as auctioning unused CPU cycles (e.g., Amazon spot market ) or offering shortlived preemptible virtual machines (e.g., Google preemptible instances ) can mitigate the impact of this under-utilization problem, they do not completely address it. Studies show that even large public cloud datacenters have utilization levels below 50%.

High Performance Computing (HPC) systems, on the other hand, have a totally different and somewhat complementary workload pattern. HPC workloads generally use a batch submission model, with the inherent understanding that the resources they demand may not be assigned to them right away. Thus, HPC datacenters generally boast very high utilization levels (∼90%) but can suffer from long queue wait times.

In this talk, we present our efforts towards designing a combined HPC and cloud deployment system that can significantly improve overall data center utilization and provides sustainable, scalable, cheap and performant bare-metal resources to HPC.

Bio: Apoorve Mohan is a 4th year PhD student at the Electrical and Computer Engineering Department, Northeastern University. He received his BSc and MSc degree from the University of Delhi, India. Broadly, his research interests include systems and networking with a current focus on finding any intersections between Cloud and High Performance Computing. He is associated with the Massachusetts Open Cloud project since 2015. He was also an intern at IBM Research, Yorktown heights during Summer'17.

Bert Moons profile

Bert Moons - KU Leuven, advised by Prof. Marian Verhelst

Talk title: Algorithmic-, Processor-, and Circuit-techniques for Embedded Deep Learning

Abstract: Deep learning algorithms are state-of-the-art for many pattern recognition tasks. Their performance however comes with a significant computational cost, making it until recently only feasible on power‐hungry server platforms in the cloud. Today, there is a trend towards embedded processing of deep learning networks in edge devices: mobiles, wearables, and IoT nodes, which can only be made possible through significant improvements of hardware and algorithmic energy efficiency. This talk is on the KU Leuven, ESAT/MICAS research efforts towards this goal. We'll discuss several generations of energy-scalable convolutional neural network processors, a ULP Visual wake-up sensor and an LSTM Keyword-spotter, as well as our work on the design of hardware-software optimized minimum energy Quantized Neural Networks.

Bio: Bert Moons (S'13) was born in Antwerp, Belgium in 1991. He received the B.S. and M.S. degree in Electrical Engineering from KU Leuven, Leuven, Belgium in 2011 and 2013, respectively. In 2013, he joined the ESAT-MICAS labarotories of KU Leuven as a Research Assistant funded through an individual grant from FWO (former IWT). In 2016, he was a Visiting Research Student at Stanford University in the Murmann Mixed-Signal Group. Currently, he is working towards the Ph.D. degree on energy-scalable and run-time adaptable digital circuits for embedded Deep Learning applications.

Ozan Tuncer profile

Ozan Tuncer - Boston University, advised by Prof. Ayse Coskun

Talk title: Analytics for Detection of Performance and Configuration Anomalies in Large-scale Systems

Abstract: In today’s growing system size and complexity both in the cloud and in HPC, applications suffer from performance variations and failures due to anomalies such as misconfigurations and software-related problems. These anomalies are among the main challenges in efficient and reliable operation. To be able to take preventive actions, one needs to quickly and accurately detect and diagnose anomalies.
In the first part of this talk, I will present a machine learning based framework to diagnose software-related anomalies such as orphan processes. Our framework leverages historical resource usage and performance counter data to generate and detect anomaly signatures. In the second part, I will focus on configuration anomalies, and present a methodology to discover and extract configurations from VMs and containers to enable configuration analysis in the cloud. Using configuration data extracted from popular Docker images, I will demonstrate how misconfigurations can be found via outlier analysis.

Bio: Ozan Tuncer is a 6th year PhD student at the Electrical and Computer Engineering Department, Boston University. He received his BSc degree from Middle East Technical University, Turkey. He has worked as a summer intern at Sandia National Laboratories (2014), at Oracle (2015), and at IBM (2016, 2017). His research interests include power and workload management for high performance computing and reliability in large-scale systems.

Ritchie Zhao profile

Ritchie Zhao - Cornell University, advised by Prof. Zhiru Zhang

Talk title: Accelerating Binarized Neural Networks with Software-Programmable FPGAs

Abstract: Convolutional neural networks (CNNs) are the current state-of-the-art for many computer vision tasks, but are very expensive in terms of computation and memory. Existing CNN applications are typically executed on clusters of CPUs or GPUs. Research on FPGA acceleration of CNNs has achieved impressive reductions in power and energy efficiency. However, modern GPUs outperform FPGAs in throughput, and are significantly easier to program due to compatibility with deep learning frameworks (e.g. Caffe, Tensorflow, and CNTK).

Recent work in machine learning demonstrates the potential of CNNs with binarized weights and activations. These binarized neural networks (BNNs) appear well-suited for FPGA implementation, as their dominant operations are bitwise logic, which map efficiently to the FPGA fabric. In this talk I present the hardware implementation of a BNN accelerator, which achieves very high operation density and energy efficiency on the embedded Zedboard.

To address the difficulty in programming FPGAs, we propose the combination of a high-level design methodology combined with tools which automate hardware optimization. We introduce DATuner, an extensible distributed autotuning framework which can optimize design parameters and tool settings using a multi-armed bandit approach. We also describe work-in-progress ideas on a domain-specific language which can greatly improve the productivity of generating and verifying hardware architectures for neural network acceleration.

Bio: Ritchie Zhao is a fourth year ECE PhD student at Cornell University under Professor Zhiru Zhang. He received his B.S. from the University of Toronto in 2014. His research interests include hardware specialization for deep learning on FPGAs, as well as high-level synthesis for productive hardware design.