2021
Energy Efficient In-memory Hyperdimensional Encoding for Spatio-temporal Signal Processing
Geethan Karunaratne, Manuel Le Gallo, Michael Hersche, Giovanni Cherubini, Luca Benini, Abu Sebastian, Abbas Rahimi
IEEE Transactions on Circuits and Systems II: Express Briefs, 1-1, 2021
Abstract
The emerging brain-inspired computing paradigm known as hyperdimensional computing (HDC) has been proven to provide a lightweight learning framework for various cognitive tasks compared to the widely used deep learning-based approaches. Spatio-temporal (ST) signal processing, which encompasses biosignals such as electromyography (EMG) and electroencephalography (EEG), is one family of applications that could benefit from an HDC-based learning framework. At the core of HDC lie manipulations and comparisons of large bit patterns, which are inherently ill-suited to conventional computing platforms based on the von-Neumann architecture. In this work, we propose an architecture for ST signal processing within the HDC framework using predominantly in-memory compute arrays. In particular, we introduce a methodology for the in-memory hyperdimensional encoding of ST data to be used together with an in-memory associative search module. We show that the in-memory HDC encoder for ST signals offers at least 1.80x energy efficiency gains, 3.36x area gains, as well as 9.74x throughput gains compared with a dedicated digital hardware implementation. At the same time it achieves a peak classification accuracy within 0.04% of that of the baseline HDC framework.
A wearable biosensing system with in-sensor adaptive machine learning for hand gesture recognition
Ali Moin, Andy Zhou, Abbas Rahimi, Alisha Menon, Simone Benatti, George Alexandrov, Senam Tamakloe, Jonathan Ting, Natasha Yamamoto, Yasser Khan, Fred Burghardt, Luca Benini, Ana C. Arias, Jan M. Rabaey
Nature Electronics, 2021
Robust high-dimensional memory-augmented neural networks
Geethan Karunaratne, Manuel Schmuck, Manuel Le Gallo, Giovanni Cherubini, Luca Benini, Abu Sebastian, Abbas Rahimi
Nature Communications 12(1), 2468, 2021
Abstract
Traditional neural networks require enormous amounts of data to build their complex mappings during a slow training procedure that hinders their abilities for relearning and adapting to new data. Memory-augmented neural networks enhance neural networks with an explicit memory to overcome these issues. Access to this explicit memory, however, occurs via soft read and write operations involving every individual memory entry, resulting in a bottleneck when implemented using the conventional von Neumann computer architecture. To overcome this bottleneck, we propose a robust architecture that employs a computational memory unit as the explicit memory performing analog in-memory computation on high-dimensional (HD) vectors, while closely matching 32-bit software-equivalent accuracy. This is achieved by a content-based attention mechanism that represents unrelated items in the computational memory with uncorrelated HD vectors, whose real-valued components can be readily approximated by binary, or bipolar components. Experimental results demonstrate the efficacy of our approach on few-shot image classification tasks on the Omniglot dataset using more than 256,000 phase-change memory devices. Our approach effectively merges the richness of deep neural network representations with HD computing that paves the way for robust vector-symbolic manipulations applicable in reasoning, fusion, and compression.
2020
Explainable deep learning for medical time series data
Thomas Frick, Stefan Gluge, Abbas Rahimi, Luca Benini, Thomas Brunschwiler
International Conference on Wireless Mobile Communication and Healthcare (MobiHealth), Springer, 2020
Abstract
Neural Networks are powerful classifiers. However, they are black boxes and do not provide explicit explanations for their decisions. For many applications, particularly in health care, explanations are essential for building trust in the model. In the field of computer vision, a multitude of explainability methods have been developed to analyze Neural Networks by explaining what they have learned during training and what factors influence their decisions. This work provides an overview of these explanation methods in form of a taxonomy. We adapt and benchmark the different methods to time series data. Further, we introduce quantitative explanation metrics that enable us to build an objective benchmarking framework with which we extensively rate and compare explainability methods. As a result, we show that the Grad-CAM++ algorithm outperforms all other methods. Finally, we identify the limits of existing explanation methods for specific datasets, with feature values close to zero.
Binarization Methods for Motor-Imagery Brain-Computer Interface Classification
Michael Hersche, Luca Benini, Abbas Rahimi
IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2020
Abstract
Successful motor-imagery brain-computer interface (MI-BCI) algorithms either extract a large number of handcrafted features and train a classifier, or combine feature extraction and classification within deep convolutional neural networks (CNNs). Both approaches typically result in a set of real-valued weights, that pose challenges when targeting real-time execution on tightly resource-constrained devices. We propose methods for each of these approaches that allow transforming real-valued weights to binary numbers for efficient inference. Our first method, based on sparse bipolar random projection, projects a large number of real-valued Riemannian covariance features to a binary space, where a linear SVM classifier can be learned with binary weights too. By tuning the dimension of the binary embedding, we achieve almost the same accuracy in 4-class MI ($\leq$1.27% lower) compared to models with float16 weights, yet delivering a more compact model with simpler operations to execute. Second, we propose to use memory-augmented neural networks (MANNs) for MI-BCI such that the augmented memory is binarized. Our method replaces the fully connected layer of CNNs with a binary augmented memory using bipolar random projection, or learned projection. Our experimental results on EEGNet, an already compact CNN for MI-BCI, show that it can be compressed by 1.28x at iso-accuracy using the random projection. On the other hand, using the learned projection provides 3.89% higher accuracy but increases the memory size by 28.10x.
Integrating event-based dynamic vision sensors with sparse hyperdimensional computing: a low-power accelerator with online learning capability
Michael Hersche, Edoardo Mello Rella, Alfio Di Mauro, Luca Benini, Abbas Rahimi
Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 169-174, 2020
Abstract
We propose to embed features extracted from event-driven dynamic vision sensors to binary sparse representations in hyperdimensional (HD) space for regression. This embedding compresses events generated across 346x260 differential pixels to a sparse 8160-bit vector by applying random activation functions. The sparse representation not only simplifies inference, but also enables online learning with the same memory footprint. Specifically, it allows efficient updates by retaining binary vector components over the course of online learning that cannot be otherwise achieved with dense representations demanding multibit vector components. We demonstrate online learning capability: using estimates and confidences of an initial model trained with only 25% of data, our method continuously updates the model for the remaining 75% of data, resulting in a close match with accuracy obtained with an oracle model on ground truth labels. When mapped on an 8-core accelerator, our method also achieves lower error, latency, and energy compared to other sparse/dense alternatives. Furthermore, it is 9.84x more energy-efficient and 6.25x faster than an optimized 9-layer perceptron with comparable accuracy.
An Ensemble of Hyperdimensional Classifiers: Hardware-Friendly Short-Latency Seizure Detection with Automatic iEEG Electrode Selection
Alessio Burrello, Simone Benatti, Kaspar Anton Schindler, Luca Benini, Abbas Rahimi
IEEE Journal of Biomedical and Health Informatics pp. 99, 1-1, 2020
Abstract
We propose an intracranial electroencephalography (iEEG) based algorithm for detecting epileptic seizures with short latency, and with identifying the most relevant electrodes. Our algorithm first extracts three features, namely mean amplitude, line length, and local binary patterns that are fed to an ensemble of classifiers using hyperdimensional (HD) computing. These features are embedded into an HD space where well-defined vector-space operations are used to construct prototype vectors representing ictal (during seizures) and interictal (between seizures) brain states. Prototype vectors can be computed at different spatial scales ranging from a single electrode up to many electrodes covering different brain regions. This flexibility allows our algorithm to identify the iEEG electrodes that discriminate best between ictal and interictal brain states. We assess our algorithm on the SWEC-ETHZ iEEG dataset that includes 99 short-time iEEG seizures recorded with 36 to 100 electrodes from 16 drug-resistant epilepsy patients. Using k-fold cross-validation and all electrodes, our algorithm surpasses state-of-the-art algorithms yielding significantly shorter latency (8.81 s vs. 9.94 s) in seizure onset detection, and higher sensitivity (96.38 % vs. 92.72 %) and accuracy (96.85 % vs. 95.43 %). We can further reduce the latency of our algorithm to 3.74 s by allowing a slightly higher percentage of false alarms (2 % specificity loss). Using only the top 10 % of the electrodes ranked by our algorithm, we still maintain superior latency, sensitivity, and specificity compared to the other algorithms with all the electrodes. We finally demonstrate the suitability of our algorithm to deployment on low-cost embedded hardware platforms, thanks to its robustness to noise/artifacts affecting the signal, its low computational complexity, and the small memory-footprint on a RISC-V microcontroller.
Autoscaling Bloom filter: controlling trade-off between true and false positives
Denis Kleyko, Abbas Rahimi, Ross W. Gayler, Evgeny Osipov
Neural Computing and Applications 32(8), 3675-3684, 2020
Abstract
A Bloom filter is a special case of an artificial neural network with two layers. Traditionally, it is seen as a simple data structure supporting membership queries on a set. The standard Bloom filter does not support the delete operation, and therefore, many applications use a counting Bloom filter to enable deletion. This paper proposes a generalization of the counting Bloom filter approach, called "autoscaling Bloom filters", which allows adjustment of its capacity with probabilistic bounds on false positives and true positives. Thus, by relaxing the requirement on perfect true positive rate, the proposed autoscaling Bloom filter addresses the major difficulty of Bloom filters with respect to their scalability. In essence, the autoscaling Bloom filter is a binarized counting Bloom filter with an adjustable binarization threshold. We present the mathematical analysis of its performance and provide a procedure for minimizing its false positive rate.
Compressing Subject-specific Brain-Computer Interface Models into One Model by Superposition in Hyperdimensional Space
Michael Hersche, Philipp Rupp, Luca Benini, Abbas Rahimi
2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 246-251
Abstract
Accurate multiclass classification of electroencephalography (EEG) signals is still a challenging task towards the development of reliable motor imagery brain-computer interfaces (MI-BCIs). Deep learning algorithms have been recently used in this area to deliver a compact and accurate model. Reaching high-level of accuracy requires to store subjects-specific trained models that cannot be achieved with an otherwise compact model trained globally across all subjects. In this paper, we propose a new methodology that closes the gap between these two extreme modeling approaches: we reduce the overall storage requirements by superimposing many subject-specific models into one single model such that it can be reliably decomposed, after retraining, to its constituent models while providing a trade-off between compression ratio and accuracy. Our method makes the use of unexploited capacity of trained models by orthogonalizing parameters in a hyperdimensional space, followed by iterative retraining to compensate noisy decomposition. This method can be applied to various layers of deep inference models. Experimental results on the 4-class BCI competition IV-2a dataset show that our method exploits unutilized capacity for compression and surpasses the accuracy of two state-of-the-art networks: (1) it compresses the smallest network, EEGNet [1], by 1.9, and increases its accuracy by 2.41% (74.73% vs. 72.32%); (2) using a relatively larger Shallow ConvNet [2], our method achieves 2.95 x compression as well as 1.4% higher accuracy (75.05% vs. 73.59%).
Binary Models for Motor-Imagery Brain-Computer Interfaces: Sparse Random Projection and Binarized SVM
Michael Hersche, Luca Benini, Abbas Rahimi
2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 163-167
Abstract
Successful motor imagery brain-computer (MIBCI) algorithms typically rely on a large number of features used in a classifier with real-valued weights that render them unsuitable for real-time execution on a resource-limited device. We propose a new method that randomly projects a large number of real-valued Riemannian covariance features to a binary space, where a linear SVM classifier can be learned with binary weights too. Flexibly increasing the dimension of binary embedding achieves almost the same accuracy (1.27% lower) compared to all models with float16 in 4-class and 3-class MI, yet delivering a more compact model with simpler operations to execute.
Evolvable Hyperdimensional Computing: Unsupervised Regeneration of Associative Memory to Recover Faulty Components
Michael Hersche, Sara Sangalli, Luca Benini, Abbas Rahimi
2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 281-285
Abstract
This paper proposes evolvable hyperdimensional (HD) computing to maintain high classification accuracy as permanent faults occur in emerging non-volatile memory fabrics. Our proposed HD architecture can detect, localize, and isolate faulty PCM blocks in discriminative classifiers, followed by unsupervised regeneration of new blocks to compensate accuracy loss. We demonstrate its application on a language recognition task: it is able to quickly relearn and fully recover the accuracy from 90.48% to 96.86% at fault rates as high as 42% by using solely 4. 2MB of text for regeneration. The new evolved model is still 285 more compact than state-of-the-art fastText.
Hyperdimensional Computing With Local Binary Patterns: One-Shot Learning of Seizure Onset and Identification of Ictogenic Brain Regions Using Short-Time iEEG Recordings
Alessio Burrello, Kaspar Schindler, Luca Benini, Abbas Rahimi
IEEE Transactions on Biomedical Engineering 67(2), 601-613, 2020
Abstract
Objective: 1 We develop a fast learning algorithm combining symbolic dynamics and brain-inspired hyperdimensional computing for both seizure onset detection and identification of ictogenic (seizure generating) brain regions from intracranial electroencephalography (iEEG). 32 Methods: 34 Our algorithm first transforms iEEG time series from each electrode into symbolic local binary pattern codes, from which a 54 holographic distributed representation 58 of the brain state of interest is constructed across all the electrodes and over time in a hyperdimensional space. The representation is used to quickly learn from few seizures, detect their onset, and identify the spatial brain regions that generated them. 100 Results: 102 We assess our algorithm on our dataset that contains 99 short-time iEEG recordings from 16 drug-resistant epilepsy patients being implanted with 36-100 electrodes. For the majority of the patients (ten out of 16), our algorithm quickly learns from one or two seizures and perfectly (100%) generalizes on novel seizures using 153 154 $k$ 156 -fold cross-validation. For the remaining six patients, the algorithm requires three to six seizures for learning. Our algorithm surpasses the state-of-the-art including deep learning algorithms by achieving higher specificity (94.84% versus 94.77%) and macroaveraging accuracy (95.42% versus 94.96%), and 74 lower memory footprint, but slightly higher average latency in detection (15.9s versus 14.7s). Moreover, the algorithm can reliably identify (with a 218 219 $p$ 221 -value 223 224 $ 226 ) the relevant electrodes covering an ictogenic brain region at two levels of granularity: cerebral hemispheres and lobes. 245 Conclusion and significance: 249 Our algorithm provides: 1) a unified method for both learning and classification tasks with end-to-end binary operations; 2) one-shot learning from seizure examples; 3) linear computational scalability for increasing number of electrodes; and 4) generation of transparent codes that enables post-translational support for clinical decision making. Our source code and anonymized iEEG dataset are freely available at 307 http://ieeg-swez.ethz.ch .
Hyperdimensional computing nanosystem: in-memory computing using monolithic 3D integration of RRAM and CNFET
Abbas Rahimi, Tony F. Wu, Haitong Li, Jan M. Rabaey, H.-S. Philip Wong, Max M. Shulaker, Subhasish Mitra
Memristive Devices for Brain-Inspired Computing, pp. 195-219, 2020
Abstract
Abstract 1 2 One viable solution for continuous reduction in energy-per-operation is to rethink functionality to cope with uncertainty by adopting computational approaches that are inherently robust to uncertainty. It requires a novel look at data representations, associated operations, and circuits, and at materials and substrates that enable them. 3D integrated nanotechnologies combined with novel brain-inspired computational paradigms that support fast learning and fault tolerance could lead the way. Recognizing the very size of the brains circuits, hyperdimensional (HD) computing can model neural activity patterns with points in a HD space, that is, with hypervectors as large randomly generated patterns. At its very core, HD computing is about manipulating and comparing these patterns inside memory. Emerging nanotechnologies such as carbon nanotube field-effect transistors (CNFETs) and resistive RAM (RRAM), and their monolithic 3D integration offer opportunities for hardware implementations of HD computing through tight integration of logic and memory, energy-efficient computation, and unique device characteristics. We experimentally demonstrate and characterize an end-to-end HD computing nanosystem built using monolithic 3D integration of CNFETs and RRAM. With our nanosystem, we experimentally demonstrate classification of 21 languages with measured accuracy of up to 98% on >20,000 sentences (6.4 million characters), training using one text sample (100,000 characters) per language, and resilient operation (98% accuracy) despite 78% hardware errors in HD representation (outputs stuck at 0 or 1). By exploiting the unique properties of the underlying nanotechnologies, we show that HD computing, when implemented with monolithic 3D integration, can be up to 420 more energy-efficient while using 25 less area compared to traditional silicon complementary metal-oxide-semiconductor (CMOS) implementations.
In-memory hyperdimensional computing
G. Karunaratne, M. Le Gallo, G. Cherubini, L. Benini, A. Rahimi, A. Sebastian
Nature Electronics, 2020
Abstract
Hyperdimensional computing is an emerging computational framework that takes inspiration from attributes of neuronal circuits including hyperdimensionality, fully distributed holographic representation and (pseudo)randomness. When employed for machine learning tasks, such as learning and classification, the framework involves manipulation and comparison of large patterns within memory. A key attribute of hyperdimensional computing is its robustness to the imperfections associated with the computational substrates on which it is implemented. It is therefore particularly amenable to emerging non-von Neumann approaches such as in-memory computing, where the physical attributes of nanoscale memristive devices are exploited to perform computation. Here, we report a complete in-memory hyperdimensional computing system in which all operations are implemented on two memristive crossbar engines together with peripheral digital complementary metalÿ oxideÿ semiconductor (CMOS) circuits. Our approach can achieve a near-optimum trade-off between design complexity and classification accuracy based on three prototypical hyperdimensional computing-related learning tasks: language classification, news classification and hand gesture recognition from electromyography signals. Experiments using 760,000 phase-change memory devices performing analog in-memory computing achieve comparable accuracies to software implementations.
2019
Online Learning and Classification of EMG-Based Gestures on a Parallel Ultra-Low Power Platform Using Hyperdimensional Computing
Simone Benatti, Fabio Montagna, Victor Kartsch, Abbas Rahimi, Davide Rossi, Luca Benini
IEEE Transactions on Biomedical Circuits and Systems 13(3), 516-528, 2019
Abstract
This paper presents a wearable electromyographic gesture recognition system based on the hyperdimensional computing paradigm, running on a programmable parallel ultra-low-power (PULP) platform. The processing chain includes efficient on-chip training, which leads to a fully embedded implementation with no need to perform any offline training on a personal computer. The proposed solution has been tested on 10 subjects in a typical gesture recognition scenario achieving 85% average accuracy on 11 gestures recognition, which is aligned with the state-of-the-art, with the unique capability of performing online learning. Furthermore, by virtue of the hardware friendly algorithm and of the efficient PULP system-on-chip (Mr. Wolf) used for prototyping and evaluation, the energy budget required to run the learning part with 11 gestures is 10.04mJ, and 83.2 123 $\mu$ 125 J per classification. The system works with a average power consumption of 10.4mW in classification, ensuring around 29h of autonomy with a 100mAh battery. Finally, the scalability of the system is explored by increasing the number of channels (up to 256 electrodes), demonstrating the suitability of our approach as universal, energy-efficient biopotential wearable recognition framework.
Hyperdimensional Computing-based Multimodality Emotion Recognition with Physiological Signals
En-Jui Chang, Abbas Rahimi, Luca Benini, An-Yeu Andy Wu
2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 137-141
Abstract
To interact naturally and achieve mutual sympathy between humans and machines, emotion recognition is one of the most important function to realize advanced human-computer interaction devices. Due to the high correlation between emotion and involuntary physiological changes, physiological signals are a prime candidate for emotion analysis. However, due to the need of a huge amount of training data for a high-quality machine learning model, computational complexity becomes a major bottleneck. To overcome this issue, brain-inspired hyperdimensional (HD) computing, an energy-efficient and fast learning computational paradigm, has a high potential to achieve a balance between accuracy and the amount of necessary training data. We propose an HD Computing-based Multimodality Emotion Recognition (HDC-MER). HDCMER maps real-valued features to binary HD vectors using a random nonlinear function, and further encodes them over time, and fuses across different modalities including GSR, ECG, and EEG. The experimental results show that, compared to the best method using the full training data, HDC-MER achieves higher classification accuracy for both valence (83.2% vs. 80.1%) and arousal (70.1% vs. 68.4%) using only 1/4 training data. HDC-MER also achieves at least 5% higher averaged accuracy compared to all the other methods in any point along the learning curve.
Laelaps: An Energy-Efficient Seizure Detection Algorithm from Long-term Human iEEG Recordings without False Alarms
Alessio Burrello, Lukas Cavigelli, Kaspar Schindler, Luca Benini, Abbas Rahimi
2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 752-757
Abstract
We propose Laelaps, an energy-efficient and fast learning algorithm with no false alarms for epileptic seizure detection from long-term intracranial electroencephalography (iEEG) signals. Laelaps uses end-to-end binary operations by exploiting symbolic dynamics and brain-inspired hyperdimensional computing. Laelapss results surpass those yielded by state-of-the-art (SoA) methods [1], [2], [3], including deep learning, on a new very large dataset containing 116 seizures of 18 drug-resistant epilepsy patients in 2656 hours of recordingseach patient implanted with 24 to 128 iEEG electrodes. Laelaps trains 18 patient-specific models by using only 24 seizures: 12 models are trained with one seizure per patient, the others with two seizures. The trained models detect 79 out of 92 unseen seizures without any false alarms across all the patients as a big step forward in practical seizure detection. Importantly, a simple implementation of Laelaps on the Nvidia Tegra X2 embedded device achieves 1.73.9 faster execution and 1.42.9 lower energy consumption compared to the best result from the SoA methods. Our source code and anonymized iEEG dataset are freely available at http://ieeg-swez.ethz.ch.
Hardware Optimizations of Dense Binary Hyperdimensional Computing: Rematerialization of Hypervectors, Binarized Bundling, and Combinational Associative Memory
Manuel Schmuck, Luca Benini, Abbas Rahimi
ACM Journal on Emerging Technologies in Computing Systems 15(4), 2019
Abstract
Brain-inspired hyperdimensional (HD) computing models neural activity patterns of the very size of the brain's circuits with points of a hyperdimensional space, that is, with hypervectors. Hypervectors are D-dimensional (pseudo)random vectors with independent and identically distributed (i.i.d.) components constituting ultra-wide holographic words: D=10,000 bits, for instance. At its very core, HD computing manipulates a set of seed hypervectors to build composite hypervectors representing objects of interest. It demands memory optimizations with simple operations for an efficient hardware realization. In this article, we propose hardware techniques for optimizations of HD computing, in a synthesizable open-source VHDL library, to enable co-located implementation of both learning and classification tasks on only a small portion of Xilinx UltraScale FPGAs: (1) We propose simple logical operations to rematerialize the hypervectors on the fly rather than loading them from memory. These operations massively reduce the memory footprint by directly computing the composite hypervectors whose individual seed hypervectors do not need to be stored in memory. (2) Bundling a series of hypervectors over time requires a multibit counter per every hypervector component. We instead propose a binarized back-to-back bundling without requiring any counters. This truly enables on-chip learning with minimal resources as every hypervector component remains binary over the course of training to avoid otherwise multibit components. (3) For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. This operator is proportional to hypervector dimension (D), and hence may take O(D) cycles per classification event. Accordingly, we significantly improve the throughput of classification by proposing associative memories that steadily reduce the latency of classification to the extreme of a single cycle. (4) We perform a design space exploration incorporating the proposed techniques on FPGAs for a wearable biosignal processing application as a case study. Our techniques achieve up to 2.39x area saving, or 2,337x throughput improvement. The Pareto optimal HD architecture is mapped on only 18,340 configurable logic blocks (CLBs) to learn and classify five hand gestures using four electromyography sensors.
Applications of computation-in-memory architectures based on memristive devices
Said Hamdioui, Hoang Anh Du Nguyen, Mottaqiallah Taouil, Abu Sebastian, Manuel Le Gallo, Sandeep Pande, Siebren Schaafsma, Francky Catthoor, Shidhartha Das, Fernando G Redondo, others
2019 Design, Automation \& Test in Europe Conference \& Exhibition (DATE), pp. 486--491
Abstract
Today's computing architectures and device technologies are unable to meet the increasingly stringent demands on energy and performance posed by emerging applications. Therefore, alternative computing architectures are being explored that leverage novel post-CMOS device technologies. One of these is a Computation-in-Memory architecture based on memristive devices. This paper describes the concept of such an architecture and shows different applications that could significantly benefit from it. For each application, the algorithm, the architecture, the primitive operations, and the potential benefits are presented. The applications cover the domains of data analytics, signal processing, and machine learning.
Efficient Biosignal Processing Using Hyperdimensional Computing: Network Templates for Combined Learning and Classification of ExG Signals
Abbas Rahimi, Pentti Kanerva, Luca Benini, Jan M. Rabaey
Proceedings of the IEEE 107(1), 123-143, 2019
Abstract
Recognizing the very size of the brains circuits, hyperdimensional (HD) computing can model neural activity patterns with points in a HD space, that is, with HD vectors. Key examined properties of HD computing include: a versatile set of arithmetic operations on HD vectors, generality, scalability, analyzability, one-shot learning, and energy efficiency. These make it a prime candidate for efficient biosignal processing where signals are noisy and nonstationary, training data sets are not huge, individual variability is significant, and energy-efficiency constraints are tight. Purely based on native HD computing operators, we describe a combined method for multiclass learning and classification of various ExG biosignals such as electromyography (EMG), electroencephalography (EEG), and electrocorticography (ECoG). We develop a full set of HD network templates that comprehensively encode body potentials and brain neural activity recorded from different electrodes into a single HD vector without requiring domain expert knowledge or 145 ad hoc 148 electrode selection process. Such encoded HD vector is processed as a single unit for fast one-shot learning, and robust classification. It can be interpreted to identify the most useful features as well. Compared to state-of-the-art counterparts, HD computing enables online, incremental, and fast learning as it demands less than a third as much training data as well as less preprocessing.
Analysis of Contraction Effort Level in EMG-Based Gesture Recognition Using Hyperdimensional Computing
Ali Moin, Andy Zhou, Simone Benatti, Abbas Rahimi, Luca Benini, Jan M. Rabaey
arXiv: Human-Computer Interaction, 2019
Abstract
Varying contraction levels of muscles is a big challenge in electromyography-based gesture recognition. Some use cases require the classifier to be robust against varying force changes, while others demand to distinguish between different effort levels of performing the same gesture. We use brain-inspired hyperdimensional computing paradigm to build classification models that are both robust to these variations and able to recognize multiple contraction levels. Experimental results on 5 subjects performing 9 gestures with 3 effort levels show up to 39.17% accuracy drop when training and testing across different effort levels, with up to 30.35% recovery after applying our algorithm.
2018
HDNA: Energy-efficient DNA sequencing using hyperdimensional computing
Mohsen Imani, Tarek Nassar, Abbas Rahimi, Tajana Rosing
2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), pp. 271-274
Abstract
DNA sequencing has a vast number of applications in a multitude of applied fields including, but not limited to, medical diagnosis and biotechnology. In this paper, we propose HDNA to apply the concepts of hyperdimensional (HD) computing (computing with hypervectors) to DNA sequencing. HDNA first assigns holographic and (pseudo)random hyper-vectors to DNA bases. Using an encoder, it then exploits the orthogonality of these hypervectors to represent a DNA sequence by generating a class hypervector. The class hypervector keeps the information of combined individual hypervectors (i.e., the DNA bases) with high probability. HDNA uses the same encoding to map a DNA sequence with unknown labels to a query hypervectors and performs the classification task by checking the similarity of the query hypervector against all class hypervectors. Our experimental evaluation shows that HDNA can achieve 99.7% classification accuracy for Empirical dataset which is 5.2% higher than state-of-the-art techniques for the same dataset. Moreover, our HDNA can improve the execution time and energy consumption of classification by 4.32 and 2.05 respectively, when compared against prior techniques.
Hyperdimensional Computing Nanosystem
Abbas Rahimi, Tony F. Wu, Haitong Li, Jan M. Rabaey, H. S. Philip Wong, Max M. Shulaker, Subhasish Mitra
arXiv preprint arXiv:1811.09557, 2018
Abstract
One viable solution for continuous reduction in energy-per-operation is to rethink functionality to cope with uncertainty by adopting computational approaches that are inherently robust to uncertainty. It requires a novel look at data representations, associated operations, and circuits, and at materials and substrates that enable them. 3D integrated nanotechnologies combined with novel brain-inspired computational paradigms that support fast learning and fault tolerance could lead the way. Recognizing the very size of the brains circuits, hyperdimensional (HD) computing can model neural activity patterns with points in a HD space, that is, with hypervectors as large randomly generated patterns. At its very core, HD computing is about manipulating and comparing these patterns inside memory. Emerging nanotechnologies such as carbon nanotube field effect transistors (CNFETs) and resistive RAM (RRAM), and their monolithic 3D integration offer opportunities for hardware implementations of HD computing through tight integration of logic and memory, energy-efficient computation, and unique device characteristics. We experimentally demonstrate and characterize an end-to-end HD computing nanosystem built using monolithic 3D integration of CNFETs and RRAM. With our nanosystem, we experimentally demonstrate classification of 21 languages with measured accuracy of up to 98% on >20,000 sentences (6.4 million characters), training using one text sample (~100,000 characters) per language, and resilient operation (98% accuracy) despite 78% hardware errors in HD representation (outputs stuck at 0 or 1). By exploiting the unique properties of the underlying nanotechnologies, we show that HD computing, when implemented with monolithic 3D integration, can be up to 420X more energy-efficient while using 25X less area compared to traditional silicon CMOS implementations.
An 826 MOPS, 210uW/MHz Unum ALU in 65 nm
Florian Glaser, Stefan Mach, Abbas Rahimi, Frank K. Gurkaynak, Qiuting Huang, Luca Benini
2018 IEEE International Symposium on Circuits and Systems (ISCAS)
Abstract
To overcome the limitations of conventional floating-point number formats, an interval arithmetic and variable-width storage format called universal number (unum) has been recently introduced [1]. This paper presents the first (to the best of our knowledge) silicon implementation measurements of an application-specific integrated circuit (ASIC) for unum floating-point arithmetic. The designed chip includes a 128-bit wide unum arithmetic unit to execute additions and subtractions, while also supporting lossless (for intermediate results) and lossy (for external data movements) compression units to exploit the memory usage reduction potential of the unum format. Our chip, fabricated in a 65 nm CMOS process, achieves a maximum clock frequency of 413 MHz at 1.2 V with an average measured power of 210uW/MHz.
Fast and Accurate Multiclass Inference for MI-BCIs Using Large Multiscale Temporal and Spectral Features
Michael Hersche, Tino Rellstab, Pasquale Davide Schiavone, Lukas Cavigelli, Luca Benini, Abbas Rahimi
arXiv preprint arXiv:1806.06823, 2018
Abstract
Accurate, fast, and reliable multiclass classification of electroencephalography (EEG) signals is a challenging task towards the development of motor imagery brain-computer interface (MI-BCI) systems. We propose enhancements to different feature extractors, along with a support vector machine (SVM) classifier, to simultaneously improve classification accuracy and execution time during training and testing. We focus on the well-known common spatial pattern (CSP) and Riemannian covariance methods, and significantly extend these two feature extractors to multiscale temporal and spectral cases. The multiscale CSP features achieve 73.70$\pm$15.90% (mean$\pm$ standard deviation across 9 subjects) classification accuracy that surpasses the state-of-the-art method [1], 70.6$\pm$14.70%, on the 4-class BCI competition IV-2a dataset. The Riemannian covariance features outperform the CSP by achieving 74.27$\pm$15.5% accuracy and executing 9x faster in training and 4x faster in testing. Using more temporal windows for Riemannian features results in 75.47$\pm$12.8% accuracy with 1.6x faster testing than CSP.
CLIM: A Cross-Level Workload-Aware Timing Error Prediction Model for Functional Units
Xun Jiao, Abbas Rahimi, Yu Jiang, Jianguo Wang, Hamed Fatemi, Jose Pineda de Gyvez, Rajesh K. Gupta
IEEE Transactions on Computers 67(6), 771-783, 2018
Abstract
Timing errors that are caused by the timing violations of sensitized circuit paths, have emerged as an important threat to the reliability of synchronous digital circuits. To protect circuits from these timing errors, designers typically use a conservative timing margin, which leads to operational inefficiency. Existing adaptive approaches reduce such conservative margins by predicting the timing errors in advance and adjusting the timing margin adaptively. However, these error prediction approaches overlook the impact of input workload (i.e., operands) on path sensitization, thereby resulting in a loss of accuracy. The diversity of input operands leads to complex path sensitization behaviors, making them hard to represent in timing error modeling. In this paper, we propose 113 114 115 CLIM 117 , a cross-level workload-aware timing error prediction model for functional units (FUs). 130 131 CLIM 133 predicts whether there are timing errors in FU at two levels: bit-level and value-level. At the bit level or value level, 155 156 CLIM 158 predicts each output bit or entire output value as one of two classes: 172 173 $\lbrace$ 175 176 177 178 179 180 timing correct , 184 timing erroneous 187 188 $\rbrace$ 190 191 192 193 194 195 as a function of input workload and clock period, respectively. We apply supervised learning methods to construct 213 214 CLIM 216 217 , by using input operands, computation history and circuit toggling as input features, as well as outputs timing classes as labels. These training data are collected from gate-level simulations (GLS) of post place-and-route designs in TSMC 45nm process. We evaluate 258 259 CLIM 261 prediction accuracy for various FUs and compare it with baseline models. On average, 275 276 CLIM 278 279 exhibits 95 percent prediction accuracy at value-level, 97 percent at bit-level, and executes at a rate 173X faster than GLS. We utilize 302 303 CLIM 305 to analyze the value-level and bit-level reliability of FUs under random and real-world application workloads. At value-level, 323 324 CLIM 326 327 -based reliability estimation is within 2.8 percent deviation on average of detailed GLS ground truth. At bit-level, we introduce the concept of 350 bit-level reliability specification 354 of error-tolerant applications and compare this with the 363 364 CLIM 366 -based bit-level reliability estimation. By comparison, 373 374 CLIM 376 will classify the application quality into two classes: 385 386 $\lbrace$ 388 389 390 391 392 393 acceptable , 396 non-acceptable 398 399 $\rbrace$ 401 402 403 404 405 . On average, 97 percent application quality classification is consistent with GLS ground truth.
One-shot Learning for iEEG Seizure Detection Using End-to-end Binary Operations: Local Binary Patterns with Hyperdimensional Computing
Alessio Burrello, Kaspar Schindler, Luca Benini, Abbas Rahimi
2018 IEEE Biomedical Circuits and Systems Conference (BioCAS)
Abstract
This paper presents an efficient binarized algorithm for both learning and classification of human epileptic seizures from intracranial electroencephalography (iEEG). The algorithm combines local binary patterns with brain-inspired hyperdimensional computing to enable end-to-end learning and inference with binary operations. The algorithm first transforms iEEG time series from each electrode into local binary pattern codes. Then atomic high-dimensional binary vectors are used to construct composite representations of seizures across all electrodes. For the majority of our patients (10 out of 16), the algorithm quickly learns from one or two seizures (i.e., one-/few-shot learning) and perfectly generalizes on 27 further seizures. For other patients, the algorithm requires three to six seizures for learning. Overall, our algorithm surpasses the state-of-the-art methods [1] for detecting 65 novel seizures with higher specificity and sensitivity, and lower memory footprint.
Classification and Recall With Binary Hyperdimensional Computing: Tradeoffs in Choice of Density and Mapping Characteristics
Denis Kleyko, Abbas Rahimi, Dmitri A. Rachkovskij, Evgeny Osipov, Jan M. Rabaey
IEEE Transactions on Neural Networks 29(12), 5880-5898, 2018
Abstract
Hyperdimensional (HD) computing is a promising paradigm for future intelligent electronic appliances operating at low power. This paper discusses tradeoffs of selecting parameters of binary HD representations when applied to pattern recognition tasks. Particular design choices include density of representations and strategies for mapping data from the original representation. It is demonstrated that for the considered pattern recognition tasks (using synthetic and real-world data) both sparse and dense representations behave nearly identically. This paper also discusses implementation peculiarities which may favor one type of representations over the other. Finally, the capacity of representations of various densities is discussed.
Brain-inspired computing exploiting carbon nanotube FETs and resistive RAM: Hyperdimensional computing case study
Tony F. Wu, Haitong Li, Ping-Chen Huang, Abbas Rahimi, Jan M. Rabaey, H.-S. Philip Wong, Max M. Shulaker, Subhasish Mitra
2018 IEEE International Solid - State Circuits Conference - (ISSCC), pp. 492-494
Abstract
We demonstrate an end-to-end brain-inspired hyperdimensional (HD) computing nanosystem, effective for cognitive tasks such as language recognition, using heterogeneous integration of multiple emerging nanotechnologies. It uses monolithic 3D integration of carbon nanotube field-effect transistors (CNFETs, an emerging logic technology with significant energy-delay product (EDP) benefit vs. silicon CMOS [1]) and Resistive RAM (RRAM, an emerging memory that promises dense non-volatile and analog storage [2]). Due to their low fabrication temperature ( 20,000 sentences (6.4 million characters) per language pair. 2. One-shot learning (i.e., learning from few examples) using one text sample (100,000 characters) per language. 3. Resilient operation (98% accuracy) despite 78% hardware errors (circuit outputs stuck at 0 or 1). Our HD nanosystem consists of 1,952 CNFETs integrated with 224 RRAM cells.
PULP-HD: accelerating brain-inspired high-dimensional computing on a parallel ultra-low power platform
Fabio Montagna, Abbas Rahimi, Simone Benatti, Davide Rossi, Luca Benini
Proceedings of the 55th Annual Design Automation Conference on , pp. 1-6, 2018
Abstract
Computing with high-dimensional (HD) vectors, also referred to as hypervectors, is a brain-inspired alternative to computing with scalars. Key properties of HD computing include a well-defined set of arithmetic operations on hypervectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operations. HD computing is about manipulating and comparing large patterns---binary hypervectors with 10,000 dimensions---making its efficient realization on minimalistic ultra-low-power platforms challenging. This paper describes HD computings acceleration and its optimization of memory accesses and operations on a silicon prototype of the PULPv3 4-core platform (1.5 mm2, 2 mW), surpassing the state-of-the-art classification accuracy (on average 92.4%) with simultaneous 3.7 end-to-end speed-up and 2 energy saving compared to its single-core execution. We further explore the scalability of our accelerator by increasing the number of inputs and classification window on a new generation of the PULP architecture featuring bit-manipulation instruction extensions and larger number of 8 cores. These together enable a near ideal speed-up of 18.4 compared to the single-core PULPv3.
An EMG Gesture Recognition System with Flexible High-Density Sensors and Brain-Inspired High-Dimensional Classifier
Ali Moin, Andy Zhou, Abbas Rahimi, Simone Benatti, Alisha Menon, Senam Tamakloe, Jonathan Ting, Natasha Yamamoto, Yasser Khan, Fred Burghardt, Luca Benini, Ana C. Arias, Jan M. Rabaey
2018 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-5
Abstract
EMG-based gesture recognition shows promise for human-machine interaction. Systems are often afflicted by signal and electrode variability which degrades performance over time. We present an end-to-end system combating this variability using a large-area, high-density sensor array and a robust classification algorithm. EMG electrodes are fabricated on a flexible substrate and interfaced to a custom wireless device for 64-channel signal acquisition and streaming. We use brain-inspired high-dimensional (HD) computing for processing EMG features in one-shot learning. The HD algorithm is tolerant to noise and electrode misplacement and can quickly learn from few gestures without gradient descent or back-propagation. We achieve an average classification accuracy of 96.64% for five gestures, with only 7% degradation when training and testing across different days. Our system maintains this accuracy when trained with only three trials of gestures; it also demonstrates comparable accuracy with the state-of-the-art when trained with one trial.
2017
VoiceHD: Hyperdimensional Computing for Efficient Speech Recognition
Mohsen Imani, Deqian Kong, Abbas Rahimi, Tajana Rosing
2017 IEEE International Conference on Rebooting Computing (ICRC), pp. 1-8
Abstract
In this paper, we propose VoiceHD, a novel speech recognition technique based on brain-inspired hyperdimensional(HD) computing. VoiceHD maps preprocessed voice signals in the frequency domain to random hypervectors and combines them to compute a hypervector (as learned patterns) representing each class. During inference, VoiceHD similarly computes a query hypervector; the classification task is done by checking the similarity of the query hypervector with all learned hypervectors and finding a class with the highest similarity. We further extend VoiceHD to VoiceHD+NN that uses a neural network with a single small hidden layer to improve the similarity measures. This neural network is a small block directly operating on the similarity outputs of VoiceHD to slightly improve the classification accuracy. We evaluate efficiency of VoiceHD and VoiceHD+NN compared to a deep neural network with three large hidden layers over Isolet spoken letter dataset. Our benchmarking results on CPU show that VoiceHD and VoiceHD+NN provide 11.9X and 8.5X higher energy efficiency, 5.3X and 4.0X faster testing time, and 4.6X and 2.9X faster training time compared to the deep neural network, while providing marginally better classification accuracy.
Human-centric computing The case for a Hyper-Dimensional approach
Jan Rabaey, Abbas Rahimi, Sohum Datta, Miles Rusch, Pentti Kanerva, Bruno Olshausen
2017 7th IEEE International Workshop on Advances in Sensors and Interfaces (IWASI), pp. 29-29
Abstract
Some of most compelling application domains of the IoT and Swarm concepts relate to how humans interact with the world around it and the cyberworld beyond. While the proliferation of communication and data processing devices has profoundly altered our interaction patterns, little has been changed in the way we process inputs (sensory) and outputs (actuation). The combination of IoT (Swarms) and wearable devices offers the potential for changing all of this, opening the door for true human augmentation. The epitome of this would be a direct interface to the human brain. Yet, making sense of the plethora of information received from the often noisy sensors and making reliable decisions within very tight latency bounds (< 10 ms) typically demands huge computational workloads to be performed in wearable form factors at extreme energy efficiency. In this presentation, we will make the case why alternative non-Von Neumann computational paradigms and architectures may be the right choice for these cognitive processing tasks. Even more, we will focus on a computational model called Hyper-Dimensional Computing (HDC), and illustrate with concrete examples of why this approach may be the right one in a post-Moore data-driven arena.
Hyperdimensional computing for noninvasive brain-computer interfaces: Blind and one-shot classification of EEG error-related potentials
Abbas Rahimi, Pentti Kanerva, Jose del R. Millan, Jan M. Rabaey
Proceedings of the 10th EAI International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS), 2017
Abstract
The mathematical properties of high-dimensional (HD) spaces show remarkable agreement with behaviors controlled by the brain. Computing with HD vectors, referred to as "hypervectors," is a brain-inspired alternative to computing with numbers. HD computing is characterized by generality, scalability, robustness, and fast learning, making it a prime candidate for utilization in application domains such as brain-computer interfaces. We describe the use of HD computing to classify electroencephalography (EEG) error-related potentials for noninvasive brain-computer interfaces. Our algorithm encodes neural activity recorded from 64 EEG electrodes to a single temporal-spatial hypervector. This hypervector represents the event of interest and is used for recognition of the subjects intentions. Using the full set of training trials, HD computing achieves on average 5% higher accuracy compared to a conventional machine learning method on this task (74.5% vs. 69.5%) and offers further advantages: (1) Our algorithm learns fast by using 34% of training trials while surpassing the conventional method with an average accuracy of 70.5%. (2) Conventional method requires prior domain expert knowledge to carefully select a subset of electrodes for a subsequent pre-processor and classier, whereas our algorithm blindly uses all 64 electrodes, tolerates noises in data, and the resulting hypervector is intrinsically clustered into HD space; in addition, most preprocessing of the electrode signal can be eliminated while maintaining an average accuracy of 71.7%.
SLoT: A supervised learning model to predict dynamic timing errors of functional units
Xun Jiao, Yu Jiang, Abbas Rahimi, Rajesh K. Gupta
Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1183-1188
Abstract
Dynamic timing errors (DTEs), that are caused by the timing violations of sensitized critical timing paths, have emerged as an important threat to the reliability of digital circuits. Existing approaches model the DTEs without considering the impact of input operands on dynamic path sensitization, resulting in loss of accuracy. The diversity of input operands leads to complex path sensitization behaviors, making it hard to represent in DTE modeling. In this paper, we propose SLoT, a supervised learning model to predict the output of functional units (FUs) to be one of two timing classes: {timing correct, timing erroneous} as a function of input operands and clock period. We apply random forest classification (RFC) method to construct SLoT, by using input operands, computation history and circuit toggling as input features and outputs timing classes as labels. The outputs timing classes are measured using gate-level simulation (GLS) of a post place-and-route design in TSMC 45nm process. For evaluation, we apply SLoT to several FUs and on average 95% predictions are consistent with GLS, which is 6.3X higher compared to the existing instruction-level model. SLoT-based reliability analysis of FUs under different datasets can achieve 0.7-4.8% average difference compared with GLS-based analysis, and execute more than 20X faster than GLS.
Hyperdimensional Computing for Blind and One-Shot Classification of EEG Error-Related Potentials
Abbas Rahimi, Artiom Tchouprina, Pentti Kanerva, Jose del R. Millan, Jan M. Rabaey
Mobile Networks and Applications, 1-12, 2017
Abstract
The mathematical properties of high-dimensional (HD) spaces show remarkable agreement with behaviors controlled by the brain. Computing with HD vectors, referred to as "hypervectors," is a brain-inspired alternative to computing with numbers. HD computing is characterized by generality, scalability, robustness, and fast learning, making it a prime candidate for utilization in application domains such as brain-computer interfaces. We describe the use of HD computing to classify electroencephalography (EEG) error-related potentials for noninvasive brain-computer interfaces. Our algorithm naturally encodes neural activity recorded from 64 EEG electrodes to a single temporal-spatial hypervector without requiring any electrode selection process. This hypervector represents the event of interest, can be analyzed to identify the most discriminative electrodes, and is used for recognition of the subjects intentions. Using the full set of training trials, HD computing achieves on average 5% higher single-trial classification accuracy compared to a conventional machine learning method on this task (74.5% vs. 69.5%) and offers further advantages: (1) Our algorithm learns fast: using only 34% of training trials it achieves an average accuracy of 70.5%, surpassing the conventional method. (2) Conventional method requires prior domain expert knowledge, or a separate process, to carefully select a subset of electrodes for a subsequent preprocessor and classifier, whereas our algorithm blindly uses all 64 electrodes, tolerates noises in data, and the resulting hypervector is intrinsically clustered into HD space; in addition, most preprocessing of the electrode signal can be eliminated while maintaining an average accuracy of 71.7%.
High-Dimensional Computing as a Nanoscalable Paradigm
Abbas Rahimi, Sohum Datta, Denis Kleyko, Edward Paxon Frady, Bruno Olshausen, Pentti Kanerva, Jan M. Rabaey
IEEE Transactions on Circuits and Systems I-regular Papers 64(9), 2508-2521, 2017
Abstract
We outline a model of computing with high-dimensional (HD) vectorswhere the dimensionality is in the thousands. It is built on ideas from traditional (symbolic) computing and artificial neural nets/deep learning, and complements them with ideas from probability theory, statistics, and abstract algebra. Key properties of HD computing include a well-defined set of arithmetic operations on vectors, generality, scalability, robustness, fast learning, and ubiquitous parallel operation, making it possible to develop efficient algorithms for large-scale real-world tasks. We present a 2-D architecture and demonstrate its functionality with examples from text analysis, pattern recognition, and biosignal processing, while achieving high levels of classification accuracy (close to or above conventional machine-learning methods), energy efficiency, and robustness with simple algorithms that learn fast. HD computing is ideally suited for 3-D nanometer circuit technology, vastly increasing circuit density and energy efficiency, and paving a way to systems capable of advanced cognitive tasks.
Exploring Hyperdimensional Associative Memory
Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, Jan M. Rabaey
2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 445-456
Abstract
Brain-inspired hyperdimensional (HD) computing emulates cognition tasks by computing with hypervectors as an alternative to computing with numbers. At its very core, HD computing is about manipulating and comparing large patterns, stored in memory as hypervectors: the input symbols are mapped to a hypervector and an associative search is performed for reasoning and classification. For every classification event, an associative memory is in charge of finding the closest match between a set of learned hypervectors and a query hypervector by using a distance metric. Hypervectors with the i.i.d. components qualify a memory-centric architecture to tolerate massive number of errors, hence it eases cooperation of various methodological design approaches for boosting energy efficiency and scalability. This paper proposes architectural designs for hyperdimensional associative memory (HAM) to facilitate energy-efficient, fast, and scalable search operation using three widely-used design approaches. These HAM designs search for the nearest Hamming distance, and linearly scale with the number of dimensions in the hypervectors while exploring a large design space with orders of magnitude higher efficiency. First, we propose a digital CMOS-based HAM (D-HAM) that modularly scales to any dimension. Second, we propose a resistive HAM (R-HAM) that exploits timing discharge characteristic of nonvolatile resistive elements to approximately compute Hamming distances at a lower cost. Finally, we combine such resistive characteristic with a currentbased search method to design an analog HAM (A-HAM) that results in faster and denser alternative. Our experimental results show that R-HAM and A-HAM improve the energy-delay product by 9.6 and 1347 compared to D-HAM while maintaining a moderate accuracy of 94% in language recognition.
2016
Resistive Bloom filters: From approximate membership to approximate computing with bounded errors
Vahideh Akhlaghi, Abbas Rahimi, Rajesh K. Gupta
2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1441-1444
Abstract
Approximate computing provides an opportunity for exploiting application characteristics to trade the accuracy for gains in energy efficiency. However, such opportunity must be able to bound the error that the system designer provides to the application developer. Space-efficient probabilistic data structure such as Bloom filter can provide one such means. Bloom filter supports approximate set membership queries with a tunable rate of false positives (i.e., errors) and no false negatives. We propose a resistive Bloom filter (ReBF) to approximate a function by tightly integrating it to a functional unit (FU) implementing the function. ReBF approximately mimics partial functionality of the FU by recalling its frequent input patterns for computational reuse. The accuracy of the target FU is guaranteed by bounding the ReBF error behavior at the design time. We further lower energy consumption of a FU by designing its ReBF using low-power memristor arrays. The experimental results show that function approximation using ReBF for five image processing kernels running on the AMD Southern Islands GPU yields on average 24.1% energy saving in 45 nm technology compared to the exact computation.
Grater: An approximation workflow for exploiting data-level parallelism in FPGA acceleration
Atieh Lotfi, Abbas Rahimi, Amir Yazdanbakhsh, Hadi Esmaeilzadeh, Rajesh K. Gupta
2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1279-1284
Abstract
Modern applications including graphics, multimedia, web search, and data analytics not only can benefit from acceleration, but also exhibit significant degrees of tolerance to imprecise computation. This amenability to approximation provides an opportunity to trade quality of the results for higher performance and better resource utilization. Exploiting this opportunity is particularly important for FPGA accelerators that are inherently subject to many resource constraints. To better utilize the FPGA resources, we devise, Grater, an automated design workflow for FPGA accelerators that leverages imprecise computation to increase data-level parallelism and achieve higher computational throughput. The core of our workflow is a source-to-source compiler that takes in an input kernel and applies a novel optimization technique that selectively reduces the precision of kernels data and operations. By selectively reducing the precision of the data and operation, the required area to synthesize the kernels on the FPGA decreases allowing to integrate a larger number of operations and parallel kernels in the fixed area of the FPGA. The larger number of integrated kernels provides more hardware context to better exploit data-level parallelism in the target applications. To effectively explore the possible design space of approximate kernels, we exploit a genetic algorithm to find a subset of safe-to-approximate operations and data elements and then tune their precision levels until the desired output quality is achieved. GRATER exploits a fully software technique and does not require any changes to the underlying FPGA hardware. We evaluate Grater on a diverse set of data-intensive OpenCL benchmarks from the AMD SDK. The synthesis result on a modern Altera FPGA shows that our approximation workflow yields 1.4-3.0 higher throughput with less than 1% quality loss.
Variability Mitigation in Nanometer CMOS Integrated Systems: A Survey of Techniques From Circuits to Software
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
Proceedings of the IEEE 104(7), 1410-1448, 2016
Abstract
Variation in performance and power across manufactured parts and their operating conditions is an accepted reality in modern microelectronic manufacturing processes with geometries in nanometer scales. This article surveys challenges and opportunities in identifying variations, their effects and methods to combat these variations for improved microelectronic devices. We focus on computing devices and their design at various levels to combat variability. First, we provide a review of key concepts with particular emphasis on timing errors caused by various variability sources. We consider methods to predict and prevent, detect and correct, and finally conditions under which such errors can be accepted; we also consider their implications on cost, performance and quality. We provide a comparative evaluation of methods for deployment across various layers of the system from circuits, architecture, to application software. These can be combined in various ways to achieve specific goals related to observability and controllability of the variability effects, providing means to achieve cross-layer or hybrid resilience. We then provide examples of real world resilient single-core and parallel architectures. We find that parallel architectures and parallelism in general provide the best means to combat and exploit variability to design resilient and efficient systems. Using programmable accelerator architectures such as clustered processing elements and GP-GPUs, we show how system designers can coordinate propagation of timing error information and its effects along with new techniques for memoization (i.e., spatial or temporal reuse of computation). This discussion naturally leads to use of these techniques into emerging area of "approximate computing," and how these can be used in building resilient and efficient computing systems. We conclude with an outlook for the emerging field.
Associative Memristive Memory for Approximate Computing in GPUs
Amirali Ghofrani, Abbas Rahimi, Miguel A. Lastras-Montano, Luca Benini, Rajesh K. Gupta, Kwang-Ting Cheng
IEEE Journal on Emerging and Selected Topics in Circuits and Systems 6(2), 222-234, 2016
Abstract
Using associative memories to enable computing-with-memory is a promising approach to improve energy efficiency. Associative memories can be tightly coupled with processing elements to restore and later recall function responses for a subset of input values. This approach avoids the actual function execution on the processing element to save on energy. The challenge, however, is to reduce the energy consumption of associative memory modules themselves. Here we address the challenge of designing ultra-low-power associative memories. We use memristive parts for memory implementation and demonstrate the energy saving potential of integrating associative memristive memory (AMM) into graphics processing units (GPUs). To reduce the energy consumption of AMM modules, we leverage approximate computing which benefits from application-level tolerance to errors: We employ voltage overscaling on AMM modules which deliberately relaxes its searching criteria to approximately match stored patterns within a 2 bit Hamming distance of the search pattern. This introduces some errors to the computation that are tolerable for target applications. We further reduce the energy consumption by employing purely resistive crossbar architectures for AMM modules. To evaluate the proposed architecture, we integrate AMM modules with floating point units in an AMD Southern Islands GPU and run four image processing kernels on an AMM-integrated GPU. Our experimental results show that employing AMM modules reduces energy consumption of running these kernels by 23%-45%, compared to a baseline GPU without AMM. The image processing kernels tolerate errors resulting from approximate search operations, maintaining an acceptable image quality, i.e., a PSNR above 30 dB.
Hyperdimensional computing with 3D VRRAM in-memory kernels: Device-architecture co-design for energy-efficient, error-resilient language recognition
Haitong Li, Tony F. Wu, Abbas Rahimi, Kai-Shin Li, Miles Rusch, Chang-Hsien Lin, Juo-Luen Hsu, Mohamed M. Sabry, S. Burc Eryilmaz, Joon Sohn, Wen-Cheng Chiu, Min-Cheng Chen, Tsung-Ta Wu, Jia-Min Shieh, Wen-Kuan Yeh, Jan M. Rabaey, Subhasish Mitra, H.-S. Philip Wong
2016 IEEE International Electron Devices Meeting (IEDM)
Abstract
The ability to learn from few examples, known as one-shot learning, is a hallmark of human cognition. Hyperdimensional (HD) computing is a brain-inspired computational framework capable of one-shot learning, using random binary vectors with high dimensionality. Device-architecture co-design of HD cognitive computing systems using 3D VRRAM/CMOS is presented for language recognition. Multiplication-addition-permutation (MAP), the central operations of HD computing, are experimentally demonstrated on 4-layer 3D VRRAM/FinFET as non-volatile in-memory MAP kernels. Extensive cycle-to-cycle (up to 1012 cycles) and wafer-level device-to-device (256 RRAMs) experiments are performed to validate reproducibility and robustness. For 28-nm node, the 3D in-memory architecture reduces total energy consumption by 52.2% with 412 times less area compared with LP digital design (using registers as memory), owing to the energy-efficient VRRAM MAP kernels and dense connectivity. Meanwhile, the system trained with 21 samples texts achieves 90.4% accuracy recognizing 21 European languages on 21,000 test sentences. Hard-error analysis shows the HD architecture is amazingly resilient to RRAM endurance failures, making the use of various types of RRAMs/CBRAMs (1k 10M endurance) feasible.
Resistive configurable associative memory for approximate computing
Mohsen Imani, Abbas Rahimi, Tajana S. Rosing
2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1327-1332
Abstract
Modern computing machines are increasingly characterized by large scale parallelism in hardware (such as GPGPUs) and advent of large scale and innovative memory blocks. Parallelism enables expanded performance tradeoffs whereas memories enable reuse of computational work. To be effective, however, one needs to ensure energy efficiency with minimal reuse overheads. In this paper, we describe a resistive configurable associative memory (ReCAM) that enables selective approximation and asymmetric voltage overscaling to manage delivered efficiency. The ReCAM structure matches an input pattern with pre-stored ones by applying an approximate search on selected bit indices (bitline-configurable) or selective pre-stored patterns (row-configurable). To further reduce energy, we explore proper ReCAM sizing, various configurable search operations with low overhead voltage overscaling, and different ReCAM update policies. Experimental result on the AMD Southern Islands GPUs for eight applications shows bitline-configurable and row-configurable ReCAM achieve on average to 43.6% and 44.5% energy savings with an acceptable quality loss of 10%.
Hyperdimensional biosignal processing: A case study for EMG-based hand gesture recognition
Abbas Rahimi, Simone Benatti, Pentti Kanerva, Luca Benini, Jan M. Rabaey
2016 IEEE International Conference on Rebooting Computing (ICRC), pp. 1-8
Abstract
The mathematical properties of high-dimensional spaces seem remarkably suited for describing behaviors produces by brains. Brain-inspired hyperdimensional computing (HDC) explores the emulation of cognition by computing with hypervectors as an alternative to computing with numbers. Hypervectors are high-dimensional, holographic, and (pseudo)random with independent and identically distributed (i.i.d.) components. These features provide an opportunity for energy-efficient computing applied to cyberbiological and cybernetic systems. We describe the use of HDC in a smart prosthetic application, namely hand gesture recognition from a stream of Electromyography (EMG) signals. Our algorithm encodes a stream of analog EMG signals that are simultaneously generated from four channels to a single hypervector. The proposed encoding effectively captures spatial and temporal relations across and within the channels to represent a gesture. This HDC encoder achieves a high level of classification accuracy (97.8%) with only 1/3 the training data required by state-of-the-art SVM on the same task. HDC exhibits fast and accurate learning explicitly allowing online and continuous learning. We further enhance the encoder to adaptively mitigate the effect of gesture-timing uncertainties across different subjects endogenously; further, the encoder inherently maintains the same accuracy when there is up to 30% overlapping between two consecutive gestures in a classification window.
A Robust and Energy-Efficient Classifier Using Brain-Inspired Hyperdimensional Computing
Abbas Rahimi, Pentti Kanerva, Jan M. Rabaey
Proceedings of the 2016 International Symposium on Low Power Electronics and Design, pp. 64-69
Abstract
The mathematical properties of high-dimensional (HD) spaces show remarkable agreement with behaviors controlled by the brain. Computing with HD vectors, referred to as "hypervectors," is a brain-inspired alternative to computing with numbers. Hypervectors are high-dimensional, holographic, and (pseudo)random with independent and identically distributed (i.i.d.) components. They provide for energy-efficient computing while tolerating hardware variation typical of nanoscale fabrics. We describe a hardware architecture for a hypervector-based classifier and demonstrate it with language identification from letter trigrams. The HD classifier is 96.7% accurate, 1.2% lower than a conventional machine learning method, operating with half the energy. Moreover, the HD classifier is able to tolerate 8.8-fold probability of failure of memory cells while maintaining 94% accuracy. This robust behavior with erroneous memory cells can significantly improve energy efficiency.
2015
Axilog: Abstractions for Approximate Hardware Design and Reuse
Divya Mahajan, Kartik Ramkrishnan, Rudra Jariwala, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Anandhavel Nagendrakumar, Abbas Rahimi, Hadi Esmaeilzadeh, Kia Bazargan
IEEE Micro 35(5), 16-30, 2015
Abstract
Relaxing the traditional abstraction of "near-perfect" accuracy in hardware design can yield significant gains in efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions and synthesis tools that can systematically incorporate approximation in hardware design. The authors define Axilog, a set of language extensions for Verilog that provides the necessary syntax and semantics for approximate hardware design and reuse. Axilog lets designers safely relax the accuracy requirements in the design while keeping the critical parts strictly precise. Axilog is coupled with a Safety Inference Analysis that automatically infers the safe-to-approximate gates and connections from the annotations. The analysis provides formal guarantees that the safe-to-approximate parts of the design strictly adhere to the designers intentions. The authors devise two synthesis flows that leverage Axilogs framework for safe approximation; one by relaxing the timing requirements and the other through gate resizing. They evaluate Axilog using a diverse set of benchmarks that gain 1.54 average energy savings and 1.82 average area reduction with 10 percent output quality loss. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code.
Supervised learning based model for predicting variability-induced timing errors
Xun Jiao, Abbas Rahimi, Balakrishnan Narayanaswamy, Hamed Fatemi, Jose Pineda de Gyvez, Rajesh K. Gupta
2015 IEEE 13th International New Circuits and Systems Conference (NEWCAS), pp. 1-4
Abstract
Circuit designers typically combat variations in hardware and workload by increasing conservative guardbanding that leads to operational inefficiency. Reducing this excessive guardband is highly desirable, but causes timing errors in synchronous circuits. We propose a methodology for supervised learning based models to predict timing errors at bit-level. We show that a logistic regression based model can effectively predict timing errors, for a given amount of guardband reduction. The proposed methodology enables a model-based rule method to reduce guardband subject to a required bit-level reliability specification. For predicting timing errors at bit-level, the proposed model generation automatically uses a binary classifier per output bit that captures the circuit path sensitization. We train and test our model on gate-level simulations with timing error information extracted from an ASIC flow that considers physical details of placed-and-routed single-precision pipelined floating-point units (FPUs) in 45nm TSMC technology. We further assess the robustness of our modeling methodology by considering various operating voltage and temperature corners. Our model predicts timing errors with an average accuracy of 95% for unseen input workload. This accuracy can be used to achieve a 0%-15% guardband reduction for FPUs, while satisfying the reliability specification for four error-tolerant applications.
Aging-Aware Compilation for GP-GPUs
Atieh Lotfi, Abbas Rahimi, Luca Benini, Rajesh K. Gupta
ACM Transactions on Architecture and Code Optimization 12(2), 2015
Abstract
General-purpose graphic processing units (GP-GPUs) offer high computational throughput using thousands of integrated processing elements (PEs). These PEs are stressed during workload execution, and negative bias temperature instability (NBTI) adversely affects their reliability by introducing new delay-induced faults. However, the effect of these delay variations is not uniformly spread across the PEs: some are affected morehence less reliablethan others. This variation causes significant reduction in the lifetime of GP-GPU parts. In this article, we address the problem of "wear leveling" across processing units to mitigate lifetime uncertainty in GP-GPUs. We propose innovations in the static compiled code that can improve healing in PEs and stream cores (SCs) based on their degradation status. PE healing is a fine-grained very long instruction word (VLIW) slot assignment scheme that balances the stress of instructions across the PEs within an SC. SC healing is a coarse-grained workload allocation scheme that distributes workload across SCs in GP-GPUs. Both schemes share a common property: they adaptively shift workload from less reliable units to more reliable units, either spatially or temporally. These software schemes are based on online calibration with NBTI monitoring that equalizes the expected lifetime of PEs and SCs by regenerating adaptive compiled codes to respond to the specific health state of the GP-GPUs. We evaluate the effectiveness of the proposed schemes for various OpenCL kernels from the AMD APP SDK on Evergreen and Southern Island GPU architectures. The aging-aware healthy kernels generated by the PE (or SC) healing scheme reduce NBTI-induced voltage threshold shift by 30p (77p in the case of SCs), with no (moderate) performance penalty compared to the naive kernels.
Axilog: language support for approximate hardware design
Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethuraman, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmaeilzadeh, Kia Bazargan
Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition on , pp. 812-817
Abstract
Relaxing the traditional abstraction of "near-perfect" accuracy in hardware design can lead to significant gains in energy efficiency, area, and performance. To exploit this opportunity, there is a need for design abstractions that can systematically incorporate approximation in hardware design. We introduce Axilog, a set of language annotations, that provides the necessary syntax and semantics for approximate hardware design and reuse in Verilog. Axilog enables the designer to relax the accuracy requirements in certain parts of the design, while keeping the critical parts strictly precise. Axilog is coupled with a Relaxability Inference Analysis that automatically infers the relaxable gates and connections from the designers annotations. The analysis provides formal safety guarantees that approximation will only affect the parts that the designer intended to approximate, referred to as relaxable elements. Finally, the paper describes a synthesis flow that approximates only the relaxable elements. Axilog enables applying approximation in the synthesis process while abstracting away the details of approximate synthesis from the designer. We evaluate Axilog, its analysis, and the synthesis flow using a diverse set of benchmark designs. The results show that the intuitive nature of the language extensions coupled with the automated analysis enables safe approximation of designs even with thousands of lines of code. Applying our approximate synthesis flow to these designs yields, on average, 54% energy savings and 1.9X area reduction with 10% output quality loss.
Approximate associative memristive memory for energy-efficient GPUs
Abbas Rahimi, Amirali Ghofrani, Kwang-Ting Cheng, Luca Benini, Rajesh K. Gupta
Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition on , pp. 1497-1502
Abstract
Multimedia applications running on thousands of deep and wide pipelines working concurrently in GPUs have been an important target for power minimization both at the architectural and algorithmic levels. At the hardware level, energy-efficiency techniques that employ voltage overscaling face a barrier so-called "path walls": reducing operating voltage beyond a certain point generates massive number of timing errors that are impractical to tolerate. We propose an architectural innovation, called A2M2 module (approximate associative memristive memory) that exhibits few tolerable timing errors suitable for GPU applications under voltage overscaling. A2M2 is integrated with every floating point unit (FPU), and performs partial functionality of the associated FPU by pre-storing high frequency patterns for computational reuse that avoids overhead due to re-execution. Voltage overscaled A2M2 is designed to match an input search pattern with any of the stored patterns within a Hamming distance range of 0--2. This matching behavior under voltage overscaling leads to a controllable approximate computing for multimedia applications. Our experimental results for the AMD Southern Islands GPU show that four image processing kernels tolerate the mismatches during pattern matching resulting in a PSNR 30dB. The A2M2 module with 8-row enables 28% voltage overscaling in 45nm technology resulting in 32% average energy saving for the kernels, while delivering an acceptable quality of service.
2014
Improving Resilience to Timing Errors by Exposing Variability Effects to Software in Tightly-Coupled Processor Clusters
Abbas Rahimi, Daniele Cesarini, Andrea Marongiu, Rajesh K. Gupta, Luca Benini
IEEE Journal on Emerging and Selected Topics in Circuits and Systems 4(2), 216-229, 2014
Abstract
Manufacturing and environmental variations cause timing errors in microelectronic processors that are typically avoided by ultra-conservative multi-corner design margins or corrected by error detection and recovery mechanisms at the circuit-level. In contrast, we present here runtime software support for cost-effective countermeasures against hardware timing failures during system operation. We propose a variability-aware OpenMP (VOMP) programming environment, suitable for tightly-coupled shared memory processor clusters, that relies upon modeling across the hardware/software interface. VOMP is implemented as an extension to the OpenMP v3.0 programming model that covers various parallel constructs, including 89 90 91 ${\tt task}$ 94 , 96 97 98 ${\tt sections}$ 101 , and 104 105 106 ${\tt for}$ 109 . Using the notion of work-unit vulnerability (WUV) proposed here, we capture timing errors caused by circuit-level variability as high-level software knowledge. WUV consists of descriptive metadata to characterize the impact of variability on different work-unit types running on various cores. As such, WUV provides a useful abstraction of hardware variability to efficiently allocate a given work-unit to a suitable core for execution. VOMP enables hardware/software collaboration with online variability monitors in hardware and runtime scheduling in software. The hardware provides online per-core characterization of WUV metadata. This metadata is made available by carefully placing key data structures in a shared L1 memory and is used by VOMP schedulerss. Our results show that VOMP greatly reduces the cost of timing error recovery compared to the baseline schedulers of OpenMP, yielding speedup of 3%-36% for tasks, and 26%-49% for sections. Further, VOMP reaches energy saving of 2%-46% and 15%-50% for tasks, and sections, respectively.
Energy-Efficient GPGPU Architectures via Collaborative Compilation and Memristive Memory-Based Computing
Abbas Rahimi, Amirali Ghofrani, Miguel Angel Lastras-Montano, Kwang-Ting Cheng, Luca Benini, Rajesh K. Gupta
Proceedings of the 51st Annual Design Automation Conference on , pp. 1-6, 2014
Abstract
Thousands of deep and wide pipelines working concurrently make GPGPU high power consuming parts. Energy-efficiency techniques employ voltage overscaling that increases timing sensitivity to variations and hence aggravating the energy use issues. This paper proposes a method to increase spatiotemporal reuse of computational effort by a combination of compilation and micro-architectural design. An associative memristive memory (AMM) module is integrated with the floating point units (FPUs). Together, we enable fine-grained partitioning of values and find high-frequency sets of values for the FPUs by searching the space of possible inputs, with the help of application-specific profile feedback. For every kernel execution, the compiler pre-stores these high-frequent sets of values in AMM modules -- representing partial functionality of the associated FPU-- that are concurrently evaluated over two clock cycles. Our simulation results show high hit rates with 32-entry AMM modules that enable 36% reduction in average energy use by the kernel codes. Compared to voltage overscaling, this technique enhances robustness against timing errors with 39% average energy saving.
Temporal memoization for energy-efficient timing error recovery in GPGPUs
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Abstract
Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (0%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66 % with 11% voltage overscaling.
Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
IEEE Transactions on Computers 63(9), 2160-2173, 2014
Abstract
Traditional application execution assumes an error-free execution hardware and environment. Such guarantees in execution are achieved by providing guardbands in the design of microelectronic processors. In reality, applications exhibit varying degrees of tolerance to error in computations. This paper proposes an adaptive guardbanding technique to combat CMOS variability for error-tolerant (probabilistic) applications as well as traditional error-intolerant applications. The proposed technique leverages a combination of accurate design time analysis and a minimally intrusive runtime technique to mitigate Process, Voltage, and Temperature (PVT) variations for a near-zero area overhead. We demonstrate our approach on a 32-bit in-order RISC processor with full post Placement and Routing (P&R) layout results in TSMC 45nm technology. The adaptive guardbanding technique eliminates traditional guardbands on operating frequency using information from PVT variations and application-specific requirements on computational accuracy. For error-intolerant applications, we introduce the notion of Sequence-Level Vulnerability (SLV) that utilizes circuit-level vulnerability for constructing high-level software knowledge as metadata. In effect, the SLV metadata partitions sequences of integer SPARC instructions into two equivalence classes to enable the adaptive guardbanding technique to adapt the frequency simultaneously for dynamic voltage and temperature variations, as well as adapt to the different classes of the instruction sequences. The proposed technique achieves on an average $1.6 \times $ speedup for error-intolerant applications compared to recent work . For probabilistic applications, the adaptive technique guarantees the error-free operation of a set of paths of the processor that always require correct timing (Vulnerable Paths) while reducing the cost of guardbanding for the rest of the paths (Invulnerable Paths). This increases the throughput of probabilistic applications upto $1.9 \times $ over the traditional worst-case design. The proposed technique has 0.022% area overhead, and imposes only 0.034% and 0.031% total power overhead for intolerant and probabilistic applications respectively.
2013
Aging-aware compiler-directed VLIW assignment for GPGPU architectures
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
Proceedings of the 50th Annual Design Automation Conference on , 2013
Abstract
Negative bias temperature instability (NBTI) adversely affects the reliability of a processor by introducing new delay-induced faults. However, the effect of these delay variations is not uniformly spread across functional units and instructions: some are affected more (hence less reliable) than others. This paper proposes a NBTI-aware compiler-directed very long instruction word (VLIW) assignment scheme that uniformly distributes the stress of instructions with the aim of minimizing aging of GPGPU architecture without any performance penalty. The proposed solution is an entirely software technique based on static workload characterization and online execution with NBTI monitoring that equalizes the expected lifetime of each processing element by regenerating aging-aware healthy kernels that respond to the specific health state of GPGPU. We demonstrate our approach on AMD Evergreen architecture where iso-throughput executions of the healthy kernels reduce NBTI-induced voltage threshold shift up to 49% (11%) compared to naive kernel executions, with (without) architectural support for power-gating. The kernel adaption flow takes average of 13 millisecond on a typical host machine thus making it suitable for practical implementation.
Variation-tolerant OpenMP tasking on tightly-coupled processor clusters
Abbas Rahimi, Andrea Marongiu, Paolo Burgio, Rajesh K. Gupta, Luca Benini
2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 541-546
Abstract
We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51 (1.17 on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(T=90C).
ARGO: aging-aware GPGPU register file allocation
Majid Namaki-Shoushtari, Abbas Rahimi, Nikil Dutt, Puneet Gupta, Rajesh K. Gupta
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, 2013
Abstract
State-of-the-art general-purpose graphic processing units (GPGPUs) implemented in nanoscale CMOS technologies offer very high computational throughput for highly-parallel applications using hundreds of integrated on-chip resources. These resources are stressed during application execution, subjecting them to degradation mechanisms such as negative bias temperature instability (NBTI) that adversely affect their reliability. To support highly parallel execution, GPGPUs contain large register files (RFs) that are among the most highly stressed GPGPU components; however we observe heavy underutilization of RFs (on average only 46%) for typical general-purpose kernels. We present ARGO, an 88 A ging-awa R e 93 G PGPU RF all O cator that opportunistically exploits this RF underutilization by distributing the stress throughout RF. ARGO achieves proper leveling of RF banks through deliberated power-gating of stressful banks. We demonstrate our technique on the AMD Evergreen GPGPU architecture and show that ARGO improves the NBTI-induced threshold voltage degradation by up to 43% (on average 27%), that yields improving RFs static noise margin up to 46% (on average 30%). Furthermore, we estimate a simultaneous reduction in leakage power of 54% by providing sleep states for unused banks.
A variability-aware OpenMP environment for efficient execution of accuracy-configurable computation on shared-FPU processor clusters
Abbas Rahimi, Andrea Marongiu, Rajesh K. Gupta, Luca Benini
Proceedings of the Ninth IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp. 1-10, 2013
Abstract
We propose a tightly-coupled, multi-core cluster architecture with shared, variation-tolerant, and accuracy-reconfigurable floating-point units (FPUs). The resilient shared-FPUs dynamically characterize FP pipeline vulnerability (FPV) and expose it as metadata to a software scheduler for reducing the cost of error correction. To further reduce this cost, our programming and runtime environment also supports controlled approximate computation through a combination of design-time and runtime techniques. We provide OpenMP extensions (as custom directives) for FP computations to specify parts of a program that can be executed approximately. We use a profiling technique to identify tolerable error significance and error rate thresholds in error-tolerant image processing applications. This information guides an application-driven hardware FPU synthesis and optimization design flow to generate efficient FPUs. At runtime, the scheduler utilizes FPV metadata and promotes FPUs to accurate mode, or demotes them to approximate mode depending upon the code region requirements. We demonstrate the effectiveness of our approach (in terms of energy savings) on a 16-core tightly-coupled cluster with eight shared-FPUs for both error-tolerant and general-purpose error-intolerant applications.
Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
2013 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1695-1700
Abstract
This paper proposes a new model of functional units for variation-induced timing errors due to PVT variations and device Aging (PVTA). The model takes into account PVTA parameter variations, clock frequency, and the physical details of Placed-and-Routed (PR and (ii) coarse-grained kernel-level. Using coarse-grained PVTA monitors with kernel-level adaptation, the throughput increases by 70% on average. By comparison, the instruction-by-instruction monitoring and adaptation enhances throughput by a factor of 1.8-2.1 depending on the configuration of PVTA monitors and the type of instructions executed in the kernels.
Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD Architectures
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
IEEE Transactions on Circuits and Systems Ii-express Briefs 60(12), 847-851, 2013
Abstract
This brief proposes a novel technique to alleviate the cost of timing error recovery, building upon the lockstep execution of single-instruction-multiple-data (SIMD) architectures. To support spatial memoization at the instruction level, we propose a single-strong-lane-multiple-weak-lane (SSMW) architecture. Spatial memoization exploits the value locality inside parallel programs, memoizes the result of an error-free execution of an instruction on the SS lane, and concurrently reuses the result to spatially correct errant instructions across MW lanes. Experiment results on Taiwan Semiconductor Manufacturing Company 45-nm technology confirm that this technique avoids the recovery for 62% of the errant instructions on average, for both error-tolerant and error-intolerant general-purpose applications.
2012
Procedure hopping: a low overhead solution to mitigate variability in shared-L1 processor clusters
Abbas Rahimi, Luca Benini, Rajesh Gupta
Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design, pp. 415-420
Abstract
Variation in performance and power across manufactured parts and their operating conditions is a well-known issue in advanced CMOS processes. This paper proposes a resilient HW/SW architecture for shared-L1 processor clusters to combat both static and dynamic variations. We first introduce the notion of procedure-level vulnerability (PLV) to expose fast dynamic voltage variation and its effects to the software stack for use in runtime compensation. To assess PLV, we quantify the effect of full operating conditions on the dynamic voltage variation of a post-layout processor in 45nm TSMC technology. Based on our analysis, PLV shows a range of 18mV--63mV inter-corner variation among the maximum voltage droop of procedures. To exploit this variation we propose a low-cost procedure hopping technique within the processor clusters, utilizing compile time characterized metadata related to PLV. Our results show that procedure hopping avoids critical voltage droops during the execution of all procedures while incurring less than 1% latency penalty.
Analysis of instruction-level vulnerability to dynamic voltage and temperature variations
Abbas Rahimi, Luca Benini, Rajesh K. Gupta
2012 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1102-1105
Abstract
Variation in performance and power across manufactured parts and their operating conditions is an accepted reality in aggressive CMOS processes. This paper considers challenges and opportunities in identifying this variation and methods to combat it for improved computing systems. We introduce the notion of instruction-level vulnerability (ILV) to expose variation and its effects to the software stack for use in architectural/compiler optimizations. To compute ILV, we quantify the effect of voltage and temperature variations on the performance and power of a 32-bit, RISC, in-order processor in 65nm TSMC technology at the level of individual instructions. Results show 3.4ns (68FO4) delay variation and 26.7x power variation among instructions, and across extreme corners. Our analysis shows that ILV is not uniform across the instruction set. In fact, ILV data partitions instructions into three equivalence classes. Based on this classification, we show how a low-overhead robustness enhancement techniques can be used to enhance performance by a factor of 1.1-5.5.
2011
A fully-synthesizable single-cycle interconnection network for Shared-L1 processor clusters
Abbas Rahimi, Igor Loi, Mohammad Reza Kakoee, Luca Benini
2011 Design, Automation & Test in Europe, pp. 1-6
Abstract
Shared L1 memory is an interesting architectural option for building tightly-coupled multi-core processor clusters. We designed a parametric, fully combinational Mesh-of-Trees (MoT) interconnection network to support high-performance, single-cycle communication between processors and memories in L1-coupled processor clusters. Our interconnect IP is described in synthesizable RTL and it is coupled with a design automation strategy mixing advanced synthesis and physical optimization to achieve optimal delay, power, area (DPA) under a wide range of design constraints. We explore DPA for a large set of network configurations in 65nm technology. Post placementr when the number of both processors and memories is increased by a factor of 4, the delay increases almost logarithmically, to 84FO4, confirming scalability across a significant range of configurations. DPA tradeoff flexibility is also promising: in comparison to the maxperformance 1632 configuration, there is potential to save power and area by 45% and 12 % respectively, at the expense of 30% performance degradation.