eLite DSP Project - Publications


  • V. Zyuban et al, Design Methodology for Low Power High Performance Semi Custom Processor Cores," Great Lakes Symposium on VLSI, pp. 448-453, April 2004.

    The paper describes a semi-custom design methodology for embedded processor cores that was prototyped through the development of a low power high performance DSP core. When compared to the standard ASIC design flow, this methodology enables significant improvement in the speed and power; such benefits are obtained without compromising the generality and flexibility that characterizes the ASIC-based design techniques. Our methodology achieves fast turn-around time in the process from RTL description to post-PD timing results, and exhibits stable convergence on timing; these characteristics enable the application of optimizations spanning multiple levels of the design hierarchy. Such optimizations proved to be much more effective than those that focus only on a single design stage. makes the described flow a compelling choice for the development of embedded processor cores.

  • V. Zyuban et al, Design Methodology for Low Power High Performance Semi Custom Processor Cores," IBM Research Report, December 2003.

    This paper describes a semi-custom design methodology that was prototyped/demonstrated through the development of low-power high-performance DSP core. The developed methodology achieves significant speed improvement and reduction in power and area, compared with standard ASIC flow, without compromising its generality and high productivity. Because of the fast turn-around time from RTL description to post PD timing results, and stable convergence on timing the developed flow enables optimizations spanning multiple levels of the design hierarchy. Such optimizations proved much more eective than those that focus on any single phase of the design, which makes the described flow a compelling choice for the development of embedded processor cores.

  • J. A. Rivers, S. Asaad, J-D. Wellman, and J.H. Moreno, "Reducing Instruction Fetch Energy with Backwards Branch Control Information and Buffering," International Symposium on Low-Power Electronics and Design, Seoul, South Korea, August 2003.

    Many emerging applications, e.g. in the embedded and DSP space, are often characterized by their loopy nature where a substantial part of the execution time is spent within a few program phases. Loop buffering techniques have been proposed for capturing and processing these loops in small buffers to reduce the processor's instruction fetch energy. However, these schemes are limited to straight-line or innermost loops and fail to adequately handle complex loops. In this paper, we propose a dynamic loop buffering mechanism that uses backwards branch control information to identify, capture and process complex loop structures. The DLB controller has been fully implemented in VHDL, synthesized and timed with the IBM Booledozer and Einstimer Synthesis tools, and analyzed for power with the Sequence PowerTheater tool. Our experiments show that the DLB approach, on average, results in a factor of 3 reduction in energy consumption compared to a traditional instruction memory design at an area overhead of about 9%.

  • H.C. Hunter and J.H. Moreno, "A New Look at Exploiting Data Parallelism in Embedded Systems," International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES '03), October, 2003.

    This paper describes and evaluates three architectural methods for accomplishing data parallel computation in a programmable embedded system. Comparisons are made between the well-studied Very Long Instruction Word (VLIW) and Single Instruction Multiple Packed Data (SIMpD) paradigms; the less-common Single Instruction Multiple Disjoint Data (SIMdD) architecture is described and evaluated. A taxonomy is defined for data-level parallel architectures, and patterns of data access for parallel computation are studied, with measurements presented for over 40 essential telecommunication and media kernels. While some algorithms exhibit data-level parallelism suited to packed vector computation, it is shown that other kernels are most efficiently scheduled with more flexible vector models. This motivates exploration of non-traditional processor architectures for the embedded domain.

  • D. Naishlos et al, "Vectorizing for a SIMdD architecture," International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES), Oct/Nov, 2003.

    The Single Instruction Multiple Data (SIMD) model for fine­grained parallelism was recently extended to support SIMD operations on disjoint vector elements. In this paper we demonstrate how SIMdD (SIMD on disjoint data) supports effective vectorization of digital signal processing (DSP) benchmarks, by facilitating data reorganization and reuse. In particular we show that this model can be adopted by a compiler to achieve near­optimal performance for important classes of kernels.

  • J.H. Moreno et al, "An innovative low-power high-performance programmable signal processor for digital communications," IBM Journal of Research & Development, Vol.47, No.2/3, p.299-326, March/May, 2003.

    We describe an innovative low-power high-performance programmable signal processor (DSP) for digital communications. The architecture of this processor is characterized by its explicit design for low-power implementations, its innovative ability to jointly exploit instruction-level parallelism (ILP) and data-level parallelism (SIMD) to achieve high-performance, its suitability as target for an optimizing high-level language compiler, and its explicit replacement of hardware resources by compile-time practices. The report describes the methodology used in the development of the processor, highlighting the techniques deployed to enable architecture/compiler/implementation co-development, and the approach used for power-performance evaluation and trade-off analysis. It summarizes the salient features of the architecture, provides a brief description of the hardware organization, and discusses the compiler techniques used to exercise these features. It also summarizes the simulation environment and associated software development tools. Coding examples from two representative kernels in the digital communications domain are alsoprovided. The resulting design methodology, architecture and compiler represent an advance of the state-of-the-art in the area of low-power, domain specific microprocessors.

  • J. H. Derby et al, "A Low-power High-Performance Embedded DSP Core with Novel SIMD Features," GSPx Conference, 2003.

    We describe a low-power, high-performance, compiler friendly digital signal processor (DSP) intended for use as an embedded core in systems-on-chip targeted at communication and media-related applications. We present an overview of the DSP architecture, provide insight into the hardware-software and power-performance codesign methodologies employed in its design, and describe some application- oriented examples. This DSP core represents an advance in the state-of-the-art in power-efficient, high-performance, programmable DSP architectures, as well as in methodologies for implementing such architectures. Index terms - digital signal processor (DSP), low-power, single-instruction multiple data (SIMD).

  • J. H. Derby and J. H. Moreno, "A High-Performance Embedded DSP Core with Novel SIMD Features," International Conference on Acoustics, Speech, and Signal Processing (ICASSP'03), April 2003.

    A low-power, high-performance, compilerfriendly DSP core has been under development in the IBM Communications Research & Development Center, as part of its eLite DSP project. This DSP incorporates instruction-level parallelism through the packing of multiple instructions in 64-bit longinstruction words, while data-level parallelism is realized through the use of SIMD techniques, such that SIMD operations can be applied to both dynamically composed vectors and packed vectors. Dynamic composition of vectors is made possible through the use of a vector pointer mechanism, which permits the addressing in a very flexible way of groups of four 16-bit elements in a large, multiport, scalar register file. This paper p rovides an overview of the architecture of this DSP core, with a focus on its SIMD features. We describe these features in some detail and discuss how they are used, with a block FIR filter and a radix-4 FFT taken as examples. Copyright © (2003) by IEEE

  • D. Naishlos. et al., "Compiler Vectorization Techniques for a Disjoint SIMD Architecture," IBM Research Report, November 2002.

    This paper presents compiler technology that targets a novel low-power Digital Signal Processor (DSP) architecture. The architecture is characterized by the exploitation of data and instruction level parallelism,and uses a large register file with dynamically composed vectors for data manipulation. We describe how an optimizing compiler can make use of the vector register file with its flexible addressing to efficiently support a range of data access patterns that are present in the digital processing application domain. We describe new challenges presented by this novel DSP architecture,as well as new opportunities for aggressive yet low-overhead optimizations that it introduces. Experiments show that an optimizing compiler can target such an architecture efficiently to achieve performance that is comparable to the optimal hand-generated code for key benchmarks. The resulting compiler technology represents an advance of the state-of-the-art in the area of DSP compilation.

  • J.H. Moreno et al, "An innovative low-power high-performance programmable signal processor for digital communications," IBM Research Report, September 2002.

    This report describes our innovative low-power high-performance programmable signal processor (DSP) for digital communications. The architecture of this processor is characterized by its explicit design for low-power implementations, its innovative ability to jointly exploit instruction-level parallelism (ILP) and data-level parallelism (SIMD) to achieve high-performance, its suitability as target for an optimizing high-level language compiler, and its explicit replacement of hardware resources by compile-time practices. The report describes the methodology used in the development of the processor, highlighting the techniques deployed to enable architecture/compiler/implementation co-development, and the approach used for power-performance evaluation and trade-off analysis. It summarizes the salient features of the architecture, provides a brief description of the hardware organization, and discusses the compiler techniques used to exercise these features. It also summarizes the simulation environment and associated software development tools. Coding examples from two representative kernels in the digital communications domain are alsoprovided. The resulting design methodology, architecture and compiler represent an advance of the state-of-the-art in the area of low-power domain specific microprocessors.

  • V. Zyuban, P. Strenski, "Unified methodology for resolving power-performance tradeoffs at the microarchitectural and circuit levels," International Symposium on Low-Power Electronics and Design, pp. 166-171, August 2002.

    This paper proposes a metric of hardware intensity, which is useful for evaluating issues that affect both circuits and architecture. Analyzing the data from actual designs, the paper shows how to measure the parameters introduced and discuss variations between observed results and common theoretical assumptions. Copyright (2002) by Association for Computing Machinery, Inc.

  • V. Zyuban, S. Kosonocky, "Low power integrated scan-retention mechanism," International Symposium on Low-Power Electronics and Design, pp. 98-102, August 2002.

    This paper presents a methodology for unifying the scan mechanism and data retention in latches which leads to scannable latches with the data retention capability achieved at a very low power overhead during the active mode. A detailed analysis of power and area overhead is presented, with layour examples for various commom latch styles. Implications of using different power gating techniques for reducing leakage during sleep mode are considered, including well-bias for leakage control and sharing wells between gated logic and retention latch devices. Copyright (2002) by Association for Computing Machinery, Inc.

  • V. Zyuban, "Unified Architecture Level Energy-Efficiency Metric," Great Lakes Symposium on VLSI, April 2002.

    This paper derives a unified energy-efficiency metric for evaluating ISA and microarchitecture features, which subsumes other commonly used power-performance metrics as special cases of a more general equation. This new metric is derived based on an analysis of a multi-dimensional power optimization problem, and the resulting formula involves only relative changes in the characteristics of a processor, enabling its application at the early stages of the design. Copyright (2002) by Association for Computing Machinery, Inc.

  • V. Zyuban, D. Meltzer, "Clocking strategies and scannable latches for low-power applications," International Symposium on Low-Power Electronics and Design, pp 346 -351, 2001.

    This paper covers a range of issues in the design of clocking schemes for low-power applications. First we revisit, extend and improve the power-performance optimization methodology for latches, attempting to make it more formal and comprehensive. Data switching factor and the glitching activity are taken into con-sideration, using a formal analytical approach, then a notion of energy-efficient family of configurations is introduced to make the comparison of different latch styles in the power-performance space more fair, also the power of the clock distribution is taken into account. Practical issues of building a low overhead scan mecha-nism are considered, and the power overhead of the scannable de-sign is analyzed. A low-power LSSD extension to single-phase latches is proposed, and results of a comparative study of LSSD-scannable latches are shown, supported by experimental data measured on a 0:18 test chip. Copyright (2001) by Association for Computing Machinery, Inc.

  • M. Moudgill,A. Zaks, "Minimizing inter-files transfers in architectures with separate address registers," IBM Research Report RC 21884, November 2000.

    This report describes instruction selection in architectures where a general-purpose register file is replaced by separate address and integer register files, each feeding a separate execution unit. In such architectures, load and store operations use address registers to compute the Iocation being accessed. Further, values in address registers can be manipulated in only a limited number of ways. In general, a value may need to be transferred from an address register to an integer register, operated on by the integer unit, and then transferred back. In this paper, we describe an optimal polynomial time algorithm to pastition opera-tions between address and integer units that minimizes the number of inter-file transfers.

  • J. Glossner, J.H. Moreno, et al., "Trends in compilable DSP architectures," IEEE Workshop on Signal Processing Systems (SiPS/2000), October 2000.

    This paper reviews the evolution of DSP architectures and compiler technology, and describe how compiler techniques are being used to optimize emerging DSP architectures. Such new architectures are characterized by the exploitation of data and instruction level parallelism while being an amenable target for a compiler, thereby reducing or eliminating the need to rely on assembly language programming and/or architecture-specific compiler intrinsics to achieve highly efficient code. It also summarizes our research results on an ultra-low power compilable DSP architecture. Copyright (2000) by IEEE