eLite DSP Project - Architecture


eLite DSP architecture

The eLite DSP architecture is an advanced architecture tailored for the execution of algorithms in digital signal processing, such as those found in wireless communication standards (2.5G, 3G), DSL, wireless local area networks, voice-over-IP, video decoding and encoding, audio decoding and encoding, and so on. The architecture, which includes support for executing control code and data operations, contains facilities for exploiting instruction-level parallelism (simultaneous execution of multiple independent instructions packed as long-instruction words LIWs), data parallelism through vector operations with independent access to elements of the vectors (single-instruction multiple-disjoint-data SIMdD), and data parallelism through vector instructions with data packed. in registers (single-instruction multiple-data with subword parallelism or packed data SIMpD). The combination of instruction-level parallelism and data parallelism enables achieving high performance levels at low clock frequency.

The eLite DSP architecture is especially targeted for ultra-low power implementations. To that effect, the architecture relies on static scheduling of instructions, static vectorization of operations, predicated instructions, visible latencies (e.g., "exposed pipeline"), and other related techniques that are applied during code generation. As a result, the code executed in a processor that implements the architecture is expected to be generated mostly by an optimizing compiler, which is used to ensure that the constraints imposed on the code executed by an implementation are properly fulfilled. Code can also be generated and optimized at the assembly language level, of course; such code can be mixed with code generated by a compiler.

The architecture also includes support for reducing power consumption during the execution of a program, such as the ability to shutdown functional units, or enable hardware blocks on demand by hardware or software, depending on the specific block.

The figure above is a logical representation of an eLite processor, which consists of the following functional units:

  • Branch unit: generates the storage address for the next long-instruction to be fetched from memory, either sequential addressing or branches, and performs logic operations on the Condition Registers.
  • Integer unit: performs operations on data in Integer (General-Purpose) Registers.
  • Storage Access unit: interacts with the data storage to transfer data (8-bit, 16-bit, 32-bit, and 64-bit) to/from internal registers and storage, and performs operations on data in Address Registers, which are used to address storage.
  • Vector Pointer unit: performs operations on Vector Pointer Registers, which are used to access the contents of Vector Element Registers.
  • Vector Element unit: performs operations on 16-bit integer and fractional data stored in Vector Element Registers. The number of these registers is implementation-dependent, and ranges from 64 to 4096
  • Vector Accumulator unit: performs operations on 32-bit integer and 32-bit or 40-bit fractional data stored in Vector Accumulator Registers, including reduction operations on the elements of a vector.
  • Vector Mask unit: performs logic operations on the Vector Mask Registers.

The Integer unit and the Storage Access unit correspond to scalar units, operating mostly on integer data. In contrast, the Vector Element unit and the Vector Accumulator unit operate on 4-element vectors in SIMD fashion.

Program execution

A program in the eLite DSP architecture consists of a sequence of long-instructions words (LIW), each containing a 4-bit prefix and either one, two or three instructions whose individual size can be 16, 20, 24, 30 or 60 bits. A long-instruction is the minimum unit of program addressing possible, and is represented in memory as a 64-bit entity. Branching into an instruction other than the first instruction of a LIW is not possible. A processor fetches LIWs from instruction storage for execution; all instructions contained in a LIW are dispatched for simultaneous execution, unless they are specified as executable in serial manner, as indicated in the LIW prefix.

All instructions, regardless of their length, contain a fixed-size opcode in bits 0:7 specifying the operation to be performed. Some instructions specify an expanded opcode field in bits 18:19. Instructions whose length is 30-bits specify additional opcode information in bits 28:29.

Similar to RISC processors, no instruction in the eLite architecture can perform a computational operation on data in memory. Conversely, no instruction other than store instructions can modify storage. To use a storage operand in a computation, the contents of storage must first be loaded into a register, and the operation is performed on the contents of the register. Similarly, to use a storage operand in a computation and then modify the same or another storage location, the contents of storage must be loaded into a register, modified, and then stored back to the target location. Direct Memory Access (DMA) operations may alter the storage contents independently.

The preferred programming model eLite programming modelconsists of loading many data elements from storage into the registers, in particular into the Vector Element Registers and Vector Accumulator Registers, and then operate on the contents of the registers with few intervening accesses to storage. Vector Element Registers are accessed indirectly through Vector Pointer Registers, so that vectors are dynamically composed from four arbitrary Vector Element Registers (SIMdD operation). Every Vector Element instruction specifies one or two Vector Pointer Registers, which in turn specify the four Vector Element Registers actually used by the instruction. In contrast, Vector Accumulator Registers contain four 40-bit elements each which are accessed simultaneously in the order they are stored in the VARs, and used in that same order as operands to Vector Accumulator Unit instructions (SIMpD operation). The elements from Vector Accumulator Registers, in conjunction with a Reduction Register, are also used as operands to a special reduction unit.

Vector units are characterized by a cascaded SIMD programming model: 16-bit data are loaded from adjacent memory locations into arbitrary locations in the Vector Element Register file (VER), 16-bit operations are performed on data from arbitrary locations within the VER (SIMdD), and the results are placed into the 4x40-bit Vector Accumulator Register file (VAR), in packed manner (in a single register). 32-bit data can also be loaded directly from memory, in packed form, into the VAR, operations are performed on packed data read from VAR (SIMpD), and the results are placed on the same register file. Packed data can be transferred from the VAR file into the 16-bit VER file with arbitrary placement, after a suitable size reduction operation, or can be placed into adjacent memory locations. Due to the varying size, data is allocated to units according to the natural data type (size) throughout the computations.

Instructions are statically scheduled taking into consideration their utilization of resources throughout the pipeline, and the data dependencies with their dependent instructions (e.g., "exposed pipeline" execution model). Most instructions are processed in six stages, vector element instructions use one extra stage to read the Vector Pointer Registers and the succeeding stage to read the Vector Element Registers, whereas memory instructions use dedicated stages for transferring the address and data from the processor to the memory subsystem. All instructions that are dispatched in the same cycle read the contents of their source registers at the same time, with the exception of Vector Element Registers which are read the following cycle after reading the associated Vector Pointer Registers. An instruction completes execution when its results are placed in their destination locations; instructions complete execution according to their individual latencies. Instructions contained wholly within a functional unit have the same latency, with the exception of branches which are resolved earlier; instructions in different units, or instructions that place the result on a register in a different unit, may exhibit different latencies

Instructions other than vector instructions can be predicated by specifying a condition that is evaluated dynamically, at execution time. The predicate is specified in a Condition Register. An instruction whose predicate evaluates to false is not completed; such an instruction is simply discarded. Vector instructions are not predicated as a whole; instead, each individual operation within a vector instruction can be executed conditionally (i.e., predicated) under control of a mask which is evaluated dynamically. The mask is specified in a Vector Mask Register.

Vector units

The most salient computing resources within the eLite architecture are the three type of Vector Units. All these units can operate in parallel.

Source operands for 16-bit Vector Element arithmetic and logic instructions (16-bit datapath) always originate from the Vector Element Register file. The destination of all 16-bit vector element operations is a Vector Accumulator Register (40-bit result). Each block within the 16-bit datapath consists of a multiplier and an ALU that performs arithmetic, logic, shift, and select operations on the contents of Vector Element Registers.

Every access to the Vector Element Register file is performed indirectly through a Vector Pointer Register. Each Vector Pointer Register contains four elements which are used as indexes into the Vector Element Register file, allowing the access to four independent Vector Element Registers. The Vector Pointer Registers can be automatically updated when used to access Vector Element Registers, and automatically implement "circular addressing" within a range of the Vector Element Register file.

Vector Accumulator Registers are used as source and destination for 40-bit Vector Accumulator arithmetic and logic instructions (40-bit datapath). Vector Accumulator Registers are also used as destination for 16-bit vector operations, as well as source operands in instructions to convert data from 40-bit to 16-bit whose result is placed in Vector Element Registers. Regardless of the instruction type, Vector Accumulator Registers are accessed as 40-bit quantities. In the case of conversion to 16-bit, saturation and rounding rules can be applied.

Computing capabilities in a LIW

Since a LIW may contain up to three instructions, the architecture supports "3-wide" instruction-level parallelism. Some of these instructions are compounded instructions which specify more than one operation. Moreover, vector instructions specify operations on vectors with four elements. As a result, the total parallel computing capability available in a single LIW is a large number of basic operations. For example,

  • a single LIW may contain a Load Vector with Update instruction, a Vector Element Multiply instruction, and a Vector Accumulator Add instruction, with each one of them producing as result a vector with four elements; that is, four operations each, for a total of twelve operations.
  • the Load Vector with Update instruction implicitly specifies the automatic update of the Address Register and the elements of the Vector Pointer Register used by the instruction, including support for circular addressing on both, adding another five operations to the set performed within the LIW;
  • the Vector Multiply instruction implicitly specifies the update of the two Vector Pointer Registers used to access the Vector Element Registers containing the operands for the instruction, including circular addressing within the Vector Element Register file, adding two sets of four update operations to the computations specified in the LIW.

Consequently, such a single LIW actually specifies transformations on 25 data items, for a total of 25 basic operations.

See the Publications and Presentations sections for further information regarding the eLite DSP compiler.