eLite DSP Project - Compiler


eLite DSP compiler

The eLite DSP compiler is based on the IBM VLIW Research Compiler originally designed for tree-VLIW processors. The compiler uses an enhanced form of Dependence Flow Graph (DFG) for its internal representation, and also extensively uses static single assignment (SSA) and reverse-SSA forms. It has a repertoire of standard SSA-based optimizations, and provides a rich collection of functions/macros for compiler development. Among other features, the eLite DSP compiler provides an excellent platform for developing and exploring new compilation techniques.

New optimizations and enhancements were introduced to the compiler in order to generate efficient code for the eLite DSP architecture. After transforming the dependence-flow graph through a number of optimizations, a novel vectorization phase tries to identify vectorizable code sequences and updates the DFG with a vectorized version of loops. Separating vectorization from instruction scheduling and register allocation has advantages such as reducing the complexity of the code generator, and offering flexibility to schedule and software-pipeline vectorized code along with scalar code. The vectorized code is scheduled and register allocated before emitting the final Assembly code.

The compiler also makes use of predication to collapse small blocks of conditionally executed code, thus eliminating branches and their overhead. Among other optimizations, the compiler does function inlining, software-pipelining, synthetic branch frequency based optimizations, as well as inter-procedural analysis.

The internal representation of the compiler uses primitive operations similar to those found in RISC processors. Optimizations prior to scheduling and register allocation are performed on a DFG with such operations as nodes. The instruction selector provides a binding between such RISC operations and instructions in the eLite architecture. Instruction selection is carried out along with instruction scheduling based on a number of factors such as data type and size of operands, automatic update of registers, resource requirements, etc.

A vectorization phase has been developed to make efficient use of the SIMD instructions and the unique Vector Element Register file (VER) of the eLite DSP. The VER is modeled as a compiler-controlled memory which can be indirectly addressed using vector pointer registers. Because all VER allocations are done by the compiler, complete aliasing information for VER accesses is available. The compiler takes advantage of this fact by applying aggressive memory-related optimizations.

The high-level vectorization scheme is similar to the unroll-and-jam technique - the loop is unrolled by a factor of 4, and scalar operations are replaced with their SIMD versions. In some cases, the compiler performs additional optimizations, such as pre-loading data into the VER before the loop to eliminate redundant loads. An important stage of vectorization is setting up vector pointers. Flexibility of VER access through vector pointers allows efficient compilation of computations that involve non-consecutive data access patterns. This aggressive vectorization optimization is based on a set of innovative analyses, on top of the standard vectorization tests such as . The compiler analyzes memory access patterns in order to effectively vectorize loads and stores within each iteration. Loop context analysis eliminates memory loads and stores across iterations and between different loops. Additional transformations which expose more parallelism, such as loop transformations, continue to undergo development.

The register name space in the eLite DSP architecture is functionally partitioned into multiple register files associated with respective functional units. This results in a set of register types such as integer register (IR), address register (AR), vector element register (VER), and vector accumulator register (VAR). The compiler is responsible for assigning each register Def to a specific register file/type, and for inserting explicit instructions moving the data between different register files. Given a register Def, the operation defining it, and the operations using it, the choice of a register file is guided by the following considerations:

  1. Availability of operations in different functional units
  2. Performance issues, such as pipeline latencies of instructions, possibilities of VER reuse, etc.

The partitioning process based on the above is augmented with an efficient min-cut graph partitioning based scheme that minimizes the communication between the register files in order to reduce the number of move instructions

Resource modeling

The primary goal of the resource model in a compiler is one of providing an abstract view of the processor to the machine-dependent optimization routines. A cycle-accurate model of the processor has been developed to model pipeline resources, register file ports, and the interconnect bus. We have also developed some low-overhead techniques to model non-uniform port access delays resulting from microarchitecture techniques that reduce power and hardware complexity, such as restricted bypasses. In addition, the resource model provides internal-to-ISA instruction mapping information for the instruction selector, resource conflict checking and reservation functions required during scheduling, and register Def/Use timing information required during register allocation. The entire resource model generation is automated and table-driven, based on machine-generated architectural descriptions shared by other tools, which also helps reduce the turn-around time for architectural exploration.

Scheduling and register allocation

Generating efficient code for long non-interlocked pipelines is a big challenge by itself. When scheduling instructions, the compiler must have accurate timing information about the access to data (read and write) and to the shared resources used by the instructions. Conservative bounds are used while compiling certain regions of code (e.g., across calls) where complete timing information is not available, which in turn prevent the compiler from carrying out aggressive scheduling techniques, such as scheduling instructions in the shadow of other instructions, in those regions.

The code generator considers reducing the code size as an objective in addition to minimizing the schedule length. We use several new techniques for scheduling and cycle-level register allocation. These techniques include schemes for scheduling instructions in the branch delay slots and instruction shadows, for scheduling predicated code, and for software pipelining.

Inter-procedure optimizations

The compiler performs inter-procedural analysis for reducing the calling convention overhead by allowing callee to modify the calling conventions according to the intended usage of the parameters. Other inter-procedure optimizations being developed include schemes for sharing VER among procedures and for obtaining better bounds on resource utilization at procedure calls and returns.

We believe that the eLite DSP compiler, with the help of its powerful optimizations and further tuning, can generate executable code from source code written in C with performance and code size comparable to hand-optimized assembly-language code. Preliminary results show that, for a set of signal-processing kernels, the compiler generates vectorized code of comparable performance (well within a factor of 2) to carefully hand-coded assembly.

See the Publications and Presentations sections for further information regarding the eLite DSP compiler.