VLIW Architecture - VLIW Compiler


Our VLIW compiler has the following goals:

1. Performance of the compiled code

In addition to most of the standard optimizations, our compiler is built around a software-pipelining scheduler which is an enhanced version of a previously described algorithm. The performance enhancements include:

  • heuristics based on modulo-scheduling;
  • infinite scheduling window.

Other optimizations are centered on exposing as much instruction-level parallelism as possible. Memory disambiguation is a very important issue for these purposes, for which we apply a combination of strategies:

  • sophisticated address computation/disambiguation;
  • cloning of loops into parallel/sequential versions;
  • address-compare buffers.

We also use some loop-rewrite optimizations to expose parallelism in loops, such as:

  • loop unrolling;
  • expression-tree rewriting to reduce the minimum-initiation interval (MII);
  • rewriting associative operation trees.

2. Validation

Our compiler is flexible, in the sense that changes in the target architecture can be quickly accommodated. A major use of our compiler is as follows:

  • propose an architectural change;
  • add an optimization to exploit/work around that feature;
  • measure the performance impact and thus accept/reject the change.

Architectural features which have been considered include:

  • address compare buffers;
  • combining dependent operations;
  • various flavors of prefetching;
  • select (conditional) operations.

The compiler is capable of handling different processor resource models. For example, various run-time switches allow changing the number and type of functional units and the size of the register files, as well as enable/disable various architectural features.

3. Intermediate Form

The dominating factor in achieving the two goals listed above is simply implementation and debug time. This is reflected in the design of the compiler in terms of the mechanisms used to support the desired transformations.

DFGs

The intermediate code representation and data structures are the key factors in the ability to quickly add optimizations and experiment with them. We have chosen a modified version of the dependence flow graph (DFG), a representation originally developed by Dr. Pingali's group at Cornell University [Pingali et al., Dependence flow graphs: an algebraic approach to program dependencies, 18th ACM Symp. on Principles of Progr. Languages, pp.67-78, 1991].

The DFG combines both the control flow and dependence information in a single coherent form, which provides all the dependence information needed by most optimizations in an easy-to-analyze form. For instance, the DFG is an executable representation, so it is well suited to analysis techniques based on partial evaluation.

The dependence information is fully "pinned," i.e., it is in both static-single assignment and reverse static-single assignment forms, which simplifies code motions across blocks. This feature represents our major change to DFGs, because we do not perform bypassing (bypassing makes code motion too difficult).

Aliasing

Another advantage in our intermediate form is the representation of load/store dependences. Anti/Output/Flow dependences provide a mechanism that enables performing memory anti-aliasing once, and then represent the result so that all other optimizations can use it.

Intervals

We use only one other global data-structure: a (pseudo-)interval hierarchy. For reducible graphs, this is the usual interval hierarchy. For irreducible graphs, certain irreducible portions become pseudo-intervals (they have multiple entry points from outside the loop).

Edge Split

We maintain the following invariant: there is no control flow edge linking a basic-block with multiple predecessors to multiple successors. If such an edge exists in the input, we split it by introducing a dummy basic block. This turns out to be extremely useful.


Compiler/architecture interaction in a tree-based VLIW processor (Foils, pdf, 136 KB)
Workshop on interaction between compilers and computer architectures,
1997 High-Performance Computer Architecture Conference (HPCA97),
San Antonio, February 1997.

Implementing an experimental VLIW compiler (Foils, pdf, 154 KB)
Workshop on computer architecture education,
1997 High-Performance Computer Architecture Conference (HPCA97),
San Antonio, February 1997.