VLIW Architecture - A VLIW based on tree instructions

In our VLIW architecture, a program consists of a sequence of tree-instructions, or simply trees, each of which corresponds to an unlimited multiway branch with multiple branch targets and an unlimited set of primitive operations. All operations and branches are independent and executable in parallel. The multiway branch is associated with the internal nodes of the tree, whereas the operations are associated with the arcs. The multiway branch is the result of a set of binary tests on condition codes; the left outgoing arc from a tree node corresponds to the false outcome of the associated test, whereas the right outgoing arc corresponds to its true outcome.

Based on the evaluation of the multiway branch, a single path within a tree-instruction is selected at execution time as the taken path. Operations on the taken path are executed to completion, and their results placed in the corresponding target registers or storage locations. In contrast, operations not on the taken path of the multiway-branch are inhibited from committing their results to storage or registers. Such operations do not produce any effect on the state of the processor.

The primitive operations in a tree are subject to sequential semantics for each path of the tree, as if each primitive operation were executed in the order in which it appears in the tree-path (a tree-path starts from the root of the tree and ends in a branch target). Consequently, a primitive operation cannot use a resource which has been set by a previous operation in the same tree-path. If this requirement is not fulfilled within a tree-instruction, the results from primitive operations having an operand set by a previous operation in the same tree-path are undefined. Sequential semantics in each tree-path guarantees binary compatibility among different implementations of this architecture, each with varying degrees of parallel execution capabilities, by allowing the decomposition of a large tree into subtrees which are executed in different cycles.

Each primitive operation is encoded as a primitive instruction in one memory word. Each tree-instruction is represented in main storage as a contiguous sequence of tree-chunks (or simply chunks), with each chunk containing the primitive instructions. A chunk is the minimum unit of program specification, and the size of a tree-instruction is a multiple of the size of a chunk. For example, a chunk could correspond just to a single memory word, or to four memory words (a quadword). The chunk size determines tradeoffs in memory space and decoding complexity, but it is fixed for a particular architecture.

The sequence of chunks is obtained from the depth-first traversal of a tree-instruction, listing tests on Condition Register fields, and primitive instructions that are executed when the corresponding path of the tree is selected. The following code sequence represents a 6-way tree instructions whose branch targets are (A,B,D,C,C,E) and whose chunk size is one memory word:

Memory representation image

Testing a Condition Register field is performed with a skip conditional instruction, which corresponds to a flow-control operation within the tree, and which indicates where the tree-path corresponding to the true outcome of the test continues in storage; as a result, a skip conditional instruction is a branch with a (short) positive displacement. On the other hand, the branch targets of a tree are represented as unconditional branch instructions, which specify the next tree to be executed when the corresponding path of the current tree is selected. In the case when the target of a skip instruction contains just an unconditional branch to another tree, then the skip instruction may be replaced by a branch conditional instruction (a composition of a skip conditional immediately followed by a branch unconditional). The compiler assigns execution paths to tree paths in left-to-right order; that is, the most-probable path is the leftmost one, whereas the least-probable path is the rightmost one. As a result, all operations in the most-probable path appear in adjacent locations in memory (from L0 up to "branch A" in the example above).

The program representation illustrated above is directly executable by a superscalar processor, which can use the hardware dependence-checking and scheduling resources to determine which operations can be executed simultaneously. In the case of a VLIW processor, the end of a tree-instruction is delimited by a primitive instruction that follows an unconditional branch which is not reachable by any skip conditional instruction within the tree. In the tree-instruction above, the end of the tree is detected at label "L1:" because such a label does not appear in any of the skip instructions. The detection of this boundary in hardware is simple.

Note that any word within a tree-instruction can also correspond to the starting point of another tree. As a result, branching into a tree-instruction leads to the execution of a tree which is a subset of a larger tree.

Pruning image If the size of a tree-instruction exceeds the resources available in a processor implementation (such as branching degree, number of fixed-point or floating-point operations), then the tree-instruction is dynamically decomposed (pruned) to fit the resources. The resulting subtrees are executed in successive cycles, unless the taken path is completely contained within the first subtree. The pruning process transforms arbitrary-size tree-instructions into subtrees which fit the resources available in a processor implementation. These subtrees have the same general structure as the original trees (that is, a multiway-branch tree with operations in the tree-paths), but their size is limited. These subtrees correspond to Very Long Instruction Words (VLIWs) which are directly executed by the processor.

As an example of the pruning process, consider the execution of the tree-instruction listed earlier in a VLIW processor capable of executing only a 4-way branch. Since the tree-instruction contains a 6-way branch, it must be pruned to fit the constraint. In this case, the tree-instruction is pruned at two of the skip instructions, leading to the following representation:

Pruning image

As illustrated, the original tree-instruction is decomposed into three smaller tree-instructions, the first one containing a 4-way branch (targets T1,T2,A,B), and the other two containing 2-way branches (targets D,C and C,E, respectively).

Since only the taken path in a tree is executed to completion, the execution of all untaken paths is speculative; in fact, the compiler places instructions on those paths to take advantage of available resources, if they exist. For the case of processor implementations with fewer resources, the pruning process automatically eliminates some of the speculative operations. That is, pruning a tree-instruction implicitly performs unspeculation. Note that, if the taken path is fully contained within the first subtree, then the remaining subtrees are not accessed at all.

ForestaPC User Instruction Set Architecture [fulltext]
IBM Research Report RC20733, February 1997.

Scalable instruction-level parallelism through tree-instructions [fulltext]
IBM Research Report RC20661, December 1996

Tree-based VLIW architecture (foils, pdf, 46 KB)