Project Name
In our VLIW architecture, a program consists of a sequence of
tree-instructions, or simply trees,
each of which corresponds to an unlimited multiway branch
with multiple branch targets and an unlimited set of primitive
operations. All operations and branches are independent and
executable in parallel. The multiway branch is associated with
the internal nodes of the tree, whereas the operations are
associated with the arcs. The multiway branch is the result of
a set of binary tests on condition codes; the left outgoing
arc from a tree node corresponds to the false outcome of the
associated test, whereas the right outgoing arc corresponds to its
true outcome.
Based on the evaluation of the multiway branch, a single path within
a tree-instruction is selected at execution time as the taken path.
Operations on the taken path are executed to completion, and their results
placed in the corresponding target registers or storage locations. In
contrast, operations not on the taken path of the multiway-branch are
inhibited from committing their results to storage or registers. Such
operations do not produce any effect on the state of the processor.
The primitive operations in a tree are subject to sequential
semantics for each path of the tree, as if each primitive
operation were executed in the order in which it appears in
the tree-path (a tree-path starts from the root of the tree and
ends in a branch target). Consequently, a primitive operation
cannot use a resource which has been set by a previous operation
in the same tree-path. If this requirement is not fulfilled within
a tree-instruction, the results from primitive operations having
an operand set by a previous operation in the same tree-path are
undefined. Sequential semantics in each tree-path guarantees
binary compatibility among different implementations of this
architecture, each with varying degrees of parallel execution
capabilities, by allowing the decomposition of a large tree into
subtrees which are executed in different cycles.
Each primitive operation is encoded as a primitive instruction in one
memory word. Each tree-instruction is represented in main storage as a
contiguous sequence of tree-chunks (or simply chunks), with
each chunk containing the primitive instructions.
A chunk is the minimum unit of program specification, and the size
of a tree-instruction is a multiple of the size of a chunk. For
example, a chunk could correspond just to a single memory word, or
to four memory words (a quadword). The chunk size determines
tradeoffs in memory space and decoding complexity, but it is
fixed for a particular architecture.
The sequence of chunks is obtained from the depth-first
traversal of a tree-instruction, listing tests on Condition
Register fields, and primitive instructions that are executed
when the corresponding path of the tree is selected. The following
code sequence represents a 6-way tree instructions whose branch
targets are (A,B,D,C,C,E) and whose chunk size is one memory word:
Testing a Condition Register field is performed with a skip
conditional instruction, which corresponds to a flow-control
operation within the tree, and which indicates where the tree-path
corresponding to the true outcome of the test continues in storage;
as a result, a skip conditional instruction is a branch with a
(short) positive displacement. On the other hand, the branch targets of a
tree are represented as unconditional branch instructions, which
specify the next tree to be executed when the corresponding path of the
current tree is selected. In the case when the target of a skip instruction
contains just an unconditional branch to another tree, then the skip
instruction may be replaced by a branch conditional instruction (a
composition of a skip conditional immediately followed by a branch
unconditional). The compiler assigns execution paths to tree paths
in left-to-right order; that is, the most-probable path is the leftmost
one, whereas the least-probable path is the rightmost one. As a result,
all operations in the most-probable path appear in adjacent locations
in memory (from L0 up to "branch A" in the example above).
The program representation illustrated above is directly executable by a
superscalar processor, which can use the hardware dependence-checking
and scheduling resources to determine which operations can be executed
simultaneously. In the case of a VLIW processor, the end of
a tree-instruction is delimited by a primitive instruction that
follows an unconditional branch which is not reachable by any skip
conditional instruction within the tree. In the tree-instruction
above, the end of the tree is detected at label "L1:" because
such a label does not appear in any of the skip instructions.
The detection of this boundary in hardware is simple.
Note that any word within a tree-instruction can also correspond to
the starting point of another tree. As a result, branching into a
tree-instruction leads to the execution of a tree which is a subset
of a larger tree.
As an example of the pruning process, consider the execution of
the tree-instruction listed earlier in a VLIW processor capable
of executing only a 4-way branch. Since the tree-instruction
contains a 6-way branch, it must be pruned to fit the constraint.
In this case, the tree-instruction is pruned at two of the skip
instructions, leading to the following representation:
As illustrated, the original tree-instruction is decomposed into
three smaller tree-instructions, the first one containing a 4-way
branch (targets T1,T2,A,B), and the other two containing 2-way
branches (targets D,C and C,E, respectively).
Since only the taken path in a tree is executed to completion, the
execution of all untaken paths is speculative; in fact, the compiler
places instructions on those paths to take advantage of available
resources, if they exist. For the case of processor implementations
with fewer resources, the pruning process automatically eliminates
some of the speculative operations. That is, pruning a tree-instruction
implicitly performs unspeculation. Note that, if the taken path
is fully contained within the first subtree, then the remaining subtrees
are not accessed at all.
If the size of a tree-instruction exceeds the resources available in a
processor implementation (such as branching degree, number of fixed-point
or floating-point operations), then the tree-instruction is dynamically
decomposed (pruned) to fit the resources. The resulting subtrees
are executed in successive cycles, unless the taken path is completely
contained within the first subtree. The pruning process transforms
arbitrary-size tree-instructions into subtrees which fit the resources
available in a processor implementation. These subtrees have the same
general structure as the original trees (that is, a multiway-branch tree
with operations in the tree-paths), but their size is limited.
These subtrees correspond to Very Long Instruction Words (VLIWs)
which are directly executed by the processor.
ForestaPC User Instruction Set Architecture [fulltext]
IBM Research Report RC20733, February 1997.
Scalable instruction-level parallelism through tree-instructions [fulltext]
IBM Research Report RC20661, December 1996
Tree-based VLIW architecture
(foils, pdf, 46 KB)



