VLIW Architecture - Basic Principles

Very-Long Instruction Word (VLIW) architectures are a suitable alternative for exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than one basic (primitive) instruction at a time. These processors contain multiple functional units, fetch from the instruction cache a Very-Long Instruction Word containing several primitive instructions, and dispatch the entire VLIW for parallel execution. These capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel. The processors have relatively simple control logic because they do not perform any dynamic scheduling nor reordering of operations (as is the case in most contemporary superscalar processors).

VLIW has been described as a natural successor to RISC, because it moves complexity from the hardware to the compiler, allowing simpler, faster processors. As stated in Microprocessor Report (2/14/94):
"The objective of VLIW is to eliminate the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors. In theory, a VLIW processor should be faster and less expensive than a comparable RISC chip."

The instruction set for a VLIW architecture tends to consist of simple instructions (RISC-like). The compiler must assemble many primitive operations into a single "instruction word" such that the multiple functional units are kept busy, which requires enough instruction-level parallelism (ILP) in a code sequence to fill the available operation slots. Such parallelism is uncovered by the compiler through scheduling code speculatively across basic blocks, performing software pipelining, reducing the number of operations executed, among others.

VLIW has been perceived as suffering from important limitations, such as the need for a powerful compiler, increased code size arising from aggresive scheduling policies, larger memory bandwidth and register-file bandwidth, limitations due to the lock-step operation, binary compatibility across implementations with varying number of functional units and latencies. In recent years, there has been significant progress regarding these issues, due to general advances in semiconductor technology as well as to VLIW-specific activities. For example, our tree-based VLIW architecture provides binary compatibility for VLIW implementations of varying width, and our VLIW compiler contains state-of-the-art parallelizing/optimizing algorithms.