iaka: January 2011

Limitations of ILP
To see what the limits of ILP might be, we first need to define an ideal processor. An
ideal processor is one where all artificial constraints on ILP are removed. The only limits
on ILP in such a processor are those imposed by the actual data flows either through
registers or memory.
The assumptions made for an ideal or perfect processor are as follows:
1. Register renaming—There are an infinite number of virtual registers available
and hence all WAW and WAR hazards are avoided and an unbounded number
of instructions can begin execution simultaneously.
2. Branch prediction—Branch prediction is perfect. All conditional branches are
predicted exactly.
3. Jump prediction—All jumps (including jump register used for return and
computed jumps) are perfectly predicted. When combined with perfect branch
prediction, this is equivalent to having a processor with perfect speculation
and an unbounded buffer of instructions available for execution.
4. Memory-address alias analysis—All memory addresses are known exactly and
a load can be moved before a store provided that the addresses are not identical.
Assumptions 2 and 3 eliminate all control dependences. Likewise, assumptions
1 and 4 eliminate all but the true data dependences. Together, these four assumptions
mean that any instruction in the of the program’s execution can be
scheduled on the cycle immediately following the execution of the predecessor
on which it depends. It is even possible, under these assumptions, for the last
dynamically executed instruction in the program to be scheduled on the very first
cycle! Thus, this set of assumptions subsumes both control and address speculation
and implements them as if they were perfect. Initially, we examine a processor that can
issue an unlimited number of instructions at once looking arbitrarily far ahead in the
computation. For all the processor models we examine, there are no restrictions on what
types of instructions can execute in a cycle. For the unlimited-issue case, this means there
may be an unlimited number of loads or stores issuing in one clock cycle. In addition,
all functional unit latencies are assumed to be one cycle, so that any sequence of
dependent instructions can issue on successive cycles. Latencies longer than one
cycle would decrease the number of issues per cycle, although not the number of
instructions under execution at any point. (The instructions in execution at any
point are often referred to as in-flight.) Finally, we assume perfect caches, which is
equivalent to saying that all loads and stores always complete in one cycle. This
assumption allows our study to focus on fundamental limits to ILP. The resulting data,
however, will be very optimistic, because realistic caches would significantly reduce the
amount of ILP that could be successfully exploited, even if the rest of the processor were
perfect!
To measure the available parallelism, a set of programs were compiled and optimized
with the standard MIPS optimizing compilers. The programs were instrumented and
executed to produce a trace of the instruction and data references. Every instruction in the
trace is then scheduled as early as possible, limited only by the data dependences. Since a
trace is used, perfect branch prediction and perfect alias analysis are easy to do. With
these mechanisms, instructions may be scheduled much earlier than they would otherwise,
moving across large numbers of instructions on which they are not data dependent,
including branches, since branches are perfectly predicted.
Limitations on the Window Size and Maximum Issue Count
To build a processor that even comes close to perfect branch prediction and perfect alias
analysis requires extensive dynamic analysis, since static compile-time schemes cannot
be perfect. Of course, most realistic dynamic schemes will not be perfect, but the use of
dynamic schemes will provide the ability to uncover parallelism that cannot be analyzed
by static compile-time analysis. Thus, a dynamic processor might be able to more closely
match the amount of parallelism uncovered by our ideal processor.
To gain insight into this question, consider what the perfect processor must do:
1. Look arbitrarily far ahead to find a set of instructions to issue, predicting all
branches perfectly.
2. Rename all register uses to avoid WAR and WAW hazards.
3. Determine whether there are any data dependencies among the instructions in
the issue packet; if so, rename accordingly.
4. Determine if any memory dependences exist among the issuing instructions
and handle them appropriately.
5. Provide enough replicated functional units to allow all the ready instructions
to issue.
Obviously, this analysis is quite complicated. For example, to determine
whether n issuing instructions have any register dependences among them, assuming
all instructions are register-register and the total number of registers is
unbounded, requires
comparisons. Thus, to detect dependences among the next 2000 instructions—the default
size we assume in several figures—requires almost four million comparisons! Even
issuing only 50 instructions requires 2450 comparisons. This cost obviously limits the
number of instructions that can be considered for issue at once. The window size directly
limits the number of instructions that begin execution in a given cycle. In practice, real
processors will have a more limited number of functional units (e.g., no processor has
handled more than two memory references per clock or more than two FP operations), as
well as limited numbers of buses and register access ports, which serve as limits on the
number of instructions initiated in the same clock. Thus, the maximum number of
instructions that may issue, begin execution, or commit in the same clock cycle is usually
much smaller than the window size.

iaka

Saturday, January 1, 2011

Limitations of ILP

addressing modes in computer architecture