The problem. It used to be that understanding microprocessors was easy. Testing sufficed to verify their correctness, and linear
formulas accurately explained their performance. Today, processors baffle their own creators.
The reason is Moore's Law: to obtain speedup, processor designers turn the growing supply of transistors into growing parallelism, at a
growing number of levels (e.g., out-of-order execution, pipelined memory hierarchy, multi-threading). While the effects of parallelism
on verification have already been recognized (e.g., via model checking), the problem of
performance complexity has been attacked only with ad hoc methods.
Our approach. The overall goal of the BAFL project is to develop a robust foundation for guiding micro-architectural
innovations as transistor counts surpass one billion. Specifically, we are developing methods for finding and eliminating
bottlenecks---program instructions and processor resources responsible for lost performance and wasted power consumption. This task is
complex because the more parallel the machine, the harder it is to tell which execution events
(e.g., cache misses, ALU operations, message transactions) constrained the execution, and which had their
execution times tolerated.
Our framework is built on dependence-graph analysis of a program's execution, implementable entirely in hardware. The framework allows a
qualitatively new way of thinking about performance. For example, by representing micro-execution events and dependences as a suitable
dependence graph, its critical path automatically determines which processor stage (e.g., fetch, execute, or commit) is a bottleneck, and
also for which dynamic instructions.
Our solutions attack performance understanding problems for which no (or no efficient) methods existed. These problems span the entire
processor life cycle, for example:
Criticality. Our
ISCA'01 paper made three notable contributions to computer architecture: First, it
developed a dependence-graph abstraction of a program executing on a given micro-architecture. The abstraction established a formal basis
for our work and also demystified the fuzzy notion of the ``critical path,'' previously used informally (and sometimes incorrectly) in
justifying new processor designs.
Second, we solved the open problem of how to compute the critical path directly by the processor. Based on a novel randomized algorithm, our
critical-path analyzer, implemented in hardware, does so very efficiently, without actually building a dependence graph, and with
only a tiny storage array.
Finally, observing that past criticality of an instruction correlates with its future criticality, we turned the critical-path analyzer into
a criticality predictor, thus facilitating the design of first truly cost-aware processor policies. One such policy, for instruction
scheduling, was able to speed up programs by up to 22%---without beefing up the design with additional resources, by simply changing
the execution order of instructions.
Slack. Perhaps the greatest challenge of upcoming processors are technological constraints, such as limited power consumption and long
wire delays. To accommodate them, future designs will be non-uniform: rather than constraining all resources equally,
they will offer them at multiple quality levels, with the fast and power-hungry ones
(e.g., high-speed functional units) intended for the more critical instructions. The key problem is how to
dynamically assign instructions to resources so that the performance penalty of the lower-quality resources is completely hidden.
Our ISCA'02 paper solves this problem by exploring a refined
notion of criticality, slack. Defined as the number of cycles the instruction can be delayed without slowing down the (overall)
execution, slack provides a natural solution---if only we could find it at run-time. Our paper distinguished several notions of slack, and
showed that slack is plentiful in the SPEC2000 workload. To exploit it, we developed a hardware slack analyzer (thanks to a reduction
trick, as simple the criticality analyzer) and used it to systematically devise a control policy for a power-friendly processor
in which one half is clocked at half the frequency. With existing policies, this processor executes programs
10% slower than a full-frequency one; with our slack-based policy, it incurs no performance degradation.