Modern architectural trends have led to an increasing gap in performance between untuned and highly tuned programs. This is particularly true of multiprocessors, but it is also of concern on today's uniprocessors, where there can be more than an order of magnitude difference in performance between a processor cache hit and a cache miss, and four to five orders of magnitude between a memory reference and a page fault.
Effective tools for tuning parallel program performance.
Initial implementations of parallel programs typically yield disappointing
performance, but most existing program tuning tools have one of two problems.
They either measure everything that could be of relevance to program
performance, and are consequently inefficient, or they measure some aspect of
behavior in isolation from other metrics and from program structure,
limiting their usefulness. We have developed a parallel program tuning tool
that addresses these problems. The principal metric of the tool is
normalized processor time: the total processor time spent in each
section of code divided by the number of other processors that are
concurrently busy when that section of code is being executed. Tied to
the logical structure of the program, this metric provides a ``smoking gun''
pointing towards those areas of the program most responsible for poor
performance; it can be efficiently measured using sampling by a
dedicated processor.
Effective tools for tuning memory system performance.
As computer system manufacturers build systems with a larger and larger
gap between processor cycle time and main memory latency, program
performance can be improved significantly by improving its cache behavior.
Traditional performance tools allow programmers to tune CPU performance;
the question is how to build a tool to tune memory performance. We have
developed a simple, yet effective approach -- for tuning memory system
behavior, as opposed to optimizing CPU usage, it helps to present both
code-oriented and data-oriented performance metrics, (instead of just
code-oriented metrics, as in UNIX gprof). It also helps to provide
information about the causes of poor memory system behavior, such as
whether the cache misses are due to thrashing between two data structures
mapped to the same cache blocks. We have also shown that it is possible
to collect these statistics with low runtime overhead by showing how to
sample memory behavior, proving that the approach is both useful and practical.
Tools for application-specific virtual memory management
The dramatic improvement in CPU performance over the past decade
has led to a gap of many orders of magnitude between CPU cycle time and
disk latency. At the same time, some applications, such as scientific
applications, databases and garbage collected programs, make use of large
amounts of virtual memory in ways that interact poorly with standard
operating system virtual memory policies. The result is a huge difference
in performance depending on the exact policy used to manage memory.
One solution is to re-structure the operating system to allow each application
to manage its own physical memory, but this requires normal application
programmers to write their own virtual memory managers, not a trivial
task. We have built a toolkit to make it easy to develop these
application-specific virtual memory managers, by building a generic
manager in an object-oriented fashion so that it was easy to modify to suit
the specific needs of an application.