Tools for Tuning Program Performance

Modern architectural trends have led to an increasing gap in performance between untuned and highly tuned programs. This is particularly true of multiprocessors, but it is also of concern on today's uniprocessors, where there can be more than an order of magnitude difference in performance between a processor cache hit and a cache miss, and four to five orders of magnitude between a memory reference and a page fault.

Effective tools for tuning parallel program performance.
Initial implementations of parallel programs typically yield disappointing performance, but most existing program tuning tools have one of two problems. They either measure everything that could be of relevance to program performance, and are consequently inefficient, or they measure some aspect of behavior in isolation from other metrics and from program structure, limiting their usefulness. We have developed a parallel program tuning tool that addresses these problems. The principal metric of the tool is normalized processor time: the total processor time spent in each section of code divided by the number of other processors that are concurrently busy when that section of code is being executed. Tied to the logical structure of the program, this metric provides a ``smoking gun'' pointing towards those areas of the program most responsible for poor performance; it can be efficiently measured using sampling by a dedicated processor.

Effective tools for tuning memory system performance.
As computer system manufacturers build systems with a larger and larger gap between processor cycle time and main memory latency, program performance can be improved significantly by improving its cache behavior. Traditional performance tools allow programmers to tune CPU performance; the question is how to build a tool to tune memory performance. We have developed a simple, yet effective approach -- for tuning memory system behavior, as opposed to optimizing CPU usage, it helps to present both code-oriented and data-oriented performance metrics, (instead of just code-oriented metrics, as in UNIX gprof). It also helps to provide information about the causes of poor memory system behavior, such as whether the cache misses are due to thrashing between two data structures mapped to the same cache blocks. We have also shown that it is possible to collect these statistics with low runtime overhead by showing how to sample memory behavior, proving that the approach is both useful and practical.

Tools for application-specific virtual memory management
The dramatic improvement in CPU performance over the past decade has led to a gap of many orders of magnitude between CPU cycle time and disk latency. At the same time, some applications, such as scientific applications, databases and garbage collected programs, make use of large amounts of virtual memory in ways that interact poorly with standard operating system virtual memory policies. The result is a huge difference in performance depending on the exact policy used to manage memory. One solution is to re-structure the operating system to allow each application to manage its own physical memory, but this requires normal application programmers to write their own virtual memory managers, not a trivial task. We have built a toolkit to make it easy to develop these application-specific virtual memory managers, by building a generic manager in an object-oriented fashion so that it was easy to modify to suit the specific needs of an application.

Selected Publications

M. Martonosi, A. Gupta, and T. Anderson. ``Tuning Memory Performance in Sequential and Parallel Programs.'' To appear, IEEE Computer Magazine.

K. Krueger, D. Loftesness, A. Vahdat, and T. Anderson. ``Tools for the Development of Application-Specific Virtual Memory Management.'' Proc. 1993 Conference on Object Oriented Programming: Systems, Languages, and Applications (OOPSLA '93) (September 1993), pp. 48--64. Postscript.

M. Martonosi, A. Gupta, and T. Anderson. ``Effectiveness of Trace Sampling for Performance Debugging Tools.'' Proc. 1993 ACM SIGMETRICS Conference on the Measurement and Modeling of Computer Systems (May 1993), pp. 248--259 .

M. Martonosi, A. Gupta, and T. Anderson. ``MemSpy: Analyzing Memory System Bottlenecks in Programs.'' Proc. 1992 ACM SIGMETRICS and Performance '92 Conference on the Measurement and Modeling of Computer Systems (May 1992), pp. 1--12.

T. Anderson and E. Lazowska. ``Quartz: A Tool for Tuning Parallel Program Performance.'' Proc. 1990 ACM SIGMETRICS Conference on the Measurement and Modeling of Computer Systems (May 1990), pp. 115--125. Also appeared as University of Washington Technical Report 89-09-05 (September 1989).