A primary motivation behind building multiprocessors is to cost-effectively improve system performance, but in practice it has proven difficult to obtain good performance from parallel applications. This project has been to devise and prototype solutions to several problems that have limited multiprocessor system performance. A common theme in each of these solutions is that a multiprocessor operating system should be different (although it need not be more complicated) than a uniprocessor operating system that happens to run on a multiprocessor. Specifically, the research has concerned:
Efficient threads.
Threads are the vehicle for concurrency in many approaches to parallel
programming. I have shown that thread management overhead can be reduced
to within an order of magnitude of the cost of a procedure call. The goal
is to make the expression and control of parallelism sufficiently cheap
that the ``natural parallel decomposition'' of an application can be
exploited by the programmer or compiler with acceptable overhead,
even in the case where this decomposition is relatively fine-grained.
With high performance threads, however, small differences in thread
management can have a significant performance impact for applications
with fine-grained parallelism, often posing a tradeoff between throughput
and latency. Per-processor data structures can be used to improve
throughput, and in some circumstances to avoid locking, improving latency
as well.
Efficient synchronization.
Efficient thread management is impossible without efficient low-level
synchronization primitives. In the work on thread management we
discovered that, contrary to intuition, the obvious methods of
spin-waiting for busy critical sections have poor performance.
Instead, slightly more sophisticated methods, such as Ethernet-style
backoff or explicit software queueing, are needed to avoid having
busy processors slowed because of the activity of spin-waiting processors.
Multiprocessor operating system structure.
Threads can be supported either by the operating system kernel or
by user-level library code in each application's address space,
but neither approach has been fully satisfactory. The performance of
kernel-level threads, although at least an order of magnitude better
than that of traditional UNIX-like processes, has been at least an order
of magnitude worse than that of user-level threads. User-level
threads, on the other hand, have suffered from a lack of system
integration, leading to poor performance or even incorrect behavior in
the presence of ``real world'' factors such as multiprogramming, I/O, and
page faults. Thus the parallel programmer has been faced with a difficult
dilemma: employ kernel-level threads, which ``work right'' but
perform poorly, or employ user-level threads, which typically perform
well but sometimes are seriously deficient. We have designed and prototyped
a new kernel interface, called Scheduler Activations, and user-level
thread package that together provide the same functionality as kernel-level
threads without compromising the performance and flexibility advantages of
managing parallelism at the user level within each application address space.