|
|
In this paper, we study how the overhead of a software scheme can be reduced in the context of a shared-memory system consisting of SMP clusters. We study various design choices including hardware assists such as forwarding logic in the protocol processor and software hints through explicit communication primitives. We conduct our experiments via trace-driven simulation and compare the execution of three programs from the SPLASH-2 suite.
We found that small cluster sizes (up to 4 processors/node) work well
for both hardware and software implementations. When the forwarding logic
is incorporated with the software scheme, the performance is competitive
to that of the hardware scheme. When enhanced further by explicit communication
primitives, the software scheme can perform even better than a pure hardware
implementation. This is particularly noticeable when the network latency
is high.
In this paper, we explore a spectrum of latency hiding techniques in the context of cache-coherent parallel systems where each node is an SMP cluster. While pure hardware coherence mechanisms might be most efficient for a given protocol and system size, they provide no means to adapt to the various behavioral patterns found in many parallel applications and they do not scale easily. A recent trend [7, 5, 10, 15, 8] has been to advocate the use of programmable protocol processors and software to enforce cache coherence policies. The goal of this paper is to show that we can take advantage of the flexibility of the software implementation to obtain performance as good as that of the hardware implementation.
Our study considers various design choices such as hardware assists in the programmable processors and software hints in the form of explicit communication primitives [13]. We will evaluate via trace-driven simulation the performance gains caused by adding the hardware assists and those due to the presence of the communication primitives. We will consider the impact of cluster size which is an important parameter since increasing the number of processors in a cluster has two contradictory effects: reducing inter-cluster references and increasing the intra-cluster contention on shared-resources.
The remainder of the paper is organized as follows. In Section 2, we present the baseline architectural model and hardware and software optimizations. Section 3 describes our experimental methodology for the simulation of the architectures and performance results from the simulation. The results reveal that, in the baseline architecture, the contention on the protocol processor can significantly degrade the performance if the cluster size is too large. However, adding the hardware assists as well as the judicious use of communication primitives enable the software scheme to achieve a performance comparable to that of the hardware implementation. We summarize the related work in Section 4 and conclude in Section 5.
The remote cache, an order of magnitude larger than the processor's private cache, stores only data homed at remote nodes. A partial inclusion policy [2] is enforced between the remote and local caches with the default write policy between these caches being write-back. It has been shown that a remote cache can eliminate a good portion of inter-cluster misses [3, 8].
The communication processor (CPP) performs under software control the functions of a conventional directory protocol and of the communication primitives. In the base architecture, Figure 1(b), the CPP contains an embedded protocol processor (PP), a network interface (NI), a bus interface (BI), and a memory controller (MC) used by the protocol processor to access memory. The BI and NI each have two queues that store requests/replies from/to the compute processors and the network respectively. Interrupts are employed to notify the protocol processor of message arrival [13]. To handle interrupts efficiently, we require the communication processor to have at least one hardware context dedicated to the message-receiving interrupt handler and mandatory cache coherence handlers.
The cluster's coherence directory is stored in the cluster local memory. We use a full directory scheme with directory entries corresponding to cluster states. We assume that applications can distinguish between shared and private data references. For private references, the memory can supply the data without consulting the CPP. For shared references, the memory is prevented from answering the bus request directly; the CPP is always involved in looking up the memory state and supplying data, if necessary. Additionally, the CPP has a dedicated link to access the memory, thus avoiding competing for the cluster bus.
Since the CPP is shared by the processors within a cluster, its protocol engine may quickly become overloaded. An option to decrease the load on the protocol processor is to add a forwarding logic module (Figure 2(a)) reminiscent of similar functionality found in earlier shared-memory machines [11, 4]. The module allows requests that need not access the coherence directory of a node to bypass the protocol engine of that node. Example of such requests are a read for remote data (Figure 2(b)) and a ``writeback'' from a home cluster (Figure 2(c)). With the forwarding logic, every cache miss or communication primitive uses only the protocol engine of the home node, thus improving performance in two ways: First, the inter-cluster miss latency is reduced by lessening the number of cycles spent on message passing; and, second there is less contention on the protocol processor.
Primitives | Semantics |
get(addr, size) | fetch data into requesting processor's cache |
getex(addr,size) | fetch data with ownership |
put(pid,addr,size) | place data in the cache of processor pid |
putex(pid,addr,size) | transfer data with ownership to processor pid |
multicast(pids,addr,size) | disseminate data to a set of processors |
putmem(addr,size) | return data to memory |
writemem(mode,addr,size) | set the write policy |
The first two primitives, get(addr, size) and getex(addr,size), allow to (pre)fetch a a set of consecutive cache lines in either shared or exclusive state. The next two, put(pid,addr,size) and putex(pid,addr,size), place data in a consumer cache, when the identity of the latter is known, while multicast(pids,addr,size) serves the same purpose for multiple consumers of known identities. When the producer knows that it will not use the data any longer, but does not know what process will consume it, the data can be stored in memory with putmem(addr,size). This may save half of the request-reply bandwidth in the network. A similar effect can be achieved on a word-per-word basis by changing the default write-back policy to write-through via writemem which can applied to a range of addresses or to all write misses.
When utilized appropriately, these primitives will benefit the application for the following reasons: (1) Overlap of communication with computation, (2) Bulk data transfers, which better utilize the network by pipelining transmissions, and (3) Tailoring the cache coherence protocols.
Architectural parameters. We give a sample of architectural parameters used in our simulations in Table 2. The remote cache is 1MB per processor, i.e., cluster size * 1 MB per cluster. The access latency of the remote cache equals that of the local memory. To give an idea of the overhead of the software implementation, an inter-cluster read miss (2 hops) takes, assuming no contention, 118 cycles in the hardware implementation, 197 cycles in the software implementation and 148 cycles when forwarding logic is added.
Workloads. We select a subset of applications from the SPLASH-2
suite [17] for our performance evaluation:
FFT, LU and RADIX. Our prior work [13]
had indicated that the user-guided optimizations should be dependent upon
the cache configuration. To avoid cache pollution, optimizations applied
to systems with small caches should be more conservative than those applied
to the systems with large caches. For this study, since each node has a
large remote cache, we apply the aggressive optimizations to software-based
cache-coherent architectures.
Parameters |
|
Number of processors | 16 |
L2 cache |
|
Remote cache |
|
CPU clock/bus clock |
|
Transfer request on the address bus |
|
Transfer data on the data bus |
|
Retrieving data from L2 cache |
|
Retrieving data from memory |
|
Retrieving data from remote cache |
|
Bus interface (in/out) |
|
Network interface (in and out) |
|
Network latency |
|
Interrupt and context switch |
|
Schedule a protocol handler |
|
Directory status lookup |
|
Add a sharing node to directory |
|
Delete a sharing node from directory |
|
Retrieve a sharing node |
|
Insert a message in the NI or BI |
|
Remove a message from the NI or BI |
|
Route a message through forwarding logic |
|
Hardware implementation. Performance slightly degrades with increasing cluster size. Since in this application the communication pattern is all-to-all, the number of inter-cluster misses decreases with larger clusters. Meanwhile, the contention over the shared resources is also intensified because of the growth in intra-cluster misses. This contention is the slight dominant factor. From a cost-performance viewpoint, clusters of size 4 or 8 seem best since they require fewer hardware resources for inter-cluster connection.
Software implementations. The software implementation in the baseline architecture increases execution time (with respect to the baseline hardware case) by 25% for single node clusters and up to 100% for clusters of size 8. The contention on the protocol processor, whose utilization is always above 65% for clusters of sizes 4 and 8, is the reason for the more severe increase.
Figure 3: Execution times for FFT. (``instruction overhead''
corresponds to processor busy time)
The use of the forwarding logic reduces the memory overhead by 20-35%. First, it directly reduces the inter-cluster miss latency by lessening the number of cycles spent on message passing. The majority of the FFT's misses are coherence misses that can take advantage of this simple and fast mechanism. Second, it reduces the cache miss latency indirectly by mitigating contention on the protocol processor. When the forwarding logic is employed, the performance of the software implementation is close to that of the hardware implementation.
Enhancement resulting from the communication primitives. In the baseline architecture, the communication primitives improve the performance over the software implementation (without forwarding logic) by 30% (respectively 20% when forwarding logic is present) for clusters of sizes 4 or less. When the cluster size increases, the usefulness of the communication primitives is not as important since there are more intra-cluster cache misses for which data movements initiated by communication primitives would not be executed. The ease in contention due to the communication primitives is not sufficient to compensate for the above factor.
Note that the performance of the enhanced software scheme is not bounded by the best fixed hardware implementation. For FFT, the enhanced software implementations with forwarding logic perform as well or better than the corresponding hardware implementations at cluster sizes 1, 2, and 4.
LU (cf. Figure 4)
Figure 4: Execution times for LU. The dashed line at the bottom
represents the processor busy time and the line above it is the synchronization
overhead under a PRAM model.
Hardware implementation. LU's performance is practically independent of the cluster size because its execution time is dominated by the computation time and synchronization overhead. In addition, for the input set simulated, the remote cache retains a large part of the working data set within a cluster. The low intra-cluster miss ratio does not put pressure on the local memory bandwidth. Cost-performance wise, a cluster size of 4 or 8 is best.
Software implementations. In the base software implementation, the performance of LU does not degrade as significantly as that of FFT when the cluster size increases because (1) the application has low communication to computation ratio and (2) intra-node sharing produces a positive effect. Adding the forwarding logic eliminates a good portion of the software cache-coherence overhead, although the enhancement is small relative to the total execution time for the reason stated above: LU has insignificant memory overhead.
Enhancement resulting from the communication primitives. The improvements by the communication primitives are negligible in all cases. The main reasons for the lack of improvement are: (1) a large portion of the memory overhead consists of barrier synchronization time (cf. the top portion of each bar in Figure 5) for which the communication primitives are useless, and (2) the communication primitives do not reduce the latency due to conflict misses since the latter are unpredictable. This application reveals the limitation of the user primitives which are targeted primarily at bulk data communication.
In summary, although the communication primitives are not of much use for LU decomposition, the software implementation assisted with the forwarding logic does present a performance competitive to that of the hardware scheme.
Figure 5: Execution times of LU: software implementations with
and without communication primitives at cluster size 2. In each group,
the left and right bars correspond to the baseline architecture and that
with the forwarding logic respectively. ``CPU busy'' is the compute processor
execution time. ``Sync-algo'' is the synchronization time due to load imbalance
intrinsic to the algorithm. ``Mem latency'' is the amount of time the compute
processor waits because of memory latency and cache coherence effects.
``Sync-Mem'' is the synchronization time, including load imbalance, due
to the effect of memory latency.
RADIX (cf. Figure 6)
Hardware implementation. Like in FFT, the execution times are best for smaller cluster sizes. The degradation with increased cluster size is more pronounced for RADIX which has a higher miss ratio and hence higher contention on shared resources.
Software implementations. Here also, the patterns are similar to what we found in FFT, namely performance loss from 30% for single node clusters to 100% for 8-processor clusters. Adding the forwarding logic enhances the performance by 20-30%. The improvement is the most significant for large cluster sizes due to the forwarding logic's effect on reducing the protocol processor's contention.
Enhancement resulting from the communication primitives. In a system where each node has limited cache capacity, two communication primitives are applicable to RADIX: get (for prefetching) and writemem (to write-through when each processor writes the keys into new locations based on their ``global ranks'' [13]). The write-through policy, however, is not worthwhile when the remote cache is present because the write-back policy makes it possible to reuse the data that are kept within a cluster by the remote cache. The get primitive by itself yields an improvement of 10% for small cluster sizes.
The results for RADIX show that the software cache coherence scheme supplemented with the forwarding logic performs nearly as well as the hardware implementation.
Figure 7 shows the execution times for FFT with increasing network latencies. The results for the other two benchmarks show the same trends, namely:
Michael et al. [9] studied the performance gained by using dual protocol engines, one for dealing with local memory accesses and the other for remote accesses. The goal was to reduce contention on the protocol processor, an idea similar to the one used in S3.mp [10]. Our forwarding logic provides some of this dual functionality with a simpler implementation.
Explicit communication primitives within the context of a shared-memory paradigm have also been studied in [14, 1]. Our implementation differs by the use of a protocol processor and the fact that the execution of the optional communication primitives are interruptible by mandatory shared memory requests that have higher priority.
The basic software scheme is always at a disadvantage, performance-wise, when compared with a hardware solution. We have shown, however, that a software scheme can be competitive or even outperform a pure hardware implementation by providing hardware assists such as forwarding logic in the communication processor and by enhancing the software coherence protocols with explicit communication primitives.
With the aggressive network technology assumed in our study, the best cluster size is a single processor per node. Nevertheless, both hardware and software implementations assisted with forwarding logic work well with small cluster sizes, up to 4 processors/node. With faster processor to network speed ratio, hardware implementations benefit from larger cluster sizes while enhanced software implementations present little variation up to a cluster size of 4 and have better performance than the corresponding hardware implementations.
Finally, we note that communication primitives are most desirable when the network latency is high and that the forwarding logic is more important for the large cluster sizes. We also remark that the applications that we simulated were optimized for flat architectures. It would be interesting to see whether a study comparing the various schemes for applications that were modified to better fit the cluster paradigm would reach the same conclusions.
* This work was completed while Xiaohan Qin was studying at the University of Washington. It was supported in part by NSF Grant MIP 9700970.
Jean-Loup Baer is Boeing Pennell Professor of Computer Science and Engineering and Adjunct Professor of Electrical Engineering at the University of Washington. He joined the University of Washington faculty in 1969 after undergraduate studies in Grenoble (France) and after receiving his Ph.D from UCLA. His present research interests are in computer systems architecture with a concentration on the design and evaluation of memory hierarchies, and in parallel and distributed processing.