Optimizing Software Cache-coherent Cluster Architectures*

Xiaohan Qin IBM T.J. Watson Research Center xqin@watson.ibm.com

Jean-Loup Baer Dept. of Computer Science and Engineering University of Washington baer@cs.washington.edu

Abstract:

Software cache-coherent systems using programmable protocol processors provide a flexible infrastructure to expand the systems in size and function. However this flexibility comes at a cost in performance. First, the software implementation of protocols is inherently slower than a hardware implementation. Second, when multiple processors share a protocol processor, contention may result in a substantial increase in memory latency.

In this paper, we study how the overhead of a software scheme can be reduced in the context of a shared-memory system consisting of SMP clusters. We study various design choices including hardware assists such as forwarding logic in the protocol processor and software hints through explicit communication primitives. We conduct our experiments via trace-driven simulation and compare the execution of three programs from the SPLASH-2 suite.

We found that small cluster sizes (up to 4 processors/node) work well for both hardware and software implementations. When the forwarding logic is incorporated with the software scheme, the performance is competitive to that of the hardware scheme. When enhanced further by explicit communication primitives, the software scheme can perform even better than a pure hardware implementation. This is particularly noticeable when the network latency is high.

Keywords:

communication primitives, software-controlled cache coherence, performance evaluation.

Introduction

Packaging technology allows to have off-the-shelf clusters or Symmetric Multiprocessors (SMP), today on a board, tomorrow on a chip. It is thus natural to consider a medium scale shared-memory parallel processing system as the interconnection of SMP clusters. Independently of the topology of the interconnection network, it is clear that the cost of inter-cluster communication, mostly memory latency, will dominate. Techniques to hide inter-cluster memory latency are therefore extremely important.

In this paper, we explore a spectrum of latency hiding techniques in the context of cache-coherent parallel systems where each node is an SMP cluster. While pure hardware coherence mechanisms might be most efficient for a given protocol and system size, they provide no means to adapt to the various behavioral patterns found in many parallel applications and they do not scale easily. A recent trend [7, 5, 10, 15, 8] has been to advocate the use of programmable protocol processors and software to enforce cache coherence policies. The goal of this paper is to show that we can take advantage of the flexibility of the software implementation to obtain performance as good as that of the hardware implementation.

Our study considers various design choices such as hardware assists in the programmable processors and software hints in the form of explicit communication primitives [13]. We will evaluate via trace-driven simulation the performance gains caused by adding the hardware assists and those due to the presence of the communication primitives. We will consider the impact of cluster size which is an important parameter since increasing the number of processors in a cluster has two contradictory effects: reducing inter-cluster references and increasing the intra-cluster contention on shared-resources.

The remainder of the paper is organized as follows. In Section 2, we present the baseline architectural model and hardware and software optimizations. Section 3 describes our experimental methodology for the simulation of the architectures and performance results from the simulation. The results reveal that, in the baseline architecture, the contention on the protocol processor can significantly degrade the performance if the cluster size is too large. However, adding the hardware assists as well as the judicious use of communication primitives enable the software scheme to achieve a performance comparable to that of the hardware implementation. We summarize the related work in Section 4 and conclude in Section 5.

Baseline architectural model and optimizations

Baseline architectural model

The baseline architecture consists of clusters connected to each other by an interconnection network such as a mesh (Figure 1(a)). Each cluster is a bus-based shared-memory multiprocessor augmented with a communication processor [7, 8] and a remote cache [8, 10]. Each processor has a private cache (or cache hierarchy). Cache coherence is maintained at two levels. Within a cluster node, a snoopy protocol keeps the local caches coherent. Across cluster nodes, a directory-based protocol tracks the caching information and maintains coherence on a cache-line basis.

Figure 1: The baseline architectural model.

The remote cache, an order of magnitude larger than the processor's private cache, stores only data homed at remote nodes. A partial inclusion policy [2] is enforced between the remote and local caches with the default write policy between these caches being write-back. It has been shown that a remote cache can eliminate a good portion of inter-cluster misses [3, 8].

The communication processor (CPP) performs under software control the functions of a conventional directory protocol and of the communication primitives. In the base architecture, Figure 1(b), the CPP contains an embedded protocol processor (PP), a network interface (NI), a bus interface (BI), and a memory controller (MC) used by the protocol processor to access memory. The BI and NI each have two queues that store requests/replies from/to the compute processors and the network respectively. Interrupts are employed to notify the protocol processor of message arrival [13]. To handle interrupts efficiently, we require the communication processor to have at least one hardware context dedicated to the message-receiving interrupt handler and mandatory cache coherence handlers.

The cluster's coherence directory is stored in the cluster local memory. We use a full directory scheme with directory entries corresponding to cluster states. We assume that applications can distinguish between shared and private data references. For private references, the memory can supply the data without consulting the CPP. For shared references, the memory is prevented from answering the bus request directly; the CPP is always involved in looking up the memory state and supplying data, if necessary. Additionally, the CPP has a dedicated link to access the memory, thus avoiding competing for the cluster bus.

Forwarding logic

Figure 2: The communication processor augmented with forwarding logic.

Since the CPP is shared by the processors within a cluster, its protocol engine may quickly become overloaded. An option to decrease the load on the protocol processor is to add a forwarding logic module (Figure 2(a)) reminiscent of similar functionality found in earlier shared-memory machines [11, 4]. The module allows requests that need not access the coherence directory of a node to bypass the protocol engine of that node. Example of such requests are a read for remote data (Figure 2(b)) and a ``writeback'' from a home cluster (Figure 2(c)). With the forwarding logic, every cache miss or communication primitive uses only the protocol engine of the home node, thus improving performance in two ways: First, the inter-cluster miss latency is reduced by lessening the number of cycles spent on message passing; and, second there is less contention on the protocol processor.

Communication primitives

We now review briefly the communication primitives introduced in [13] (for implementation issues, see [12]). The communication primitives extend, under programmer or compiler control, the basic on-demand data movement operations (fetch, store) of the shared-memory model. The primitives, summarized along with their semantics in Table 1, give to the user some of the advantageous features of asynchronous message-passing, such as non-blocking operations, while retaining the simplicity of the cache coherent shared-memory paradigm, i.e., global cache coherence is maintained.

Primitives	Semantics
get(addr, size)	fetch data into requesting processor's cache
getex(addr,size)	fetch data with ownership
put(pid,addr,size)	place data in the cache of processor pid
putex(pid,addr,size)	transfer data with ownership to processor pid
multicast(pids,addr,size)	disseminate data to a set of processors
putmem(addr,size)	return data to memory
writemem(mode,addr,size)	set the write policy

Table 1: Communication primitives.

The first two primitives, get(addr, size) and getex(addr,size), allow to (pre)fetch a a set of consecutive cache lines in either shared or exclusive state. The next two, put(pid,addr,size) and putex(pid,addr,size), place data in a consumer cache, when the identity of the latter is known, while multicast(pids,addr,size) serves the same purpose for multiple consumers of known identities. When the producer knows that it will not use the data any longer, but does not know what process will consume it, the data can be stored in memory with putmem(addr,size). This may save half of the request-reply bandwidth in the network. A similar effect can be achieved on a word-per-word basis by changing the default write-back policy to write-through via writemem which can applied to a range of addresses or to all write misses.

When utilized appropriately, these primitives will benefit the application for the following reasons: (1) Overlap of communication with computation, (2) Bulk data transfers, which better utilize the network by pipelining transmissions, and (3) Tailoring the cache coherence protocols.

Simulation: Methodology and performance results

Experimental methodology

Simulations. Since our interest is primarily in the performance aspects of the memory system, we use Mint [16], an execution-driven trace-based simulation tool that is tailored to memory event simulation. Our evaluation compares five sets of simulation results. The difference between the first two cases shows the software protocol processing overhead. The difference between the latter two cases reveals the performance gains obtained by the communication primitives. For the last two implementations we also show the results with the addition of forwarding logic.

A hardware-based cache coherent system in which the protocol is hardwired into circuitry. In the graphs, the results for that implementation will be drawn in black line with circles.
A software implementation that uses the communication processor with its embedded protocol engine performing cache coherency operations in software. Results will be drawn in light gray lines with squares (yellow on the screen). The dashed-line indicates the presence of the forwarding logic.
An optimized software implementation where applications are annotated with communication primitives and the extended protocols are available. Results will be drawn in (dashed) dark gray lines (green on the screen) with pluses.

We limit our experiments to a system with 16 processors. We will vary cluster size from a single node cluster to a system consisting of two clusters of 8 nodes. We will see how the contention on shared resources in large size clusters can offset the benefits of the shorter intra-cluster miss latencies. Our study also gauges the improvements brought upon by the forwarding logic while varying cluster size.

Architectural parameters. We give a sample of architectural parameters used in our simulations in Table 2. The remote cache is 1MB per processor, i.e., cluster size * 1 MB per cluster. The access latency of the remote cache equals that of the local memory. To give an idea of the overhead of the software implementation, an inter-cluster read miss (2 hops) takes, assuming no contention, 118 cycles in the hardware implementation, 197 cycles in the software implementation and 148 cycles when forwarding logic is added.

Workloads. We select a subset of applications from the SPLASH-2 suite [17] for our performance evaluation: FFT, LU and RADIX. Our prior work [13] had indicated that the user-guided optimizations should be dependent upon the cache configuration. To avoid cache pollution, optimizations applied to systems with small caches should be more conservative than those applied to the systems with large caches. For this study, since each node has a large remote cache, we apply the aggressive optimizations to software-based cache-coherent architectures.

Parameters	Value (Software/Hardware)
Number of processors	16
L2 cache	size = 64KB, assoc = 1, linesize = 64B
Remote cache	size = 1MB * cluster size assoc = 4, linesize = 64B
CPU clock/bus clock	2
Transfer request on the address bus	2*
Transfer data on the data bus	2*
Retrieving data from L2 cache	4
Retrieving data from memory	14
Retrieving data from remote cache	14
Bus interface (in/out)	2
Network interface (in and out)	12
Network latency	24
Interrupt and context switch	8/0
Schedule a protocol handler	4/0
Directory status lookup	5/1
Add a sharing node to directory	6/1
Delete a sharing node from directory	6/1
Retrieve a sharing node	3/1
Insert a message in the NI or BI	4/2
Remove a message from the NI or BI	4/2
Route a message through forwarding logic	3

Table 2: Architectural parameters. Bus transactions are counted in bus clock cycles (marked with asterisks). The other operations are counted in CPU cycles. The latencies of data transactions are for the first 8 bytes.

Performance results

FFT (cf. Figure 3 plotting execution time vs. cluster size)

Hardware implementation. Performance slightly degrades with increasing cluster size. Since in this application the communication pattern is all-to-all, the number of inter-cluster misses decreases with larger clusters. Meanwhile, the contention over the shared resources is also intensified because of the growth in intra-cluster misses. This contention is the slight dominant factor. From a cost-performance viewpoint, clusters of size 4 or 8 seem best since they require fewer hardware resources for inter-cluster connection.

Software implementations. The software implementation in the baseline architecture increases execution time (with respect to the baseline hardware case) by 25% for single node clusters and up to 100% for clusters of size 8. The contention on the protocol processor, whose utilization is always above 65% for clusters of sizes 4 and 8, is the reason for the more severe increase.

Figure 3: Execution times for FFT. (``instruction overhead'' corresponds to processor busy time)

The use of the forwarding logic reduces the memory overhead by 20-35%. First, it directly reduces the inter-cluster miss latency by lessening the number of cycles spent on message passing. The majority of the FFT's misses are coherence misses that can take advantage of this simple and fast mechanism. Second, it reduces the cache miss latency indirectly by mitigating contention on the protocol processor. When the forwarding logic is employed, the performance of the software implementation is close to that of the hardware implementation.

Enhancement resulting from the communication primitives. In the baseline architecture, the communication primitives improve the performance over the software implementation (without forwarding logic) by 30% (respectively 20% when forwarding logic is present) for clusters of sizes 4 or less. When the cluster size increases, the usefulness of the communication primitives is not as important since there are more intra-cluster cache misses for which data movements initiated by communication primitives would not be executed. The ease in contention due to the communication primitives is not sufficient to compensate for the above factor.

Note that the performance of the enhanced software scheme is not bounded by the best fixed hardware implementation. For FFT, the enhanced software implementations with forwarding logic perform as well or better than the corresponding hardware implementations at cluster sizes 1, 2, and 4.

LU (cf. Figure 4)

Figure 4: Execution times for LU. The dashed line at the bottom represents the processor busy time and the line above it is the synchronization overhead under a PRAM model.

Hardware implementation. LU's performance is practically independent of the cluster size because its execution time is dominated by the computation time and synchronization overhead. In addition, for the input set simulated, the remote cache retains a large part of the working data set within a cluster. The low intra-cluster miss ratio does not put pressure on the local memory bandwidth. Cost-performance wise, a cluster size of 4 or 8 is best.

Software implementations. In the base software implementation, the performance of LU does not degrade as significantly as that of FFT when the cluster size increases because (1) the application has low communication to computation ratio and (2) intra-node sharing produces a positive effect. Adding the forwarding logic eliminates a good portion of the software cache-coherence overhead, although the enhancement is small relative to the total execution time for the reason stated above: LU has insignificant memory overhead.

Enhancement resulting from the communication primitives. The improvements by the communication primitives are negligible in all cases. The main reasons for the lack of improvement are: (1) a large portion of the memory overhead consists of barrier synchronization time (cf. the top portion of each bar in Figure 5) for which the communication primitives are useless, and (2) the communication primitives do not reduce the latency due to conflict misses since the latter are unpredictable. This application reveals the limitation of the user primitives which are targeted primarily at bulk data communication.

In summary, although the communication primitives are not of much use for LU decomposition, the software implementation assisted with the forwarding logic does present a performance competitive to that of the hardware scheme.

Figure 5: Execution times of LU: software implementations with and without communication primitives at cluster size 2. In each group, the left and right bars correspond to the baseline architecture and that with the forwarding logic respectively. ``CPU busy'' is the compute processor execution time. ``Sync-algo'' is the synchronization time due to load imbalance intrinsic to the algorithm. ``Mem latency'' is the amount of time the compute processor waits because of memory latency and cache coherence effects. ``Sync-Mem'' is the synchronization time, including load imbalance, due to the effect of memory latency.

RADIX (cf. Figure 6)

Hardware implementation. Like in FFT, the execution times are best for smaller cluster sizes. The degradation with increased cluster size is more pronounced for RADIX which has a higher miss ratio and hence higher contention on shared resources.

Software implementations. Here also, the patterns are similar to what we found in FFT, namely performance loss from 30% for single node clusters to 100% for 8-processor clusters. Adding the forwarding logic enhances the performance by 20-30%. The improvement is the most significant for large cluster sizes due to the forwarding logic's effect on reducing the protocol processor's contention.

Enhancement resulting from the communication primitives. In a system where each node has limited cache capacity, two communication primitives are applicable to RADIX: get (for prefetching) and writemem (to write-through when each processor writes the keys into new locations based on their ``global ranks'' [13]). The write-through policy, however, is not worthwhile when the remote cache is present because the write-back policy makes it possible to reuse the data that are kept within a cluster by the remote cache. The get primitive by itself yields an improvement of 10% for small cluster sizes.

The results for RADIX show that the software cache coherence scheme supplemented with the forwarding logic performs nearly as well as the hardware implementation.

Figure 6: Execution times for RADIX.

Effect of network latency

The previous results indicate that for our simulation parameters single-processor clusters give the best execution times, and that from a cost/performance viewpoint, clusters of size 2 or 4 might be preferable. These results were obtained in the context of a very aggressive network with a round-trip latency (denoted L_base in the following) of 0.4ms if we assume 200MHz processors. We now examine the effect of increasing the network latency to 2*L_base, 4*L_base, and 8*L_base. These experiments are justified by the current trend of processor speed, for both compute and communication processors, increasing faster than network latency.

Figure 7 shows the execution times for FFT with increasing network latencies. The results for the other two benchmarks show the same trends, namely:

As network latencies increase, clustering becomes more beneficial for the hardware implementations, i.e., the bus contention is less important than the delays caused by inter-cluster misses.
The same trend is true for the software implementations with high latencies (4*L_base and 8*L_base) but it is less pronounced. The contention on the communication processor remains the dominant factor for large clusters.
The use of communication primitives hides the effects of increasing latency to a point where differences in network latency do not significantly alter the performance (but the contention on the communication processor remains a bottleneck for large clusters).
A consequence of this last point is that absolute best performance is achieved with a software solution enhanced by communication primitives. This is true even for large clusters (contrary to what happens for slower networks), and performance can be improved up to a factor of 3 for small clusters.

Figure 7: Varying network latency for FFT.

Related Work

Protocol processor efficiency has been studied in [6] in the context of Flash [7]. Simulation prior to the design of STiNG [8] showed that multiple processor clusters can increase the contention on the protocol processor to the point where remote access latency can double.

Michael et al. [9] studied the performance gained by using dual protocol engines, one for dealing with local memory accesses and the other for remote accesses. The goal was to reduce contention on the protocol processor, an idea similar to the one used in S3.mp [10]. Our forwarding logic provides some of this dual functionality with a simpler implementation.

Explicit communication primitives within the context of a shared-memory paradigm have also been studied in [14, 1]. Our implementation differs by the use of a protocol processor and the fact that the execution of the optional communication primitives are interruptible by mandatory shared memory requests that have higher priority.

Conclusion

The use of programmable protocol processors and of software cache coherence opens a promising and flexible avenue in the implementation of fine-grained, SMP cluster based, shared-memory systems at the cost of a higher protocol processing overhead.

The basic software scheme is always at a disadvantage, performance-wise, when compared with a hardware solution. We have shown, however, that a software scheme can be competitive or even outperform a pure hardware implementation by providing hardware assists such as forwarding logic in the communication processor and by enhancing the software coherence protocols with explicit communication primitives.

With the aggressive network technology assumed in our study, the best cluster size is a single processor per node. Nevertheless, both hardware and software implementations assisted with forwarding logic work well with small cluster sizes, up to 4 processors/node. With faster processor to network speed ratio, hardware implementations benefit from larger cluster sizes while enhanced software implementations present little variation up to a cluster size of 4 and have better performance than the corresponding hardware implementations.

Finally, we note that communication primitives are most desirable when the network latency is high and that the forwarding logic is more important for the large cluster sizes. We also remark that the applications that we simulated were optimized for flat architectures. It would be interesting to see whether a study comparing the various schemes for applications that were modified to better fit the cluster paradigm would reach the same conclusions.

* This work was completed while Xiaohan Qin was studying at the University of Washington. It was supported in part by NSF Grant MIP 9700970.

References

1: H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In Proc. of the 3rd Int. Symp. on High-Performance Computer Architecture, pages 204-15, 1997.
2: J.-L. Baer and W.-H. Wang. On the inclusion properties for multi-level cache hierarchies. In Proc. of the 15th Int. Symp. on Comp. Architecture, pages 73-80, 1988.
3: F. Baetke. CONVEX Exemplar SPP1000 series MPPC systems with a new virtual shared memory concept. In Proc. of the 2nd Austrian Hungarian Workshop on Transputer Applications, pages 14-22, 1995.
4: BBN Advanced Computer Inc. Butterfly CP1000 switch tutorial, 1989.
5: M. Heinrich et al. The performance impact of flexibility in the Stanford FLASH multiprocessor. In Proc. of the 6th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages 274-285, 1994.
6: C. Holt. The effects of latency, occupancy, and bandwidth in distributed shared memory multiprocessors. Technical report, Dept. of Computer Science, Stanford Univ., 1995. CSL-TR-95-660.
7: J. Kuskin et al. The Stanford FLASH multiprocessor. In Proc. of the 21st Int. Symp. on Comp. Architecture, pages 302-313, 1994.
8: T. Lovett and R. Clapp. STiNG: A CC-NUMA computer system for the commercial marketplace. In Proc. of the 23rd Int. Symp. on Comp. Architecture, pages 308-317, 1996.
9: M. M. Michael, A. K. Nanda, B.-H. Lim, and M. L. Scott. Coherence controller architectures for SMP-based CC-NUMA multiprocessors. In Proc. of the 24th Int. Symp. on Comp. Architecture, pages 219-28, 1997.
10: A. Nowatzyk et al. Exploiting parallelism in cache coherency protocol engines. In Proceedings of EURO-PAR '95 Parallel Processing, pages 269-86, 1995.
11: G.F. Pfister, W. C. Brantley, D. A. George, and S. L. Harvey. The IBM research parallel processor prototype (RP3): introduction and architecture. In Proc. of Int. Conf. on Parallel Processing, pages 764-71, 1985.
12: X. Qin. On the use and performance of communication primitives in software controlled cache-coherent cluster architectures. PhD thesis, Dept. of Computer Science, Univ. of Washington, 1997.
13: X. Qin and J.-L. Baer. On the use and performance of explicit communication primitives in cache-coherent multiprocessor systems. In Proc. of the 3rd Int. Symp. on High-Performance Computer Architecture, pages 182-93, 1997.
14: U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I Yanasak. Architectural mechanisms for explicit communication in shared memory multiprocessors. In Proc. of Supercomputing '95, 1995.
15: S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proc. of the 24th Int. Symp. on Comp. Architecture, pages 34-43, 1996.
16: J. E. Veenstra and R. J. Fowler. MINT: a front end for efficient simulation of shared-memory multiprocessors. In Proc. of the 2nd Int. Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages 201-7, 1994.
17: S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. of the 22nd Int. Symp. on Comp. Architecture, pages 24-36, 1995.

Author Biographies

Xiaohan Qin is a research staff member at IBM TJ Watson Research Center. Before joining IBM, she studied at the University of Washington and received a PhD in computer science in 1997. Her current research interests include commercial workload characterization, system performance evaluation, and analytical modelling.

Jean-Loup Baer is Boeing Pennell Professor of Computer Science and Engineering and Adjunct Professor of Electrical Engineering at the University of Washington. He joined the University of Washington faculty in 1969 after undergraduate studies in Grenoble (France) and after receiving his Ph.D from UCLA. His present research interests are in computer systems architecture with a concentration on the design and evaluation of memory hierarchies, and in parallel and distributed processing.