Amin M. Vahdat, Paul C. Eastham, and Thomas E. Anderson
Computer
Science Division
University of California
Berkeley, CA 94720
The immense popularity of the World Wide Web has lead to the development of a vast number of new distributed applications. These distributed Internet applications often require common underlying functionality, including: a global name space, cache coherence, fault tolerance, security, and authentication. Unfortunately, this functionality must currently be reimplemented on a per-application basis. We are building WebFS, a global cache coherent file system to provide a common substrate for developing Internet applications. A prototype of WebFS has been completed for the Solaris operating system and provides access to the HTTP name space for unmodified binaries. A number of cache coherence policies have also been implemented providing support for two prototype applications, scalable Internet chat and a stock ticker. In particular, we have found the use of multicast updates appropriate for applications which broadcast regular updates to a dynamic, global audience.
The immense popularity of the World Wide Web has lead to a need for the design of Internet services. We believe that such services should be able to grow, shrink, and migrate geographically in response to offered load. Further, Internet services should deliver data to end applications as efficiently as possible. We have found that such Internet services require some common pieces of core functionality, including:
Currently a number of efforts are under way to address these requirements. Cache coherence is currently under investigation by a number of groups[8,18]. Existing Internet services such as Inktomi[3] and Alta Vista[12] provide a level of fault tolerance through both local and geographical replication. Security and authentication can be provided through packages such as the Secure Socket Layer[15] and SHTTP[24]. While all of these efforts address deficiencies in the Web infrastructure, our goal is to provide a general substrate for the development of Internet services. That is, we aim to develop an interface which packages a common set of solutions for distributed applications and which can be customized to match the semantics of individual applications.
In this paper, we will argue that the above requirements can be addressed in a consistent and natural way using the file system as the basic abstraction. To validate this claim, we are building WebFS, a global cache coherent file system. Given the immense popularity of the Web, it is essential that this filesystem be compatible with the existing HTTP name space. An initial WebFS prototype has been completed and is operational on the Solaris operating system; it is available for download from the WebOS homepage. WebFS is implemented as a loadable kernel module and can be mounted just like a normal filesystem. For example, if WebFS is mounted on /http, the following sequence of operations is supported for unmodified applications:
cd /http/www6conf.slac.stanford.edu
cat index.html
cd
img
ls
cd logo
xv 200x200.jpg
Given this base-level functionality of access to the global HTTP namespace, WebFS provides a testbed for exploration of the cache coherence, security, and authentication issues associated with Internet services. We have implemented two sample distributed applications, Internet chat and a stock ticker (providing regular price updates) to explore a number of cache coherence policies in the context of the Web. We have identified three policies appropriate for different classes of applications: last writer wins, append only, and multicast updates. Our experience in developing these applications and the matching coherence policies suggests that the use of application-specific cache coherence policies as an underlying abstraction can vastly simplify the implementation of our target Internet services.
In particular, the integration of multicast into WebFS allows for efficient delivery of frequently updated files to large numbers of clients interested in the most up to date file contents. As an example, the recent 1996 United States general election was widely covered on the Internet. Unfortunately, many Internet servers were unable to keep up with the massive request stream from around the world. Given the fact that, in many instances, users are interested in all file updates (e.g. election results or stock prices), we hypothesize that use of multicast to deliver updates can improve client latency while simultaneously reducing server load.
The rest of this paper is organized as follows. Section 2 details our implementation of WebFS. In Section 3, we describe the three cache coherence policies currently implemented in WebFS and two sample Internet services built on top of these policies, chat and a stock ticker. Section 4 presents related work. In Section 5, we present our future work and conclude in Section 6.
Our goals for WebFS were the following: (i) cache coherent access to the global name space for unmodified application, (ii) a fully functional file system interface supporting arbitrary file/directory operations, and (iii) performance comparable to standard file systems for cached access. To accomplish all of these goals, we decided to build WebFS at the UNIX vnode layer[21] with tight integration to the virtual memory system for fast cached accesses. Using the Vnode layer allows for portability across various UNIX variants and potentially Windows NT. For example, both NFS[28] and AFS[19] were developed at the vnode layer.
|
Figure 1: This figure describes the high-level WebFS architecture. |
Figure 1 summarizes the WebFS system architecture, which consists of two parts: a user-level daemon and a loadable vnode module. Consider the example of a read to a file in the WebFS namespace. Initially (step 0), the user-level daemon spawns a thread which makes a system call intercepted by the WebFS vnode layer. The layer then puts that process to sleep until work becomes available for it. When an application makes the read system call requesting on a WebFS file, the operating system translates the call into a vnode read operation (step 1). The vnode operation checks to see if the read can be satisfied in the kernel (i.e. if the page being requested is cached in the kernel). If the operation cannot be satisfied in the kernel, the vnode operation fills in a structure requesting a read of the file page in question and wakes up one of the sleeping threads in the work queue (step 3). The user level daemon is then responsible for retrieving the page, by contacting a remote HTTP or WebFS daemon (step 4).
Once the page is retrieved remotely, the reverse path is followed into the kernel. The WebFS daemon packages the reply into another WebFS system call. The Vnode operation then copies the page into the virtual memory system, ensuring that subsequent accesses to the page can be handled entirely in the kernel. To ensure that concurrent WebFS file accesses are completed as quickly as possible. For maximum concurrency, the WebFS daemon is multi-threaded: multiple threads simultaneously make themselves available for kernel requests (step 0 from Figure 1).
In summary, the WebFS architecture provides a number of key advantages. First, interfacing at the vnode layer allows unmodified applications access to the global HTTP namespace. Second, tight integration with the virtual memory system provides for excellent performance on cached accesses. Next, the vnode interface is supported by most UNIX variants and potentially Windows NT, simplifying ports to new platforms. Finally, the user level daemon allows for simplified development: much of the file system functionality and data structures are maintained at the user level, facilitating experimentation and simplifying debugging.
Upon mounting WebFS, the root directory is defined to contain all HTTP/WebFS sites that the filesystem is initially aware of. Thus, the root directory is initially empty. Directory entries are created on reference. That is, the first time an application attempts access to an HTTP or WebFS site, the system checks for the presence of first a WebFS server, and second, an HTTP server. If either server exists, a directory entry is created in the root WebFS directory. The name of the directory is equal to the hostname of the remote server. The URL naming convention of appending a colon plus a port number is maintained for naming servers running on non-standard ports.
In the future, we hope to present a view of all sites in the global name space initially. Such a view requires maintaining a notion of the current state of the name space. It would also be advantageous to allow users to impose a hierarchy on the namespace since a flat namespace containing all Web sites is unmanageable. Even using a scheme similar to Alex[7] where IP address components (for example /http/edu/stanford/logic) are used to impose a hierarchy on the Web would not allow intelligent searches for particular web sites. Our current plan is for WebFS to retrieve a file containing a hint for the Web name space as gathered by a regular Web crawling agent.
In cases where WebFS is not running on remote nodes, our system falls back to HTTP to retrieve directory and file contents. This technique has the advantage of allowing read access to the existing HTTP namespace. Providing access to this namespace simplifies the implementation of many existing applications, including web crawlers. Unfortunately, its very simplicity presents a number of challenges in layering UNIX file semantics on top of HTTP.
On a client request for the contents of a directory containing a pre-determined filename (e.g. index.html), HTTP returns the contents of the file rather than the contents of the directory. In this case, WebFS presents a single entry for the directory with the same predetermined filename. The following techniques are utilized to work around this limitation. WebFS allows creation of directory entries on the fly: on access to a file that WebFS is unaware of, it attempts to retrieve the file from the remote site. If the file does indeed exist, a directory entry is created (failure is returned otherwise). Building on this technique, we also provide a utility program which parses an HTML file (e.g. index.html) creating entries for all links found in the file. Using these techniques, applications can use WebFS to access all files normally available through HTTP.
A second limitation that WebFS must address is the limited file statistics exported by HTTP when providing directory information. For example, HTTP server express file size in units of kilobytes or megabytes, making it impossible to ascertain exact file size unless the file is actually retrieved. Upon entering a previously unknown directory, WebFS uses the values retrieved from the HTTP server as the initial file sizes. Upon read access, the file is retrieved from the remote site allowing WebFS to update the actual size of the file. While this may confuse applications which attempt to stat a file before reading it, we have found that in practice standard UNIX applications gracefully deal with such semantics. Thus, we believe the performance savings associated with not retrieving files before access to determine file size is well worth the slightly different semantics. Note that both these limitations are only associated with providing compatibility with HTTP servers; they are eliminated if WebFS is running on the remote site.
WebFS uses Public Key Cryptography [11,25] to authenticate the identity of individual servers as well as clients. Access control lists (ACLs) are associated with each file that enumerate users who have read, write, or execute permission on individual files. Users are uniquely identified by their public keys. Files inherit default permissions from the containing directory. Thus, in the common case, no action is necessary on the user's part when creating a new file or directory. Of course, utility programs are provided to inspect and modify the ACL as necessary.
For wide area distributed applications, the choice of caching policy is crucial for application correctness, performance, and development ease. We expect to implement application-specific caching policies based on our experience with the development of various distributed applications. In the future, we hope to allow for application programmers to develop their own coherence policies if none of the provided techniques are appropriate (such a technique has been explored in other contexts[5,6]). Currently, the cache coherence policy is determined on a per-file basis. We plan to allow programs to select caching policies on a per file descriptor basis allowing different applications to access the same file with different coherence policies. In this section, we will describe three contexts for distributed computation and file sharing. We describe a cache coherence policy appropriate for each application.
Figure 2: Description of implementation of last writer
wins cache coherence policy.A file's host site maintains a directory of all sites currently caching the file. When the file is updated, invalidation messages are sent to all caching sites. |
For general file sharing in such cases as distributed program development, WebFS provides last writer wins cache coherence similar to what was originally developed for AFS[19]. Figure 2 describes the WebFS implementation of last writer wins. Each WebFS daemon maintains a directory of files exported from their site. For each file, a list of all sites caching the file is maintained. In this example, site B commits a write (corresponding to either a close or flush of a modified file) of a file foo hosted at site A. The WebFS daemon consults its directory and discovers that sites B and C are caching foo. An invalidation message is sent to site C, which is responsible for invalidating any cached kernel pages for file foo through a WebFS system call. As an optimization, the invalidate for site B is omitted since its cache value is current. Finally, the directory is cleared of all cachers which were sent invalidate messages (in this case, only site C).
The last writer wins coherence policy is appropriate in the case where a file is being widely shared for read access and only a single site is likely to be modifying a file at any given time. The "callbacks" provided for cache invalidates make this policy scalable to hundreds of simultaneous read cachers [19]. In the common case, clients access cached files without contacting the server and have reasonable guarantees of accessing the most recent version of the file. Note however that last writer wins is unlikely to scale well to the context of providing cache coherence for generalized Web browsing. Maintaining directory information for potentially millions of simultaneous file cachers (many of which will access the file exactly once) will impose severe storage and performance bottlenecks.
While last writer wins provides coherence for relatively small-scale file sharing, it does not preserve strict UNIX write semantics in the following case. If two sites simultaneously open a file for writing, then the updates of the last site to flush changes will overwrite any earlier modifications. Since last writer wins is most appropriate for cases where write sharing is the uncommon case, we believe that the advantages of simple, scalable implementation and coherence guarantees outweigh any differences from strict UNIX semantics. In practice, it has been found that simultaneous write sharing is fairly rare for UNIX workloads[26].
Figure 3: Chat rooms are WebFS files. Reads correspond to receiving conversation updates, while writes correspond to sending out a message (appending to the file). On receiving a write, WebFS updates all its local clients; updates are collected and propagated to other servers in a lazy fashion. |
The second cache coherence policy implemented in WebFS was driven by the implementation of an Internet chat application[29]. For this application, we required scalability to thousands of geographically distributed users. For chat, we determined that updates need not appear in the same order at all sites as in IRC. The append only coherence policy satisfies the needs of the chat application as summarized in Figure 3. A WebFS file is associated with each chat room. However, multiple WebFS servers, called a server group, can be contacted for sending and retrieving updates to the chat log. A read operation on the chat file corresponds to a poll for availability of new messages, while a write corresponds to appending a new message to the chat file. The client application is responsible for choosing the closest server in the group. WebFS servers package multiple file updates and synchronize their updates with the rest of the server group periodically. This policy ensures that clients connected to the same server will all see updates in the same order. However, clients connected to different server groups can potentially see updates out of order. With a synchronization period of 300 ms, we found that out of order updates were not a perceptible semantic problem.
An initial prototype of the chat application has been completed. We plan to conduct further tests to determine the scalability of our approach. We plan to use this append-only caching policy to implement a number of other distributed applications, including a shared whiteboard. Further, we believe distributed applications which are not necessarily bound by connection rate to the server, but rather suffer from weak connectivity (i.e. mobile clients), can benefit from the use of WebFS server groups which synchronize only when network connectivity becomes available. For example, a meeting room scheduler and a distributed bibliography server were implemented in the Bayou project [27] using a similar notion of server groups.
A number of emerging Internet services including whiteboards, video conferencing, interactive games, and news services have two interesting properties. First, all clients of the system are interested in all updates, and second, a significant amount of overlap is present among the updates. Unfortunately, the current practice of individually sending updates to each client is wasteful of Internet resources. A better solution is to use IP Multicast [9] to transmit updates for applications which in essence desire to periodically broadcast updates. Using router support, multicast replicates packets only along the necessary routes. For example, if a message is destined for multiple clients on a single subnet across the continent, a single packet is transmitted across the continent and is replicated only at the destination subnet. This approach is in contrast to the traditional approach of unicasting individual updates to each of the clients, wasting bandwidth and increasing average latency.
We have added multicast support to WebFS to experiment with various caching policies for broadcast applications. Currently, WebFS associates two multicast addresses with each file designated to use multicast for coherency. The first address, called channel A, is used to send invalidates of the file contents to interested clients. The second address, channel B, is used to send out updates of the file to all clients subscribing to the address. Channel A is used by clients which need to ensure that stale cache entries are not used, while channel B is used by clients needing the most current file contents (i.e. subscribers to broadcast services). The use of these multicast channels also simplifies server design and fault recovery. Since interested clients subscribe to either the invalidate or update channel, the server need not maintain a directory of file cachers as was necessary for the last writer wins protocol. On a server crash for example, the client directory does not need to be reconstructed as any updates or invalidates are simply sent to the appropriate multicast address.
Figure 4: A screenshot of the WebFS stock ticker
which uses multicast to send updates to interested clients. |
To explore the utility of multicast support in WebFS, we implemented a stock ticker which graphs stock trading price over the course of the day. A screen shot for the application is shown in Figure 4. The WebFS stock ticker is a Java[16] applet which uses the WebFS protocol to receive periodic updates of stock charts. A centralized daemon continually retrieves current stock prices from a stock quote service, updates the graphs for the day, and uses WebFS to multicast a bitmap representation of a new stock chart every 10 seconds. The use of a Java interface to WebFS not only makes the code portable across multiple platforms but also makes it unnecessary to install WebFS on the client machine for this application. Unfortunately, popular Web browsers such as Netscape's Navigator 3.0 and Microsoft's Internet Explorer 3.0 do not yet contain multicast support. Thus, the applet must currently be run through the appletviewer which is part of the Java Development Kit (JDK) provided by JavaSoft. We hope that future versions of browsers will include native multicast support. An alternative is to develop a browser plug-in supporting Java applets which require multicast support.
Multicast support can also be useful in the context of Web browsing and wide-scale shared access in general. For example, users can select coherency for sites where they desire to see the most current page contents on every visit (news sites for example). In the background, the browser can listen on the invalidate channel; on the first invalidate, the page is flushed from the browser's cache. At this point, the browser can cease listening on the channel as it is not interested in further invalidates. On the next client access to the page, the browser misses in its cache and retrieves the current page from the server. This argument can be extended to cache proxies in a straightforward manner.
A number of concerns are associated with such widespread deployment of multicast. First, the number of IP addresses reserved for multicast is limited to about one million in the current version of IP. However, with the impending release of IPv6[10] will increase this number to 2^112. Further, the number of addresses currently available more than suffice for any initial deployment of multicast for Web updates. A second concern is the space required in routing tables for each multicast address. If multicast use becomes very popular, the amount of memory required for routing tables of heavily used routers in the Internet backbone can become prohibitive. We believe that once multicast achieves such popularity, it is likely that compression techniques can be developed to ensure that the size of routing tables remains manageable. Next, multicast is not currently ubiquitously deployed. However, all major IP router manufacturers have implemented or are implementing multicast support in their products and all major OS vendors have added multicast support to their kernels. Finally, reliability is a concern in multicast since it is layered on top of UDP making duplicate and lost packets fairly common. We plan to leverage techniques from Scalable Reliable Multicast (SRM)[13] for applications requiring guarantees with respect to lost or duplicated messages.
A number of previous efforts including AFS[19], Alex[7], and UFO[1] have produced file systems which export a global name space. One of the first distributed file systems to provide access to a global name space across the wide area was AFS. The original goal of AFS was to improve the scalability of local area file servers. WebFS uses the notion of server-callbacks to implement its last writer wins cache coherency. While AFS possesses many strengths, a number of system limitations prevents AFS from supporting distributed applications. AFS source code is not freely available and thus does not facilitate experimentation. Further, AFS assumes the traditional UNIX file access model where widespread read/write sharing is the uncommon case. For many distributed applications, this assumption is invalid. We hope to use WebFS to determine cache coherence policies appropriate for a number of different distributed applications. Finally, AFS does not currently support the HTTP name space making it less attractive for many applications.
UFO provides access to a read-only HTTP name space, while providing read/write access to remote files using the FTP protocol. UFO has the advantage of being implemented entirely at the user-level using the Solaris proc filesystem to intercept filesystem relevant system calls. Thus, installation of UFO is straightforward and does not require root privileges. However, use of the proc filesystem limits performance. Further, in contrast to WebFS, UFO is only able to make very loose consistency guarantees in the case of FTP, and no guarantees for HTTP. Finally, accounts are required to access files on remote machines through FTP. We believe the requirement of providing a full user account to individuals hoping to access remote resources may limit widespread deployment of the system.
Similar to UFO, Alex is a user-level NFS server providing read only access to global anonymous FTP servers. The use of an NFS server implies improved performance of Alex relative to UFO. However, once again, Alex can only provide limited cache coherence guarantees. Further, the read-only nature of the file system makes Alex inappropriate for many distributed applications.
Recently, research on WebNFS[22] has proposed replacing HTTP with NFS as the transport layer for the Web. In some cases, an order of magnitude speedup over HTTP is claimed because of the great amount of tuning that has gone into NFS. WebFS is largely independent of the transport protocol underneath. If in the future WebNFS gains popularity, it should be relatively straightforward to support the new protocols in WebFS.
A number of research efforts, including Coda[23] and Bayou[27] have addressed file system and application semantics in the context of disconnected operation (mobile machines connected to the network intermittently or over slow links). We plan to leverage the techniques utilized for disconnected operation to make WebFS resilient to failures (which we believe will likely be the common case in the scale of the Internet). It is interesting to note that many of the challenges of mobile computing can also apply in the context of the Web.
Projects such as Legion [17], Globus[14], and Atlas[2] aim to provide an infrastructure for global computation (i.e. a world wide supercomputer). WebFS is complimentary to such projects in the sense that it can provide the global name space, authentication, and cache coherence necessary to execute such global applications. We plan to explore cache coherence policies appropriate for different classes of parallel applications. For example, one can imagine a distributed shared memory (DSM) abstraction built on top of a memory-mapped WebFS file.
WebFS is a part of the larger WebOS project at UC Berkeley. In the future, we plan to support multiple application-specific cache coherency policies. Cache policies may be determined by the applications or by WebFS in response to observed file access patterns. We further plan to investigate fault tolerance policies such as hoarding[20] and eventual consistency[27]. In the scale of the Web, we expect network partitions and host failures to be the common case. As such, we plan to implement hoarding and reintegration policies similar to techniques presented in Coda and Bayou.
In addition to improving the basic file system functionality and performance, we believe that WebFS will simplify the implementation of many new distributed applications. We plan to investigate the implementation of a number of such applications, including: a distributed whiteboard applet, dynamically migratable HTTP servers, interactive games, and multicasts of news sources.
In conclusion, WebFS is a new global file system providing read/write access to the web name space to unmodified UNIX applications. An initial prototype has been completed for the Solaris operating system and is available for download from the Webos homepage. WebFS is backwardly compatible with HTTP and provides access to the Web name space for unmodified applications. By interfacing at the vnode layer, WebFS provides cached access performance comparable to existing file systems. Using multicast technology, WebFS is able to efficiently distribute updates of widely shared Internet data. To demonstrate the functionality of our system, we have developed two distributed applications: Internet chat and a stock ticker. Our initial results indicate that a global cache coherent file system is a useful substrate and abstraction for implementation of scalable Internet services.