home products articles events developers
{short description of image}
{short description of image}
{short description of image}

Gnutella: To the Bandwidth Barrier and Beyond
November 6, 2000

{short description of image}

Overview
In this report, Clip2 DSS describes the evolution and present condition of the Gnutella peer-to-peer file-sharing network based on substantial data gathered over a five-month period. Significantly, we find the network has neither smoothly scaled nor catastrophically collapsed since average traffic grew to regularly exceed dial-up modem bandwidth in August 2000. Instead, the network persists in a fragmented state comprised of numerous continuously evolving responsive segments, the largest of which typically contains hundreds of hosts. We estimate at present that unique Gnutella users per day number no less than 10,000 and may range as high as 30,000. We suggest that further technical innovation and wide adoption of this innovation are necessary for the Gnutella network to scale beyond its present state.

The fraction of hosts providing files for download has varied between 15% and 40%, rebounding in October following a substantial slide in September. The majority of hosts are online for minutes to hours, with a minority component (measured at 30% in early August) connected on timescales of a day or longer. Users continue to query the network primarily for audio, video, image, and program files, and automated indexing schemes regularly constitute a few percent of query traffic.

In a large-scale survey, one out of three network hosts was found outside the US-centric top-level domains, with Germany and Japan dominating the non-US population. Internet Service Provider second-level domains accounted for most of the top sources of hosts on the COM and NET domains, and hosts on the @Home and Road Runner broadband services were especially prevalent. Gnutella hosts were found at over 550 institutions in the EDU domain, with MIT and Virginia Tech leading a broadly distributed pack.


Contents by Section Contents by Frequently Asked Question

Introduction
The history of Gnutella can be divided into two eras: "pre-barrier" and "post-barrier," where the terms refer to the traffic level on the public Gnutella network relative to dial-up modem bandwidth, and the dividing line falls in August 2000. In this article, Clip2 DSS publishes previously unreleased data on the state of the pre-barrier network and provides additional information illustrating the barrier transition on which we first reported on September 8. In addition, we illuminate conditions on the poorly understood post-barrier network and examine the locations of network hosts. Finally, we introduce an analogy to describe the current state of the network and indicate future scenarios for network development.

Gnutella Genesis
On March 14, 2000 at 8:31 AM PST, a message posted on Slashdot spread the news that AOL's Nullsoft division had released an "open source Napster clone" named "Gnutella." On March 15 at 1:25 PM PST, Wired News reported that Nullsoft's distribution of "a file-sharing software tool which could be even more potent than Napster" had ceased. Nonetheless, third parties redistributed copies of Gnutella downloaded from Nullsoft, and in extraordinarily short order a number of "Gnutella clones" appeared to augment the "official" Nullsoft version. Clones spoke the same language - the Gnutella protocol - as the Nullsoft application, and could therefore communicate with each other and the Nullsoft version. As users began to run these programs and connect them to one another via the Internet, a network comprised of a mixture of Gnutella-speaking applications formed. Because these applications acted simultaneously as both servers and clients in a fully decentralized system, the term "servent" was adopted to describe them. In addition, "Gnutella" came to have several meanings:

  • Gnutella = the servent produced by Nullsoft. A program identified only as "Gnutella" (often "Gnutella v0.56") has seen wide distribution; one venue alone, Download.com, reports approximately 250,000 downloads to date.
  • Gnutella = the protocol, of which Version 0.4 is in widespread use. Clip2 DSS has published a protocol specification document.
  • Gnutella = the network. At any time there is a single major publicly accessible network of Gnutella-speaking applications ("the" network) and a number of smaller and fully (and often private) disconnected networks. Clip2 DSS tracks the major network, and we report related information on our home page.
  • Gnutella = any combination of the above, or the system as a whole.
In this article, we discuss the evolution of Gnutella, the network.

Pre-Barrier Gnutella
Clip2 DSS began conducting systematic studies of the Gnutella network in June 2000. Most Gnutella servents provide a "host count" feature in which they report to the user a dynamically updated count of the hosts they have identified on the network. During the period from June through mid-August, servents connected to the public Gnutella network for at least a matter of minutes typically reported host counts on the order of 1,000 to 4,000 hosts. Clip2 DSS's Gnutella network crawler regularly found 1,000 to 8,000 hosts online during the course of a sub-1-hour full-network traversal. Were these hosts connected in a tight concentration or in a loosely knit and far-flung configuration? How many connections did a typical host have? We were able to answer these questions using the network graphs generated by our crawler.

The "concentration" of the network is a particularly interesting question because Gnutella servents typically issue queries with a "TTL" of 7, meaning a user's query will travel up to 7 hosts away on the Gnutella network from the originating computer. There is always a (not necessarily unique) shortest path between any two hosts on the network, and the longest such path is a known in mathematical graph theory as the "diameter" of the network. Clearly, if the diameter were greater than (7*2+1)=15, there would be hosts from which a query could be launched and not reach the entire network before expiring. We saw such high-diameter networks in early July. Below, we show data for a crawl of 1,959 hosts on July 7. Here, we plot on a semi-log scale the number of pairs of hosts that had a given shortest-path distance (or "separation") between them. The diameter of the network discovered by this crawl was 22, indicating some regions were not in communication with others (assuming messages used the common TTL value). We also note that most pairs of hosts were separated by 7 hops.

In the latter days of the pre-barrier period, the network diameter fell to smaller values, typically 8 or 9. Below, we show network diameters for crawls made between July 25 and August 18. Note that while the period of larger diameters around July 28 corresponds to the "Napster Flood", a larger diameter network is not necessarily a direct result of more hosts connecting to the network. A network with few hosts can have a large diameter and vice versa; the diameter is purely a function of how the hosts are connected, not the number of hosts. The smaller diameters are significant in that they imply any host on the network could easily reach all other hosts using TTL=7.

Further examining host connectivity, we found a host was most likely to have a single connection, and hosts with higher numbers of connections were increasingly uncommon. In more technical terms, we saw roughly power-law degree distributions, where "power-law" means the number of hosts having a given degree varied as a power of the degree, and "degree" is shorthand for the total number of incoming and outgoing connections a given host has open. In addition to host separations, the degree distribution is another means of quantitatively assessing network connectivity, and we present below the degree distribution based on a 1,813-host crawl made on July 7. One particularly interesting coincidence is in that other researchers (e.g., Broder et al. 1999) have found power-law degree distributions for graphs in which the nodes are Web pages and the connections are Web links. In the plot, we compare measured data with a least-squares best fit of slope (power-law index) = -2.3.

As the above data show, the structure of the Gnutella network is in a continuous state of flux. Hosts come and go as quickly as users open and close Gnutella applications, and connections between hosts may only last for seconds. We found that half of an initial host population persisted after five hours, and that approximately 30% of the initial host population was stable on the timescale of 24 hours. We determined this result by comparing host populations found in a succession of crawls to an initial reference crawl and by repeating this analysis for multiple reference crawls; the plot below illustrates our findings.

Gnutella Hits the Wall
As we first reported in "Bandwidth Barriers to Gnutella Network Scalability" (September 8), average Gnutella network traffic began to regularly exceed the throughput capacity of dial-up modems in August. Hosts connected to the Internet via dial-up modems ceased to be able to effectively participate as peers on the Gnutella network. These hosts essentially became dead-ends, resulting in a widespread fragmentation of the Gnutella network into effectively disconnected components comprised of hosts with higher-speed Internet connections. The animation below illustrates the evolution of the network as traffic passed the dial-up barrier.

Data gathered by Clip2 DSS clearly illustrates the effective loss of dial-up hosts from the responsive portion of the Gnutella network. In one long-term experiment, we regularly visited hosts and issued probe-like Gnutella "ping" messages of TTL=2 to discover their neighbors. As shown below, unique hosts sending Gnutella "pong" messages in response to these small-TTL pings declined substantially and permanently in mid-to-late August.

As noted earlier, most Gnutella servents display a host count based on the number of pongs received since the application began running. As the barrier was passed, numerous postings on public user forums noted servents were displaying lower-than-usual host counts, often only in the tens or hundreds. Many users interpreted the decrease in host counts as meaning the total number of hosts on the network had decreased. However, as the above data show, the change in network responsiveness was rather abrupt. Such an abrupt transition is much more plausibly explained by the reaching of a technical barrier than a mass change in user behavior. Had users departed the network, we would have expected to see a decrease in the usage rate of the Clip2 DSS host list service; instead, we saw an unabated rise in usage. In addition, even though responses in the ping experiment had dropped, this experiment continued to find ten to twenty times the number of hosts that would be reported in a servent's host counter over a similar time interval. In sum, we found no evidence supporting a user exodus and multiple indications that the host population remained sizable but fragmented. The total number of hosts online can be substantially larger than the number reported in a servent's host counter, since a servent only sees out as far as the boundary of the responsive region to which the servent is connected.

The preceding discussion begs the question of the source of the traffic that caused the network to reach the dial-up modem bandwidth barrier. Was it the result of continued growth in the number of Gnutella users, or was it the result of the introduction of programmatic sources of traffic, such as machine-generated spam? The question is difficult to answer due to a lack of comprehensive data. Clip2 DSS reported on one anomalous form of traffic seen on the network in early September that may have existed for some time prior. Since that report, we have regularly observed other forms of apparently automated messages on the network, including repeated series such as {a.mp3, b.mp3, c.mp3, ...} that appear to be network-indexing attempts. While this is an unresolved question, it is moot in a sense, because with continued growth in the user base, user-generated network traffic would have eventually reached the barrier level of its own accord.

Gnutella Beyond the Barrier
What is the state of the post-barrier Gnutella network? Among other sources, we can find some clues in data from the Clip2 DSS-operated "gnutellahosts.com" host list service, which publishes IP addresses of live Gnutella hosts. Approximately 10% of users access this list by visiting the Clip2 DSS home page; the remainder retrieve addresses by connecting their servents to a special-purpose Gnutella server operated by Clip2 DSS at gnutellahosts.com, port 6346. After responding to the incoming Gnutella servent with multiple IP addresses, the gnutellahosts.com server disconnects, and the servent proceeds to connect directly to the hosts at the provided addresses. As noted in the previous section, we have not observed any decrease in the traffic to gnutellahosts.com due to Gnutella having hit the barrier. On the contrary, traffic has continued to grow in the post-barrier period. Below, we show the long-term (3-month) straight-line trend in gnutellahosts.com usage.

How many users are there on the post-barrier network? How are they connected? From the number of callers to gnutellahosts.com and our probing experiments, we found no evidence of a sudden population collapse as the barrier was passed. Instead, we found evidence that the population had fragmented into multiple dynamically changing responsive and unresponsive segments. The sum of all data sources leads us to estimate that the total number of daily users of Gnutella numbers between 10,000 and 30,000, where the lower bound is a much better approximation than the upper bound. Below, we show the numbers of hosts in the largest responsive network segments we were able to identify over a range of post-barrier dates.

Since the fragmentation occurred, it has been a matter of chance whether or not a user manages to find and remain connected to a responsive segment. Typically, the gnutellahosts.com host list service has provided addresses of hosts in the largest identifiable responsive segment, although this region is a moving target. In order to track it, Clip2 DSS has refined its crawling strategy in recent weeks and regularly crawls the network on the timescale of every 15 minutes.

On the post-barrier network, dial-up modem users cannot effectively participate as peers throughout the network. What can be done to alleviate this situation? One solution is to connect these users to high-speed proxies that handle network traffic on their behalf. This is an underlying concept of the Clip2 Reflector(TM), a special-purpose Gnutella server, and the network architecture that results is illustrated below:

Reflectors are programmed to maintain connectivity to the most responsive segment of the network by calling gnutellahosts.com (by default). Dial-up users singly connect to Reflectors that in turn maintain multiple outgoing network connections on their behalf. A list of running public-access Reflectors can be found on the Clip2 DSS home page.

How different is user behavior in the post-barrier period relative to the pre-barrier period? One measure of behavior is the fraction of hosts serving a non-zero number of files. In a 24-hour period in early August, during the late pre-barrier era, researchers at Xerox PARC found 30% of hosts made available one or more files for download (Adar & Huberman 2000). Using a different methodology, Clip2 DSS independently measured the "serving fraction" before, during, and since the period of the PARC study. Our results confirm theirs during the period of their study. In the final days of the pre-barrier era, we observed an increase in the serving fraction to a maximum in excess of 40%. However, as the network evolved into the post-barrier period, we saw a substantial decrease in the serving fraction, down to a low of less than 15%. Notably, since early October we have observed a general rise in the serving fraction back to near pre-barrier levels. Below, we plot the serving fraction over time.

What are users searching for in the post-barrier period? Clip2 DSS analyzed three query stream samples of varying sizes taken on three different dates. Applying a subjective analysis to categorize 2,000 queries heard on September 19, we found the following breakdown:

Notes on these categories: The "gibberish" category includes queries consisting of non-alphanumeric characters; among other sources, we have seen such queries generated by poorly programmed clients that do not properly read and forward Gnutella query messages. The "automated indexing" category includes queries that appeared to be programmatically generated with the intent of indexing network content, such as "a.mp3", "b.mp3", etc. "File extension only" queries are just that, containing extensions such as "mp3" or "mpg" (possibly with accompanying periods, asterisks, or both) but no other content. "Song and artist+song" counts queries either containing only a song title or a song title along with an artist name. "Artist" counts queries containing just an artist name.

By objectively analyzing only those queries containing a file extension anywhere in the query string, we are able to analyze much larger data sets. We found from 20% to 40% of queries in three separate stream samples of 2,000, 30,000, and 150,000 queries contained a popular file extension somewhere in the query string. Below, among the set of queries that contained an extension, we show the frequencies of specific types of extensions.

Note the programmatically generated queries of the form "a.mp3", "b.mp3", etc. were a recurring feature in every sample, and the samples were spread over a 21-day period. These queries likely originated from a single source and amounted to a few percent of total network query traffic.

Global Gnutella
We complete our survey of the Gnutella network by examining the access points of Gnutella hosts. Where are Gnutella hosts? To arrive at an answer, we utilized 3.3 million non-unique IP addresses gathered continuously by Clip2 DSS between July 27 and November 3, spanning both Gnutella eras. Of these addresses, 1.3 million (39%) were resolvable to non-numeric hostnames, and our reports below are on this resolvable subset. The populations we report below should be interpreted in probabilistic terms; the data have not been de-duplicated, so that the relative populations represent relative probabilities of host discovery within a given domain.

We divided top-level domains into US-Centric (COM, NET, EDU, MIL, GOV, US) and Non-US-Centric (all others) categories, although we note a number of non-US-based organizations operate domains in the US-Centric set. Our first finding is that Gnutella is a truly international phenomenon, with one out of three hosts located on a non-US-centric domain.

Among Non-US-Centric domains, 95% of hosts were found in just 17 country-specific top-level domains. European domains comprised 67% and the Asia-Pacific domains comprised 20%

Among US-Centric hosts, COM, NET, and EDU domains predictably dominated, although non-zero numbers of hosts were found on each of ORG, US, GOV, and MIL domains (in ratios 19:8:2:1, respectively).

In the case of the COM, NET, and EDU domains, we dug deeper to examine populations at the level of second-level domain names. The popular second-level COM domains were primarily broadband Internet service providers. The ISP @Home accounted for half of all Gnutella hosts in the COM domain, and Road Runner trailed in second place at nearly one quarter of hosts. Gnutella hosts with resolvable host names in the COM domain were therefore strongly concentrated in a small number of second-level domains, with 87% of hosts residing in the top 25. Among the top 25 were six second-level domains either representing non-ISP companies or organizations whose nature we could not determine. In total, Gnutella hosts were found on 1750 unique second-level domains within the COM domain.

The popular second-level NET domains were exclusively Internet service providers, both broadband and dial-up. The distribution among NET domains was less concentrated than among COM domains, with the leader, Road Runner, claiming less than 10% of Gnutella hosts on the NET domain. Note the second-place second-level NET domain was a German company, illustrating that while the NET domain is US-centric, it is not US-exclusive. Over 2000 unique second-level domain names were represented in the Gnutella host population.

The distribution of hosts among EDU domains was even less concentrated than among NET domains. The Massachusetts Institute of Technology led the pack at 3.8% by a sizable margin. Virginia Tech made a strong showing at slightly less than 3%, and the distribution declined smoothly from 3rd-place University of Southern California through the remainder of the list. In all, Gnutella hosts were discovered on over 550 second-level domains within the EDU domain.

In summary, major conclusions are that (1) the Gnutella network is an international phenomenon led by the US, Germany, and Japan; (2) substantial populations of hosts in COM and NET domains are on ISP second-level domains; and (3) hosts are widely distributed among EDU domains.

Gnutella Tomorrow
The Gnutella network is analogous to a continuous global rave: an informal, decentralized, unregulated gathering without a permanent location. Network hosts, like rave attendees, come and go unpredictably, connecting and disconnecting as fast as ravers switch dance partners. In the pre-barrier era, the Gnutella rave could accommodate all comers. The effect of the network hitting the dial-up modem barrier is analogous to a rave venue reaching capacity, with many would-be revelers being crammed shoulder-to-shoulder and left unable to dance, and still more spilling out the doors. Small regions within the crowd, corresponding to responsive segments of the Gnutella network, remain sufficiently open to enable movement. These pockets form and vanish, grow and shrink, and merge and split on a variety of timescales. In the post-barrier era, the Gnutella rave remains more popular than the typical raver, pressed in among the crowd, might realize. The crowding results in the potential of the gathering being released isolated bursts rather than in a continuous widespread discharge. While Gnutella has not scaled, we call attention to the fact that it has also not collapsed. Like many simple decentralized systems, it has remained remarkably robust in the face of technical adversity. In the present post-barrier period, Gnutella exists in an intermediate state between scaling and collapsing.

In the opinion of Clip2 DSS, this situation is probable to persist until either (1) the user population collapses and traffic falls to pre-barrier levels or (2) an "organizing principle" for connectivity that enables scaling takes root. In the former case, the improved performance that would result could potentially drive alienated users to return to the network, driving resurgence in traffic, another barrier crossing, and repetition of the entire cycle. In the latter case, examples of organizing connectivity principles include (1) dial-up users regularly connecting to broadband Reflectors and (2) widespread adoption of servents with sophisticated and consistently implemented connection management rules. However, to be widely and rapidly successful, any organizing principal must require no user action or change in behavior that is not immediately and powerfully rewarded, and it must not involve a change to the protocol that breaks the considerable installed application base. In the post-barrier era, there have been various initiatives to create new and smaller networks - spin-off raves - enabled by user adjustment of the "Gnutella handshake" mechanism in servents that support this feature. However, because the underlying technology is no different, if traffic on these networks were to grow sufficiently large, they would be subject to bandwidth barriers as well. These attempts to re-create pre-barrier conditions do so at the cost of sacrificing the relatively large user base on the main network and do not directly address the problem. To move beyond its present state, Gnutella awaits widely adopted technical innovation.


about | jobs | contact | terms of service
Copyright © 2001 Clip2.com, Inc. All rights reserved.