Overview In this report, Clip2 DSS describes the evolution and
present condition of the Gnutella peer-to-peer file-sharing network based on
substantial data gathered over a five-month period. Significantly, we find the
network has neither smoothly scaled nor catastrophically collapsed since
average traffic grew to regularly exceed dial-up modem bandwidth in August
2000. Instead, the network persists in a fragmented state comprised of numerous
continuously evolving responsive segments, the largest of which typically
contains hundreds of hosts. We estimate at present that unique Gnutella users
per day number no less than 10,000 and may range as high as 30,000. We suggest
that further technical innovation and wide adoption of this innovation are
necessary for the Gnutella network to scale beyond its present state.
The fraction of hosts providing files for download has varied between 15% and
40%, rebounding in October following a substantial slide in September. The
majority of hosts are online for minutes to hours, with a minority component
(measured at 30% in early August) connected on timescales of a day or longer.
Users continue to query the network primarily for audio, video, image, and
program files, and automated indexing schemes regularly constitute a few
percent of query traffic.
In
a large-scale survey, one out of three network hosts was found outside the
US-centric top-level domains, with Germany and Japan dominating the non-US
population. Internet Service Provider second-level domains accounted for most
of the top sources of hosts on the COM and NET domains, and hosts on the @Home
and Road Runner broadband services were especially prevalent. Gnutella hosts
were found at over 550 institutions in the EDU domain, with MIT and Virginia
Tech leading a broadly distributed pack.
Contents by
Section
Contents by Frequently Asked Question
Introduction The history of Gnutella can be divided
into two eras: "pre-barrier" and "post-barrier," where the terms refer to the
traffic level on the public Gnutella network relative to dial-up modem
bandwidth, and the dividing line falls in August 2000. In this article, Clip2
DSS publishes previously unreleased data on the state of the pre-barrier
network and provides additional information illustrating the barrier transition
on which we first reported
on September 8. In addition, we illuminate conditions on the poorly understood
post-barrier network and examine the locations of network hosts. Finally, we
introduce an analogy to describe the current state of the network and indicate
future scenarios for network development.
Gnutella Genesis On March 14, 2000 at
8:31 AM PST, a message posted
on Slashdot spread the news that AOL's
Nullsoft division had released an "open
source Napster clone" named "Gnutella." On March 15 at 1:25 PM PST, Wired News
reported
that Nullsoft's distribution of "a file-sharing software tool which could be
even more potent than Napster" had ceased. Nonetheless, third parties
redistributed copies of Gnutella downloaded from Nullsoft, and in
extraordinarily short order a number of "Gnutella clones" appeared to augment
the "official" Nullsoft version. Clones spoke the same language - the Gnutella
protocol - as the Nullsoft application, and could therefore communicate with
each other and the Nullsoft version. As users began to run these programs and
connect them to one another via the Internet, a network comprised of a mixture
of Gnutella-speaking applications formed. Because these applications acted
simultaneously as both servers and clients in a fully decentralized system, the
term "servent" was adopted to describe them. In addition, "Gnutella" came to
have several meanings:
- Gnutella = the servent produced by Nullsoft.
A program identified only as "Gnutella" (often "Gnutella v0.56") has seen wide
distribution; one venue alone, Download.com, reports approximately 250,000
downloads to date.
- Gnutella = the protocol, of which Version 0.4
is in widespread use. Clip2 DSS has published a
protocol specification
document.
- Gnutella = the network. At any time there is
a single major publicly accessible network of Gnutella-speaking applications
("the" network) and a number of smaller and fully (and often private)
disconnected networks. Clip2 DSS tracks the major network, and we report
related information on our home page.
- Gnutella = any combination of the above, or
the system as a whole.
In this article, we discuss the evolution of
Gnutella, the network.
Pre-Barrier Gnutella Clip2 DSS began conducting
systematic studies of the Gnutella network in June 2000. Most Gnutella servents
provide a "host count" feature in which they report to the user a dynamically
updated count of the hosts they have identified on the network. During the
period from June through mid-August, servents connected to the public Gnutella
network for at least a matter of minutes typically reported host counts on the
order of 1,000 to 4,000 hosts. Clip2 DSS's Gnutella network crawler regularly
found 1,000 to 8,000 hosts online during the course of a sub-1-hour
full-network traversal. Were these hosts connected in a tight concentration or
in a loosely knit and far-flung configuration? How many connections did a
typical host have? We were able to answer these questions using the network
graphs generated by our crawler.
The "concentration" of the network is a particularly
interesting question because Gnutella servents typically issue queries with a
"TTL" of 7, meaning a user's query will travel up to 7 hosts away on the
Gnutella network from the originating computer. There is always a (not
necessarily unique) shortest path between any two hosts on the network, and the
longest such path is a known in mathematical graph theory as the "diameter" of
the network. Clearly, if the diameter were greater than (7*2+1)=15, there would
be hosts from which a query could be launched and not reach the entire network
before expiring. We saw such high-diameter networks in early July. Below, we
show data for a crawl of 1,959 hosts on July 7. Here, we plot on a semi-log
scale the number of pairs of hosts that had a given shortest-path distance (or
"separation") between them. The diameter of the network discovered by this
crawl was 22, indicating some regions were not in communication with others
(assuming messages used the common TTL value). We also note that most pairs of
hosts were separated by 7 hops.
In
the latter days of the pre-barrier period, the network diameter fell to smaller
values, typically 8 or 9. Below, we show network diameters for crawls made
between July 25 and August 18. Note that while the period of larger diameters
around July 28 corresponds to the
"Napster Flood", a
larger diameter network is not necessarily a direct result of more hosts
connecting to the network. A network with few hosts can have a large diameter
and vice versa; the diameter is purely a function of how the hosts are
connected, not the number of hosts. The smaller diameters are significant in
that they imply any host on the network could easily reach all other hosts
using TTL=7.
Further examining host connectivity, we found a host was most likely to have a
single connection, and hosts with higher numbers of connections were
increasingly uncommon. In more technical terms, we saw roughly power-law degree
distributions, where "power-law" means the number of hosts having a given
degree varied as a power of the degree, and "degree" is shorthand for the total
number of incoming and outgoing connections a given host has open. In addition
to host separations, the degree distribution is another means of quantitatively
assessing network connectivity, and we present below the degree distribution
based on a 1,813-host crawl made on July 7. One particularly interesting
coincidence is in that other researchers (e.g.,
Broder et al. 1999)
have found power-law degree distributions for graphs in which the nodes are Web
pages and the connections are Web links. In the plot, we compare measured data
with a least-squares best fit of slope (power-law index) = -2.3.
As the above data show, the structure of the Gnutella network is
in a continuous state of flux. Hosts come and go as quickly as users open and
close Gnutella applications, and connections between hosts may only last for
seconds. We found that half of an initial host population persisted after five
hours, and that approximately 30% of the initial host population was stable on
the timescale of 24 hours. We determined this result by comparing host
populations found in a succession of crawls to an initial reference crawl and
by repeating this analysis for multiple reference crawls; the plot below
illustrates our findings.
Gnutella Hits the Wall As we first
reported in "Bandwidth Barriers
to Gnutella Network Scalability" (September 8), average Gnutella network
traffic began to regularly exceed the throughput capacity of dial-up modems in
August. Hosts connected to the Internet via dial-up modems ceased to be able to
effectively participate as peers on the Gnutella network. These hosts
essentially became dead-ends, resulting in a widespread fragmentation of the
Gnutella network into effectively disconnected components comprised of hosts
with higher-speed Internet connections. The animation below illustrates the
evolution of the network as traffic passed the dial-up barrier.
Data gathered by Clip2 DSS clearly illustrates the effective loss of dial-up
hosts from the responsive portion of the Gnutella network. In one long-term
experiment, we regularly visited hosts and issued probe-like Gnutella "ping"
messages of TTL=2 to discover their neighbors. As shown below, unique hosts
sending Gnutella "pong" messages in response to these small-TTL pings declined
substantially and permanently in mid-to-late August.
As noted earlier, most Gnutella servents display a host count
based on the number of pongs received since the application began running. As
the barrier was passed, numerous
postings
on public user forums noted servents were displaying lower-than-usual host
counts, often only in the tens or hundreds. Many users interpreted the decrease
in host counts as meaning the total number of hosts on the network had decreased.
However, as the above data show, the change in network responsiveness was
rather abrupt. Such an abrupt transition is much more plausibly explained
by the reaching of a technical barrier than a mass change in user behavior.
Had users departed the network, we would have expected to see a decrease
in the usage rate of the Clip2 DSS host list service; instead, we saw an
unabated rise in usage. In addition, even though responses in the ping experiment
had dropped, this experiment continued to find ten to twenty times the number
of hosts that would be reported in a servent's host counter over a similar
time interval. In sum, we found no evidence supporting a user exodus and
multiple indications that the host population remained sizable but fragmented.
The total number of hosts online can be substantially larger than the number
reported in a servent's host counter, since a servent only sees out as far
as the boundary of the responsive region to which the servent is connected.
The preceding discussion begs the
question of the source of the traffic that caused the network to reach the
dial-up modem bandwidth barrier. Was it the result of continued growth in the
number of Gnutella users, or was it the result of the introduction of
programmatic sources of traffic, such as machine-generated spam? The question
is difficult to answer due to a lack of comprehensive data. Clip2 DSS
reported on one
anomalous form of traffic seen on the network in early September that may have
existed for some time prior. Since that report, we have regularly observed
other forms of apparently automated messages on the network, including repeated
series such as {a.mp3, b.mp3, c.mp3, ...} that appear to be network-indexing
attempts. While this is an unresolved question, it is moot in a sense, because
with continued growth in the user base, user-generated network traffic would
have eventually reached the barrier level of its own accord.
Gnutella Beyond the Barrier What is the state of
the post-barrier Gnutella network? Among other sources, we can find some clues
in data from the Clip2 DSS-operated "gnutellahosts.com" host list service,
which publishes IP addresses of live Gnutella hosts. Approximately 10% of users
access this list by visiting the Clip2 DSS home page; the remainder retrieve
addresses by connecting their servents to a special-purpose Gnutella server
operated by Clip2 DSS at gnutellahosts.com, port 6346. After responding to the
incoming Gnutella servent with multiple IP addresses, the gnutellahosts.com
server disconnects, and the servent proceeds to connect directly to the hosts
at the provided addresses. As noted in the previous section, we have not
observed any decrease in the traffic to gnutellahosts.com due to Gnutella
having hit the barrier. On the contrary, traffic has continued to grow in the
post-barrier period. Below, we show the long-term (3-month) straight-line trend
in gnutellahosts.com usage.
How many users are there on the post-barrier
network? How are they connected? From the number of callers to
gnutellahosts.com and our probing experiments, we found no evidence of a sudden
population collapse as the barrier was passed. Instead, we found evidence that
the population had fragmented into multiple dynamically changing responsive and
unresponsive segments. The sum of all data sources leads us to estimate that
the total number of daily users of Gnutella numbers between 10,000 and 30,000,
where the lower bound is a much better approximation than the upper bound.
Below, we show the numbers of hosts in the largest responsive network segments
we were able to identify over a range of post-barrier dates.
Since the fragmentation occurred, it has been a matter of chance whether or not
a user manages to find and remain connected to a responsive segment. Typically,
the gnutellahosts.com host list service has provided addresses of hosts in the
largest identifiable responsive segment, although this region is a moving
target. In order to track it, Clip2 DSS has refined its crawling strategy in
recent weeks and regularly crawls the network on the timescale of every 15
minutes.
On the post-barrier network, dial-up modem users cannot
effectively participate as peers throughout the network. What can be done to
alleviate this situation? One solution is to connect these users to high-speed
proxies that handle network traffic on their behalf. This is an underlying
concept of the Clip2
Reflector(TM), a special-purpose Gnutella server, and the network
architecture that results is illustrated below:
Reflectors are programmed to maintain connectivity to the most responsive
segment of the network by calling gnutellahosts.com (by default). Dial-up users
singly connect to Reflectors that in turn maintain multiple outgoing network
connections on their behalf. A list of running public-access Reflectors can be
found on the Clip2 DSS home page.
How different is user behavior in the post-barrier period
relative to the pre-barrier period? One measure of behavior is the fraction of
hosts serving a non-zero number of files. In a 24-hour period in early August,
during the late pre-barrier era, researchers at Xerox PARC found 30% of hosts
made available one or more files for download (Adar
& Huberman 2000). Using a different methodology, Clip2 DSS
independently measured the "serving fraction" before, during, and since the
period of the PARC study. Our results confirm theirs during the period of their
study. In the final days of the pre-barrier era, we observed an increase in the
serving fraction to a maximum in excess of 40%. However, as the network evolved
into the post-barrier period, we saw a substantial decrease in the serving
fraction, down to a low of less than 15%. Notably, since early October we have
observed a general rise in the serving fraction back to near pre-barrier
levels. Below, we plot the serving fraction over time.
What are users searching for in the post-barrier period? Clip2
DSS analyzed three query stream samples of varying sizes taken on three
different dates. Applying a subjective analysis to categorize 2,000 queries
heard on September 19, we found the following breakdown:
Notes on these categories: The "gibberish" category includes queries consisting
of non-alphanumeric characters; among other sources, we have seen such queries
generated by poorly programmed clients that do not properly read and forward
Gnutella query messages. The "automated indexing" category includes queries
that appeared to be programmatically generated with the intent of indexing
network content, such as "a.mp3", "b.mp3", etc. "File extension only" queries
are just that, containing extensions such as "mp3" or "mpg" (possibly with
accompanying periods, asterisks, or both) but no other content. "Song and
artist+song" counts queries either containing only a song title or a song title
along with an artist name. "Artist" counts queries containing just an artist
name.
By
objectively analyzing only those queries containing a file extension anywhere
in the query string, we are able to analyze much larger data sets. We found
from 20% to 40% of queries in three separate stream samples of 2,000, 30,000,
and 150,000 queries contained a popular file extension somewhere in the query
string. Below, among the set of queries that contained an extension, we show
the frequencies of specific types of extensions.
Note the programmatically generated queries of the form "a.mp3", "b.mp3", etc.
were a recurring feature in every sample, and the samples were spread over a
21-day period. These queries likely originated from a single source and
amounted to a few percent of total network query traffic.
Global Gnutella We complete our survey of the
Gnutella network by examining the access points of Gnutella hosts. Where are
Gnutella hosts? To arrive at an answer, we utilized 3.3 million non-unique IP
addresses gathered continuously by Clip2 DSS between July 27 and November 3,
spanning both Gnutella eras. Of these addresses, 1.3 million (39%) were
resolvable to non-numeric hostnames, and our reports below are on this
resolvable subset. The populations we report below should be interpreted in
probabilistic terms; the data have not been de-duplicated, so that the relative
populations represent relative probabilities of host discovery within a given
domain.
We divided top-level domains into US-Centric (COM, NET, EDU,
MIL, GOV, US) and Non-US-Centric (all others) categories, although we note a
number of non-US-based organizations operate domains in the US-Centric set. Our
first finding is that Gnutella is a truly international phenomenon, with one
out of three hosts located on a non-US-centric domain.
Among Non-US-Centric domains, 95% of hosts were found in just
17 country-specific top-level domains. European domains comprised 67% and the
Asia-Pacific domains comprised 20%
Among US-Centric hosts, COM, NET, and EDU domains predictably
dominated, although non-zero numbers of hosts were found on each of ORG, US,
GOV, and MIL domains (in ratios 19:8:2:1, respectively).
In the case of the COM, NET, and EDU domains, we dug deeper to
examine populations at the level of second-level domain names. The popular
second-level COM domains were primarily broadband Internet service providers.
The ISP @Home accounted for half of all Gnutella hosts in the COM domain, and
Road Runner trailed in second place at nearly one quarter of hosts. Gnutella
hosts with resolvable host names in the COM domain were therefore strongly
concentrated in a small number of second-level domains, with 87% of hosts
residing in the top 25. Among the top 25 were six second-level domains either
representing non-ISP companies or organizations whose nature we could not
determine. In total, Gnutella hosts were found on 1750 unique second-level
domains within the COM domain.
The popular second-level NET domains were exclusively Internet
service providers, both broadband and dial-up. The distribution among NET
domains was less concentrated than among COM domains, with the leader, Road
Runner, claiming less than 10% of Gnutella hosts on the NET domain. Note the
second-place second-level NET domain was a German company, illustrating that
while the NET domain is US-centric, it is not US-exclusive. Over 2000 unique
second-level domain names were represented in the Gnutella host population.
The distribution of hosts among EDU domains was even less
concentrated than among NET domains. The Massachusetts Institute of Technology
led the pack at 3.8% by a sizable margin. Virginia Tech made a strong showing
at slightly less than 3%, and the distribution declined smoothly from 3rd-place
University of Southern California through the remainder of the list. In all,
Gnutella hosts were discovered on over 550 second-level domains within the EDU
domain.
In
summary, major conclusions are that (1) the Gnutella network is an
international phenomenon led by the US, Germany, and Japan; (2) substantial
populations of hosts in COM and NET domains are on ISP second-level domains;
and (3) hosts are widely distributed among EDU domains.
Gnutella Tomorrow The Gnutella
network is analogous to a continuous global rave: an informal, decentralized,
unregulated gathering without a permanent location. Network hosts, like rave
attendees, come and go unpredictably, connecting and disconnecting as fast as
ravers switch dance partners. In the pre-barrier era, the Gnutella rave could
accommodate all comers. The effect of the network hitting the dial-up modem
barrier is analogous to a rave venue reaching capacity, with many would-be
revelers being crammed shoulder-to-shoulder and left unable to dance, and still
more spilling out the doors. Small regions within the crowd, corresponding to
responsive segments of the Gnutella network, remain sufficiently open to enable
movement. These pockets form and vanish, grow and shrink, and merge and split
on a variety of timescales. In the post-barrier era, the Gnutella rave remains
more popular than the typical raver, pressed in among the crowd, might realize.
The crowding results in the potential of the gathering being released isolated
bursts rather than in a continuous widespread discharge. While Gnutella has not
scaled, we call attention to the fact that it has also not collapsed. Like many
simple decentralized systems, it has remained remarkably robust in the face of
technical adversity. In the present post-barrier period, Gnutella exists in an
intermediate state between scaling and collapsing.
In
the opinion of Clip2 DSS, this situation is probable to persist until either
(1) the user population collapses and traffic falls to pre-barrier levels or
(2) an "organizing principle" for connectivity that enables scaling takes root.
In the former case, the improved performance that would result could
potentially drive alienated users to return to the network, driving resurgence
in traffic, another barrier crossing, and repetition of the entire cycle. In
the latter case, examples of organizing connectivity principles include (1)
dial-up users regularly connecting to broadband
Reflectors and (2) widespread
adoption of servents with sophisticated and consistently implemented connection
management rules. However, to be widely and rapidly successful, any organizing
principal must require no user action or change in behavior that is not
immediately and powerfully rewarded, and it must not involve a change to the
protocol that breaks the considerable installed application base. In the
post-barrier era, there have been various initiatives to create new and smaller
networks - spin-off raves - enabled by user adjustment of the "Gnutella
handshake" mechanism in servents that support this feature. However, because
the underlying technology is no different, if traffic on these networks were to
grow sufficiently large, they would be subject to bandwidth barriers as well.
These attempts to re-create pre-barrier conditions do so at the cost of
sacrificing the relatively large user base on the main network and do not
directly address the problem. To move beyond its present state, Gnutella awaits
widely adopted technical innovation. |