Rebuttal of Sproat, Farmer, et al.’s supposed “refutation”
[Updated: September, 2010]
Our response to
Sproat, Farmer, and other
In 2004, Steve Farmer, Richard Sproat, and Michael Witzel
published a paper in "Electronic Journal of Vedic Studies" (entitled
"The Collapse of the Indus-Script Thesis: The Myth of a Literate Harappan
Civilization") claiming that the Indus valley civilization was illiterate
and that
The publication of our paper in Science elicited
hostile reactions from them, ranging from off-the-cuff dismissive remarks such
as “garbage in, garbage out” (Witzel) to ad-hominem attacks (labeling us
“Dravidian nationalists”) and a vicious campaign on internet discussion groups
and blogs to discredit our work. Their first knee-jerk reaction was to call the
two artificial control datasets in our study “invented data sets” (Farmer).
This was followed by Sproat and others on a blog claiming to have constructed “counterexamples”
to our result. Sproat has also created a web page entitled "Why Rao et
al.'s work proves nothing," which is interesting because it reflects a misunderstanding
of our work: we do not claim to have “proved” any statement regarding the Indus
script—our work presents evidence that is supportive of the linguistic
hypothesis (in an inductive framework), but does not prove it. In fact, we
believe such a “proof” would require a decipherment of the script (more
detailed explanation here). Sproat repeats
some of his claims in an opinion piece in the journal Computational
Linguistics where, after dismissing the work of yet another group, he ends
up calling the editors of "the general science journals" and their
reviewers incompetent. Computational Linguistics has kindly agreed to
publish our response, a preprint of which can be accessed here.
Here, we respond to the arguments of Sproat, Farmer, and
other
(1) Two datasets, used as controls in our work, are artificial.
(2) Counterexamples can be given, of non-linguistic systems, which produce conditional entropy plots like those presented in our Science paper.
(3) Conditional entropy cannot even differentiate between language families.
(4) The absence of writing material and long texts is
“proof” that the
We
view arguments (1)-(3) as arising from a misunderstanding of our approach and
an overinterpretation of the conditional entropy result. Some of these
arguments are made with a myopic point of view without considering other
properties of the Indus script and the context of its use in the
Here is the point-by-point rebuttal:
(1) As stated in our Science paper, the two artificial data sets (which Farmer et al. call “invented data sets”) simply represent controls, necessary in any scientific investigation, to delineate the limits of what is possible. The two controls in our work represent sequences with maximum and minimum flexibility, for a given number of tokens. Though this can be computed analytically, the data sets were generated to subject them to the same parameter estimation process as the other data sets. Our conclusions do not depend on the controls, but are based on comparisons with real world data: DNA and protein sequences, various natural languages, and FORTRAN computer code. All our real world examples are bounded by the maximum and the minimum provided by the controls, which thus serve as a check on the computation.
(2) Counterexamples matter only if we claim that conditional
entropy by itself is a sufficient criterion to distinguish between language and
non-language. We do not make this claim in our Science paper. As
clearly stated in the last sentence of the paper, our results provide evidence
which, given the rich syntactic structure in the script (and other evidence as
listed below), increases the probability that the script represents
language.
The methodology, which is Bayesian in nature, can be
summarized as follows. We begin with the fact that the
· The Indus texts are linearly written, like the vast majority of linguistic scripts (and unlike nonlinguistic systems such as medieval heraldry or traffic signs);
·
· The script obeys the Zipf-Mandelbrot law, a power-law distribution on ranked data, which is often considered a necessary (though not sufficient) condition for language (see our PLOS One paper);
· The script exhibits rich syntactic structure such as the clear presence of beginners and enders, preferences of symbol clusters for particular positions within texts etc. (see References), not unlike linguistic sequences;
·
Indus texts that have been discovered in Mesopotamia and the
Persian Gulf use the same signs as texts found in the
Given that the Indus script shares the above properties with
linguistic scripts, we claim that the similarity in conditional entropy of the
We have recently extended the result in our Science
paper to block entropies for sequences of up to 6 symbols (see IEEE Computer paper and this link for details):
The language-like scaling behavior of block entropies in the
above figure, in combination with the other properties of language enumerated above,
could be viewed in a Bayesian framework as further evidence for the linguistic
nature of the
The above figure also addresses objections raised by some
(e.g., Fernando Pereira) who felt conditional entropy (which considers only
pairwise dependencies) was not a sufficiently rich measure.
Let us now consider the nonlinguistic systems that have been
suggested:
· Mark Liberman, Sproat, and Cosmo Shalizi in a blog constructed artificial examples of nonlinguistic systems whose conditional entropy was similar to the Indus script but their examples have no correlations between symbols - these examples do not exhibit the entropy scaling property exhibited by the Indus script and languages in the above figure, let alone other language-like properties like those exhibited by the Indus script.
· Two natural nonlinguistic systems that have been suggested, medieval heraldry and traffic signs, are not even linear, nor do they exhibit other script-like properties such as those listed above.
· The Vinca markings on pottery are linear but scholars have established that the symbols do not appear to follow any order - the system thus can be expected to fall in the maximum entropy range (MaxEnt) in the above figure.
·
The carvings of deities
on Mesopotamian boundary stones are also linear but the ordering of symbols
appears to be more rigid than in natural languages, following for example the
hierarchical ordering of the deities. This system can thus be expected to fall
closer to the minimum entropy (MinEnt) range in the above entropy scaling
figure than to natural languages.
We therefore believe that the new result above from our IEEE Computer paper, showing that the block
entropies of the
(3) Sproat has endeavored to produce a plot where languages
belonging to different language families have similar conditional entropies,
thereby claiming that the conditional entropy result “proves nothing.” This
claim is once again based on an overinterpretation of the result in our Science
paper. We specifically note on page 10 in the supplementary information that
“answering the question of linguistic affinity of the Indus texts requires a
more sophisticated approach, such as statistically inferring an underlying
grammar for the
(4) With regard to the length of texts, several West Asian
writing systems such as Proto-Cuneiform, Proto-Sumerian, and the Uruk script
have statistical regularities in sign frequencies and text lengths which are
remarkably similar to the
As regards the argument for literacy from the point of view of cultural sophistication of the Indus people, we believe Iravatham Mahadevan has addressed this adequately in his op-ed piece below (see also Massimo Vidale's entertaining article).
References
· Final version of the Science paper (including Supplementary Information), 2009:
o http://www.cs.washington.edu/homes/rao/ScienceIndus.pdf
·
IEEE Computer review article with new block
entropy result:
Probabilistic analysis of an ancient undeciphered script, 2010:
o http://www.cs.washington.edu/homes/rao/ieeeIndus.pdf
·
PLoS One paper: Statistical Analysis of the
o http://dx.plos.org/10.1371/journal.pone.0009506
·
PNAS paper: A Markov model of the
o http://www.cs.washington.edu/homes/rao/PNASIndus.pdf
· Asko Parpola’s point-by-point rebuttal of Farmer, Sproat, and Witzel:
o Parpola A (2008) Is the
http://www.harappa.com/script/indus-writing.pdf
· Massimo Vidale’s "The collapse melts down: a reply to Farmer, Sproat and Witzel":
o http://www.docstoc.com/docs/document-preview.aspx?doc_id=9163376
·
Iravatham Mahadevan’s "The
o http://www.hindu.com/mag/2009/05/03/stories/2009050350010100.htm
·
Syntactic structure in
the
o Koskenniemi K (1981) Syntactic
methods in the study of the
o Parpola A (1994) Deciphering the
o Yadav N,
http://www.harappa.com/script/tata-writing/indus-script-paper.pdf
o Yadav N,
http://www.harappa.com/script/tata-writing/indus-texts.pdf