Está en la página 1de 4

I 5.

Conclusions
Read Before You Cite!
The definition of three-dimensional ants (3D ants) given here results in
ants that build three-dimensional highways. This definition shows that
the rules of Langton's ant are elementary and in some cases universally M. V. Simkin
applicable. An interesting result is that there is more than one style of V. P. Roychowdhury
highway. First, classification systems were presented. The researches Department of Electrical Engineering,
stated here are only a small part of the infinite world of 3D ants. But the University of California, Los Angeles, CA 90095-1594
search space for ants grows exponentially with the rule length and big
three-dimensional worlds are hard to simulate (with respect to memory, We report a method for estimating what percentage of people who cited a
computing time, and costly visualization). Some additional questions paper had actually read it. The method is based on a stochastic modeling
of the citation process that explains empirical studies of misprint distri-
come up. Are there 3D ants that build arbitrarily complex highways?
butions in citations (which we show follows a Zipf law). Our estimate is
Does every 3D ant build a highway? Is there an ant and a number of that only about 20% of citers read the original.
steps for every arbitrary shape? The consumption of which resource
(time or space) grows faster by increasing rule length?
A simple software tool to simulate 3D ants made this work possible
Many psychological tests have the so-called "lie-scale." A small but
and will be made available by the author to those who are interested.
sufficient number of questions that admit only one true answer, such as:
"Do you always reply to letters immediately after reading them?" are
I References inserted among others that are central to the particular test. A wrong
[1] C. G. Langton, "Studying Artificial Life with Cellular Automata," Phys- reply for such a question adds a point on the lie-scale, and when the
ica D, 22 (1986) 120-149. lie-score is high, the over-all test results are discarded as unreliable.
Perhaps, for a scientist the best candidate for such a lie-scale is the
[2] L. A. Bunimovich and S. E. Troubetzkoy, "Rotators, Periodicity and Ab- question: "Do you read all of the papers that you cite?"
sence of Diffusion in Cyclic Cellular Automata," Journal of Statistical Comparative studies of the popularity of scientific papers has been a
Physics, 74 (1994) 1-14.
subject of much recent interest [1-8], but the scope has been limited to
[3] J. Propp, "Further Ant-ics," Mathematical Intelligencer, 16 (1994) 37-42. citation distribution analysis. We have discovered a method of estimat-
ing what percentage of people who cited the paper had actually read it.
[4] D. Gale, J. Propp, S. Sutherland, and S. Troubetzkoy, "Further Travels with Remarkably, this can be achieved without any testing of the scientists,
My Ant," The Mathematical Intelligencer, 17(3) (1995) 48-56. but solely on the basis of the information available in the lSI citation
[5] O. Beuret and M. Tomassini, "Behavior of Multiple Generalized Langton's database.
Ants," Proceedings ofthe Artificial Life V Conference, edited by C. Langton Freud [9] had discovered that the application of his technique of psy-
and K. Shimohara (MIT Press, 1998). choanalysis to slips in speech and writing could reveal a lot of hidden
information about human psychology. Similarly, we find that the appli-
[6] L. A. Bunimovich, "Many Dimensional Lorentz Cellular Automata and Tur- cation of statistical analysis to misprints in scientific citations can give
ing Machines," International Journal of Bifurcation and Chaos, 6 (1996) an insight into the process of scientific writing. As in the freudian case,
1127-1135.
the truth revealed is.embarrassing. For example, an interesting statistic
revealed in our study is that a lot of misprints are identical. Consider,
for example, a four-digit page number with one digit misprinted. There
can be 104 such misprints. The probability of repeating someone else's
misprint accidentally is 10-4 • There should be almost no repeat mis-
prints by coincidence. One concludes that repeat misprints are due to
copying someone else's reference, without reading the paper in question.
In principle, one can argue that an author might copy a citation
from an unreliable reference list, but still read the paper. A modest

Complex Systems, 14 (2003) 263-268 Complex Systems, 14 (2003) 269-274; © 2003 Complex Systems Publications, Inc.
Read Before You Cite!


270 M. V. Simkin and V. P. Roychowdhury 271

100 repeat number

~
~ 0.1
o
li.
0.01

0.001

100
rank 0.0001

Figure 1. Rank-frequency distribution of misprints referencing a paper, which Fi~ure.2. Same data as in Figure 1, but in the number-frequency representation.
had acquired 4300 citations. There are 196 misprints total, out of which 45 are Mlspnnts follow a power-law distribution with exponent close to 2.
distinct. The most popular misprint propagated 78 times. A good fit to Zipf's
law is evident. ~ page number propagated 78 times. Figure 2 shows the same data, but
10 a number-frequency format.
reflection would convince one that this is relatively rare, and cannot ~s a preliminary attempt, one can estimate an upper bound on the
apply to the majority. Surely, in the pre-internet era it took almost ratio of the number of readers to the number of citers R as the ratio of
equal effort to copy a reference as to type in one's own based on the the number of distinct misprints D to the total number of misprints T.
original, thus providing little incentive to copy if someone has indeed Cle~rly, .am?ng T citers, T - D copied, because they repeated someone
read, or at the very least has procured access to the original. Moreover, else s mlspnnt. For the D others, with the information at hand we
if someone accesses the original by tracing it from the reference list ~ave no evi~e~ce that they did not read, so according to the presu~ed
of a paper with a misprint, then with a high likelihood, the misprint mnocent prmclple, we ~ssume that they did. Then in our sample, we
has been identified and will not be propagated. In the past decade with have D readers and T clters, which lead to:
the advent of the Internet, the ease with which would-be nonreaders can
D
copy from unreliable sources; as well as would-be readers that can access R:::::-
T' (1)
the original, has become equally convenient. But there is no increased
incentive for those who read the original to also make verbatim copies, Substituting D = 45 and T = 196 in equation (1) we obtain that
especially from unreliable resources. 1 In the rest of this paper, giving R ~ .0.23. :rhi~ estimate would be correct if the people'who introduced
the benefit of doubt to potential nonreaders, we adopt a much more ongmal mlspnnts had always read the original paper. However, given
generous view of a "reader" of a cited paper as someone who at the th~ lo~ value o~ the upper bound on R, it is obvious that many original
very least consulted a trusted source (e.g., the original paper or heavily- mlsprmts were mtroduced while copying references. Therefore a more
used and authenticated databases) in putting together the citation list. careful analysis is neccessary. We need a model to accomplish i~.
As misprints in citations are not too frequent, only celebrated papers O~r model ~or misprint~ propagation, which was stimulated by Si-
provide enough statistics to work with. Figure 1 shows a distribution mon s explanation of the Zlpf law [12] and the idea of link redirection
of misprints in citations to one such paper [10] in the rank-frequency by Krapivsky and Redner [4] is as follows. Each new citer finds the
representation, introduced by Zipf [11]. The most popular misprint in referen~~ to the original in any of the papers that already cite it. With
p~ob.ablhty R he reads the original. With probability 1 - R he copies the
1According to many researchers the Internet may end up even aggravating the copying clt~tlon from the paper he found it in. In any case, with probability M
problem: more users are copying second-hand material without verifying or referring to he mtroduces a new misprint.
the original sources.

Complex Systems, 14 (2003) 269-274


M. V. Simkin and V. P. Roychowdhury Read Before You Cite! 273
272

The evolution of the misprint distribution (here N K denotes the num- and D = N x M x R. As a consequence, equation (1) becomes exact (in
ber of misprints that propagated K times, and N is the total number of terms of expectation values, of course). .
citations) is described by the following rate equations: The preceding analysis assumes that the stationary state had been
reached. Is this reasonable? Equation (5) can be rewritten as:
dN 1 = M _ (1 - R) x (1 - M) x NN1 ,
dN d(~) _ dl N (9)
(K-l)xNK -KxNK M - (~) x (R + M - M x R) - n .
dN
_K=(l-R)x(l-M)x -1 (K>l). (2)
dN N .
As long as M is small it is natural to assume that the first citation was
These equations can be easily solved using methods developed in [4] to correct. Then the initial condition is N = 1; T = O. Equation (9) can be
get: solved to get:
1
'Y = 1 + (1 _ R) x (1 - M)'
(3) T=Nx
R+M-MxR
M x (
1-~~~~ 1)
NR+M-MxR .
(10)

As the exponent of the number-frequency distribution 'Y is related to the This should be solved numerically for R. For our guinea pig equa-
exponent of the rank-frequency distribution a by a relation 'Y = 1 +(1/a), tion (10) gives R = 0.17. '
equation (3) implies that: Just as a cautionary note, equation (10) can be rewritten as:
(4)
a = (1 - R) x (1 - M). T
D
=!x X(l- ~).
NX '
x=R+M-MxR. (11)
The rate equation for the total number of misprints is:
The definition of the natural logarithm is:
dT T (5)
dN = M + (1 - R) x (1 - M) x N' . aX - l
lna=hm--.
x...o x
The stationary solution of equation (5) is:
Comparing this with equation (11) we see that when R is small (M is
M (6) obviously always small):
T=Nx R+M-MR'
T
The expectation value for the number of distinct misprints is obviously D ~ InN. (12)
(7)
D=NxM. This means that a naive analysis using equations (1) or (8) can lead to
an erroneous belief that more cited papers are less read.
From equations (6) and (7) we obtain:
One can augment our results with a closer scrutiny of the data. In
D N-T (8) order to make sure that misprints have not been introduced by the lSI
R=TxN_D' as it sometimes happens [13], we explicitly verified a dozen misprinted
Substituting D= 45, T = 96, and N = 4300 in equation (8),.we obt~in citations in the original articles. All of them were exactly as in the lSI
R ~ 0.22, which is very close to the initial estimate obtamed usmg database. There are also occasional repeat identical misprints in papers,
equation (1). This low value of R is consistent with the "Principle of which share individuals in their author lists. Such events constitute a
minority of repeat misprints. It is not obvious what to do with such
Least Effort" [11].
One can ask: Why did we not choose to extract R using equations (3) cases when the author lists are not identical: Should the set of citations
or (4)? This is because a and 'Yare not very sensitive to R when it is be counted as a single occurrence (under the premise that the common
co-author is the only source of the misprint); or as multiple repetitions?
small. In contrast, T scales as 1/R.
We can slightly modify our model and assume that original misprints However, even if we count all such repetitions as only a single misprint
are only introduced when the reference is derived from the original occurrence, then the number of citation-copiers (i.e., T - D) shall drop
paper, while those who copy references do not introduce new misprints from 151 to 112, bringing the upper bound for R (equation (1)) from
(e.g., they cut-and-paste). In this case one can show that T = N x M 23% up to 29%. However a more detailed analysis via our model
Complex Systems, 14 (2003) 269-274
M. V. Simkin and V. P. Roychowdhury
274

[14] will bring down the estimate closer to 20%, keeping the original Creating Large Life Forms with Interactive Life
conclusions unaltered. . d'
We conclude that misprints in scientific citations .shou~d not be IS-
carded as a mere happenstance, but, similar to FreudIan slips, analyzed. William H. Paulsen
Department of Mathematics and Statistics,
Arkansas State University,
State University, AR 72467
I Acknowledgments
This paper demonstrates how very complicated Life forms can easily be
We are grateful to J. M. Kosterlitz, A. V. Melechko, N. Sarshar, H. Muir, created using the interactive Life program introduced by James Gilbert
and many others for correspondence. [1]. By having control over just a single cell, called the intelligent cell,
a glider gun can be created in under 250 generations. Note that the
standard rules of Life apply to the intelligent cell as well as the other cells.
I References
[1] Z. K. Silagadze, "Citations and the Zipf-Mandelbr~t Law," Complex
Systems, 11 (1997) 487-499; http://arxiv.orglabslphyslcsl9901035.
Ever since John Conway introduced the game of Life in 1970 [2], ama-
[2] S. Redner, European Physics Journal B, 4 (1998) 131-134; teur and professional mathematicians have been obsessed by finding new
http://arxiv.orglabs/cond-matl9804163. complicated structures, and finding new applications for the seemingly
[3] C. Tsallis, and M. P. de Albuquerque, European Physics Journal B, 13 chaotic behavior of Life forms. Perhaps its addictive nature stems from
(2000) 777-780; http://arxiv.orglabslcond-matl9903433. the simplistic rules: each cell on a square lattice is either on (alive) or off
(dead). The Moore neighborhood of a cell is the eight surrounding cells
[4] P. L. Krapivsky and S. Redner, Physical Review E, 63 (2001) Art. No.
(counting diagonals) [3]. If a dead cell has exactly three neighboring
066123; http://arxiv.orglabslcond-matl0011094.
cells which are alive, then the cell becomes alive in the next genera-
[5] H. Jeong, Z. Neda, and A.-L. Barabasi, tion. On the other hand, a living cell must have either two or three
http://arxiv.orglabslcond-matl0104131. living neighbors to remain alive, otherwise the cell will die in the next
generation.
[6] A. Vazquez, http://arxiv.orglabs/cond-matl0105031.
Figure 1 shows four generations of the Life form known as the
[7] H. M. Gupta, J. R. Campanha, and B. A. Ferrari, "glider." Because every four generations, this glider moves one space
http://arxiv.orglabslcond-matlOl12049. diagonally, we say that this glider moves at 1/4 the speed of light. (In
the Life model, light is said to travel at 1 square per generation.)
[8] S. Lehmann, B. Lautrup, and A. D. Jackson,
http://arxiv.orglabs/physicsl0211010. The first open problem that Conway asked was whether any Life
form could be proven to have unbounded growth. This was solved by
[9] S. Freud, Zur Psychopathologie des Alltagslebens (Internationaler psycho- Bill Gosper's discovery of the glider gun, a formation that spews a glider
analytischer Verlag, Leipzig, 1920). every 30 generations [4]. Other glider guns have since been discovered,
[10] Our guinea pig is the Kosterlitz-Thouless paper (J. M. Kosterli~z a~d but none have been as efficient as the original, now referred to as the
D. J. Thouless, Journal of Physics C,6 (1973) 1181-120.3): The mlsprmt "p30 glider gun. "
distribution for a dozen other studied papers look very SImilar. Using glider guns, one in fact can engineer even more complex Life
forms. Any computer circuit, and even a Turing machine, can be con-
[11] G. K. Zipf, Human Behavior and the Principle of Lea~t Effort: An Intro-

II- II- 11-- 11--


duction to Human Ecology, (Addison-Wesley, Cambndge, MA, 1949).

[12] H. A. Simon, Models of Man (Wiley, New York, 1957).


[13] A. Smith, New Library World, 84 (1983) 198.
[14] M. V. Simkin and V. P. Roychowdhury, to be published.
91
- •• • •

Figure 1.
••

Four generations of a glider.


• • •

Complex Systems, 14 (2003) 275-283; © 2003 Complex Systems Publications, Inc.

También podría gustarte