```
Statistical Science
1994, Vol. 9, No. 3, 429-438

Equidistant Letter Sequences in the
Book of Genesis

Doron Witztum, Eliyahu Rips and Yoav Rosenberg

Abstract. It has been noted that when the Book of Genesis is written
as two-dimensional arrays, equidistant letter sequences spelling
words with related meanings often appear in close proximity.
Quantitative tools for measuring this phenomenon are developed.
Randomization analysis shows that the effect is significant at the
level of 0.00002.

Key words and phrases: Genesis, equidistant letter sequences,
cylindrical representations, statistical analysis.

1. INTRODUCTION
The phenomenon discussed in this paper was first discovered several
decades ago by Rabbi Weissmandel [7].  He found some interesting patterns in
the Hebrew Pentateuch (the Five Books of Moses), consisting of words or
phrases expressed in the form of equidistant letter sequences (ELS's)--that
is, by selecting sequences of equally spaced letters in the text.

As impressive as these seemed, there was no rigorous way of
determining if these occurrences were not merely due to the enormous quantity
of combinations of words and expressions that can be constructed by searching
out arithmetic progressions in the text.  The purpose of the research reported
here is to study the phenomenon systematically.  The goal is to clarify
whether the phenomenon in question is a real one, that is, whether it can or
cannot be explained purely on the basis of fortuitous combinations.

The approach we have taken in this research can be illustrated by the
following example.  Suppose we have a text written in a foreign language that
we do not understand.  We are asked whether the text is meaningful (in that
foreign language) or meaningless.  Of course, it is very difficult to decide
between these possibilities, since we do not understand the language.  Suppose
now that we are equipped with a very partial dictionary, which enables us to
recognise a small portion of the words in the text: "hammer" here and "chair"
there, and maybe even "umbrella" elsewhere.  Can we now decide between the two
possibilities?

Not yet.  But suppose now that, aided with the partial dictionary, we
can recognise in the text a pair of conceptually related words, like "hammer"
and "anvil."  We check if there is a tendency of their appearances in the text
to be in "close proximity."  If the text is meaningless, we do not expect to
see such a tendency, since there is no reason for it to occur.  Next, we widen
our check; we may identify some other pairs of conceptually related words:
like "chair" and "table," or "rain" and "umbrella."  Thus we have a sample of
such pairs, and we check the tendency of each pair to appear in close
proximity in the text.  If the text is meaningless, there is no reason to
expect such a tendency.  However, a strong tendency of such pairs to appear
in close proximity indicates that the text might be meaningful.

Note that even in an absolutely meaningful text we do not expect that,
deterministically, every such pair will show such tendency.  Note also, that
we did not decode the foreign language of the text yet: we do not recognise
its syntax and we cannot read the text.

This is our approach in the research described in the paper.  To test
whether the ELS's in a given text may contain "hidden information," we write
the text in the form of two-dimensional arrays, and define the distance
between ELS's according to the ordinary two-dimensional Euclidean metric.
Then we check whether ELS's representing conceptually related words tend to
appear in "close proximity."

Suppose we are given a text, such as Genesis (G).  Define an
equidistant letter sequence (ELS) as a sequence of letters in the text whose
positions, not counting spaces, form an arithmetic progression; that is, the
letters are found at the positions
n, n+d, n+2d, ... , n+(k-1)d.
We call d the skip, n the start and k the length of the ELS.  These three
parameters uniquely identify the ELS, which is denoted (n,d,k).

Let us write the text as a two-dimensional array--that is, on a single
large page--with rows of equal length, except perhaps for the last row.
Usually, then, an ELS appears as a set of points on a straight line.  The
exceptional cases are those where the ELS "crosses" one of the vertical edges
of the array and reappears on the opposite edge.  To include these cases in
our framework, we may think of the two vertical edges of the array as pasted
together, with the end of the first line pasted to the beginning of the second
, the end of the second to the beginning of the third and so on.  We thus get
a cylinder on which the text spirals down in one long line.

It has been noted that when Genesis is written in this way, ELS's
spelling out words with related meanings often appear in close proximity.  In
Figure 1 we see the example of 'patish' (hammer) and 'sadan' (anvil); in
Figure 2, 'Zidkiyahu' (Zedekia) and 'Matanya' (Matanya), which was the
original name of King Zedekia (Kings II, 24:17).  In Figure 3 we see yet
another example of 'hachanuka' (the Chanuka) and 'chashmonaee' (Hasmonean),
recalling that the Hasmoneans were the priestly family that led the revolt
against the Syrians whose successful conclusion the Chanuka feast celebrates.

Indeed, ELS's for short words, like those for 'patish' (hammer) and
'sadan' (anvil), may be expected on general probability grounds to appear
close to each other quite often, in any text.  In Genesis, though, the
phenomenon persists when one confines attention to the more "noteworthy"
ELS's, that is, those in which the skip |d| is _minimal_ over the whole text
or over large parts of it.  Thus for 'patish' (hammer), there is no ELS with
a smaller skip than that of Figure 1 in all of Genesis; for 'sadan' (anvil),
there is none in a section of text comprising 71% of G; the other four words
are minimal over the whole text of G.  On the face of it, it is not clear
whether or not this can be attributed to chance.  Here we develop a method
for testing the significance of the phenomenon according to accepted
statistical principles.  After making certain choices of words to compare and
ways to measure proximity, we perform a randomization test and obtain a very
small p-value, that is, we find the results highly statistically significant.

2. OUTLINE OF THE PROCEDURE

In this section we describe the test in outline.  In the Appendix,
sufficient details are provided to enable the reader to repeat the
computations precisely, and so to verify their correctness.  The authors will
provide, upon request, at cost, diskettes containing the program used and the
texts G, I, R, T, U, V and W (see Section 3).

We test the significance of the phenomenon on samples of pairs of
related words (such as hammer-anvil and Zedekia-Matanya).  To do this we must
do the following:

(i) define the notion of "distance" between any two words, so as to lend
meaning to the idea of words in "close proximity";

(ii) define statistics that express how close, "on the whole," the words
making up the sample pairs are to each other (some kind of average over the
whole sample);

(iii) choose a sample of pairs of related words on which to run the test;

(iv) determine whether the statistics defined in (ii) are "unusually small"
for the chosen sample.

Task (i) has several components.  First, we must define the notion
of "distance" between two given ELS's in a given array; for this we use a
convenient variant of the ordinary Euclidean distance.  Second, there are
many ways of writing a text as a two-dimensional array, depending on the row
length; we must select one or more of these arrays and somehow amalgamate
the results (of course, the selection and/or amalgamation must be carried out
according to clearly stated, systematic rules).  Third, a given word may occur
many times as an ELS in a text; here again, a selection and amalgamtion
process is called for.  Fourth, we must correct for factors such as word
length and composition.  All this is done in detail in Sections A.1 and A.2
of the Appendix.

We stress that our defintion of distance is not unique.  Although
there are certain general principles (like minimizing the skip d) some of the
details can be carried out in other ways.  We feel that varying these details
is unlikely to affect the results substantially.  Be that as it may, we chose
one particular defintion, and have, throughout, used _only_ it, that is, the
function c(w,w') described in Section A.2 of the Appendix had been defined
before any sample was chosen, and it underwent no changes.  [Similar remarks

Next, we have task (ii), measuring the overall proximity of pairs of
words in the sample as a whole.  For this, we used two different statistics
p and p , which are defined and motivated in the Appendix (Section A.5).
1     2
Intuitively, each measures overall proximity in a different way.  In each
case, a small value of p i indicates that the words in the sample pairs are,
on the whole, close to each other.  No other statistics were _ever_ calculated
for the first, second or indeed any sample.

In task (iii), identifying an appropriate sample of word pairs, we
strove for uniformity and objectivity with regard to the choice of pairs and
to the relation between their elements.  Accordingly, our sample was built
from a list of personalities (p) and the dates (Hebrew day and month) (p')
of their death or birth.  The personalities were taken from the _Encyclopedia
of Great Men in Israel_ [5].

At first, the criterion for inclusion of a personality in the sample
was simply that his entry contain at least three columns of text and that a
date of birth or death be specified.  This yielded 34 personalities (the
first list--Table 1).  In order to avoid any conceivable appearance of having
fitted the tests to the data, it was later decided to use a fresh sample,
without changing anything else.  This was done by considering all
personalities whose entries contain between 1.5 and 3 columns of text in the
Encyclopedia; it yielded 32 personalities (the second list--Table 2).  The
significance test was carried out on the second sample only.

Note that personality-date pairs (p,p') are not word pairs.  The
personalities each have several appellations, there are variations in spelling
and there are different ways of designating dates.  Thus each personality-
date pair (p,p') corresponds to several word pairs (w,w').  The precise
method used to generate a sample of word pairs from a list of personalities
is explained in the Appendix (Section A.3).

The measures of proximity of word pairs (w,w') result in statistics
p  and p .  As explained in the Appendix (Section A.5), we also used a variant
1      2
of this method, which generates a smaller sample of word pairs from the same
list of personalities.  We denote the statistics p  and p , when applied to
1      2
this smaller sample, by p  and p .
3      4
Finally, we come to task (iv), the significance test itself.  It is
so simple and straightfoward that we describe it in full immediately.

The second list contains of 32 personalities.  For each of the 32!
(pi)
permutations (pi) of these personalities, we define the statistic p
1
obtained by permuting the personalities in accordance with (pi), so that
Personality i is matched with the dates of Personality (pi)(i).  The 32!
(pi)
numbers p are ordered, with possible ties, according to the usual order
1
of the real numbers.  If the phenomenon under study were due to chance, it
would be just as likely that p occupies any one of the 32! places in this
1
order as any other.  Similarly for p, p and p.  This is our null hypothesis.
2   3    4

To calculate significance levels, we chose 999,999 random permutations
(pi) of the 32 personalities; the precise way in which this was done is
explained in the Appendix (Section A.6).  Each of these permutations (pi)
(pi)
determines a statistic p; together with p, we have thus 1,000,000
1                1
numbers.  Define the rank order of p among these 1,000,000 numbers as the
1
(pi)                                       (pi)
number of p not exceeding p; if p is tied with other p, half of
1               1     1                    1
these others are considered to "exceed" p.  Let rho be the rank order of p,
1         1                      1
divided by 1,000,000; under the null hypothesis, rho is the probability
1
that p would rank as low as it does.  Define rho, rho and rho similarly
1                                         2    3       4
(using the same 999,999 permutations in each case).

After calculating the probabilities rho through rho, we must make
1           4
an overall decision to accept or reject the research hypothesis.  In doing
this, we should avoid selecting favorable evidence only.  For example,
suppose that rho = 0.01, the other rho being higher.  There is then the
3                     i
temptation to consider rho only, and so to reject the null hypothesis at
3
the level of 0.01.  But this would be a mistake; with enough sufficiently
diverse statistics, it is quite likely that just by chance, some one of them
will be low.  The correct question is, "Under the null hypothesis, what is
the probability that at least one of the four rho would be less than or
i
equal to 0.01?"  Thus denoting the event "rho <= 0.01" by E, we must find
i             i
the probability not of E, but of "E or E or E or E."  If the E were
3          1    2    3    4           i
mutually exclusive, this probability would be 0.04; overlaps only decrease
the total probability, so that it is in any case less than or equal to 0.04.
Thus we can reject the null hypothesis at the level of 0.04, but not 0.01.

More generally, for any given delta, the probability that at least
one of the four numbers rho is less than or equal to delta is at most 4delta.
i
This is known as the Bonferroni inequality.  Thus the overall significance
level (or p-value), using all four statistics, is rho := 4 min rho.
0            i

3. RESULTS AND CONCLUSIONS

In Table 3, we list the rank order of each of the four p among the
(pi)                                    i
1,000,000 corresponding p.  Thus the entry 4 for p means that for
i                        4
precisely 3 out of the 999,999 random permutations (pi), the statistic
(pi)
p was smaller than p (none was equal).  It follows that min rho =
4                  4                                          i
0.000004 so rho = 4 min rho = 0.000016.  The same calculations, using the
0           i
same 999,999 random permutations, were performed for control texts.  Our
first control text, R, was obtained by permuting the letters of G randomly
(for details, see Section A.6 of the Appendix).  After an earlier version
of this paper was distributed, one of the readers, a prominent scientist,
suggested to use as a control text Tolstoy's _War and Peace_.  So we used
text T consisting of the initial segment of the Hebrew translation of
Tolstoy's _War and Peace_ [6]--of the same length of G.  Then we were asked
by a referee to perform a control experiment on some early Hebrew text.  He
also suggested to use randomization on words in two forms: on the whole text
and within each verse.  In accordance, we checked texts I, U and W: text I
is the Book of Isaiah [2]; W was obtained by permuting the words of G
randomly; U was obtained from G by permuting randomly words within each verse.
In addition, we produced also text V by permuting the verses of G randomly.
(For details, see Section A.6 of the Appendix.)  Table 3 gives the results of
these calculations, too.  In the case of I, min rho is approximately 0.900;
i
in the case of R it is 0.365; in the case of T it is 0.277; in the case of U
it is 0.276; in the case of V it is 0.212; and in the case of W it is 0.516.
So in five cases rho = 4 min rho exceeds 1, and in the remaining case
0           i
rho = 0.847; that is, the result is totally nonsignificant, as one would
0
expect for control texts.
We conclude that the proximity of ELS's with related meanings in the
Book of Genesis is not due to chance.

------------------------------------------------------------------------------

Table 3           (pi)
Rank order of p among one million p
i                   i
---------------------------------------------------------------
p               p               p               p
1               2               3               4
----------------------------------------------------------------
G           453               5             570               4
R       619,140         681,451         364,859         573,861
T       748,183         363,481         580,307         277,103
I       899,830         932,868         929,840         946,261
W       883,770         516,098         900,642         630,269
U       321,071         275,741         488,949         491,116
V       211,777         519,115         410,746         591,503
----------------------------------------------------------------

```