The nice friendly way to play Twenty Questions is to select in your mind a secret something (a person, place, or thing) and to give honest answers to a bunch of true/false questions about it. A less nice way to play is to keep changing what you have in mind so that you can answer “No” to every question. That’s not a good way to keep friends, but something very much like it is a good way to generate a quasi-random sequence of bits.
First I’ll play nice. I’m thinking of an infinite binary sequence that begins 0101101010110101… How will you guess the next bit? My infinite sequence happens to repeat with period seven, but if you didn’t know that ahead of time, what sort of bit-prediction method would you use? More importantly, how would you get a computer to predict successive bits and learn from its mistakes? This kind of question is relevant to data-compression.
A simple-minded but curiously effective general procedure for predicting the next bit involves looking at the different-length suffixes of the currently-known part of the sequence, where the suffix of length k consists of the last k bits, and looking at earlier occurrences of those particular patterns earlier in the sequence. For instance, the suffix of length 3 in 0101101010110101 is the pattern 101. This pattern of bits has occurred earlier in the sequence (several times, in fact). A longer suffix is 0101, which has also occurred earlier in the sequence. The curiously effective procedure for predicting the next bit requires that you first identify the longest suffix that has occurred earlier in the sequence. That suffix happens to be 010110101:
0101101010110101 (suffix of length 9)
0101101010110101 (same pattern, seen earlier)
Call the longest previously-seen suffix S. The curiously effective procedure prescribes that, having found S, you must locate the previous (i.e., second-to-last) occurrence of S. Then you must see what bit occurred immediately after the previous occurrence of S:
In this case, that bit is a 0, so you guess that the next bit will be 0.
This guess is right, and that’s no accident: as long as I’ve picked a periodic sequence of bits (and I did), your use of this simple-minded method guarantees that you’ll guess all bits correctly from some point on, even if you don’t know what the period is.
Okay, now it’s time for me to stop playing nice. I’m thinking of an infinite sequence whose first sixteen terms are 0100110101110001. (Spoiler: it’s called the Ehrenfeucht-Mycielski sequence, and you can watch me construct the first dozen terms in a Barefoot Math video I just posted.) What’s your guess for the seventeenth term? The longest suffix that’s occurred previously is 001,
and the previous occurrence of 001 was followed by a 1,
so if you follow the procedure described above you’ll guess that the next bit is 1.
“Wrong,” I say; “the next bit is 0. This really isn’t your day; that’s the seventeenth straight time that you’ve been wrong!”
I’m being mean to you, but I’m not changing my mind about the sequence as I go; what I had in mind from the start was the infinite sequence of bits that will make all your guesses wrong. (I can do this because I know what prediction method you’re using; this enables me to front-load all my meanness and just go on autopilot after starting the game.) This sequence was invented by Andrzej Ehrenfeucht and Jan Mycielski in 1992, and is described in a nice 2003 article by Klaus Sutner. (I’ve cut some corners in my explanation; to really do it properly, we need to allow the empty string to be considered the “suffix of length 0”. The Wikipedia page gives a more rigorous treatment.)
It’s believed that the sequence exhibits “normality“: that is, it’s believed that half of the bits are 0s and half are 1s, that the four patterns 00, 01, 10, and 11 each appear a quarter of the time, that the eight patterns 000, 001, …, 110, and 111 each appear one-eighth of the time, etc. What’s more, the sequence seems to have distinct eras, where during the mth era the sequence is “trying” to make sure that each of the 2m different possible patterns of length m occurs at least once. The sequence certainly isn’t pseudorandom; there are many easily-detected patterns in it that would make it unsuitable for use as a general-purpose source of random bits. For instance, in a random sequence of bits, we expect the first n bits to contain about n/2 0’s and n/2 1’s, but we’d be surprised if the split was too close to a tie too often. After all, if you toss a coin n times, the Central Limit Theorem tells us that the discrepancy between the number of heads and the number of tails is usually on the order of the square root of n. But with the Ehrenfeucht-Mycielski sequence, the discrepancy is a lot smaller than sqrt(n).
Or at least, the discrepancy appears to be smaller. Sutner’s simulations ran for millions of steps, and it’s possible for us to go farther now, but mere calculation can never tell us what really happens out near infinity where the trains don’t run. Mathematicians believe that for all large enough n, the number of 0’s in the first n bits of the Ehrenfeucht-Mycielski sequence differs from n/2 by less than sqrt(n) divided by one million. In fact, they believe that the preceding sentence remains true if you replace “million” by “billion” or any larger number you like (though the meaning of “large enough” will need to be adjusted accordingly). But: not only have they not proved this, they don’t even know how to prove that this claim is true if sqrt(n)/1,000,000 is replaced by the much larger number n/1,000,000. They haven’t proved that the asymptotic density of 0’s (or 1’s) is 1/2. This is the notorious Ehrenfeucht-Mycielski balance problem, and it’s been an open problem for over twenty-five years.
The Ehrenfeucht-Mycielski sequence exhibits the phenomenon of quasirandomness, weaker than pseudorandomness but still quite interesting. One of the frustrations of the study of quasirandom processes is the persistent gap between what we can guess and what we can prove. As Paul Erdős said of the Collatz Conjecture, “Mathematics may not be ready for such problems.”
But that’s a defeatist attitude, and I want to end on a positive note. So let me announce here that, after much work, it’s been shown by Kieffer and Szpankowski that the asymptotic density of 0’s and 1’s in the Ehrenfeucht-Mycielski sequence, if it exists, must lie between 1/4 and 3/4.
Oh, so you think it must be easy to prove that the density exists? Guess again.
Thanks to Bill Gasarch, Cris Moore, Joel Spencer and Klaus Sutner.
“On the Ehrenfeucht-Mycielski Balance Problem”, John C. Kieffer and W. Szpankowski, https://dmtcs.episciences.org/3542/pdf
“The Ehrenfeucht-Mycielski Sequence”, Klaus Sutner; http://www.cs.cmu.edu/~sutner/papers/CIAA-Sutner-2003.pdf