Reservoir sampling is a class of algorithms for sampling from streaming data, i.e., a sequence of data that we can access only once. Presented below is Algorithm R, the first well-known reservoir sampling algorithm.
Reservoir sampling – Algorithm R. We wish to sample elements of with equal probability. To this end, we define a size- set (“reservoir”) and initialize it by setting for . For each index , we perform a simple random sampling on the set of integers between and to choose . If , then we set ; otherwise, we leave unchanged. is the desired sample, once we go through all elements in .
In Volume 2, Section 3.4.1 of The Art of Computer Programming, Knuth attributes Algorithm R to Alan G. Waterman. He, however, does not provide a reference, and there appears to be little information available on the matter. I sent a letter of inquiry to Knuth and received the following reply:
28 Dec 2017
to mark kim
from don knuth
Alan Waterman wrote me in the 70s, with detailed comments and suggestions that I incorporated into the second edition of Volume 2—which I was happy to complete in 1975 or 1976. (Bad typesetting took over, and I had to develop before that book could be published, finally, in 1981.)
The first edition had an inferior Algorithm 3.4.2R dating from 1962; Alan was responding to it.
My books contain lots of unpublished work that I happen to learn about in various ways. But if the author(s) do(es) publish, I try to give a citation.
I’m unaware that Alan ever did write a paper about this (or any of his other contributions that are cited elsewhere in the second edition). If you learn of any appropriate references, please let me know … and you will deserve a reward of 0x$1.00.
All I remember is that I was tickled pink to receive a letter from a real Harvard statistician who actually enjoyed my self-taught attempts at exposition of statistical methods…
Enclosed is a copy of the letter that survives in my files.
Here are the relevant sections of Waterman’s 1975 letter to Knuth:
Dr. Alan G. Waterman
March 24, 1975
Dear Prof. Knuth:
I have some remarks on Chapter 3 of your book, The Art of Computer Programming, on Random Numbers.
There is a better algorithm than reservoir sampling, section 3.4.2, when the length of the file is unknown. Bring in the first records, then for each , skip over the th record with probability , and replace the th item in the sample by the th record with probability , for each . If it is necessary to pass the records to a reservoir (the text is not clear on this point), the replacements may be done in an internal table of indices to the reservoir.
Fourth, your improved reservoir algorithm (why oh why didn’t I think of it?) will replace the old one in Section 3.4.2.
All in all, Algorithm R was known to Knuth and Waterman by 1975, and to a wider audience by 1981, when the second edition of The Art of Computer Programming volume 2 was published.
Before the publication of Waterman’s algorithm, [Fan–Muller–Rezucha 1962] described a similar, but not quite the same, algorithm:
Procedure of Method 4:
For each , , obtain , associate it with the item and its label , and place these pieces of information in the reservoir. For each , , place , the item, and its label in the reservoir if is equal to or less than the maximum value of of a specified subset of values of in the reservoir. The subset of values of used in the comparison consists of the smallest distinct values of in the reservoir at the time of the comparison, where if all the ’s in this comparison subset are distinct; otherwise .
When , i.e., all the items have been inspected, stage one is completed.
Stage 2. Search the reservoir and select the ’s associated with the subset of the smallest values of in the reservoir Difficulty can occur only when the values of the ’s are not distinct. In this case it is possible that there will be one or more items in the reservoir with a value of equal to the largest value, say , of the comparison subset upon completion of stage 1. Let denote the number of items in this comparison subset upon completion of stage 1 which have values of less than and let be the number of items in the reservoir which have a value of equal to . To satisfy the original requirement of obtaining a random sample of exactly distinct items it will be necessary to utilize an additional selection procedure to select distinct items out of such that . This selection can be accomplished by using, for example, Method 1.
While it might be reasonable to say that the reservoir algorithm paradigm was discovered by Fan, Muller, and Rezucha, it seems unlikely that they were aware of Algorithm R before Knuth published Waterman’s algorithm.
As far as I can tell, most citations of Algorithm R credit Waterman. Curiously, however, the Wikipedia article on reservoir sampling makes no mention of Waterman, crediting instead J. S. Vitter via [Vitter 1985]. But then, the Vitter paper cites Waterman:
Algorithm R (which is is [sic] a reservoir algorithm due to Alan Waterman) works as follows: When the st record in the file is being processed, for , the candidates form a random sample of the first records. The st record has a chance of being in a random sample of size of the first records, and so it is made a candidate with probability . The candidate it replaces is chosen randomly from the candidates. It is easy to see that the resulting set of candidates forms a random sample of the first records.
This is probably why high school history teachers tell their students not to use Wikipedia for their essay homework.