Prerequisites. Basic probability theory, including the notion of independence and conditional proability. A nodding acquaintance with the analysis of algorithms is helpful but not necessary.
Studying a large collection of data often requires random sampling, an act of choosing a subset from a larger set by chance. In the most basic case, sampling is performed so that each element from the pool of data has the same probability of being selected for the sample, a process known as simple random sampling.
While the concept of simple random sampling is intuitive, how one actually goes about choosing an element with a certain probability is less so. Coin tossing is an example: it allows us to select from two choices with (supposedly) equal probability. Other popular methods weren’t too different:
Many years ago, people who needed random numbers in their scientific work would draw balls out of a “well-stirred urn,” or they would roll dice or deal out cards. A table of over 40,000 random digits, “taken at random from census reports,” was published in 1927 … Since then, a number of devices have been built to generate random numbers mechanically. (Knuth, 1998)
A common feature of many methods of simple random sampling is that they require the knowledge of the size of the population from which to sample. By tossing a coin, we acknowledge that we are choosing from a set of two elements. Rolling a die means we choose from a set of six elements.
Once a dataset is large, however, it might not be feasible to know the precise size of the dataset in advance. In this post, we study a method of simple random sampling that does not require such knowledge, called reservoir sampling. As an added benefit, we shall see that reservoir sampling does not require access to the whole dataset at once, allowing a computer with limited memory to perform sampling by loading a small subset at a time.
2. Algorithm R
Suppose that we have a list of elements: . We assume that the elements can only be accessed sequentially, one at a time. The following algorithm, a special case of Algorithm R from Knuth’s The Art of Computer Programming volume 2, selects one element from with equal probability:
Algorithm 2.1 (One-node reservoir sampling). We begin by selecting , labeling it
selected. For each , we perform a simple random sampling on the set of integers between and to choose . If is 0, we designate as
selected; otherwise, we leave
Observe that the algorithm processes the input serially, accessing each element only once. This makes reservoir sampling an example of an online algorithm.
An alternative algorithm popular on the internet is the following:
Algorithm 2.2. At each node, we pick a number from the interval , uniformly at random. If the number is larger than all random numbers picked so far, then we replace
selected with the current node. Repeat iteratively until we reach the end of the list.
Albeit appealing, the algorithm is wrong for all but the case.
To this end, we first recall that the notion of uniform random ness on corresponds to the lengths of subintervals of . Specifically, for each , the probability that a number picked from uniformly at random is larger than is .
Let us assume for contradiction that . For each , we let denote the random number chosen uniformly at random from upon reaching in .
We fix an index and, for each , we let be the event that the
selected equals right after we pass through . By definition, is the event that
selected equals after all elements in has been examined.
We fix an index and let . There is probability that , and so
Now, we fix and assume that
selected equals upon reaching in . With this assumption, the probability that
selected remains unchanged after going through equals the probability that , which is . Therefore,
We now observe that
It is not difficult to find values of such that the above product does not equal . For example,
The name reservoir refers to the fact that we keep track of a candidate—in a reservoir, if you will. By placing multiple candidates in the reservoir, we can generalize Algorithm 2.1 to sample several nodes at once. Below, we present the full version of Algorithm R.
Algorithm 2.3 (Reservoir sampling; Algorithm R). We wish to sample elements of with equal probability. To this end, we define a size- reservoir and initialize it by setting for . For each index , we perform a simple random sampling on the set of integers between and to choose . If , then we set ; otherwise, we leave unchanged. is the desired sample, once we go through all elements in .
Note that Algorithm 2.1 is a special case of Algorithm 2.3 with . It thus suffices to establish the correctness of Algorithm 2.3, which amounts to showing that the probability of an arbitrary element of being in the reservoir is always .
We fix an index and, for each , we let be the event that the reservoir contains right after we pass through . By definition, is the event that is in the reservoir after all elements in has been examined. We also observe that
for all , as cannot be in after we pass through unless it was already in when we reach . It follows that
where refers to the conditional probability of event given another event .
Now, when we pass through , we replace an element of with with probability . In other words,
We now fix and assume that is in just prior to examining in . With this assumption, the probability that is removed from upon passing through is , as it entails obtaining as a random choice of an integer between and . Therefore, the probability of remaining in after we pass through is
It now follows that
independent of the choice of , whence the probability of an arbitrary node being in the reservoir is , as was to be shown.
3. Who Discovered Algorithm R?
In Volume 2, Section 3.4.1 of The Art of Computer Programming, Knuth attributes Algorithm R to Alan G. Waterman. He, however, does not provide a reference, and there appears to be little information available on the matter. I sent a letter of inquiry to Knuth and received the following reply:
28 Dec 2017
to mark kim
from don knuth
Alan Waterman wrote me in the 70s, with detailed comments and suggestions that I incorporated into the second edition of Volume 2—which I was happy to complete in 1975 or 1976. (Bad typesetting took over, and I had to develop before that book could be published, finally, in 1981.)
The first edition had an inferior Algorithm 3.4.2R dating from 1962; Alan was responding to it.
My books contain lots of unpublished work that I happen to learn about in various ways. But if the author(s) do(es) publish, I try to give a citation.
I’m unaware that Alan ever did write a paper about this (or any of his other contributions that are cited elsewhere in the second edition). If you learn of any appropriate references, please let me know … and you will deserve a reward of 0x$1.00.
All I remember is that I was tickled pink to receive a letter from a real Harvard statistician who actually enjoyed my self-taught attempts at exposition of statistical methods…
Enclosed is a copy of the letter that survives in my files.
Here are the relevant sections of Waterman’s 1975 letter to Knuth:
Dr. Alan G. Waterman
March 24, 1975
Dear Prof. Knuth:
I have some remarks on Chapter 3 of your book, The Art of Computer Programming, on Random Numbers.
There is a better algorithm than reservoir sampling, section 3.4.2, when the length of the file is unknown. Bring in the first records, then for each , skip over the th record with probability , and replace the th item in the sample by the th record with probability , for each . If it is necessary to pass the records to a reservoir (the text is not clear on this point), the replacements may be done in an internal table of indices to the reservoir.
Fourth, your improved reservoir algorithm (why oh why didn’t I think of it?) will replace the old one in Section 3.4.2.
All in all, Algorithm R was known to Knuth and Waterman by 1975, and to a wider audience by 1981, when the second edition of The Art of Computer Programming volume 2 was published.
Before the publication of Waterman’s algorithm, [Fan–Muller–Rezucha 1962] described a similar, but not quite the same, algorithm:
Procedure of Method 4:
For each , , obtain , associate it with the item and its label , and place these pieces of information in the reservoir. For each , , place , the item, and its label in the reservoir if is equal to or less than the maximum value of of a specified subset of values of in the reservoir. The subset of values of used in the comparison consists of the smallest distinct values of in the reservoir at the time of the comparison, where if all the ’s in this comparison subset are distinct; otherwise .
When , i.e., all the items have been inspected, stage one is completed.
Stage 2. Search the reservoir and select the ’s associated with the subset of the smallest values of in the reservoir Difficulty can occur only when the values of the ’s are not distinct. In this case it is possible that there will be one or more items in the reservoir with a value of equal to the largest value, say , of the comparison subset upon completion of stage 1. Let denote the number of items in this comparison subset upon completion of stage 1 which have values of less than and let be the number of items in the reservoir which have a value of equal to . To satisfy the original requirement of obtaining a random sample of exactly distinct items it will be necessary to utilize an additional selection procedure to select distinct items out of such that . This selection can be accomplished by using, for example, Method 1.
While it might be reasonable to say that the reservoir algorithm paradigm was discovered by Fan, Muller, and Rezucha, it seems unlikely that they were aware of Algorithm R before Knuth published Waterman’s algorithm.
As far as I can tell, most citations of Algorithm R credit Waterman. Curiously, however, the Wikipedia article on reservoir sampling makes no mention of Waterman, crediting instead J. S. Vitter via [Vitter 1985]. But then, the Vitter paper cites Waterman:
Algorithm R (which is is [sic] a reservoir algorithm due to Alan Waterman) works as follows: When the st record in the file is being processed, for , the candidates form a random sample of the first records. The st record has a chance of being in a random sample of size of the first records, and so it is made a candidate with probability . The candidate it replaces is chosen randomly from the candidates. It is easy to see that the resulting set of candidates forms a random sample of the first records.
This is probably why high school history teachers tell their students not to use Wikipedia for their essay homework.
4. Additional remarks and further results
It is not hard to see that Algorithm R runs in time, where is the size of the list from which we are to sample elements. Indeed, each element of is scanned once, and only a constant number of operations are performed at each element of .
In [Vitter 1985], J. S. Vitter presents Algorithm Z, which samples elements from an -element set in expected time and establishes its asymptotic optimality. As far as I no, there is no algorithm with the same worst-time complexity. Various optimizations with the same expected-time complexity have been proposed: see, for example, [Li 1994].
Note that approaches rapidly as decreases in size, resulting in a significant drop in performance. See [Dash—Ng 2006] for an algorithm that potentially addresses this issue.
Finally, we remark that a large presents a problem regardless of the size of , as Algorithm R and Algorithm Z operate entirely in-memory. The standard workaround for memory problems is to distribute the task. Suppose, for example, that we have a cluster of machines available. Machine 0 takes in a stream of data; for each datapoint , it chooses a random integer between and and sends to Machine . Machines through carry out reservoir sampling, selecting samples of size . The samples are then sent back to Machine 0 to be merged, resulting in a sample of size . Given a large enough stream of data, the machines would receive roughly equal amount of data. With the equal-size assumption, we see that the total probability of an arbitrary data point being included in the final sample is
Of course, the equal-size assumption we have made does not always hold, and so the algorithm does not always produce the correct answer. We do not pursue the details here.