# Reservoir sampling

Prerequisites. Basic probability theory, including the notion of independence and conditional proability. A nodding acquaintance with the analysis of algorithms is helpful but not necessary.

## 1. Introduction

Studying a large collection of data often requires random sampling, an act of choosing a subset from a larger set by chance. In the most basic case, sampling is performed so that each element from the pool of data has the same probability of being selected for the sample, a process known as simple random sampling.

While the concept of simple random sampling is intuitive, how one actually goes about choosing an element with a certain probability is less so. Coin tossing is an example: it allows us to select from two choices with (supposedly) equal probability. Other popular methods weren’t too different:

Many years ago, people who needed random numbers in their scientific work would draw balls out of a “well-stirred urn,” or they would roll dice or deal out cards. A table of over 40,000 random digits, “taken at random from census reports,” was published in 1927 … Since then, a number of devices have been built to generate random numbers mechanically. (Knuth, 1998)

A common feature of many methods of simple random sampling is that they require the knowledge of the size of the population from which to sample. By tossing a coin, we acknowledge that we are choosing from a set of two elements. Rolling a die means we choose from a set of six elements.

Once a dataset is large, however, it might not be feasible to know the precise size of the dataset in advance. In this post, we study a method of simple random sampling that does not require such knowledge, called reservoir sampling. As an added benefit, we shall see that reservoir sampling does not require access to the whole dataset at once, allowing a computer with limited memory to perform sampling by loading a small subset at a time.

## 2. Algorithm R

Suppose that we have a list of $n$ elements: $L = \{x_{0},\ldots,x_{n-1}\}$. We assume that the elements can only be accessed sequentially, one at a time. The following algorithm, a special case of Algorithm R from Knuth’s The Art of Computer Programming volume 2, selects one element from $L$ with equal probability:

Algorithm 2.1 (One-node reservoir sampling). We begin by selecting $x_0$, labeling it selected. For each $i \geq 1$, we perform a simple random sampling on the set of integers between $0$ and $i$ to choose $k$. If $k$ is 0, we designate $x_i$ as selected; otherwise, we leave selected unchanged.

Observe that the algorithm processes the input serially, accessing each element only once. This makes reservoir sampling an example of an online algorithm.

An alternative algorithm popular on the internet is the following:

Algorithm 2.2. At each node, we pick a number from the interval $[0, 1]$, uniformly at random. If the number is larger than all random numbers picked so far, then we replace selected with the current node. Repeat iteratively until we reach the end of the list.

Albeit appealing, the algorithm is wrong for all but the $n=1$ case.

To this end, we first recall that the notion of uniform random ness on $[0, 1]$ corresponds to the lengths of subintervals of $[0, 1]$. Specifically, for each $x \in [0, 1]$, the probability that a number picked from $[0, 1]$ uniformly at random is larger than $x$ is $1-x$.

Let us assume for contradiction that $n \geq 2$. For each $0 \leq i \leq n-1$, we let $p_i$ denote the random number chosen uniformly at random from $[0, 1]$ upon reaching $x_j$ in $L$.

We fix an index $0 \leq i \leq n-1$ and, for each $0 \leq j \leq n-1$, we let $S_j^i$ be the event that the selected equals $x_i$ right after we pass through $x_j$. By definition, $S_{n+1}^i{}_{}$ is the event that selected equals $x_i$ after all elements in $L$ has been examined.

We fix an index $0 \leq k \leq n-1$ and let $p^k = \max\{p_{0},\ldots,p_{k-1}\}$. There is probability $1-p^k$ that $p_k > p^k$, and so

Now, we fix $l > k$ and assume that selected equals $x_k$ upon reaching $x_l$ in $L$. With this assumption, the probability that selected remains unchanged after going through $x_l$ equals the probability that $p^k \geq p_l$, which is $1 - p_l$. Therefore,

We now observe that

It is not difficult to find values of $p^k, p_{k+1},\ldots,p_{n-1}$ such that the above product does not equal $1/n$. For example,

would do. $\square$

The name reservoir refers to the fact that we keep track of a candidate—in a reservoir, if you will. By placing multiple candidates in the reservoir, we can generalize Algorithm 2.1 to sample several nodes at once. Below, we present the full version of Algorithm R.

Algorithm 2.3 (Reservoir sampling; Algorithm R). We wish to sample $m$ elements of $L$ with equal probability. To this end, we define a size-$m$ reservoir $R$ and initialize it by setting $R[i] = x_i$ for $0 \leq i \leq m-1$. For each index $i \geq m$, we perform a simple random sampling on the set of integers between $0$ and $i$ to choose $k$. If $0 \leq k \leq m-1$, then we set $R[k] = x_i$; otherwise, we leave $R$ unchanged. $R$ is the desired sample, once we go through all elements in $L$.

Note that Algorithm 2.1 is a special case of Algorithm 2.3 with $m = 1$. It thus suffices to establish the correctness of Algorithm 2.3, which amounts to showing that the probability of an arbitrary element of $L$ being in the reservoir $R$ is always $m/n$.

We fix an index $0 \leq i \leq n-1$ and, for each $0 \leq j \leq n-1$, we let $S_j^i$ be the event that the reservoir $R$ contains $x_i$ right after we pass through $x_j$. By definition, $S_{n+1}^i{}_{}$ is the event that $x_i$ is in the reservoir after all elements in $L$ has been examined. We also observe that

for all $j \leq i$, as $x_i$ cannot be in $R$ after we pass through $x_{j+1}{}_{}$ unless it was already in $R$ when we reach $x_{j+1}{}_{}$. It follows that

where $\mathbb{P}[X \mid Y]$ refers to the conditional probability of event $X$ given another event $Y$.

Now, when we pass through $x_i$, we replace an element of $R$ with $x_i$ with probability $\frac{m}{i+1}$. In other words,

We now fix $j > i$ and assume that $x_i$ is in $R$ just prior to examining $x_j$ in $L$. With this assumption, the probability that $x_i$ is removed from $R$ upon passing through $x_j$ is $1/(j+1)$, as it entails obtaining $i$ as a random choice of an integer between $0$ and $j$. Therefore, the probability $\mathbb{P}[S^i_{j} \mid S^i_{j-1}]$ of $x_i$ remaining in $R$ after we pass through $x_j$ is

It now follows that

independent of the choice of $i$, whence the probability of an arbitrary node being in the reservoir is $\frac{m}{n}$, as was to be shown. $\square$

## 3. Who Discovered Algorithm R?

In Volume 2, Section 3.4.1 of The Art of Computer Programming, Knuth attributes Algorithm R to Alan G. Waterman. He, however, does not provide a reference, and there appears to be little information available on the matter. I sent a letter of inquiry to Knuth and received the following reply:

28 Dec 2017

to mark kim

from don knuth

Alan Waterman wrote me in the 70s, with detailed comments and suggestions that I incorporated into the second edition of Volume 2—which I was happy to complete in 1975 or 1976. (Bad typesetting took over, and I had to develop $\TeX$ before that book could be published, finally, in 1981.)

The first edition had an inferior Algorithm 3.4.2R dating from 1962; Alan was responding to it.

My books contain lots of unpublished work that I happen to learn about in various ways. But if the author(s) do(es) publish, I try to give a citation.

I’m unaware that Alan ever did write a paper about this (or any of his other contributions that are cited elsewhere in the second edition). If you learn of any appropriate references, please let me know … and you will deserve a reward of 0x\$1.00.

All I remember is that I was tickled pink to receive a letter from a real Harvard statistician who actually enjoyed my self-taught attempts at exposition of statistical methods…

Enclosed is a copy of the letter that survives in my files.

-don

Here are the relevant sections of Waterman’s 1975 letter to Knuth:

Dr. Alan G. Waterman

March 24, 1975

Dear Prof. Knuth:

I have some remarks on Chapter 3 of your book, The Art of Computer Programming, on Random Numbers.

[omitted]

There is a better algorithm than reservoir sampling, section 3.4.2, when the length of the file is unknown. Bring in the first $n$ records, then for each $k > n$, skip over the $k$th record with probability $(k-n)/k$, and replace the $i$th item in the sample by the $k$th record with probability $1/k$, for each $i \leq n$. If it is necessary to pass the records to a reservoir (the text is not clear on this point), the replacements may be done in an internal table of indices to the reservoir.

[omitted]

Fourth, your improved reservoir algorithm (why oh why didn’t I think of it?) will replace the old one in Section 3.4.2.

All in all, Algorithm R was known to Knuth and Waterman by 1975, and to a wider audience by 1981, when the second edition of The Art of Computer Programming volume 2 was published.

Meanwhile, A. I. McLeod and D. R. Bellhouse appear to have discovered Algorithm R independently, as [McLeod–Bellhouse 1983] presents Algorithm R without citing Knuth or Waterman.

Before the publication of Waterman’s algorithm, [Fan–Muller–Rezucha 1962] described a similar, but not quite the same, algorithm:

Procedure of Method 4:

For each $t$, $t = 1, 2, \cdots, n$, obtain $r_t$, associate it with the item and its label $I_t$, and place these pieces of information in the reservoir. For each $t$, $t > n$, place $r_t$, the item, and its label $I_t$ in the reservoir if $r_t$ is equal to or less than the maximum value of $r$ of a specified subset of values of $r$ in the reservoir. The subset of values of $r$ used in the comparison consists of the $D$ smallest distinct values of $r$ in the reservoir at the time of the comparison, where $D = n$ if all the $r$’s in this comparison subset are distinct; otherwise $% $.

When $t = N$, i.e., all the items have been inspected, stage one is completed.

Stage 2. Search the reservoir and select the $I_t$’s associated with the subset of the $n$ smallest values of $r$ in the reservoir Difficulty can occur only when the values of the $r$’s are not distinct. In this case it is possible that there will be one or more items in the reservoir with a value of $r$ equal to the largest value, say $r_u$, of the comparison subset upon completion of stage 1. Let $L$ denote the number of items in this comparison subset upon completion of stage 1 which have values of $r$ less than $r_u$ and let $M$ be the number of items in the reservoir which have a value of $r$ equal to $r_u$. To satisfy the original requirement of obtaining a random sample of exactly $n$ distinct items it will be necessary to utilize an additional selection procedure to select $M'$ distinct items out of $M$ such that $L+M' = n$. This selection can be accomplished by using, for example, Method 1.

While it might be reasonable to say that the reservoir algorithm paradigm was discovered by Fan, Muller, and Rezucha, it seems unlikely that they were aware of Algorithm R before Knuth published Waterman’s algorithm.

As far as I can tell, most citations of Algorithm R credit Waterman. Curiously, however, the Wikipedia article on reservoir sampling makes no mention of Waterman, crediting instead J. S. Vitter via [Vitter 1985]. But then, the Vitter paper cites Waterman:

Algorithm R (which is is [sic] a reservoir algorithm due to Alan Waterman) works as follows: When the $(t+1)$st record in the file is being processed, for $t \geq n$, the $n$ candidates form a random sample of the first $t$ records. The $(t+1)$st record has a $n/(t+1)$ chance of being in a random sample of size $n$ of the first $t+1$ records, and so it is made a candidate with probability $n/(t+1)$. The candidate it replaces is chosen randomly from the $n$ candidates. It is easy to see that the resulting set of $n$ candidates forms a random sample of the first $t+1$ records.

This is probably why high school history teachers tell their students not to use Wikipedia for their essay homework.

## 4. Additional remarks and further results

It is not hard to see that Algorithm R runs in $O(n)$ time, where $n$ is the size of the list $L$ from which we are to sample $m$ elements. Indeed, each element of $L$ is scanned once, and only a constant number of operations are performed at each element of $L$.

In [Vitter 1985], J. S. Vitter presents Algorithm Z, which samples $m$ elements from an $n$-element set in $O(m(1 + \log(n/m)))$ expected time and establishes its asymptotic optimality. As far as I no, there is no algorithm with the same worst-time complexity. Various optimizations with the same expected-time complexity have been proposed: see, for example, [Li 1994].

Note that $\log (n/m) = \log n - \log m$ approaches $n$ rapidly as $m$ decreases in size, resulting in a significant drop in performance. See [Dash—Ng 2006] for an algorithm that potentially addresses this issue.

Finally, we remark that a large $n$ presents a problem regardless of the size of $m$, as Algorithm R and Algorithm Z operate entirely in-memory. The standard workaround for memory problems is to distribute the task. Suppose, for example, that we have a cluster of $D+1$ machines available. Machine 0 takes in a stream of data; for each datapoint $x$, it chooses a random integer $k$ between $1$ and $D+1$ and sends $x$ to Machine $k$. Machines $1$ through $D+1$ carry out reservoir sampling, selecting samples of size $m/D$. The samples are then sent back to Machine 0 to be merged, resulting in a sample of size $m$. Given a large enough stream of data, the machines would receive roughly equal amount of data. With the equal-size assumption, we see that the total probability of an arbitrary data point $x$ being included in the final sample is

Of course, the equal-size assumption we have made does not always hold, and so the algorithm does not always produce the correct answer. We do not pursue the details here.

Tags:

Updated: