# Reservoir Sampling

*Reservoir Sampling* is an online, probabilistic
algorithm to uniformly sample $k$ random elements out of a stream of values.

It’s a particularly elegant and small algorithm, only requiring $\Theta(k)$ amount of space and a single pass through the stream.

## Sampling one element

As an introduction, we’ll first focus on fairly sampling one element from the stream.

```
def sample_one[T](stream: Iterable[T]) -> T:
stream_iter = iter(stream)
# Sample the first element
res = next(stream_iter)
for i, val in enumerate(stream_iter, start=1):
j = random.randint(0, i)
# Replace the sampled element with probability 1/(i + 1)
if j == 0:
res = val
# Return the randomly sampled element
return res
```

### Proof

Let’s now prove that this algorithm leads to a fair sampling of the stream.

We’ll be doing proof by induction.

#### Hypothesis $H_N$

After iterating through the first $N$ items in the stream,
each of them has had an equal $\frac{1}{N}$ probability of being selected as
`res`

.

#### Base Case $H_1$

We can trivially observe that the first element is always assigned to `res`

,
$\frac{1}{1} = 1$, the hypothesis has been verified.

#### Inductive Case

For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of loop iteration where `i = N`

(i.e: observation of the $N + 1$-th item in the
stream).

`j = random.randint(0, i)`

uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have two cases:

`j == 0`

, with probability $\frac{1}{N + 1}$: we select`val`

as the new reservoir element`res`

.`j != 0`

, with probability $\frac{N}{N + 1}$: we keep the previous value of`res`

. By $H_N$, any of the first $N$ elements had a $\frac{1}{N}$ probability of being`res`

before at the start of the loop, each element now has a probability $\frac{1}{N} \cdot \frac{N}{N + 1} = \frac{1}{N + 1}$ of being the element.

And thus, we have proven $H_{N + 1}$ at the end of the loop.

## Sampling $k$ element

The code for sampling $k$ elements is very similar to the one-element case.

```
def sample[T](stream: Iterable[T], k: int = 1) -> list[T]:
stream_iter = iter(stream)
# Retain the first 'k' elements in the reservoir
res = list(itertools.islice(stream_iter, k))
for i, val in enumerate(stream_iter, start=k):
j = random.randint(0, i)
# Replace one element at random with probability k/(i + 1)
if j < k:
res[j] = val
# Return 'k' randomly sampled elements
return res
```

### Proof

Let us once again do a proof by induction, assuming the stream contains at least $k$ items.

#### Hypothesis $H_N$

After iterating through the first $N$ items in the stream, each of them has had an equal $\frac{k}{N}$ probability of being sampled from the stream.

#### Base Case $H_k$

We can trivially observe that the first $k$ element are sampled at the start of the algorithm, $\frac{k}{k} = 1$, the hypothesis has been verified.

#### Inductive Case

For a given $N$, let us assume that $H_N$ holds. Let us now look at the events
of the loop iteration where `i = N`

, in order to prove $H_{N + 1}$.

`j = random.randint(0, i)`

uniformly selects a value in the range $[0, i]$,
a.k.a $[0, N]$. We then have three cases:

`j >= k`

, with probability $1 - \frac{k}{N + 1}$: we do not modify the sampled reservoir at all.`j < k`

, with probability $\frac{k}{N + 1}$: we sample the new element to replace the`j`

-th element of the reservoir. Therefore for any element $e \in [0, k[$ we can either have:- $j = e$: the element
*is*replaced, probability $\frac{1}{k}$. - $j \neq e$: the element is
*not*replaced, probability $\frac{k - 1}{k}$.

- $j = e$: the element

We can now compute the probability that a previously sampled element is kept in the reservoir: $1 - \frac{k}{N + 1} + \frac{k}{N + 1} \cdot \frac{k - 1}{k} = \frac{N}{N + 1}$.

By $H_N$, any of the first $N$ elements had a $\frac{k}{N}$ probability of being sampled before at the start of the loop, each element now has a probability $\frac{k}{N} \cdot \frac{N}{N + 1} = \frac{k}{N + 1}$ of being the element.

We have now proven that all elements have a probability $\frac{k}{N + 1}$ of being sampled at the end of the loop, therefore $H_{N + 1}$ has been verified.