# Bloom Filter

The *Bloom Filter* is a probabilistic data structure for set membership.

The filter can be used as an inexpensive first step when querying the actual data is quite costly (e.g: as a first check for expensive cache lookups or large data seeks).

## What does it do?

A *Bloom Filter* can be understood as a hash-set which can either tell you:

- An element is
*not*part of the set. - An element
*may be*part of the set.

More specifically, one can tweak the parameters of the filter to make it so that
the *false positive* rate of membership is quite low.

I won’t be going into those calculations here, but they are quite trivial to compute, or one can just look up appropriate values for their use case.

## Implementation

I’ll be using Python, which has the nifty ability of representing bitsets through its built-in big integers quite easily.

We’ll be assuming a `BIT_COUNT`

of 64 here, but the implementation can easily be
tweaked to use a different number, or even change it at construction time.

### Representation

A `BloomFilter`

is just a set of bits and a list of hash functions.

```
BIT_COUNT = 64
class BloomFilter[T]:
_bits: int
_hash_functions: list[Callable[[T], int]]
def __init__(self, hash_functions: list[Callable[[T], int]]) -> None:
# Filter is initially empty
self._bits = 0
self._hash_functions = hash_functions
```

### Inserting a key

To add an element to the filter, we take the output from each hash function and use that to set a bit in the filter. This combination of bit will identify the element, which we can use for lookup later.

```
def insert(self, val: T) -> None:
# Iterate over each hash
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# Set the corresponding bit
self._bit |= 1 << n
```

### Querying a key

Because the *Bloom Filter* does not actually store its elements, but some
derived data from hashing them, it can only definitely say if an element *does
not* belong to it. Otherwise, it *may* be part of the set, and should be checked
against the actual underlying store.

```
def may_contain(self, val: T) -> bool:
for f in self._hash_functions:
n = f(val) % BIT_COUNT
# If one of the bits is unset, the value is definitely not present
if not (self._bit & (1 << n)):
return False
# All bits were matched, `val` is likely to be part of the set
return True
```