postgresql/src/test/modules/test_bloomfilter
Bruce Momjian 29275b1d17 Update copyright for 2024
Reported-by: Michael Paquier

Discussion: https://postgr.es/m/ZZKTDPxBBMt3C0J9@paquier.xyz

Backpatch-through: 12
2024-01-03 20:49:05 -05:00
..
expected Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00
sql Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00
.gitignore Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00
Makefile Split all OBJS style lines in makefiles into one-line-per-entry style. 2019-11-05 14:41:07 -08:00
README Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00
meson.build Update copyright for 2024 2024-01-03 20:49:05 -05:00
test_bloomfilter--1.0.sql Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00
test_bloomfilter.c Update copyright for 2024 2024-01-03 20:49:05 -05:00
test_bloomfilter.control Add Bloom filter implementation. 2018-03-31 17:49:41 -07:00

README

test_bloomfilter overview
=========================

test_bloomfilter is a test harness module for testing Bloom filter library set
membership operations.  It consists of a single SQL-callable function,
test_bloomfilter(), plus a regression test that calls test_bloomfilter().
Membership tests are performed against a dataset that the test harness module
generates.

The test_bloomfilter() function displays instrumentation at DEBUG1 elog level
(WARNING when the false positive rate exceeds a 1% threshold).  This can be
used to get a sense of the performance characteristics of the Postgres Bloom
filter implementation under varied conditions.

Bitset size
-----------

The main bloomfilter.c criteria for sizing its bitset is that the false
positive rate should not exceed 2% when sufficient bloom_work_mem is available
(and the caller-supplied estimate of the number of elements turns out to have
been accurate).  A 1% - 2% rate is currently assumed to be suitable for all
Bloom filter callers.

With an optimal K (number of hash functions), Bloom filters should only have a
1% false positive rate with just 9.6 bits of memory per element.  The Postgres
implementation's 2% worst case guarantee exists because there is a need for
some slop due to implementation inflexibility in bitset sizing.  Since the
bitset size is always actually kept to a power of two number of bits, callers
can have their bloom_work_mem argument truncated down by almost half.
In practice, callers that make a point of passing a bloom_work_mem that is an
exact power of two bitset size (such as test_bloomfilter.c) will actually get
the "9.6 bits per element" 1% false positive rate.

Testing strategy
----------------

Our approach to regression testing is to test that a Bloom filter has only a 1%
false positive rate for a single bitset size (2 ^ 23, or 1MB).  We test a
dataset with 838,861 elements, which works out at 10 bits of memory per
element.  We round up from 9.6 bits to 10 bits to make sure that we reliably
get under 1% for regression testing.  Note that a random seed is used in the
regression tests because the exact false positive rate is inconsistent across
platforms.  Inconsistent hash function behavior is something that the
regression tests need to be tolerant of anyway.

test_bloomfilter() SQL-callable function
========================================

The SQL-callable function test_bloomfilter() provides the following arguments:

* "power" is the power of two used to size the Bloom filter's bitset.

The minimum valid argument value is 23 (2^23 bits), or 1MB of memory.  The
maximum valid argument value is 32, or 512MB of memory.

* "nelements" is the number of elements to generate for testing purposes.

* "seed" is a seed value for hashing.

A value < 0 is interpreted as "use random seed".  Varying the seed value (or
specifying -1) should result in small variations in the total number of false
positives.

* "tests" is the number of tests to run.

This may be increased when it's useful to perform many tests in an interactive
session.  It only makes sense to perform multiple tests when a random seed is
used.