NAME Data::BloomFilter::Shared - shared-memory Bloom filter for Linux SYNOPSIS use Data::BloomFilter::Shared; # sized for 1_000_000 items at a 1% false-positive rate, anonymous mapping my $bf = Data::BloomFilter::Shared->new(undef, 1_000_000, 0.01); $bf->add("alice"); $bf->add("bob"); $bf->contains("alice"); # 1 (probably present) $bf->contains("carol"); # 0 (definitely absent) # "have I seen this before?" -- add returns 1 the first time, 0 after my $first = $bf->add("event-42"); # 1: probably new my $again = $bf->add("event-42"); # 0: all its bits were already set # bulk add in a single lock acquisition my $new = $bf->add_many([ map { "user-$_" } 1 .. 1000 ]); # merge another filter of identical geometry (bitwise OR -> union) my $other = Data::BloomFilter::Shared->new(undef, 1_000_000, 0.01); $other->add_many([ map { "user-$_" } 500 .. 1500 ]); $bf->merge($other); # share across processes via a backing file my $shared = Data::BloomFilter::Shared->new("/tmp/seen.bloom", 1_000_000, 0.01); DESCRIPTION A Bloom filter in shared memory: a compact, fixed-size structure for probabilistic set membership. You add items to it, then ask whether an item is in the set. The answer is either "definitely not present" or "probably present": the filter has no false negatives (if you added it, "contains" always returns true) but a tunable rate of false positives (it may occasionally report an item as present that was never added). It never stores the items themselves, only a bit array, so memory is proportional to the configured capacity and false-positive rate, not to the size of the items. Each item is hashed once with XXH3 (128-bit); the two 64-bit halves drive "k" probe positions into a power-of-two bit array via Kirsch-Mitzenmacher double hashing. "add" sets those "k" bits; "contains" reports present only if all "k" are set. The number of hashes "k" and the bit-array size are derived from the "capacity" and "fp_rate" you request. Because the bit array lives in a shared mapping, several processes share one filter: any process that opens the same backing file, inherits the anonymous mapping across "fork", or reopens a passed memfd, sees the others' additions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may "add" and "contains" concurrently. Two filters of identical geometry can be combined with "merge" (bitwise OR), which yields a filter whose membership is the union of the two input sets. Items are added and tested by their byte content; wide-character strings (any codepoint above 255) cause a "Wide character" croak -- encode such strings to bytes first (for example with "Encode::encode_utf8"). Linux-only. Requires 64-bit Perl. METHODS Constructors my $bf = Data::BloomFilter::Shared->new($path, $capacity, $fp_rate); my $bf = Data::BloomFilter::Shared->new(undef, 1_000_000); # fp_rate 0.01 my $bf = Data::BloomFilter::Shared->new_memfd($name, $capacity, $fp_rate); my $bf = Data::BloomFilter::Shared->new_from_fd($fd); $path is the backing file ("undef" or omitted for an anonymous mapping). $capacity is the number of items you expect to add; it must be at least 1. $fp_rate is the target false-positive rate at that capacity; it is optional, defaults to 0.01 (1%), and must be strictly between 0 and 1. "new" and "new_memfd" croak if $capacity is less than 1 or $fp_rate is out of range. From $capacity and $fp_rate the filter derives its geometry: "k = round(-log2(fp_rate))" hashes (clamped to the range 1..32), and a bit array of "m = next_power_of_two(ceil(capacity * k / ln 2))" bits (with a floor of 64 bits). Rounding the bit count up to a power of two means the realised false-positive rate at capacity is typically at or below the configured target. When reopening an existing file or memfd, the stored geometry wins and the caller's $capacity/$fp_rate arguments are ignored. "new_memfd" creates a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd" reopens one in another process. Adding and testing my $new = $bf->add($item); # 1 if probably new, else 0 my $added = $bf->add_many(\@items); # count of items that were probably new my $in = $bf->contains($item); # 1 if probably present, 0 if definitely absent $bf->clear; # reset to empty (all bits 0) "add" hashes $item (taken by its bytes; wide characters croak, encode first) and sets its "k" bits, returning 1 if the item was probably new -- that is, if at least one of its bits was previously unset -- and 0 if all "k" bits were already set (the item was probably added before). A return of 0 is therefore a "probably seen this already" signal, subject to the same false-positive rate as "contains". "add_many" takes an array reference and does the whole batch under a single write lock, returning how many of its elements were probably new. "contains" returns 1 if the item is probably present and 0 if it is definitely absent. The contract is asymmetric and is the whole point of a Bloom filter: a 0 is exact (the item was never added), while a 1 may be a false positive (some other items happened to set all of this item's bits). There are never false negatives: any item you have added always reports as present. Merging $bf->merge($other); Folds $other's bit array into $bf by bitwise OR, so $bf then reports as present the union of the two filters' item sets (still with no false negatives). Both filters must have identical geometry -- the same number of bits and the same number of hashes, which follows from constructing both with the same $capacity and $fp_rate ("merge" croaks on a mismatch). $other is read under its own lock into a private snapshot first, so merging is deadlock-free even if two processes merge each other concurrently; $other is not modified. Introspection and lifecycle $bf->capacity; $bf->bits; $bf->hashes; $bf->fp_rate; $bf->count; $bf->stats; $bf->path; $bf->memfd; $bf->sync; $bf->unlink; # or Class->unlink($path) "capacity" is the configured item capacity; "bits" is the bit-array size in bits (a power of two); "hashes" is the number of hash probes "k"; "fp_rate" is the configured target false-positive rate. "count" returns an estimate of the number of distinct items added, computed from the fraction of bits set (accurate while the filter is not saturated, and capped at "capacity" once it is). "sync" flushes the mapping to its backing store (a no-op for anonymous and memfd filters, which have none); "unlink" removes the backing file (also callable as "Class->unlink($path)"); "path" returns the backing path ("undef" for anonymous, memfd, or fd-reopened filters) and "memfd" the backing descriptor -- the memfd of a "new_memfd" filter or the dup'd fd of a "new_from_fd" filter, and -1 for file-backed or anonymous filters. STATS stats() returns a hashref describing the filter: * "capacity" -- the configured item capacity. * "fp_rate" -- the configured target false-positive rate. * "bits" -- the bit-array size in bits (a power of two). * "hashes" -- the number of hash probes "k" per item. * "bits_set" -- how many bits are currently set. * "count" -- the estimated number of distinct items added. * "fill_ratio" -- "bits_set / bits", between 0 and 1. As this approaches 0.5 the filter is near its designed capacity and the realised false-positive rate approaches the configured target; well above that the rate climbs. * "ops" -- running count of mutating operations ("add", "add_many", "merge", "clear"). * "mmap_size" -- bytes of the shared mapping. SHARING ACROSS PROCESSES The filter lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls "new($path, ...)" on the same path with matching capacity and rate), an anonymous mapping inherited across "fork", or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into and tests against the same bit array, so membership reflects the union of what all of them have added. # producer and consumer share one filter with no coordination my $bf = Data::BloomFilter::Shared->new(undef, 100_000, 0.01); # before fork unless (fork) { $bf->add_many([ map { "ev-$_" } 1 .. 1000 ]); exit } wait; print $bf->contains("ev-500") ? "seen\n" : "no\n"; # seen -- the child's add SECURITY The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes. CRASH SAFETY Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each bit set is a single word store, so a crash leaves the filter consistent up to the last completed "add". Limitation: PID reuse is not detected (very unlikely in practice). SEE ALSO Data::HyperLogLog::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the "Data::*::Shared" family. AUTHOR vividsnow LICENSE This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.