NAME Data::CuckooFilter::Shared - shared-memory Cuckoo filter for Linux SYNOPSIS use Data::CuckooFilter::Shared; # sized for 1_000_000 items, anonymous mapping my $cf = Data::CuckooFilter::Shared->new(undef, 1_000_000); $cf->add("alice"); $cf->add("bob"); $cf->contains("alice"); # 1 (probably present) $cf->contains("carol"); # 0 (definitely absent) # unlike a Bloom filter, you can delete $cf->remove("alice"); $cf->contains("alice"); # 0 # add returns 0 when the table is full (a true no-op) my $ok = $cf->add("item"); # 1 if stored, 0 if full # bulk add in a single lock acquisition my $n = $cf->add_many([ map { "user-$_" } 1 .. 1000 ]); # share across processes via a backing file my $shared = Data::CuckooFilter::Shared->new("/tmp/seen.cuckoo", 1_000_000); DESCRIPTION A Cuckoo filter in shared memory: a compact, fixed-size structure for approximate set membership that, unlike a Bloom filter, supports delete. You add items, ask whether an item is present, and remove items you previously added. The membership answer is either "definitely not present" or "probably present": for items you have added (and not removed) "contains" always returns true -- there are no false negatives -- but there is a tiny rate of false positives (it may occasionally report an item as present that was never added). It never stores the items themselves, only a small fingerprint of each, so memory is proportional to the configured capacity, not to the size of the items. Each item is hashed once with XXH3 (128-bit). The high half yields a 16-bit fingerprint; the low half yields one candidate bucket, and partial-key cuckoo hashing derives a second candidate bucket from the first and the fingerprint. Each bucket holds four fingerprint slots. "add" stores the fingerprint in either candidate bucket, evicting and rehoming existing fingerprints (the "cuckoo" kick) when both are full. The false-positive rate is governed by the fingerprint width and bucket size and is approximately "2 * slots_per_bucket / 2**16" -- about 0.012% with the fixed 16-bit fingerprint and 4-slot buckets. The filter has a bounded capacity. When both candidate buckets are full and a bounded sequence of cuckoo evictions cannot rehome a fingerprint, "add" returns false and leaves the filter byte-for-byte unchanged -- a failed insert is a true no-op, so it never drops a previously stored fingerprint and never creates a false negative, even at the full boundary. A real-world filter accepts roughly its configured capacity (typically 95% or more) before reporting full. Because the table lives in a shared mapping, several processes share one filter: any process that opens the same backing file, inherits the anonymous mapping across "fork", or reopens a passed memfd, sees the others' additions and removals and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may "add", "remove", and "contains" concurrently. Removal caveat. "remove" deletes one fingerprint matching the item. Because the filter stores only a 16-bit fingerprint, removing an item that was never added -- or one whose fingerprint happens to collide with a different present item -- may delete the wrong fingerprint and corrupt the filter (causing a false negative for some other item). Only remove items you actually added. There is no de-duplication: re-adding an item stores a second copy of its fingerprint (and "count" rises by one each time), so to fully forget an item you must "remove" it as many times as you added it. Cuckoo filters do not union cleanly, so there is no merge operation. Items are added, tested, and removed by their byte content; wide-character strings (any codepoint above 255) cause a "Wide character" croak -- encode such strings to bytes first (for example with "Encode::encode_utf8"). Linux-only. Requires 64-bit Perl. METHODS Constructors my $cf = Data::CuckooFilter::Shared->new($path, $capacity); my $cf = Data::CuckooFilter::Shared->new(undef, 1_000_000); # anonymous my $cf = Data::CuckooFilter::Shared->new_memfd($name, $capacity); my $cf = Data::CuckooFilter::Shared->new_from_fd($fd); $path is the backing file ("undef" or omitted for an anonymous mapping). $capacity is the number of items you expect to add; it must be at least 1. "new" and "new_memfd" croak if $capacity is less than 1. From $capacity the filter derives its geometry: a bucket array of "num_buckets = next_power_of_two(ceil(capacity / 4 / 0.95))" buckets (floor 2), each with four 16-bit fingerprint slots, for "4 * num_buckets" slots total. The 0.95 target load factor and the rounding up to a power of two mean the realised capacity at the full boundary is typically at or above the requested $capacity. When reopening an existing file or memfd, the stored geometry wins and the caller's $capacity argument is ignored. "new_memfd" creates a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd" reopens one in another process. Adding, testing, removing my $ok = $cf->add($item); # 1 if stored, 0 if the table is full my $added = $cf->add_many(\@items); # count of items stored my $in = $cf->contains($item); # 1 if probably present, 0 if definitely absent my $gone = $cf->remove($item); # 1 if a fingerprint was removed, else 0 $cf->clear; # reset to empty "add" hashes $item (taken by its bytes; wide characters croak, encode first) and stores its fingerprint in one of its two candidate buckets, returning 1 on success. It returns 0 only when the table is full -- both candidate buckets are occupied and a bounded run of cuckoo evictions could not make room. A return of 0 is a true no-op: the filter is unchanged, so nothing you previously added is lost (no false negatives for added items, even when full). "add" does not de-duplicate: adding the same item twice stores its fingerprint twice and increments "count" by two. "add_many" takes an array reference and does the whole batch under a single write lock, returning how many elements were stored (the count of "add" calls that returned 1). "contains" returns 1 if the item is probably present and 0 if it is definitely absent. A 0 means definitely absent: an item you added and have not removed will never return 0 (there are no false negatives). A 1 may be a false positive (a different item happens to share a fingerprint and bucket). There are never false negatives for items that are currently stored. "remove" deletes one stored fingerprint of $item, returning 1 if one was found and cleared or 0 if none matched. See the Removal caveat in "DESCRIPTION": only remove items you added, and remove an item as many times as it was added to forget it completely. "clear" empties the whole filter (all slots zeroed, "count" reset to 0). Introspection and lifecycle $cf->count; $cf->capacity; $cf->buckets; $cf->slots; $cf->stats; $cf->path; $cf->memfd; $cf->sync; $cf->unlink; # or Class->unlink($path) "count" is the number of fingerprints currently stored (maintained exactly on every "add", "remove", and "clear"); since "add" stores duplicates, this is the number of live fingerprints, not the number of distinct items. "capacity" is the configured item capacity; "buckets" is the bucket count (a power of two); "slots" is the total fingerprint-slot count ("4 * buckets"). "sync" flushes the mapping to its backing store (a no-op for anonymous and memfd filters); "unlink" removes the backing file (also callable as "Class->unlink($path)"); "path" returns the backing path ("undef" for anonymous, memfd, or fd-reopened filters) and "memfd" the backing descriptor -- the memfd of a "new_memfd" filter or the dup'd fd of a "new_from_fd" filter, and -1 for file-backed or anonymous filters. There is deliberately no merge method: cuckoo filters cannot be unioned by a simple element-wise operation the way Bloom filters can. STATS stats() returns a hashref describing the filter: * "capacity" -- the configured item capacity. * "buckets" -- the bucket count (a power of two). * "slots" -- the total number of fingerprint slots ("4 * buckets"). * "count" -- the number of fingerprints currently stored. * "fill_ratio" -- "count / slots", between 0 and 1. As this approaches 1 the table is near full and "add" begins to fail; in practice inserts start to fail somewhat below a full table. * "ops" -- running count of write-path calls ("add", "add_many", "remove", "clear"), whether or not any fingerprint was actually stored or removed. * "mmap_size" -- bytes of the shared mapping. SHARING ACROSS PROCESSES The filter lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls "new($path, ...)" on the same path with a matching capacity), an anonymous mapping inherited across "fork", or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into, tests against, and removes from the same table, so membership reflects the combined effect of what all of them have done. # producer and consumer share one filter with no coordination my $cf = Data::CuckooFilter::Shared->new(undef, 100_000); # before fork unless (fork) { $cf->add_many([ map { "ev-$_" } 1 .. 1000 ]); exit } wait; print $cf->contains("ev-500") ? "seen\n" : "no\n"; # seen -- the child's add SECURITY The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes. CRASH SAFETY Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each "add" commits with a single fingerprint store (or, on the eviction path, an all-or-nothing sequence that rolls back on failure), so a crash leaves the filter consistent up to the last completed operation. Limitation: PID reuse is not detected (very unlikely in practice). SEE ALSO Data::BloomFilter::Shared (membership without delete), Data::HyperLogLog::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the "Data::*::Shared" family. AUTHOR vividsnow LICENSE This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.