NAME Data::CountMinSketch::Shared - shared-memory Count-Min sketch for Linux SYNOPSIS use Data::CountMinSketch::Shared; # epsilon (error factor) 0.1%, delta (failure prob) 0.1%, anonymous mapping my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001); $cms->add("alice"); # count "alice" once $cms->add("bob", 5); # count "bob" five times $cms->estimate("alice"); # 1 (never less than the true count) $cms->estimate("bob"); # 5 $cms->estimate("carol"); # 0 (never added) # bulk add in a single lock acquisition (each element counted once) $cms->add_many([ map { "user-$_" } 1 .. 1000 ]); # merge another sketch of identical geometry (cellwise add -> summed streams) my $other = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001); $other->add_many([ map { "user-$_" } 500 .. 1500 ]); $cms->merge($other); # share across processes via a backing file my $shared = Data::CountMinSketch::Shared->new("/tmp/freq.cms", 0.001, 0.001); DESCRIPTION A Count-Min sketch in shared memory: a compact, fixed-size structure for approximate frequency estimation over a stream. You add items (optionally with a count), then ask for the estimated number of times any item has been added. Memory is proportional to the configured error parameters, not to the number of distinct items or the size of the items; the sketch never stores the items themselves, only a small matrix of counters. The estimate has a one-sided guarantee: it never underestimates the true count, and overestimates by at most "epsilon * total" with probability at least "1 - delta", where "total" is the sum of all increments. (An item never added estimates as 0 unless hash collisions with other items inflate every one of its cells.) This makes the sketch ideal for finding heavy hitters and approximate counts in a stream that is too large to count exactly. Each item is hashed once with XXH3 (128-bit); the two 64-bit halves drive one column per row ("d"-row double hashing) into a "d" x "w" matrix of 64-bit counters, with "w" a power of two. "add" increments the "d" cells of the item (one per row); "estimate" returns the minimum of those "d" cells -- since every collision only ever adds to a cell, the smallest cell is the tightest upper bound, and is exact when at least one of the item's cells suffered no collision. The matrix width "w" and depth "d" are derived from the "epsilon" and "delta" you request. Because the matrix lives in a shared mapping, several processes share one sketch: any process that opens the same backing file, inherits the anonymous mapping across "fork", or reopens a passed memfd, sees the others' additions and contributes its own. A write-preferring futex rwlock with dead-process recovery guards mutation, so many processes may "add" and "estimate" concurrently. Two sketches of identical geometry can be combined with "merge" (cellwise add), which yields a sketch whose counts are the sum of the two input streams -- the merged estimate of any item equals the sum of its estimates in the two inputs. Items are added and queried by their byte content; wide-character strings (any codepoint above 255) cause a "Wide character" croak -- encode such strings to bytes first (for example with "Encode::encode_utf8"). Linux-only. Requires 64-bit Perl. METHODS Constructors my $cms = Data::CountMinSketch::Shared->new($path, $epsilon, $delta); my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001); # defaults my $cms = Data::CountMinSketch::Shared->new_memfd($name, $epsilon, $delta); my $cms = Data::CountMinSketch::Shared->new_from_fd($fd); $path is the backing file ("undef" or omitted for an anonymous mapping). $epsilon is the target error factor and $delta the target failure probability; both are optional, default to 0.001, and must be strictly between 0 and 1. "new" and "new_memfd" croak if $epsilon or $delta is out of range. From $epsilon and $delta the sketch derives its geometry: a width of "w = next_power_of_two(ceil(e / epsilon))" columns (with a floor of 2 columns) and a depth of "d = ceil(ln(1 / delta))" rows (clamped to the range 1..32). Rounding the width up to a power of two means the realised error factor at any given total is typically at or below the configured target. When reopening an existing file or memfd, the stored geometry wins and the caller's $epsilon/$delta arguments are ignored. "new_memfd" creates a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd" reopens one in another process. This is the standard Count-Min sketch (plain cell increments); it does not use the conservative-update variant, which would make "merge" unsound. The plain construction is what guarantees that merging two sketches is exactly equivalent to having counted both streams into one. Adding and estimating my $total = $cms->add($item); # add 1; returns the new grand total my $total = $cms->add($item, $n); # add $n; returns the new grand total my $count = $cms->add_many(\@items); # add 1 per element; returns how many added my $est = $cms->estimate($item); # estimated count of $item (>= true count) $cms->clear; # reset every counter (and total) to 0 "add" hashes $item (taken by its bytes; wide characters croak, encode first) and increments its "d" cells by $n (default 1), returning the new grand total -- the running sum of all increments across all items. $n is an unsigned integer. "add_many" takes an array reference and adds each element once under a single write lock, returning the number of elements added (the array's length). "estimate" returns the estimated number of times $item has been added: the minimum of its "d" cells. This value never underestimates the true count, and exceeds it by at most "epsilon * total" with probability at least "1 - delta". An item that was never added estimates as 0 unless every one of its cells happens to collide with other items. Merging $cms->merge($other); Folds $other's counter matrix into $cms by cellwise addition, so $cms then estimates, for every item, the sum of that item's counts in the two sketches; $cms's "total" likewise becomes the sum of the two totals. Both sketches must have identical geometry -- the same width and depth, which follows from constructing both with the same $epsilon and $delta ("merge" croaks on a mismatch). $other is read under its own lock into a private snapshot first, so merging is deadlock-free even if two processes merge each other concurrently; $other is not modified. Cells that would overflow a 64-bit counter saturate at the maximum value. Introspection and lifecycle $cms->total; $cms->width; $cms->depth; $cms->cells; $cms->stats; $cms->path; $cms->memfd; $cms->sync; $cms->unlink; # or Class->unlink($path) "total" is the running sum of all increments; "width" is the column count "w" (a power of two); "depth" is the row count "d"; "cells" is "width * depth", the number of counters. "sync" flushes the mapping to its backing store (a no-op for anonymous and memfd sketches, which have none); "unlink" removes the backing file (also callable as "Class->unlink($path)"); "path" returns the backing path ("undef" for anonymous, memfd, or fd-reopened sketches) and "memfd" the backing descriptor -- the memfd of a "new_memfd" sketch or the dup'd fd of a "new_from_fd" sketch, and -1 for file-backed or anonymous sketches. STATS stats() returns a hashref describing the sketch: * "width" -- the column count "w" (a power of two). * "depth" -- the row count "d". * "total" -- the running sum of all increments. * "cells" -- "width * depth", the number of 64-bit counters. * "epsilon" -- the achieved error factor, "e / width". The per-item overestimate is bounded by "epsilon * total" (with probability "1 - delta"); a smaller value is a tighter bound. * "delta" -- the achieved failure probability, exp(-depth). This is the chance that the overestimate exceeds the "epsilon * total" bound for a given item; a smaller value is a stronger guarantee. * "ops" -- running count of mutating operations ("add", "add_many", "merge", "clear"). * "mmap_size" -- bytes of the shared mapping. SHARING ACROSS PROCESSES The sketch lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls "new($path, ...)" on the same path with matching epsilon and delta), an anonymous mapping inherited across "fork", or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened with new_from_fd($fd). Because the mapping is shared, every process adds into and estimates against the same counter matrix, so the counts reflect the combined stream all of them have added. # producer and consumer share one sketch with no coordination my $cms = Data::CountMinSketch::Shared->new(undef, 0.001, 0.001); # before fork unless (fork) { $cms->add("ev-500") for 1 .. 10; exit } wait; print $cms->estimate("ev-500"), "\n"; # >= 10 -- the child's adds SECURITY The mmap region is writable by all processes that open it. Do not share backing files with untrusted processes. CRASH SAFETY Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Each cell increment is a single word store, so a crash leaves the sketch consistent up to the last completed "add". Limitation: PID reuse is not detected (very unlikely in practice). SEE ALSO Data::BloomFilter::Shared, Data::HyperLogLog::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, and the rest of the "Data::*::Shared" family. AUTHOR vividsnow LICENSE This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.