NAME Data::RoaringBitmap::Shared - shared-memory Roaring bitmap (compressed uint32 set) for Linux SYNOPSIS use Data::RoaringBitmap::Shared; # anonymous shared mapping: a 256-slot container pool my $a = Data::RoaringBitmap::Shared->new(undef, 256); $a->add(5); # 1 (newly added) $a->add(5); # 0 (already present) $a->contains(5); # true ('test' is an alias) $a->remove(5); # 1 (removed; 'delete' is an alias) $a->add_many([1, 2, 3, 1000, 70000]); # returns count newly added $a->cardinality; # number of elements ('count'/'size' are aliases) $a->min; $a->max; # smallest / largest element (undef if empty) my $list = $a->to_array; # arrayref of every element, ascending # in-place set operations against another bitmap my $b = Data::RoaringBitmap::Shared->new(undef, 256); $b->add_many([3, 4, 5]); $a->union($b); # a |= b ('or' is an alias); returns $a $a->intersect($b); # a &= b ('and' is an alias); returns $a # share across processes via a backing file my $shared = Data::RoaringBitmap::Shared->new("/tmp/ids.bin", 65536); DESCRIPTION A Roaring bitmap in shared memory: a compressed set of 32-bit unsigned integers. The 32-bit value space is split into 65536 buckets keyed by the high 16 bits of each value; each bucket stores the low 16 bits of its members in one of two container kinds, chosen automatically by density: * an array container -- a sorted ascending "uint16" array, compact when the bucket is sparse (up to 4096 elements); * a bitmap container -- a 65536-bit bitmap, efficient when the bucket is dense. A bucket starts as an array and is promoted to a bitmap once it would exceed 4096 elements. Both kinds occupy one fixed 8192-byte slot from a container pool, so the structure stays compact for sparse, clustered, and dense sets alike. It supports membership tests, cardinality, "min"/"max", "to_array", and in-place "union" and "intersect" with another bitmap. Because the bucket table and container pool live in a shared mapping, several processes share one bitmap: any process that opens the same backing file, inherits the anonymous mapping across "fork", or reopens a passed memfd, sees the others' elements. A write-preferring futex rwlock with dead-process recovery guards every mutation, so concurrent adds, removals, and set operations serialize cleanly. Values must fit in 32 bits (0 .. 4294967295); a value outside that range croaks on "add"/"add_many" and is simply absent for "contains"/"remove". v1 scope: array + bitmap containers (no run containers); union and intersect only (no "xor" / "andnot" yet); a bitmap container is not down-converted back to an array when removals make it sparse again. Capacity is fixed at creation. The container pool holds "container_capacity" slots; "add", "add_many" and "union" croak (after releasing the lock) when the pool is exhausted. A bitmap that touches every bucket as a dense bitmap needs all 65536 slots (512 MiB of pool); a sparse set needs far fewer. Linux-only. Requires 64-bit Perl. METHODS Constructors my $a = Data::RoaringBitmap::Shared->new($path, $container_capacity); my $a = Data::RoaringBitmap::Shared->new(undef, $container_capacity); # anonymous my $a = Data::RoaringBitmap::Shared->new_memfd($name, $container_capacity); my $a = Data::RoaringBitmap::Shared->new_from_fd($fd); $path is the backing file ("undef" or omitted for an anonymous mapping). $container_capacity (default 256) is the number of 8192-byte container slots the pool holds; it must be ">= 1" and "<= 2**20". One slot is consumed per non-empty bucket (a bucket is one container regardless of whether it is an array or a bitmap), so size the pool for the number of distinct high-16 groups your values fall into, with headroom. "new" and "new_memfd" croak if the capacity is out of range. A freshly created bitmap is empty ("cardinality == 0"). When reopening an existing file or memfd, the stored geometry wins and the existing elements are preserved; the capacity you pass to "new" on a reopen is used only when the file is brand new. "new_memfd" creates a Linux memfd (transferable via its "memfd" descriptor); "new_from_fd" reopens one in another process. Adding and removing my $new = $a->add($x); # 1 if newly added, 0 if already present my $new = $a->set($x); # alias for add my $added = $a->add_many(\@ints); # count of elements newly added my $gone = $a->remove($x); # 1 if removed, 0 if absent my $gone = $a->delete($x); # alias for remove "add" (aliased "set") inserts $x, returning 1 if it was new or 0 if it was already present. $x must be an unsigned integer that fits in 32 bits; a larger value croaks before any lock is taken. "add" croaks if the container pool is exhausted (only adding to a brand-new bucket can need a slot); the croak happens after the write lock is released and the bitmap is left consistent. "add_many" adds every value of the array reference (each range-checked up front; an out-of-range value croaks before any lock). It returns the number of elements that were newly added. It is not atomic on pool exhaustion: values are added in order, and if the pool runs out partway, "add_many" croaks with the already-processed elements left as members (a set is order-independent, so the added ones are simply present). "remove" (aliased "delete") removes $x, returning 1 if it was present or 0 if it was absent (an out-of-range value is treated as absent). When the last element of a bucket is removed, the bucket's container slot is returned to the pool. As noted in "DESCRIPTION", a bitmap container is not converted back to an array on removal in v1. Membership, cardinality, ordering my $bool = $a->contains($x); # true if $x is a member my $bool = $a->test($x); # alias for contains my $n = $a->cardinality; # number of elements my $n = $a->count; # alias for cardinality my $n = $a->size; # alias for cardinality my $bool = $a->is_empty; # true when cardinality == 0 my $lo = $a->min; # smallest element, or undef if empty my $hi = $a->max; # largest element, or undef if empty my $list = $a->to_array; # arrayref of all elements, ascending "contains" (aliased "test") is a read; an out-of-range value is simply not a member. "cardinality" (aliased "count"/"size") is the total number of elements across all buckets. "min" and "max" scan for the smallest/largest member and return "undef" on an empty set. "to_array" returns a reference to a new array holding every element in ascending order; for a large dense set this can be a long list. (The array is pre-sized under a brief read lock and filled under the read lock; it is a best-effort snapshot if the set is being mutated concurrently.) Set operations (in place) $a->union($other); # a |= b (set union); returns $a $a->or($other); # alias for union $a->intersect($other); # a &= b (set intersection); returns $a $a->and($other); # alias for intersect "union" (aliased "or") adds every element of $other to the receiver; "intersect" (aliased "and") keeps only the elements present in both. Both modify the receiver in place and return it (so calls chain). $other must be a "Data::RoaringBitmap::Shared" object; combining a bitmap with itself ("$a->union($a)") is a no-op. "union" may need new container slots (one per bucket that $other occupies and the receiver does not); it pre-checks capacity and croaks before mutating (after releasing the locks) if the pool cannot satisfy the operation, leaving the receiver unchanged. "intersect" only ever shrinks or frees containers, so it never needs new slots and never croaks for capacity. Locking note: a set operation takes the receiver's write lock and $other's read lock. To stay deadlock-free even if two processes run "$a->union($b)" and "$b->union($a)" simultaneously, the two locks are acquired in a fixed order keyed on a per-bitmap identity stored in shared memory (so both processes compute the same order regardless of how the handles happen to be laid out in each process), and the receiver always takes the write lock. Calling a set op with two handles to the same underlying bitmap (the same object, or a second handle that reopened the same backing file or memfd) is detected and is a no-op. Lifecycle $a->clear; # remove every element $a->path; $a->memfd; $a->sync; $a->unlink; # or Class->unlink($path) "clear" empties the bitmap and returns every container slot to the pool. "sync" flushes the mapping to its backing store (a no-op for anonymous and memfd bitmaps); "unlink" removes the backing file (also callable as "Class->unlink($path)"); "path" returns the backing path ("undef" for anonymous, memfd, or fd-reopened bitmaps) and "memfd" the backing descriptor -- the memfd of a "new_memfd" bitmap or the dup'd fd of a "new_from_fd" bitmap, and -1 for file-backed or anonymous bitmaps. STATS stats() returns a hashref describing the bitmap: * "cardinality" -- the total number of elements (same as "cardinality"). * "containers_used" -- the 1-based high-water of container slots allocated since creation or the last "clear" (slot 0 is the reserved NULL sentinel, so an empty pool reports 1). * "containers_capacity" -- the fixed container-pool capacity. * "buckets_used" -- the number of non-empty buckets (each is one container, array or bitmap). * "ops" -- running count of mutating operations ("add", "add_many", "remove", "union", "intersect", "clear"). * "mmap_size" -- bytes of the shared mapping. SHARING ACROSS PROCESSES The bitmap lives in a shared mapping, shared the same three ways as the rest of the family: a backing file (every process calls "new($path, ...)" on the same path), an anonymous mapping inherited across "fork", or a memfd whose descriptor is passed to an unrelated process (over a UNIX socket via "SCM_RIGHTS", or via "/proc/$pid/fd/$n") and reopened with new_from_fd($fd). Because the mapping is shared, every process adds to and queries the same bitmap. All mutation is serialized by the write lock, so the final contents are independent of how the processes interleave. # parent and children share one bitmap with no coordination my $a = Data::RoaringBitmap::Shared->new(undef, 65536); # before fork unless (fork) { $a->add($_) for 1 .. 100000; exit } wait; print $a->cardinality, "\n"; # reflects the child's adds SECURITY The mmap region is writable by all processes that open it, and a reopened file is trusted: validation checks the header geometry but cannot vet every internal container offset against a maliciously crafted file. A hostile file could direct a read to an out-of-range container slot. Do not share backing files with, or reopen files from, untrusted processes. CRASH SAFETY Mutation is guarded by a futex-based write-preferring rwlock with PID-encoded ownership; if a holder dies, the next contender detects the dead owner and recovers. Because every mutation performs its container allocations under the lock after a capacity pre-check, a crash leaves the bitmap consistent up to the last completed operation. Limitation: PID reuse is not detected (very unlikely in practice). SEE ALSO Data::RadixTree::Shared, Data::DisjointSet::Shared, Data::Intern::Shared, Data::SortedSet::Shared, Data::SpatialHash::Shared, Data::Histogram::Shared, Data::CountMinSketch::Shared, Data::HyperLogLog::Shared, Data::BloomFilter::Shared, and the rest of the "Data::*::Shared" family. AUTHOR vividsnow LICENSE This is free software; you can redistribute it and/or modify it under the same terms as Perl itself.