Skip to content

RSVP the world: BitSet vs BloomFilter

October 5, 2013

Imagine if as a Java developer you wanted to record RSVP status for invitees to a world party. You’d need to track responses for over 7 billion individuals. Both BitSets and Bloom filters are used to efficiently test an element for membership in a Set. Instead of storing the entire object in memory we can keep it in some larger data storage and use these data structures to tell us if the item we’re looking for is in the larger data store.

BitSets (BitArrays)

java.util.BitSet is the implementation in Java that allows setting a single bit in an array of long integers. BitSets do provide error-free Set membership testing unlike Bloom Filters, however this implementation has a serious limitation. Notice that the constructor for this class takes a Java “int” as a representation for the number of bits. Java’s Integer.MAX_VALUE is equal to 231-1 which is 2,147,483,647 (about 2.147 billion) which doesn’t even represent 1/3 of the Earth’s current population. Knowing that the underlying representation of these bits is simply an array of long integers, we know that the size of that array is 2^31/2^6 = 2^25 = 33,554,432 (33.5 million long ints). 2^6 = 64 which is the number of bits in a Java long integer. But we in Java we can have a much large array of long ints than that. The JVM bytecode to create arrays newarray takes an integer as it’s size count. So the maximum size of an array of long integers is close to 2^31 (a few elements less because of memory alignment and overhead).

Further down in this post, I’ve written some code that allows you to RSVP for this worldwide party by splitting the responses across multiple BitSets. Works very quickly for an RSVP list size of 10 BILLION on my Macbook Air.

Bloom Filters

Bloom Filters are one-way probabilistic data structures for testing Set membership. The “one-way error” means that the filter might tell you something is in the Set when it is NOT, but they will never tell you something is not in the set when it actually is. The reason for this is simple: 1 to k hash functions are used to find the position of the bit to set as true. The logic being if you had added an element to the Set those bits calculated by the hash function would have been all set to true, however some may have been set to true by hashing other elements which is why there are chances for false positives.

Large Sets a problem in Java because of hash code

Many implementations of BitSet and BloomFilter depend on Java int and that’s no coincidence. Java only allows arrays of int max size and the way we index into those arrays is usually with some sort of hash function like Object#hashCode() which returns an int. The problem with using int as a hash code is that if we wanted to hash many “long integers” then we have 2^32 possible hash collisions for EVERY INTEGER. So if we were to use a bloom filter to check for set membership for many long integers the odds of collisions would make it useless very quickly. Even Guava’s Bloom Filter implementation does its hashing with Java ints instead of longs.

One example I found that operates correctly with long integers is Apache Lucene’s OpenBitSet.

WorldPartyRsvp source code

Advertisements

From → Algorithms, Java

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: