Bloom Filters - Explained ~ Tech-Effigy

The Bloom filter is a space efficient, probabilistic data structure, designed to test the membership of elements to a set. The trade-off for being a space efficient data structure is it may return false positives, but always returns definite negatives: Meaning Bloom filters can accurately test an element for non-membership to a set, but can only with probability test an element for membership. Bloom filters find application in circumstances where testing for non-membership saves resources such as calls to a web server, checking a proxy cache. Google uses Bloom filters in the Chrome browser as a preliminary check for malicious URL's.

Internally Bloom filters use a binary-array, and multiple hash functions. Lets say for instance we have a bit-array of a 100 elements, and 3 hash functions. We want to insert the word "Singapore" into the filter, so we pass it through hash 1 - returns 33; hash 2 - returns 7; hash 3 - returns 22. Next we go to each of those indices of the array and set them to 1, we are done and the word has been included for testing. To test a word for membership, we pass the word through all three hash functions, get the three hashes and check the corresponding indices of the bit-array. If all 3 indices are set to 1, then its a probable true; if any one of those indices are set to zero, it's a definite false. Below is an example...

The efficiency of Bloom filter relies upon the number of bits in the array, the number of hash functions, and most importantly the quality of the hash functions. In a real world example, we would use 14 hash functions, and a binary-array of millions of elements. xxHash is one of the higher quality hash functions available, other examples are MurmurHash, FNV Hash, Jenkins Hashes, pseudo-number generators will also do. Uniform Distribution of hash functions being a key factor in quality. Here are the formulas invloved in the designing of a bloom filter:

m - number of elements in bit array
k - number of hash functions
n - number of items in collection
p - desired false positive probability // 0.0 - 1.0
^ - power

m = -((n*ln(p))/(ln(2)^2));
k = (m/n) * ln(2);
n = (m * ln(2))/k;
p = e^(-(m/n) * (ln(2)^2)); // estimated probability of false positives
// lets do a test, i know n = 1000,
// and i desire a p = 0.01 :
m = 9585.058 = 9586
k = 6.644 = 7

// so for 100 items in a collection,
// a desired probability of 1% false positives
// you need a bit array of 9586 elements and 7 hash functions */

Thank you for reading - I'll keep on editing this tutorial in the future.

Link outs :
http://en.wikipedia.org/wiki/Bloom_filter
http://www.jasondavies.com/bloomfilter/

7 comments:

Anonymous9 January 2015 at 22:21
Thank you for this. What a simple and profound idea!
Magnificient Meerkat13 January 2015 at 13:22
My gratitude as well, I just learnt something.
Luca Amore18 February 2015 at 01:46
Great blog! Thank you!
Unknown14 February 2016 at 05:56
Awesome! The way you explain concepts is just great
Unknown14 February 2016 at 05:57
Awesome! The way you explain concepts is just great
vaiybora29 November 2017 at 18:39
I'am glad to read the whole content of this blog and am very excited.Thank you.

หีฟิต
Top SEO26 November 2020 at 08:57
This is my first time i visit here. I found so many interesting stuff in your blog especially its discussion. From the tons of comments on your articles, I guess I am not the only one having all the enjoyment here keep up the good work Fox Farm Nutrients

Tech-Effigy

Concepts made Simple

Thursday 8 January 2015

Bloom Filters - Explained

7 comments:

Search

Popular Posts

Recommended Books

Categories

Blog Archive

Total Pageviews