Skip to main content

Bloom Filters - Explained

The Bloom filter is a space efficient, probabilistic data structure, designed to test the membership of elements to a set. The trade-off for being a space efficient data structure is it may return false positives, but always returns definite negatives: Meaning Bloom filters can accurately test an element for non-membership to a set, but can only with probability test an element for membership. Bloom filters find application in circumstances where testing for non-membership saves resources such as calls to a web server, checking a proxy cache. Google uses Bloom filters in the Chrome browser as a preliminary check for malicious URL's.



Internally Bloom filters use a binary-array, and multiple hash functions. Lets say for instance we have a bit-array of a 100 elements, and 3 hash functions. We want to insert the word "Singapore" into the filter, so we pass it through hash 1 - returns 33; hash 2 - returns 7; hash 3 - returns 22. Next we go to each of those indices of the array and set them to 1, we are done and the word has been included for testing. To test a word for membership, we pass the word through all three hash functions, get the three hashes and check the corresponding indices of the bit-array. If all 3 indices are set to 1, then its a probable true; if any one of those indices are set to zero, it's a definite false. Below is an example...


The efficiency of Bloom filter relies upon the number of bits in the array, the number of hash functions, and most importantly the quality of the hash functions. In a real world example, we would use 14 hash functions, and a binary-array of millions of elements. xxHash is one of the higher quality hash functions available, other examples are MurmurHash, FNV Hash, Jenkins Hashes, pseudo-number generators will also do. Uniform Distribution of hash functions being a key factor in quality. Here are the formulas invloved in the designing of a bloom filter:


m - number of elements in bit array
k - number of hash functions
n - number of items in collection
p - desired false positive probability // 0.0 - 1.0
^ - power


m = -((n*ln(p))/(ln(2)^2));
k = (m/n) * ln(2);
n = (m * ln(2))/k;
p = e^(-(m/n) * (ln(2)^2)); // estimated probability of false positives
// lets do a test, i know n = 1000,
// and i desire a p = 0.01 :
m = 9585.058 = 9586
k = 6.644 = 7


// so for 100 items in a collection,
// a desired probability of 1% false positives
// you need a bit array of 9586 elements and 7 hash functions */

Thank you for reading - I'll keep on editing this tutorial in the future.

Link outs :
http://en.wikipedia.org/wiki/Bloom_filter
http://www.jasondavies.com/bloomfilter/

Comments

  1. Thank you for this. What a simple and profound idea!

    ReplyDelete
  2. Magnificient Meerkat13 January 2015 at 13:22

    My gratitude as well, I just learnt something.

    ReplyDelete
  3. Awesome! The way you explain concepts is just great

    ReplyDelete
  4. Awesome! The way you explain concepts is just great

    ReplyDelete
  5. I'am glad to read the whole content of this blog and am very excited.Thank you.

    หีฟิต

    ReplyDelete

Post a Comment

Popular posts from this blog

Markov Chains - Explained

Markov Chains is a probabilistic process, that relies on the current state to predict the next state. For Markov chains to be effective the current state has to be dependent on the previous state in some way; For instance, from experience we know that if it looks cloudy outside, the next state we expect is rain. We can also say that when the rain starts to subside into cloudiness, the next state will most likely be sunny. Not every process has the Markov Property, such as the Lottery, this weeks winning numbers have no dependence to the previous weeks winning numbers.

The Genetic Algorithm - Explained

Paradigm of Evolution Every organism contains an array of traits - called DNA, when 2 parents mate they produce a child containing a mixture of their DNA. Depending on how well those traits work together to help the child survive so that he may reproduce, will determine if those traits will pass into future generation. Rarely a random trait enters a child's DNA that was not inherited from the parents, we call this mutation. If the mutation is beneficial, the organism will survive long enough to carry the mutation over to future generations. So the cycle continues, after many generations we continue to optimize the population by "mixing and matching", "trial and error" of  DNA.

Introduction Genetic Algorithms are used to find optimal solution by method of evolution-inspired search and optimization; Generally used in problems where linear/brute-force searches are not viable in terms of time, such as – Travelling Salesman Problem, Timetable Scheduling, Finding Ne…