Web Caching with Dynamic Content Seminar - White Paper

The Story

It never pays to be before your time.   It is also not helpful if the trend of technology is marching away from you.   This would be the actual outcome of a simple and clever algorithm which is believed to be first published by Burton H. Bloom in the Communications of the ACM in 1970 ( Volume 13 / Number 7 / July 1970 ).  
It is obvious how the Bloom filter works.   It is not obvious how well the bloom filter works.
While actually authored in 1969, at a time close to the initial conceptions of the early ARPA network, it would take almost 25 years before the modern commercial internet would bring back the practical use of Bloom's idea.   However, between these times, most of the characteristics of mainframe and desktop systems would make Bloom's efforts less relevant.   Most people are surprised at the simplicity of this algorithm.   It is obvious how the Bloom filter works.   It is not obvious how well the Bloom filter works.  

The Scenario

Imagine, if you will, the desire to store boolean information in a bit array.   A very simple premise.   You simply assign each element in the bit array to a meaning and then assign it a value.   In this scenario, it takes 1 bit in the array to store 1 bit of stored information.   The bit array faithfully represents its appropriate value with 100 percent accurately.   This, of course, works best when the stored data is array oriented like a transient over time or space.   But what if, the data you wish to store is not a linear transient oriented dataset?  

Bloom's way

In his publication, Bloom suggests using a "Hash coding with Allowable Errors" algorithm to help word processors perform capitalization and/or hyphenation on a document.   An algorithm that could use less space and be faster than a conventional one to one mapping algorithm.   In both of these example cases, a majority of words ( 90% for example ) could be checked using a simple rule, while the smaller minority set could be solved with special cases and an exception list used to catch the instances where the algorithm would report a word as simply solvable when it was not.   His motivation was to reduce the time it took to lookup data from a slow storage device to faster 'core memory'.   At the time, is was unlikely typical memory systems could hold the entire ruleset for such a task in system memory.   Blooms way could dramatically improve the performance with existing hardware.  

What we want to do

Bloom's algorithm lets you store the existence of stored input in a bit array, however you must allow the possibility of a false positive when extracting this information.   For this example, we will assume you wish to optimize the performance of your e-commerce website.   You have customers.   These customers have shopping baskets.   Most of them are empty ( the baskets, not the shoppers ).   All of your dynamically generated web pages must show the contents of the shopping basket or indicate if there is content in the basket with an icon.   This is accomplished with a fragment of HTML that indicates basket status.   Currently, you hit the SQL database for each generated web page to determine the status of a shoppers cart by querying the database using the browser's cookie as the index.   This consumes precious time while the user is waiting for the page to be generated.   Typically, makes the database the bottle neck in web applications.   We can optimize this by building a 'Bloom filter' that will hold the status of web shoppers baskets.   The Bloom filter will accurately tell us if the basket is empty and with a high degree of accuracy if it has content.   This false positive case is acceptable as long as the rate is low, because for each TRUE case we get back from the Bloom filter, we will hit the database to get the actual basket content.  

How it works

To store the status of a user's filled basket in a Bloom bit array N bits long, take the user's cookie and produce a high quality hashed number.   Then, take the first N bits from the result of the hashed number and set the index N in the bit array to TRUE.  



We repeat this about 5 to 20 times setting 5 to 20 bits, respectively, in the bit array using the next consecutive N bits from the hashed output.   The bit array starts out reset ( FALSE ), before we start to populate the bit array with data.   As the array becomes more populated, sometimes we will set a bit that has already been set.   This is the root of the false positive cases we will later examine.   When we wish to examine the basket status of a user, we proceed in almost the exact steps except we read the bits from the bit array instead of setting them.   If any of the bits are zero ( FALSE ), the basket is guaranteed to be empty.   If all of the bits are set ( TRUE ), the basket likely has content and the SQL database should be queried for a precise answer.  

Tuning the Bloom filter

The saturation is defined to be the percent of bits set to TRUE in the bit array.   The phase is defined to be the number of times we attempt to set a bit in the array ( 5 to 20 times in this example ).   Both of these variables can be modified to change the accuracy and capacity of the Bloom filter.   Generally speaking, the larger the size of the bit array (N) and the higher the phase, the less likely a
False_Positive = SaturationPhase
false positive response will occur.   Bloom asserted that the optimum performance of this algorithm occurs when saturation of the bit array is 50%.   Statistically, the chance of a false positive can be determined by taking the saturation and raising it to the power of the phase.   Other equations are available to help tune the Bloom filter algorithm also.  
Copyright © 2000 Beach Software