|
Web Caching with Dynamic Content Seminar - White Paper The Story It never pays to be before your time. It is also not helpful if the trend of technology is marching away from you. This would be the actual outcome of a simple and clever algorithm which is believed to be first published by Burton H. Bloom in the Communications of the ACM in 1970 ( Volume 13 / Number 7 / July 1970 ).
The Scenario Imagine, if you will, the desire to store boolean information in a bit array. A very simple premise. You simply assign each element in the bit array to a meaning and then assign it a value. In this scenario, it takes 1 bit in the array to store 1 bit of stored information. The bit array faithfully represents its appropriate value with 100 percent accurately. This, of course, works best when the stored data is array oriented like a transient over time or space. But what if, the data you wish to store is not a linear transient oriented dataset? Bloom's way In his publication, Bloom suggests using a "Hash coding with Allowable Errors" algorithm to help word processors perform capitalization and/or hyphenation on a document. An algorithm that could use less space and be faster than a conventional one to one mapping algorithm. In both of these example cases, a majority of words ( 90% for example ) could be checked using a simple rule, while the smaller minority set could be solved with special cases and an exception list used to catch the instances where the algorithm would report a word as simply solvable when it was not. His motivation was to reduce the time it took to lookup data from a slow storage device to faster 'core memory'. At the time, is was unlikely typical memory systems could hold the entire ruleset for such a task in system memory. Blooms way could dramatically improve the performance with existing hardware. What we want to do Bloom's algorithm lets you store the existence of stored input in a bit array, however you must allow the possibility of a false positive when extracting this information. For this example, we will assume you wish to optimize the performance of your e-commerce website. You have customers. These customers have shopping baskets. Most of them are empty ( the baskets, not the shoppers ). All of your dynamically generated web pages must show the contents of the shopping basket or indicate if there is content in the basket with an icon. This is accomplished with a fragment of HTML that indicates basket status. Currently, you hit the SQL database for each generated web page to determine the status of a shoppers cart by querying the database using the browser's cookie as the index. This consumes precious time while the user is waiting for the page to be generated. Typically, makes the database the bottle neck in web applications. We can optimize this by building a 'Bloom filter' that will hold the status of web shoppers baskets. The Bloom filter will accurately tell us if the basket is empty and with a high degree of accuracy if it has content. This false positive case is acceptable as long as the rate is low, because for each TRUE case we get back from the Bloom filter, we will hit the database to get the actual basket content. How it works To store the status of a user's filled basket in a Bloom bit array N bits long, take the user's cookie and produce a high quality hashed number. Then, take the first N bits from the result of the hashed number and set the index N in the bit array to TRUE. ![]() We repeat this about 5 to 20 times setting 5 to 20 bits, respectively, in the bit array using the next consecutive N bits from the hashed output. The bit array starts out reset ( FALSE ), before we start to populate the bit array with data. As the array becomes more populated, sometimes we will set a bit that has already been set. This is the root of the false positive cases we will later examine. When we wish to examine the basket status of a user, we proceed in almost the exact steps except we read the bits from the bit array instead of setting them. If any of the bits are zero ( FALSE ), the basket is guaranteed to be empty. If all of the bits are set ( TRUE ), the basket likely has content and the SQL database should be queried for a precise answer. Tuning the Bloom filter The saturation is defined to be the percent of bits set to TRUE in the bit array. The phase is defined to be the number of times we attempt to set a bit in the array ( 5 to 20 times in this example ). Both of these variables can be modified to change the accuracy and capacity of the Bloom filter. Generally speaking, the larger the size of the bit array (N) and the higher the phase, the less likely a
|
|||
| Copyright © 2000 Beach Software | |||