Hashing | Database Management System

Hashing | Database Management System

Hashing : A hash function is computed on Some attribute of each record. The result of the function specifies in which block of the file the record should be placed. Hashing provides rapid, non-sequential, direct access to records. A key record field is used to calculate the record address by subjecting it to some calculation; a process called hashing. For numeric ascending order a sequential key record fields this might involve simply using relative address indexes from a base storage address to access records. Most of the time, key field does not have the values in sequence that can directly be used as relative record number. It has to be transformed. Hashing involves computing the address of a data item by computing a function on the search key value. A hash function his a function from the set of all search key values K to the set of all bucket addresses B. We choose a number of buckets to correspond to the number of search key values we will have Stored in the database.

To perform a lookup on a search key value K we compute h (K) and search the bucket with that address. If two search keys i and map to the same address, because

h (K)=h(K)

then the bucket at the address obtained will contain records with both search key values.
In this case we will have to check the search key value of every record in the bucket to get the ones we want. Insertion and deletion are simple.

Hash Functions

A good hash function gives an average-case lookup that is a small constant, independent of the number of Search keys. We hope records are distributed uniformly among the buckets. The worst hash function maps all keys to the same bucket. The best hash function maps all keys to distinct addresses. Ideally, distribution of keys to addresses is uniform and random.

Hashed Access Characteristics: There are following major characteristics:

  • No indexes to search or maintain
  • Very fast direct access
  • Inefficient Sequential access
  • Use when direct access is needed, but sequential access is not.

Mapping functions

The direct address approach requires that the function h(k), is a one-to-one mapping from each k to integers in (1,m). Such a function is known as a perfect hashing function: it maps each key to a distinct integer within Some manageable range and enables us to trivially build an O(1) search time table.

Unfortunately, finding a perfect hashing function is not always possible. Let’s say that we can find a hash function h(k), which maps most of the keys onto unique integers, but maps a small number of keys onto the same integer. If the number of Collisions (cases where multiple keys map onto the same integer), is Sufficiently small, then hash tables work quite well and give O(1) search times.

Handling the Collisions

In the Small number of cases, where multiple keys map to the same integer, then elements with different keys may be stored in the same “slot” of the hash table. It is clear that when the hash function is used to locate a potential match, it will be necessary to compare the key of that element with the search key. But there may be more than one element, which should be stored in a single slot of the table. Various techniques are used to manage this problem:

  1. Chaining,
  2. Re-hashing,
  3. Using neighboring slots (linear probing),
  4. Clustering,
  5. Overflow areas,
  6. Open addressing.

(1) Chaining: One simple scheme is to chain all Collisions in lists attached to the appropriate slot. This allows an unlimited number of Collisions to be handled and doesn’t require a priori knowledge of how many elements are contained in the collection. The trade off is the same as with linked lists versus array implementations of collections: linked list overhead in space and to a lesser extent in time.

(2) Re-hashing : Re-hashing schemes use a second hashing operation when there is a Collision. If there is a further collision, we re-hash until an empty “slot” in the table is found. The re-hashing function can either be a new function or a re-application of the original one. As long as the functions are applied to a key in the same Order, then a sought key can always be located.

(3) Linear probing: One of the simplest re-hashing functions is +1 (or -1), i.e., on a collision; look in the neighboring slot in the table. It calculates the new address extremely quickly and may be extremely efficient on a modern RISC processor due to efficient cache utilization. The animation gives you a practical demonstration of the effect of linear probing: It also implements a quadratic re-hash function so that you can Compare the difference

So the next hash function h1 is used. A second Collision occurs, so h2 is used. Linear probing is subject to a clustering phenomenon. Re-hashes from One location occupy a block of slots in the table which “grows” towards slots to which other keys hash. This exacerbates the collision problem and the number of re-hashed can become large.

(4) Clustering : Liner probing is subject to a clustering phenomenon. Re-hashes from one location occupy a block of slots in the table which “grows” towards slots to which other keys hash. This exacerbates the collision problem and the number of re-hashed can become large.

 

(5) Overflow area : Another scheme will divide the pre-allocated table into two sections: the primary area to which keys are mapped and an area for collisions, normally termed the overflow area. When a collision occurs, a slot in the Overflow area is used for the new element and a link from the primary slot established as in a chained System. This is essentially the same as chaining, except that the overflow area is pre-allocated and thus possibly faster to access. As with re-hashing, the maximum number of elements must be known in advance, but in this case, two parameters must be estimated: the optimum size of the primary and overflow areas. Of course, it is possible to design Systems with multiple overflow tables, or with a mechanism for handling overflow out of the overflow area, which provide flexibility without losing the advantages of the overflow scheme.

(6) Open addressing: The hash table is an array of (key, value) pairs. The basic idea is that when a (key, value) pairs inserted into the array, and a collision occurs, the entry is simply inserted at an alternative location in the array. Linear probing, double hashing, and rehashing are all different ways of choosing an alternative location. The simplest probing method is called linear probing. In linear probing, the probe Sequence is simply the Sequence of Consecutive locations, beginning with the hash value of the key. If the end of the table is reached, the probe sequence wraps around and continues at location 0. Only if the table is Completely full will the search fail.

Hash Table Organization

 

Hashing is the technique of retrieving the data in the database. For example, we created one index for one main table, so we can retrieve the index from that main table by using one function, that function is called hash function. Hash function format is h(search key)=pointer or bucket identifier. There are two types of hashing:

(1) Static Hashing : Static hashing is the method by which items that are stored in a table, or directory, are fixed. Once these primary pages are full, an overflow bucket is necessary in order to store any additional records, but this must hash to the original bucket (the place where the original records are kept). This can be achieved by using a link to the overflow page, or by using a linked list of the overflow pages. In the search, the first item that has been saved is the key item and becomes the function value. This is stored as a table code for function calculation. When searching for items, if the key codes are the same then a successful search can be made, either in the original pages or in the overflow buckets. The original bucket is searched initially for a record, and then the overflow buckets are searched; if there are many keys that hash to the same bucket, finding what you require will take longer because many pages will be accessed before you find your record.

  • A bucket is a unit of storage containing one or more records (a bucket is typically a disk block).
  • The file blocks are divided into M equal-sized buckets, numbered bucket0, bucket1… bucketM-1. Typically, a bucket corresponds to one (or a fixed number of) disk block.
  • In a hash file organization we obtain the bucket of a record directly from its search-key value using a Hash Function, h (K).
  • The record with hash key value K is stored in bucket, where i=h(K).
  • Hash function is used to locate records for access, insertion as well as deletion.
  • Records with different search-key values may be mapped to the same bucket; thus entire bucket has to be searched sequentially to locate a record.
  • Primary pages fixed, allocated sequentially, never de-allocated; overflow pages if needed.

(2) Dynamic Hashing: The time consuming search method has been solved by using dynamic hashing. Dynamic hashing means that the directory will become bigger in conjunction with the number of Collisions, so that new records can be accommodated. This also means that long overflow page chains can be avoided. Linear hashing and extendible hashing are two examples of dynamic hashing techniques.

Good for database that grows and shrinks in size.

  • Allows the hash function to be modified dynamically.
  • Hash function generates values over a large range-typically b-bit integers, with b = 32.
  • At any time use only a prefix of the hash function to index into a table of bucket addresses.