Ch06 Hashing

HASHING
Hashing Techniques
This is where a records placement is determined by value in the hash field. This value has a hash or randomizing function applied to it which yields the address of the disk block where the record is stored. For most records, we need only a single-block access to retrieve that record.
www.ittelkom.ac.id
General hashing approach
Hash File for a relation R has N pages
Each page is called a bucket Buckets are numbered from 0 to N 1 If a bucket gets full, an overflow page must be chained to it Can be made out of 1 or more attributes Ex. Studens(sid, name, login, age, gpa)
Each record t in R has a search key k
Search key: age attribute
A hash function is used to map the search key k of a record t in R to bucket number [0, N-1]
Hash function should distribute records uniformly Record is searched inside the bucket
www.ittelkom.ac.id
121 123 1237
Jil Bob Pat
NY NY WI
$5595 $102 $30
2381
Bill Ned Al
LA SJ SF
$500 $73 $52303
H()
Account attribute
8387 4882
9403 81982
Ned Tim
NY MIA
$3333 $4000
www.ittelkom.ac.id
121 123 1237
Jil Bob Pat
NY NY WI
$5595 $102 $30 LA NY NY city
2381 8387 4882
Bill Ned Al
LA SJ SF
$500 $73 $52303
NY
H()
MIA SJ
9403 81982
Ned Tim
NY MIA
$3333 $4000
SF WI
www.ittelkom.ac.id
Internal Hashing
Internal Hashing is implemented as a hash table through the use of an array of records. (In memory) An array index range of 0 to M-1. A function that transforms the hash field value into an integer between 0 to M-1 is used. A common one is h(K) =K mod M.
www.ittelkom.ac.id
Chapter 5
Internal Hashing (cont)
Collisions occur when a hash field value of a record being inserted hashes to an address that already contains a different record. The process of finding another position for this record is called collision resolution.
www.ittelkom.ac.id
Chapter 5
Collision Resolution
Open Addressing- Places the record to be inserted in the first available position subsequent to the hash address. Chaining - A pointer field is added to each record location. When an overflow occurs this pointer is set to point to overflow blocks making a linked list.
www.ittelkom.ac.id
Chapter 5
Collision Resolution (cont)
Multiple hashing - If an overflow occurs a second hash function is used to find a new location. If that location is also filled either another hash function is applied or open addressing is used.
www.ittelkom.ac.id
Chapter 5
Goals of the Hash Function
The goals of a good hash function are to uniformly distribute the records over the address space while minimizing collisions to avoid wasting space. Research has shown
70% to 90% fill ratio best. That when uses a Mod function M should be a prime number.
www.ittelkom.ac.id
Chapter 5
External Hashing for Disk Files
External hashing makes use of buckets, each of which can hold multiple records. A bucket is either a block or a cluster of contiguous blocks. The hash function maps a key into a relative bucket number, rather than an absolute block address for the bucket.
www.ittelkom.ac.id
Chapter 5
Types of External Hashing
Static Hashing
Using a fixed address space Dynamically changing address space:
Dynamically Hashing
Extendible hashing Linear hashing
: with a directory : without a directory
www.ittelkom.ac.id
Chapter 5
Static Hashing
Under Static Hashing a fixed number of buckets (M) is allocated. Based on the hash value a bucket number is determined in the block directory array which yields the block address. If n records fit into each block. This method allows up to n*M records to be stored.
www.ittelkom.ac.id
Chapter 5
Static Hashing: Scheme
0 1
Search Key k h
2 N -1
Primary Buckets Overflow Pages
www.ittelkom.ac.id
Static Hashing: Issues
Number of primary buckets is fixed at file creation Hash function maps key to a bucket number Typical hash function
H(k) = a*k + b Bucket number = h(k) mod N Char each character is mapped to ASCII, and all value are added to get an integer Parameters a and b are choose to tune the distribution of values (i.e. need to play with this values to get them right )
Key can be int or char
When a primary bucket get full, need to create an overflow page and chain it to primary bucket.
www.ittelkom.ac.id
Static hashing: Operations
Search for key value k:
Hash k to find the bucket, call this bucket B Search records in B to find the one(s) with key k If records are found
clustered, the data record is there: Cost: 1 I/O unclustered, need to fetch the actual data page: Cost 2 I/Os
If records are not found, need to search in overflow pages (if there are any)
Clustered: Cost: (1 + number of pages searched) * I/O Unclustered: Cost: (2 + number of pages searched) * I/O
The more overflow page you have, the worst the performance get
Need to keep overflow pages to 1 or 2, but rarely gets done!

www.ittelkom.ac.id
Static Hashing: Operations (cont.)
Insert (or Update) record with key k
Hash k to find bucket, call this bucket B If bucket has room
clustered, write data record there: Cost: 2 I/Os

Read page, then write it back updated
unclustered, write record to actual data page: Cost 4 I/Os Clustered: Cost: (2 + number of pages searched) * I/O Unclustered: Cost: (4 + number of pages searched) * I/O
If bucket is full, write to overflow page (create one if needed)
Delete costs are the same, since we need to write page back to disk Again, overflow pages make performance bad as the number of records increases
www.ittelkom.ac.id
Extensible hashing
Allows the number of buckets to grow or shrink Hash function hashes to slots in a directory
Slots store the page id of the bucket Directory can be kept in buffer pool Directory can have hundreds or thousand of slots to buckets Create a new bucket and split records between the new and full bucket
When a bucket gets full
Redistributes the data Hash function still works!!!

www.ittelkom.ac.id
Overflow page is need only if you have many duplicate records
Extendible Hashing
In Extendible Hashing, a type of directory is maintained as an array of 2d bucket addresses. Where d refers to the first d high (left most) order bits and is referred to as the global depth of the directory. However, there does NOT have to be a DISTINCT bucket for each directory entry. A local depth d is stored with each bucket to indicate the number of bits used for that bucket.
www.ittelkom.ac.id
Chapter 5
Binary Pattern Hashing Technique
Hash function will map search key to binary pattern
Ex h(51) = 00110011 Ex. If d = 2, then h(51) = 00110011 will yield bucket number 11, which is 3 in binary
Last d bits in the pattern are taken as bucket number!
Thus, 51 goes to bucket 3
The number of d of bits used to hash the search key is called the depth Two types of depths
Bucket depth File depth
Number of bits need to hash value to a given bucket

Largest depth of any bucket in the hash file
www.ittelkom.ac.id
Example Extensible Hashing Index Hashing on an int attribute

00
2 2
4 12
Bucket A
32 16
2
1 5
Bucket B
21
01 H(4) = 100 d=2 gives slot 00. For value 4 Bucket is then found from the slot.
10 11
2
10
Bucket C
2 Directory Page
15 7
Bucket D
19
www.ittelkom.ac.id
The use of the depth
Depth tells us the number of bits that we need to use to pick a bucket
Ex. H(4) = 100, d = 2, tell us to use 00 to identify slot. This would be slot 00. Used to hash key to proper slot Used when bucket need to be slipt
Directory has a global depth
Each bucket has a local depth Let us see what happens when we need to insert the value 22 into the hash index
H(22) = 10110, d =2
www.ittelkom.ac.id
Hash File before the Insert 22

2 2
00 4 12
Bucket A
32 16
2
1 5
Bucket B
21
01
10 11
2
10
Bucket C
2 Directory Page
www.ittelkom.ac.id
15 7
Bucket D
19
Hash File After Inserting 22

H(22) = 10110 d=2 gives slot 10. This bucket has space
2 2
00 4 12
Bucket A
32 16
2
1 5
Bucket B
21
01
10 11
2
10 22
Bucket C
2 Directory Page
www.ittelkom.ac.id
15 7
Bucket D
19
The issue of a full bucket: Insert 20

H(20) = 10100 d=2 gives slot 00. This bucket is Full
2 2
00 4 12
Bucket A
32 16
2
1 5
Bucket B
21
01
10 11
2
10
Bucket C
2 Directory Page
www.ittelkom.ac.id
15 7
Bucket D
19
Splitting a bucket
A full bucket gets split into two buckets
Their directory slots are called corresponding elements
These buckets have the same hash value at the current depth d But at depth d + 1, they differ by 1 bit
one has a 1 at bit position d + 1 the other has a 0 at bit position d + 1 Bucket A is split into two buckets: bucket A and bucket A2 Bucket A , d = 2, has value 00, but at d = 3 becomes 000 Bucket A2, d = 2, has value 00, but at d = 3 becomes 100
www.ittelkom.ac.id
Example:
Splitting a bucket (cont.)
The values in the original bucket A and the new value to be inserted get distributed into buckets A and A2. The hash function now increment the local depth of the buckets A and A2 to be d + 1 Now, the keys are hashed to buckets using d + 1 bits Recall that bucket A had: 4, 12, 32, 16, and d = 2
We wanted to insert 20 4 = 100 12 = 1100 32 = 100000 16 = 10000 20 = 10100

www.ittelkom.ac.id
Now d becomes 3, we get the following hashing:
H(4) = 100 H(12) = 100 H(32) = 000 H(16) = 000 H(20) = 100
Corresponding elements & buckets

Operation insert 20 Hash = 00
2
4 12
splitting
Bucket A
32 16
Hash = 000 Hash = 100 After Split
Bucket A
32 16
3
4 12
Bucket A2
20
Original bucket A
Split image of bucket A
Local depth is changed from 2 to 3 Need 3 bits for hashing
www.ittelkom.ac.id
Expanding the Directory

3
Number of slots is not enough. Need to double size of the directory and increase Global depth 00 01 10 11
Bucket A
32 16
2
1 5
Bucket B
21
2
10
Bucket C
2
15 7
Bucket D
19
Directory Page
3
4 12
Bucket A2
20
www.ittelkom.ac.id
Expanded Hash Index

3
3 000 001 010 011 100 101 110 111
Bucket A
32 16
2
1 5
Bucket B
21
2
10
Bucket C
2
15 7
Bucket D
19
Directory Page
www.ittelkom.ac.id
3
4 12
Bucket A2
20
Some Issues
Some corresponding elements point to the same bucket
This means the bucket has not been split
Not all splits operations cause the directory to be doubled. Each bucket has a local depth
If depth of bucket = global depth 1, then splitting this bucket will not cause a doubling in directory
Doubling only occurs when
Bucket is full and cannot fit another insertion Bucket has same local depth as global depth
www.ittelkom.ac.id
Tradeoffs of Extensible Hashing
Advantages
Can gracefully adapt to insertion and deletions Limits the number of overflow pages Hash function is easy to implement
No need for complex prime number computations
Disadvantages
Directory can grow large when we have billions of records Also, when we have skewed data distributions
Lots of values go to same bucket Lots of empty buckets, a few one have all the data Overflow pages due to collisions (values that hash to same bucket)
Too much doubly in the size of the directory

www.ittelkom.ac.id
Shrinking Extendible Hashing Files
The generally used principal for shrinking extendible hashing files is that when d > d for all buckets after a deletion occurs. Buckets may be combined when the each of the buckets to be combined are less than half full and have the same bit pattern with the exception of the d bit. I.e. d = 3 and the bit patterns of 110 and 111.
www.ittelkom.ac.id
Chapter 5
Linear Hashing
Dynamic hashing technique
No need for directory Limits overflow pages due to collisions Splitting of buckets is done in a more lazy fashion
Idea is to have a family of hash functions h0, h1, h2,

Each function has a range twice as big as the predecesor If hi maps to M buckets, hi+1 maps to 2M buckets This is used when more buckets are needed
Switch from current h to h

i
we need to grow number of buckets beyond current M (we double number of buckets to 2M)
i+1 if
www.ittelkom.ac.id
Linear Hashing
Linear Hashing allows the hash file to expand and shrink its number of buckets dynamically without needing a directory. It starts with M buckets numbered 0 to M-1 and use the mod hash function h(K)= K mod M as the initial hash function called hi. Overflow is handled by chaining individual overflow chains for each bucket. It works by methodically splitting the original buckets; starting with bucket 0, redistributing the contents of bucket 0 between bucket 0 and bucket M (the new bucket) using a secondary hash function: h i+1(K) = K mod 2M
www.ittelkom.ac.id
Linear Hashing (Cont)
This splitting of buckets is done in order (0,1,,M-1) REGARDLESS of which bucket the collision occurred. To keep track of the next bucket to be split we will use n. So n would be incremented to 1. When a record hashes to a bucket less than n we use the secondary hash function to determine which of the two buckets it belongs in. When all of the original M buckets have been split and we have 2M buckets and n=M We reset M to 2M, n to 0 and change our secondary hash function to our primary hash function. Shrinking of the file is done based on the load factor using the reverse of splitting.
www.ittelkom.ac.id
Chapter 5 3
Building the family of hash functions
General form is
hi (key) = h(key) mod (2iN)
h(key) acts as the base function h(key) is the same as for extensible hashing
Looks as the bit pattern in the value
If N is a power of 2, and d0 is the number of bit to represent N, then di gives the number of bits used by function hi
di = d0 + i
www.ittelkom.ac.id
General Scheme
Hash index file as an associated round number
Called Level
At round number Level we use hash functions hLevel and hLevel + 1 We keep track of the next bucket to be split
Buckets are split in a round robin fashion
Every bucket eventually gets splits
Index file has tree types of buckets
Buckets that were split in this round Buckets that are yet to be split Buckets created by splits in this round
www.ittelkom.ac.id
Organization of Index Hash File
Split in this round

Original Buckets in this Level hLevel used here
Next bucket to be split

Buckets not split
New buckets resulting from splits

www.ittelkom.ac.id

Ch06 Hashing

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Ch06 Hashing

Enviado por

Direitos autorais:

Formatos disponíveis

HASHING

General hashing approach

Hash File for a relation R has N pages

Each record t in R has a search key k

Search key: age attribute

121 123 1237

Jil Bob Pat

$5595 $102 $30

$500 $73 $52303

121 123 1237

Jil Bob Pat

$5595 $102 $30 LA NY NY city

2381 8387 4882

$500 $73 $52303

Internal Hashing (cont)

Collision Resolution (cont)

Goals of the Hash Function

External Hashing for Disk Files

Types of External Hashing

Using a fixed address space Dynamically changing address space:

Extendible hashing Linear hashing

: with a directory : without a directory

Static Hashing: Scheme

Static Hashing: Issues

Key can be int or char

Static hashing: Operations

Search for key value k:

Need to keep overflow pages to 1 or 2, but rarely gets done!

Static Hashing: Operations (cont.)

Insert (or Update) record with key k

Hash k to find bucket, call this bucket B If bucket has room

clustered, write data record there: Cost: 2 I/Os

If bucket is full, write to overflow page (create one if needed)

When a bucket gets full

Redistributes the data Hash function still works!!!

Overflow page is need only if you have many duplicate records

Binary Pattern Hashing Technique

Hash function will map search key to binary pattern

Last d bits in the pattern are taken as bucket number!

Thus, 51 goes to bucket 3

Bucket depth File depth

Number of bits need to hash value to a given bucket

Example Extensible Hashing Index Hashing on an int attribute

The use of the depth

Directory has a global depth

Hash File before the Insert 22

Hash File After Inserting 22

The issue of a full bucket: Insert 20

A full bucket gets split into two buckets

Their directory slots are called corresponding elements

Splitting a bucket (cont.)

We wanted to insert 20 4 = 100 12 = 1100 32 = 100000 16 = 10000 20 = 10100

Now d becomes 3, we get the following hashing:

Corresponding elements & buckets

Hash = 000 Hash = 100 After Split

Split image of bucket A

Local depth is changed from 2 to 3 Need 3 bits for hashing

Expanding the Directory

Expanded Hash Index

Some corresponding elements point to the same bucket

This means the bucket has not been split

Doubling only occurs when

Tradeoffs of Extensible Hashing