Cellular Patition by Using Hash Org.

DATA STRUCTURE
FILES
UNIT IV
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 1
Learning Objectives
Hashing
Indexing Techniques
File Organization
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 2
Hashing
1
Hashing
Technique for performing Insertion, Deletion,
Search in constant average time
Ordering of elements is not supported efficiently
Keys are mapped onto a number between 0 &

TableSize-1
Mapping is done on basis of a function called

Hash Function
Hash Function
Transforms a key into a cell/bucket address
Must be simple to compute
Should ensure that distinct keys get distinct cells

Not possible in all cases as number of keys increases
Leads to collisions (multiple keys map to the same hash
value)
So choose a function that leads to even distribution of

keys
Considerations
Which hash function to use
How to respond to collisions
2
Common hash functions
Mod
Mid Square
Folding
Digit Analysis
Collision Resolution
Open address hashing

Linear Probing
Quadratic Probing
Double Hashing
Separate Chaining
Rehashing
Open Addressing
In case of a collision alternate cells are tried till an
empty cell is not found.
Cell hi(X)= Hash(X) + F(i)

Given F(0)=0
For Linear probing F(i) is a linear function of i ;
For Quadratic probing F(i) is a function of i2;
For Double Hashing F(i) is some function of I other

than the one chosen originally.
3
Linear Probing
For a table large enough in size to hold all the keys;
free space will always be found
Though the time required will be large
Drawback
Blocks of occupied cells might get formed: PRIMARY
CLUSTERING
i.e a key that hashes into a cluster will require several
attempts to resolve collision
Linear Probing
Consider a hash table with 10 slots.
Say,
The keys to be inserted are 12, 30, 11, 32, 34, 54, 50
The hash function is mod 10
This divisor is chosen just for illustration and is not a good
choice
9as a maximum of 10 resultant cells get generated, thus
collisions will be frequent.
9The divisor should preferably be a prime number
Stages of insertion are illustrated on following slides
Linear Probing: Illustration

Add 30 on Cell 30%10= 0 0
30
Add 11 on Cell 11%10= 1 1

11
2
Add 12 on Cell 12%10= 2 12
32 3
Try to Add 32 on Cell 32%10= 2; Not available; Try Next
4
Add 34 on Cell 34%10= 4 34
54 5
Try to Add 54 on Cell 54%10= 4; Not available; Try Next
Try to Add 50 on Cell 50%10= 0; Not available; Try Next… 6

50
Till an empty cell isn’t found
7
9
4
Quadratic Probing
Similar treatment can be given when collisions
occur in case of Quadratic probing;
Here,
instead of choosing the next cell that lies after the ideal
cell i (or a cell given by a linear function of i)
A new cell number given by some quadratic function of

i is chosen
Separate Chaining
Maintains a list of all the keys that hash to the same value
To insert:
Calculate the hash function
Access the corresponding list
Add a link to the list
i.e. A link is added in case of a collision
The new key might be added at either end of the list
Better for large sized records, handles collisions & overflow
efficiently.
Not as efficient when record size is small or domain of keys
values is limited to a small number of entries
Separate Chaining: Illustration
Insert
InsertSequence:
Sequence:22,
22,42,
42,30,
30,43,
43,10
10
0
30 10
1
2
22 42
3
43
5
Rehashing
When table gets Too full,
number of collisions increase;
thus, resulting in a degradation in performance while
inserting as well as searching
Build another hash table with size ~ 2*OldSize
Scan the original table; for each entry

Compute the new hash value
Insert in the new hash table
Rehashing is costly, thus, should not be done very
frequently.
Rehashing: Illustration
Consider the hash 30 0
table as given in the
11 1
figure:
12 2
–The keys to be inserted are 12, 32 3

30, 11, 32, 34, 54, 50
34 4
–The hash function is mod 10
54 5
50 6
7
8
Rehashing
• New table size 19
• The hash function is mod 23
50 30 54 32 11 12 34
0 1 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18
2 8
6
Indexing Techniques
Indexing Techniques
Cylinder Surface Indexing

Hashed Indexing
Tree Indexing

Used for primary key index in sequential file
organization
Assumes records are stored in increasing order of

Primary Key
Index consists of CYLINDER INDEX + SURFACE

INDEX for each cylinder
7
If a data file takes up c cylinders CI has c entries
Each CI entry contains

{CYLINDER_NO, Largest key on cylinder}
Each entry of SI of ith cylinder contains:

{SURFACE_NO, Largest key on ith cylinder of this surface}

Searching a record (ISAM)
• Read Cylinder Index in memory
• Locate the cylinder number that possibly contains the

record
• Read the surface index of the corresponding cylinder
• Find the surface (reduced to track) that may contain the

record
• Search the track sequentially

Hashed Indexing
Maintains hash table of key values along with the corresponding
record addresses
The set of hash functions and overflow handling techniques:

discussed in hashing
In case of linear probing seek time is less as overflow buckets /

cells are adjacent
In case of Separate Chaining special buffer space is allocated

for expansion of buckets; thus little or no additional seek time
is required
Max seek time in case of random or quadratic probing

8
Tree Indexing
Indexing using balanced trees of order m
Discussed before as B-trees and B+ tree
Maximum number of keys: ml-1

Let number of Keys= N
Number of failure nodes (number of nodes that one could
reach while looking for a key that doesn’t exist in tree)=
N+1
= number of nodes at level l+1
>= 2 * Ceil (m/2) l-1
Thus, N >= 2 * Ceil (m/2) l-1 –1
Tree indexing
Consider a B-Tree of order m=200
Say N<= 2*106
Using N >= 2 * Ceil (m/2) l-1 –1
i.e. 2*106 >= 2 * Ceil (200/2) l-1 –1
We get
106 >= (100) l-1
6 >= 2(l-1)
l <= 4
Thus 2*106 keys can be searched in a maximum of 4
passes
Tree Indexing
A high value of m would result in still lesser number
of passes,
But, a large value of m would require more time to

sequentially search a particular node.
So an optimal value of order m is chosen so that

the sum of two time components, i.e:
• Reading a node from disk
• Sequentially searching a node
Comes out to be the minimum
9
FILE ORGANIZATION
File Organization
Sequential File Organization
Random File Organization
Inverted Files
Cellular Files
Sequential File Organization

ISAM is the most popular sequential file organization
• Cylinder surface index is maintained for primary key.
Makes search based on PK efficient
Search based on other attributes require use of an alternate

indexing technique
Insertion, Deletion are time consuming
Batch processes and Range queries are executed efficiently
10
Random File Organization
Records are stored at random locations
Techniques used for randomization

• Direct Addressing
• Directory Lookup
• Hashed File organization
Direct Addressing
Available disk space is divided into nodes large enough to hold
a record
Numeric value of the PK determined the node number where

the insertion is to be made (1 disk access for read)
Good for fixed length records and high identifier density

(Current/Domain).
In case of variable length records pointer to actual locations on

disk are maintained. (2 disk accesses for read)
Directory Lookup
Like, DA, Variable length records, index maintains

key values and pointers to disk addresses
Unlike, DA, Variable length records, available

space is utilized efficiently as the existing keys
are stored contiguously
Searching requires multiple disk accesses as the

index needs to be searched first
11
Hashed File Organization
Uses same principle as hashed indexes
Available file space is divided into

cells/buckets/slots
Some space is set aside for overflow in case of

chaining
Inverted Files
Index contains the link information
Index structure is most important
Stores index values and related record addresses
Records may be stored using any organization
Actual records my do away with storage of key

values.
Inverted Files
E# Index Occupation Index
Analyst C, E
100 A
101 D Programme A, B, D
110 E r
200 B
220 C
Gender Index
340 F
Male A, E
Female B, C, D
12
Inverted Files
Searching becomes efficient as address

associated with a key value are available as a
list
Combination of conditions can be carried out

using simple list operations like union,
intersection, subtraction etc.
Cellular Partitions
Storage media is divided into cells
A cell could be
• A disk pack; or
• A cylinder
Lists of a given key value are divided into sub-lists

such that each sub-list occupies a single cell.
The index entries now contain the starting address of

each sub-list and the number of records in this list.
Cellular Partition
In case a cell is a cylinder, all the records placed in
on cell can be accessed without moving the
read/write head
In case a cell is a disk pack, several cells can be

search in parallel.
13
What we Studied
9 Hashing
9 Indexing Techniques
9 File Organization
Review Questions
1. What is the criteria behind the design of hash function ?
2. What are the various ways to store the Graphs in Memory?
3. Discuss the application of hash table. Write short note on symbol
table.
4. Compare Sequential and random file organization.
5. What are the advantages of using inverted files?
6. Would you use Quadratic Probing for resolving collisions in
hashed index files? State reasons.
7. Write short note on Structure of direct file
8. Give comparison between sequential file,indexed sequential file
and random access file.
9. Write a short note on Open Address Hashing and Separate
Chaining
10. Discuss Random file Organization and various techniques used
for randomization
11. Explain various techniques for overflow / collision resolution in
case of hashing
References
• “Fundamentals of Data Structures”, E. Horowitz and S. Sahani,
Galgotia Booksource Pvt. Ltd., (1999)
• Data Structures and Algorithm Analysis in C (Second Edition)
by Mark Allen Weiss
• Data Structures: A Pseudocode Approach with C, Second Edition
Richard Gilberg, Behrouz Forouzan
• “Data Structures and program design in C”, R. L. Kruse, B. P.
Leung, C. L. Tondo, PHI.
• “Data Structure”, Schaum’s outline series, TMH, 2002
• “Data Structures using C and C++”, Y. Langsam et. al., PHI (1999).
• “Data Structures”, N. Dale and S.C. Lilly, D.C. Heath and Co. (1995).
• “Data Structure & Algorithms”, R. S. Salaria, Khanna Book
Publishing Co. (P) Ltd., 2002.
14

Cellular Patition by Using Hash Org.

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Cellular Patition by Using Hash Org.

Enviado por

Direitos autorais:

Formatos disponíveis

DATA STRUCTURE

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 1

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 2

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 3

Ordering of elements is not supported efficiently

Keys are mapped onto a number between 0 &

Mapping is done on basis of a function called

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 4

Must be simple to compute

Should ensure that distinct keys get distinct cells

So choose a function that leads to even distribution of

Which hash function to use

How to respond to collisions

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 6

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 7

Open address hashing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 8

Cell hi(X)= Hash(X) + F(i)

For Linear probing F(i) is a linear function of i ;

For Quadratic probing F(i) is a function of i2;

For Double Hashing F(i) is some function of I other

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 10

Consider a hash table with 10 slots.

Stages of insertion are illustrated on following slides

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 11

Linear Probing: Illustration

Add 11 on Cell 11%10= 1 1

Try to Add 50 on Cell 50%10= 0; Not available; Try Next… 6

 A new cell number given by some quadratic function of

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 13

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 14

Separate Chaining: Illustration

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 15

Build another hash table with size ~ 2*OldSize

Scan the original table; for each entry

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 16

–The keys to be inserted are 12, 32 3

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 17

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 18

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 19

Cylinder Surface Indexing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 20

Cylinder Surface Indexing

Assumes records are stored in increasing order of

Index consists of CYLINDER INDEX + SURFACE

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 21

Each CI entry contains

Each entry of SI of ith cylinder contains:

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 22

Cylinder Surface Indexing

• Read Cylinder Index in memory

• Locate the cylinder number that possibly contains the

• Read the surface index of the corresponding cylinder

• Find the surface (reduced to track) that may contain the

• Search the track sequentially

The set of hash functions and overflow handling techniques:

In case of linear probing seek time is less as overflow buckets /

In case of Separate Chaining special buffer space is allocated

Max seek time in case of random or quadratic probing

Discussed before as B-trees and B+ tree

Maximum number of keys: ml-1

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 25

A new cell number given by some quadratic function of