Você está na página 1de 14

DATA STRUCTURE

FILES
UNIT IV

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 1

Learning Objectives
ƒ Hashing
ƒ Indexing Techniques
ƒ File Organization

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 2

Hashing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 3

1
Hashing
Technique for performing Insertion, Deletion,
Search in constant average time

Ordering of elements is not supported efficiently

Keys are mapped onto a number between 0 &


TableSize-1

Mapping is done on basis of a function called


Hash Function

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 4

Hash Function
Transforms a key into a cell/bucket address

Must be simple to compute

Should ensure that distinct keys get distinct cells


ƒ Not possible in all cases as number of keys increases
ƒ Leads to collisions (multiple keys map to the same hash
value)

So choose a function that leads to even distribution of


keys
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 5

Considerations

Which hash function to use

How to respond to collisions

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 6

2
Common hash functions

Mod
Mid Square
Folding
Digit Analysis

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 7

Collision Resolution

Open address hashing


ƒ Linear Probing
ƒ Quadratic Probing
ƒ Double Hashing

Separate Chaining

Rehashing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 8

Open Addressing
In case of a collision alternate cells are tried till an
empty cell is not found.

Cell hi(X)= Hash(X) + F(i)


ƒ Given F(0)=0

For Linear probing F(i) is a linear function of i ;

For Quadratic probing F(i) is a function of i2;

For Double Hashing F(i) is some function of I other


than the one chosen originally.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 9

3
Linear Probing
For a table large enough in size to hold all the keys;
free space will always be found
ƒ Though the time required will be large

Drawback
ƒ Blocks of occupied cells might get formed: PRIMARY
CLUSTERING
ƒ i.e a key that hashes into a cluster will require several
attempts to resolve collision

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 10

Linear Probing

Consider a hash table with 10 slots.

Say,
ƒ The keys to be inserted are 12, 30, 11, 32, 34, 54, 50
ƒ The hash function is mod 10
ƒ This divisor is chosen just for illustration and is not a good
choice
9as a maximum of 10 resultant cells get generated, thus
collisions will be frequent.
9The divisor should preferably be a prime number

Stages of insertion are illustrated on following slides

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 11

Linear Probing: Illustration


Add 30 on Cell 30%10= 0 0
30

Add 11 on Cell 11%10= 1 1


11
2
Add 12 on Cell 12%10= 2 12

32 3
Try to Add 32 on Cell 32%10= 2; Not available; Try Next

4
Add 34 on Cell 34%10= 4 34

54 5
Try to Add 54 on Cell 54%10= 4; Not available; Try Next

Try to Add 50 on Cell 50%10= 0; Not available; Try Next… 6


50
Till an empty cell isn’t found
7

9
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 12

4
Quadratic Probing
Similar treatment can be given when collisions
occur in case of Quadratic probing;

Here,
ƒ instead of choosing the next cell that lies after the ideal
cell i (or a cell given by a linear function of i)

ƒ A new cell number given by some quadratic function of


i is chosen

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 13

Separate Chaining
Maintains a list of all the keys that hash to the same value
To insert:
ƒ Calculate the hash function
ƒ Access the corresponding list
ƒ Add a link to the list
i.e. A link is added in case of a collision
The new key might be added at either end of the list
Better for large sized records, handles collisions & overflow
efficiently.
Not as efficient when record size is small or domain of keys
values is limited to a small number of entries

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 14

Separate Chaining: Illustration

Insert
InsertSequence:
Sequence:22,
22,42,
42,30,
30,43,
43,10
10

0
30 10
1

2
22 42
3
43

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 15

5
Rehashing
When table gets Too full,
ƒ number of collisions increase;
ƒ thus, resulting in a degradation in performance while
inserting as well as searching

Build another hash table with size ~ 2*OldSize

Scan the original table; for each entry


ƒ Compute the new hash value
ƒ Insert in the new hash table
Rehashing is costly, thus, should not be done very
frequently.

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 16

Rehashing: Illustration
Consider the hash 30 0
table as given in the
11 1
figure:
12 2

–The keys to be inserted are 12, 32 3


30, 11, 32, 34, 54, 50
34 4
–The hash function is mod 10
54 5

50 6

7
8

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 17

Rehashing
• New table size 19
• The hash function is mod 23

50 30 54 32 11 12 34

0 1 3 4 5 6 7 9 10 11 12 13 14 15 16 17 18
2 8

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 18

6
Indexing Techniques

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 19

Indexing Techniques

Cylinder Surface Indexing


Hashed Indexing
Tree Indexing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 20

Cylinder Surface Indexing


Used for primary key index in sequential file
organization

Assumes records are stored in increasing order of


Primary Key

Index consists of CYLINDER INDEX + SURFACE


INDEX for each cylinder

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 21

7
Cylinder Surface Indexing
If a data file takes up c cylinders CI has c entries

Each CI entry contains


{CYLINDER_NO, Largest key on cylinder}

Each entry of SI of ith cylinder contains:


{SURFACE_NO, Largest key on ith cylinder of this surface}

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 22

Cylinder Surface Indexing


Searching a record (ISAM)

• Read Cylinder Index in memory

• Locate the cylinder number that possibly contains the


record

• Read the surface index of the corresponding cylinder

• Find the surface (reduced to track) that may contain the


record

• Search the track sequentially


© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 23

Hashed Indexing
Maintains hash table of key values along with the corresponding
record addresses

The set of hash functions and overflow handling techniques:


discussed in hashing

In case of linear probing seek time is less as overflow buckets /


cells are adjacent

In case of Separate Chaining special buffer space is allocated


for expansion of buckets; thus little or no additional seek time
is required

Max seek time in case of random or quadratic probing


© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 24

8
Tree Indexing
Indexing using balanced trees of order m

Discussed before as B-trees and B+ tree

Maximum number of keys: ml-1


Let number of Keys= N
Number of failure nodes (number of nodes that one could
reach while looking for a key that doesn’t exist in tree)=
N+1
= number of nodes at level l+1
>= 2 * Ceil (m/2) l-1
Thus, N >= 2 * Ceil (m/2) l-1 –1

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 25

Tree indexing
Consider a B-Tree of order m=200
Say N<= 2*106
Using N >= 2 * Ceil (m/2) l-1 –1
i.e. 2*106 >= 2 * Ceil (200/2) l-1 –1
We get
106 >= (100) l-1
6 >= 2(l-1)
l <= 4
Thus 2*106 keys can be searched in a maximum of 4
passes

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 26

Tree Indexing
A high value of m would result in still lesser number
of passes,

But, a large value of m would require more time to


sequentially search a particular node.

So an optimal value of order m is chosen so that


the sum of two time components, i.e:
• Reading a node from disk
• Sequentially searching a node
Comes out to be the minimum
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 27

9
FILE ORGANIZATION

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63 28

File Organization

Sequential File Organization

Random File Organization

Inverted Files

Cellular Files

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 29

Sequential File Organization


ISAM is the most popular sequential file organization
• Cylinder surface index is maintained for primary key.

Makes search based on PK efficient

Search based on other attributes require use of an alternate


indexing technique

Insertion, Deletion are time consuming

Batch processes and Range queries are executed efficiently

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 30

10
Random File Organization
Records are stored at random locations

Techniques used for randomization


• Direct Addressing
• Directory Lookup
• Hashed File organization

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 31

Direct Addressing
Available disk space is divided into nodes large enough to hold
a record

Numeric value of the PK determined the node number where


the insertion is to be made (1 disk access for read)

Good for fixed length records and high identifier density


(Current/Domain).

In case of variable length records pointer to actual locations on


disk are maintained. (2 disk accesses for read)

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 32

Directory Lookup

Like, DA, Variable length records, index maintains


key values and pointers to disk addresses

Unlike, DA, Variable length records, available


space is utilized efficiently as the existing keys
are stored contiguously

Searching requires multiple disk accesses as the


index needs to be searched first

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 33

11
Hashed File Organization
Uses same principle as hashed indexes

Available file space is divided into


cells/buckets/slots

Some space is set aside for overflow in case of


chaining

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 34

Inverted Files
Index contains the link information

Index structure is most important

Stores index values and related record addresses

Records may be stored using any organization

Actual records my do away with storage of key


values.
© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 35

Inverted Files

E# Index Occupation Index

Analyst C, E
100 A
101 D Programme A, B, D
110 E r
200 B
220 C
Gender Index
340 F
Male A, E
Female B, C, D

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 36

12
Inverted Files

Searching becomes efficient as address


associated with a key value are available as a
list

Combination of conditions can be carried out


using simple list operations like union,
intersection, subtraction etc.

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 37

Cellular Partitions
Storage media is divided into cells

A cell could be
• A disk pack; or
• A cylinder

Lists of a given key value are divided into sub-lists


such that each sub-list occupies a single cell.

The index entries now contain the starting address of


each sub-list and the number of records in this list.

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 38

Cellular Partition
In case a cell is a cylinder, all the records placed in
on cell can be accessed without moving the
read/write head

In case a cell is a disk pack, several cells can be


search in parallel.

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 39

13
What we Studied
9 Hashing
9 Indexing Techniques
9 File Organization

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 40

Review Questions
1. What is the criteria behind the design of hash function ?
2. What are the various ways to store the Graphs in Memory?
3. Discuss the application of hash table. Write short note on symbol
table.
4. Compare Sequential and random file organization.
5. What are the advantages of using inverted files?
6. Would you use Quadratic Probing for resolving collisions in
hashed index files? State reasons.
7. Write short note on Structure of direct file
8. Give comparison between sequential file,indexed sequential file
and random access file.
9. Write a short note on Open Address Hashing and Separate
Chaining
10. Discuss Random file Organization and various techniques used
for randomization
11. Explain various techniques for overflow / collision resolution in
case of hashing

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 41

References
• “Fundamentals of Data Structures”, E. Horowitz and S. Sahani,
Galgotia Booksource Pvt. Ltd., (1999)
• Data Structures and Algorithm Analysis in C (Second Edition)
by Mark Allen Weiss
• Data Structures: A Pseudocode Approach with C, Second Edition
Richard Gilberg, Behrouz Forouzan
• “Data Structures and program design in C”, R. L. Kruse, B. P.
Leung, C. L. Tondo, PHI.
• “Data Structure”, Schaum’s outline series, TMH, 2002
• “Data Structures using C and C++”, Y. Langsam et. al., PHI (1999).
• “Data Structures”, N. Dale and S.C. Lilly, D.C. Heath and Co. (1995).
• “Data Structure & Algorithms”, R. S. Salaria, Khanna Book
Publishing Co. (P) Ltd., 2002.

© Bharati Vidyapeeth’s Institute of Computer Applications and Management, New Delhi-63. 42

14

Você também pode gostar