Você está na página 1de 23

File Organization in DBMS

As we know , database consists of tables, views, index, procedures, functions


etc. The tables and views are logical form of viewing the data. But the actual
data are stored in the physical memory. In the physical memory devices, these
datas cannot be stored as it is. They are converted to binary format. Each
memory devices will have many data blocks, each of which will be capable of
storing certain amount of data. The data and these blocks will be mapped to
store the data in the memory.

Any user who wants to view these data or modify these data, simply fires SQL
query and gets the result on the screen. How is it stored in the memory,
Accessing method, query type etc makes great affect on getting the results.

Types of File Organization

File Organization defines how file records are mapped onto disk blocks. It
is same as indexes in the books, or catalogues in the library, which helps us to
find required topics or books respectively.

Storing the files in certain order is called file organization. The main objective of
file organization is

 Optimalselection of records i.e.; records should be accessed as fast as


possible.
 Anyinsert, update or delete transaction on records should be easy, quick
and should not harm other records.
 Noduplicate records should be induced as a result of insert, update or
delete
 Records should be stored efficiently so that cost of storage is minimal.

1
Types of File Organization to organize file records.

1. Sequential File Organization


2. Heap File Organization
3. Hash/Direct File Organization
4. Indexed Sequential Access Method
5. B+ Tree File Organization
6. Cluster File Organization

1.Sequential File Organization


It is one of the simple methods of file organization. Here each file/records are
stored one after the other in a sequential manner. This can be achieved in two
ways:

 Records are stored one after the other as they are inserted into the tables.
This method is called pile file method. When a new record is inserted,
it is placed at the end of the file. In the case of any modification or
deletion of record, the record will be searched in the memory blocks.
Once it is found, it will be marked for deleting and new block of record is
entered.

Inserting a new record:

2
In the diagram above, R1, R2, R3 etc are the records. They contain all the
attribute of a row. i.e.; when we say student record, it will have his id, name,
address, course, DOB etc. Similarly R1, R2, R3 etc can be considered as one
full set of attributes.

 In the second method, records are sorted (either ascending or


descending) each time they are inserted into the system. This method is
called sorted file method. Sorting of records may be based on the
primary key or on any other columns. Whenever a new record is
inserted, it will be inserted at the end of the file and then it will sort –
ascending or descending based on key value and placed at the correct
position. In the case of update, it will update the record and then sort
the file to place the updated record in the right place. Same is the case
with delete.

Inserting a new record:

3
2. Heap File Organization
Here records are inserted at the end of the file as and when they are inserted.
There is no sorting or ordering of the records. Once the data block is full, the
next record is stored in the new block. This new block need not be the very next
block. This method can select any block in the memory to store the new records.
It is the responsibility of the DBMS to store the records and manage them.

If a new record is inserted, then in the above case it will be inserted into data
block 1.

4
When a record has to be retrieved from the database, in this method, we need
to traverse from the beginning of the file till we get the requested record. Hence
fetching the records in very huge tables, it is time consuming. This is because
there is no sorting or ordering of the records. We need to check all the data.

Similarly if we want to delete or update a record, first we need to search for the
record. Again, searching a record is similar to retrieving it- start from the
beginning of the file till the record is fetched. If it is a small file, it can be fetched
quickly. But larger the file, greater amount of time needs to be spent in fetching.

In addition, while deleting a record, the record will be deleted from the data
block. But it will not be freed and it cannot be re-used. Hence as the number of
record increases, the memory size also increases and hence the efficiency. For
the database to perform better, DBA has to free this unused memory
periodically.

3. Hash/Direct File Organization


In this method of file organization, hash function is used to calculate the address
of the block to store the records. The hash function can be any simple or
complex mathematical function. The hash function is applied on some
columns/attributes – either key or non-key columns to get the block address.
Hence each record is stored randomly irrespective of the order they come.
Hence this method is also known as Direct or Random file organization. If the
hash function is generated on key column, then that column is called hash key,
and if hash function is generated on non-key column, then the column is hash
column.

5
When a record has to be retrieved, based on the hash key column, the address
is generated and directly from that address whole record is retrieved. Here no
effort to traverse through whole file. Similarly when a new record has to be
inserted, the address is generated by hash key and record is directly inserted.
Same is the case with update and delete. There is no effort for searching the
entire file nor sorting the files. Each record will be stored randomly in the
memory.

These types of file organizations are useful in online transaction systems, where
retrieval or insertion / updation should be faster.

6
4. Indexed Sequential Access Method (ISAM)
This is an advanced sequential file organization method. Here records are
stored in order of primary key in the file. Using the primary key, the records are
sorted. For each primary key, an index value is generated and mapped with the
record. This index is nothing but the address of record in the file.

In this method, if any record has to be retrieved, based on its index value, the
data block address is fetched and the record is retrieved from memory.

5. B+ Tree File Organization

B+ Tree is an advanced method of ISAM file organization. It uses the same


concept of key-index, but in a tree like structure. B+ tree is similar to binary
search tree, but it can have more than two leaf nodes. It stores all the records
only at the leaf node. Intermediary nodes will have pointers to the leaf nodes.
They do not contain any data/records.

Consider a student table below. The key value here is STUDENT_ID. And each
record contains the details of each student along with its key value and the
index/pointer to the next value. In a B+ tree it can be represented as below.

7
Please note that the leaf node 100 means, it has name and address of student
with ID 100, as we saw in R1, R2, R3 etc above.

From the above B+ tree structure, it is evident that

 There is one main node called root of the tree – 105 is the root here.
 There is an intermediary layer with nodes. They do not have actual
records stored. They are all pointers to the leaf node. Only the leaf node
contains the data in sorted order.
 The nodes to the left of the root nodes have prior values of root and nodes
to the right have next values of the root. i.e.; 102 and 108 respectively.

8
 Allthe leaf nodes are balanced – all the leaf nodes at same distance from
the root node. Hence searching any record is easier.
 Since the intermediary nodes have only pointers to the leaf node, the tree
structure is of shorter height. Shorter the height, faster is the traversal
and hence the retrieval of records.

6. Cluster File Organization

In all the file organization methods described above, each file contains single
table and are all stored in different ways in the memory. In real life situation,
retrieving records from single table is comparatively less. Most of the cases, we
need to combine/join two or more related tables and retrieve the data. In such
cases, above all methods will not be faster to give the result.

Another method of file organization – Cluster File Organization is introduced to


handle above situation. In this method two or more table which are frequently
used to join and get the results are stored in the same file called clusters.
These files will have two or more tables in the same data block and the key
columns which map these tables are stored only once. This method hence
reduces the cost of searching for various records in different files. All the records
are found at one place and hence making search efficient.

For example, we want to see the students who have taken particular course.
The tables are shown in below diagram. We can see there are two students who
have opted for ‘Database’ and ‘Perl’ course each. Though it is stored in separate
tables in logical view, when it is stored in physical view, we have combined
them. This can be seen in cluster file below. This is the result of join. So do not
have to put any effort or time for joining. Hence it will give faster results.

9
If we have to insert or update or delete any record, we can directly do so. Here
data are sorted based on the primary key or the key with which we are
searching the data. Also, clusters are formed based on the join condition. The
key with which we are joining the tables is known as cluster key.

There are two types of cluster file organization


 Indexed Clusters: - Here records are grouped based on the cluster key
and stored together. Our example above to illustrate STUDENT-
COURSE cluster is an indexed cluster. The records are grouped based
on the cluster key – COURSE_ID and all the related records are stored
together. This method is followed when there is retrieval of data for
range of cluster key values or when there is a huge data growth in the
clusters. That means, if we have to select the students who are
attending the course with COURSE_ID 230-240 or there is a large
number of students attending the same course, say 250.
 Hash Clusters: - This is also similar to indexed cluster. Here instead of
storing the records based on the cluster key, we generate the hash key
value for the cluster key and store the records with same hash key
value together in the memory disk.

10
 INDEXING
We know that data is stored in the form of records. Every record has a key
field, which helps it to be recognized uniquely.

Indexing is a data structure technique to efficiently retrieve records from


the database files based on some attributes on which the indexing has
been done. Indexing in database systems is similar to what we see in
books.

Indexing is defined based on its indexing attributes. Indexing can be of the


following types:

Primary Index : The data file is ordered on a key field. The key field is
generally the primary key of the relation.

Secondary Index : Secondary index may be generated from a field which


is a candidate key and has a unique value in every record, or a non-key
with duplicate values.

Clustering Index: Clustering index is defined on an ordered data file. The


data file is ordered on a non-key field.

Ordered Indexing is of two types:


 Dense Index
 Sparse Index

Dense Index : In dense index, there is an index record for every search
key value in the database. This makes searching faster but requires more
space to store index records itself. Index records contain search key value
and a pointer to the actual record on the disk.

11
dense index

Sparse Index: In sparse index, index records are not created for every
search key. An index record here contains a search key and an actual
pointer to the data on the disk. To search a record, we first proceed by
index record and reach at the actual location of the data. If the data we are
looking for is not , then the system starts sequential search until the
desired data is found.

sparse index

Multilevel Index : Index records comprise search-key values and data


pointers. Multilevel index is stored on the disk along with the actual
database files. As the size of the database grows, so does the size of the
indices. There is a need to keep the index records in the main memory so
as to speed up the search operations. Multi-level Index helps in breaking
down the index into several smaller indices in order to make the outermost
level so small that it can be saved in a single disk block, which can easily
be accommodated anywhere in the main memory.
12
Multilevel Index

 B + Tree :

A B+ tree is a balanced binary search tree that follows a multi-level index


format. The leaf nodes of a B+ tree denote actual data pointers. B+ tree
ensures that all leaf nodes remain at the same height, thus balanced.

( + previous b tree on page 7)

13
B+Tree Insertion :

B+ trees are filled from bottom and each entry is done at the leaf node

If a leaf node overflows:


 Split node into two parts.
 Partition the node at i = ⌊(m+1)/2⌋.
 First i entries are stored in one node. o Rest of the entries (i+1
onwards) are moved to a new node.
 i th key is duplicated at the parent of the leaf.

If a non-leaf node overflows:


 Split node into two parts.
 Partition the node at i = ⌈(m+1)/2⌉.
 Entries up to i are kept in one node.
 Rest of the entries are moved to a new node.

B+Tree Deletion :

 B + tree entries are deleted at the leaf nodes.


 The target entry is searched and deleted.
 If it is an internal node, delete and replace with the entry from the left
position. After deletion, underflow is tested,
 if underflow occurs, distribute the entries from the nodes left to it. If
distribution is not possible from left, then
 Distribute the entries from the nodes right to it. If distribution is not
possible from left or from right, then
 Merge the node with left and right to it.

OR

14
Searching a record in B+ Tree
Suppose we want to search 65 in the below B+ tree structure. First we will fetch
for the intermediary node which will direct to the leaf node that can contain
record for 65. So we find branch between 50 and 75 nodes in the intermediary
node. Then we will be redirected to the third leaf node at the end. Here DBMS
will perform sequential search to find 65. Suppose, instead of 65, we have to
search for 60. What will happen in this case? We will not be able to find in the
leaf node. No insertions/update/delete is allowed during the search in B+ tree.

Insertion in B+ tree
Suppose we have to insert a record 60 in below structure. It will go to 3 rd leaf
node after 55. Since it is a balanced tree and that leaf node is already full, we
cannot insert the record there. But it should be inserted there without affecting
the fill factor, balance and order. So the only option here is to split the leaf node.
But how do we split the nodes?

The 3rd leaf node should have values (50, 55, 60, 65, 70) and its current root
node is 50. We will split the leaf node in the middle so that its balance is not
15
altered. So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes. If these two
has to be leaf nodes, the intermediary node cannot branch from 50. It should
have 60 added to it and then we can have pointers to new leaf node.

This is how we insert a new entry when there is overflow. In normal scenario, it
is simple to find the node where it fits and place it in that leaf node.

Delete in B+ tree
Suppose we have to delete 60 from the above example. What will happen in this
case? We have to remove 60 from 4th leaf node as well as from the intermediary
node too. If we remove it from intermediary node, the tree will not satisfy B+ tree
rules. So we need to modify it have a balanced tree. After deleting 60 from
above B+ tree and re-arranging nodes, it will appear as below.

Suppose we have to delete 15 from above tree. We will traverse to the 1st leaf
node and simply delete 15 from that node. There is no need for any re-
arrangement as the tree is balanced and 15 do not appear in the intermediary
node.

16
 HASHING :
Hash File organization method is the one where data is stored at the data
blocks whose address is generated by using hash function. The memory
location where these records are stored is called as data block or data
bucket. This data bucket is capable of storing one or more records.
The hash function can use any of the column value to generate the address.
Most of the time, hash function uses primary key to generate the hash index –
address of the data block.

17
This hash function can also be simple mathematical function like mod, sin, cos,
exponential etc. Imagine we have hash function as mod (5) to determine the
address of the data block. So what happens to the above case? It applies mod
(5) on primary keys and generates 3,3,1,4 and 2 respectively and the records
are stored in those data block addresses.

There are two types of hash file organizations – Static and Dynamic Hashing.

Static Hashing
In this method of hashing, the resultant data bucket address will be always
same. That means, if we want to generate address for EMP_ID = 103 using mod
(5) hash function, it always result in the same bucket address 3. There will not
be any changes to the bucket address here.

18
Searching a record

Using the hash function, data bucket address is generated for the hash key. The
record is then retrieved from that location. i.e.; if we want to retrieve whole
record for ID 104, and if the hash function is mod (5) on ID, the address
generated would be 4. Then we will directly got to address 4 and retrieve the
whole record for ID 104. Here ID acts as a hash key.

Inserting a record

When a new record needs to be inserted into the table, we will generate a
address for the new record based on its hash key. Once the address is
generated, the record is stored in that location.

Delete a record

Using the hash function we will first fetch the record which is supposed to be
deleted. Then we will remove the records for that address in memory.

Update a record

Data record marked for update will be searched using static hash function and
then record in that address is updated.

note :

Suppose we have to insert some records into the file. But the data bucket
address generated by the hash function is full or the data already exists in that
address. How do we insert the data? This situation in the static hashing is
called bucket overflow. This is one of the critical situations/ drawback in this
method. Where will we save the data in this case? We cannot lose the data.
There are various methods to overcome this situation. Most commonly used
methods are listed below:

19
Closed hashing

In this method we introduce a new data bucket with same address and link it
after the full data bucket. These methods of overcoming the bucket overflow are
called closed hashing or overflow chaining.

Consider we have to insert a new record R2 into the tables. The static hash
function generates the data bucket address as ‘AACDBF’. But this bucket is full
to store the new data. What is done in this case is a new data bucket is added at
the end of ‘AACDBF’ data bucket and is linked to it. Then new record R2 is
inserted into the new bucket. Thus it maintains the static hashing address. It can
add any number of new data buckets, when it is full.

Open Hashing

In this method, next available data block is used to enter the new record, instead
of overwriting on the older one. This method is called Open Hashing or linear
probing.

In the below example, R2 is a new record which needs to be inserted. But the
hash function generates address as 237. But it is already full. So the system
searches next available data bucket, 238 and assigns R2 to it.

20
In the linear probing, the difference between the older bucket and the new
bucket is usually fixed and it will be 1 most of the cases.

Quadratic probing

This is similar to linear probing. We use quadratic function to determine the new
bucket address i.e instead of incrementing the address by one every time in
linear probing , we will increment the offset by 12,22,32 .. n2 , so until we find
the free memory location.

formula = H + i2 % table size

Double Hashing

This is also another method of linear probing . Address is calculated by using


another hash function on already hashed key. Hence the name is double
hashing.

now we will search hash key location h+h' , h+2h' , h+3h' ,,,

h = key % table size , h' = table size - (key % table size)

21
Dynamic Hashing

This hashing method is used to overcome the problems of static hashing –


bucket overflow. In this method of hashing, data buckets grows or shrinks
as the records increases or decreases. This method of hashing is also
known as extendable hashing method. Let us see an example to
understand this method.

Consider there are three records R1, R2 and R4 are in the table. These
records generate addresses 100100, 010110 and 110110 respectively.
This method of storing considers only part of this address – especially only
first one bit to store the data. So it tries to load three of them at address 0
and 1.

What will happen to R3 here? There is no bucket space for R3. The bucket
has to grow dynamically to accommodate R3. So it changes the address
have 2 bits rather than 1 bit, and then it updates the existing data to have
2 bit address. Then it tries to accommodate R3.

Now we can see that address of R1 and R2 are changed to reflect the new
address and R3 is also inserted. As the size of the data increases, it tries
to insert in the existing buckets. If no buckets are available, the number of
bits is increased to consider larger address, and hence increasing the
22
buckets. If we delete any record and if the datas can be stored with lesser
buckets, it shrinks the bucket size.

Fixed-length and Variable-length Records

Fixed length records:-


1.All the records in the file are of same size.
2. Leads to memory wastage.
3. Access of the records is easier and faster.
4. Exact location of the records can be determined: location of ith
record would be n*(i-1), where n is the size of every record.

Variable length records:-


1.Different records in the file have different sizes.
2. Memory efficient.
3. Access of the records is slower. .

23

Você também pode gostar