Você está na página 1de 16

A Project Report on

FILE INDEXING
Submitted to the Department of Information Technology

For the partial fulfilment of the degree of Dual Degree in


Information Technology

by
ISHAN MOURYA, RISHIK RAMENA, AMIT JALAN

Roll number : HX-20, HY-47, HX-12


Registration number : 510814043, 510814041, 510814034 of 2014-19
Dual Degree, 3rd year

Under the supervision of


PROF. SURAJIT KUMAR ROY

Department of Information Technology


INDIAN INSTITUTE OF ENGINEERING SCIENCE AND
TECHNOLOGY, SHIBPUR
Department of Information Technology
INDIAN INSTITUTE OF ENGINEERING SCIENCE AND
TECHNOLOGY, SHIBPUR

December, 2016

CERTIFICATE
This is to certify that the work presented in this report entitled File Indexing,
submitted by Ishan Mourya, Rishik Ramena and Amit Jalan, having the
examination roll numbers 510814043, 510814041 and 510814034 respectively
has been carried out under my supervision for the partial fulfilment of the degree of
Dual Degree in Information Technology during the session 2014-19 in the
Department of Information Technology, Indian Institute of Engineering Science and
Technology, Shibpur.
Date: 06/12/2016

PROF. SURAJIT KUMAR ROY


Assistant Professor
Department of Information Technology
Indian Institute of Engineering Science
and Technology, Shibpur

DR. AMIT KUMAR DAS
Dean(Academics)
Indian Institute of Engineering Science
and Technology, Shibpur

DR. ARINDAM BISWAS


Head of the Department
Department of Information Technology
Indian Institute of Engineering Science and Technology, Shibpur

Acknowledgements

This project would not have been successful without the kind support of the
Organisation, Department and its faculty members.
Our sincere thanks goes to our project guide and mentor Prof. Surajit Kumar Roy
for his guidance and supervision.

Last but not the least we would like to express our gratitude to some of our
batchmates and friends who have helped us in collecting valuable information
regarding the project and being a constant source of motivation and encouragement.

Date: 06/12/2016


ISHAN MOURYA
Department of Information Technology,
Indian Institute of Engineering Science
and Technology, Shibpur


RISHIK RAMENA
Department of Information Technology,
Indian Institute of Engineering Science
and Technology, Shibpur


AMIT JALAN
Department of Information Technology,
Indian Institute of Engineering Science
and Technology, Shibpur

Abstract

The main objective for designing a database is faster access to any data in the
database and quicker insertion, deletion and search operations on the data, but
when a database is huge, it requires a considerably large amount of time to perform
such operations. In order to reduce time spent in such operations, Indexes are used
to quickly access the required data. It is similar to the concept of an Index Page in a
book, or the concept of book catalogues in a library.

When records are stored in the primary memory, accessing them is very easy and
quick. But generally records are not limited in numbers to store in primary memory.
They are very huge and we have to store them in the secondary memories which
causes a huge increase in the time needed to access them.

Broadly, indexing can be classified in two types:


Single-level indexing
Multi-level indexing

These techniques can be used for the quick and efficient retrieval of data from files.
Here we compare the extent to which these different techniques are efficient i.e
their rate of accessing records from data files.
We mainly compare the access times of files with indexed structures with that of non
indexed ones. We also compare the search mechanisms namely binary search and
sequential search.

Contents

1. INTRODUCTION 1

2. RELATED WORKS 2

3. PRELIMINARIES AND DEFINITIONS 3


3.1 Including Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.2 Including Definitions, Theorems, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 4
4. PROBLEM DEFINITION 5
5. PROPOSED APPROACH 5
5.1 Including Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
6. WORKING 9
7. EXPERIMENTAL RESULTS 10

8. CONCLUSIONS 11

1. INTRODUCTION

"File organization" refers to the logical relationships among the various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record. "File structure" refers to the format of the label and data
blocks and of any logical record control information.

Indexes are used to quickly locate data without having to search every row in a
database table every time a database table is accessed. Indices can be created
using one or more columns of a database table, providing the basis for both rapid
random lookups and efficient access of ordered records.
Record management carries significant importance in the organization. Record
management is concerned with keeping record safely and providing as per the
requirement. Indexing is an instrument of record management which makes possible
to find out the records easily and quickly. Filing without indexing is meaningless for
large database. In this respect, some importance of indexing can be explained as
follows:

Easy location: Indexing points out the required records or file and
facilitates easy location.
Saves time and efforts: Indexing gives the ready reference to the
records and saves the time and efforts of office.
Efficiency: Indexing helps to find out the records easily and quickly
which enhances the efficiency of office.

2. RELATED WORKS
There are several DataBase Management Systems which uses different indexing
methods to make faster access to data, like - Oracle RDBMS, IBM DB2, Microsoft
SQL Server, and many more. They are all good in allowing fast access to data, have
various functionalities, and are secure enough but all these need to be purchased
and require initial setup and an database expert to handle the database
management system.

We are trying to create a DataBase Management System that would not


require any initial setup nor any database expert to handle data. It provides basic
insert and search operation and uses Sequential Data Retrieval Method on the
Main Data file, Primary Index Data Retrieval Method and B+ Tree Data Retrieval
Method to access the data.
It is easy to use and has a very simple UI. It is good for storing medium sized
databases.

3. PRELIMINARIES AND DEFINITIONS

3.1 Including Figures

Figure 3.1.1: Primary Indexing


Figure 3.1.2: Multi Level Indexing

3
3.2 Including Definitions, Theorems, etc.

Definition 3.2.1 : File Indexing


An indexed file is a computer file with an index that allows easy random access to
any record given its file key. The key must be such that it uniquely identifies a record
(primary index). If more than one index is present the other ones are called alternate
indexes.

Definition 3.2.2 : Primary indexing


A primary index is an index on a set of fields that includes the unique primary key for
the field and is guaranteed not to contain duplicates. If the index is not on a primary
key field of the data file, it is called a Clustering index.

Definition 3.2.3 : Ordered indexes


If the index is created on a set of fields that are ordered in the main data file. This
field is called the ordering key field if it is also a key field in the main data file.
Definition 3.2.4 : B+ Tree
A B+ tree is an n-ary tree with a variable but often large number of children per node.
A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf
or a node with two or more children.

Definition 3.2.5 : JSON


It stands for JavaScript Object Notation. It is a lightweight data-interchange format. It
is easy for humans to read and write and also easy for machines to parse and
generate.

4. PROBLEM DEFINITION

In real life scenario databases may be huge in size. So, implementing a Sequential
Search for accessing a record in such database may be very time taking. Generally,
databases are so large that they cant be loaded into the main memory entirely and
therefore are stored in the secondary memory. Accessing data from the secondary
memory is much slower as compared to the main memory. So, we need to use some
kind of indexing for making access faster. If the index is also very large, as in case of
huge databases, we may use multi-level indexing or the B Tree or B+ Tree based
indexing.

5. PROPOSED APPROACH
Here we compare the efficiencies of different approaches of accessing data from
files ranging from sequential non-indexed method to B+Trees. Although the results
might not show the efficiencies of these mechanisms in case of small database files
but in case of large files they make a huge difference. So we take a range of different
file sizes for comparing the access times of these mechanisms.

5
5.1 Including Algorithms

Algorithm 1 : Sequential Database Search

___________________________________________________________________

Input : Roll no. of a Student (search key field)


Output : Name of the Student (if found) and the time taken to access it.

Steps of the Algorithm :

1 : Get the contents of the main data file.


2 : Store the records of the file into an object array (JSON array).
3 : Get the length of the array.
4 : Start a timer (using microtime()).
5 : Set flag = 0.
6 : Loop through the array sequentially and search for the input roll no.
7 : If found set flag = 1.
8 : If flag == 0
Print no student found.
9 : Else
Print name of student.
Stop timer.
Calculate time taken for accessing = timer_stop - timer_start.
Display the time taken.
10: End

Algorithm 2 : Binary Search on Sorted Database

_______________________________________________________________________

Input : Roll no. of a Student (search key field)


Output : Name of the Student (if found) and the time taken to access it.

Steps of the Algorithm :

1 : Get the contents of the main data file.


2 : Store the records of the file into an object array (JSON array).
3 : Get the length of the array.
4 : Start a timer (using microtime()).
5 : Set flag = 0.
6 : Use binary search to search for the given roll no. in the object array.
7 : If found set flag = 1.
8 : If flag == 0
Print no student found.
9 : Else
Print name of student.
Stop timer.
Calculate time taken for accessing = timer_stop - timer_start.
Display the time taken.
10: End
7
Algorithm 3 : Searching Database through the Index File
_________________________________________________________________________
__

Input : Roll no. of a Student (search key field)


Output : Name of the Student if found and the time taken to access it.

Steps of the Algorithm :

1 : Get the contents of the index file.


2 : Store the records (key-value pairs) of the file into an object array (JSON array).
3 : Get the length of the array.
4 : Start a timer (using microtime()).
5 : Set flag = 0.
6 : Use binary search in the index file array to search for the given roll no.
7 : If found set flag = 1 and store the index key value in a variable.
8 : If flag == 0
Print no student found.
9 : Else
Get the contents of the main data file into an array (JSON).
Get the name of the student from the array using the stored key (from index).
Print name of student.
Stop timer.
Calculate time taken for accessing = timer_stop - timer_start.
Display the time taken.
10: End
8

6. WORKING
The application is designed in a client server architecture.
The server is an apache webserver hosted locally using XAMPP.
The client side user interface is an HTML file (UI.html) which has the
database insertion and search forms.
The main databases files are JSON object files which contain records
of students.
The index file is also a JSON file which contains only the key field (roll
no.) and the pointer (array index) of the corresponding record in the main file.
The server side backend pages are the search.php and Student.php.
The search.php processes the search query from the UI (input roll no.).
It is here that the different retrieval algorithms are used.
The Student.php processes the new student entry from the UI (input
details of the student). and stores it in the main data file. It is here that the
index file is created and sorted each time a new entry is made in the
database.

7. EXPERIMENTAL RESULTS
Time Taken in Searching for a record in the Database using different Database
Retrieval methods.

Sequential Search Binary Search on Search through


on Unsorted Data Sorted Data File Index Table

Small Sized
Database

Medium Sized
Database

Large Sized
Database

In case of small database ( records), the average time required to access a record is
more in case of Using File Index then in case of Direct searching from the main
database. This is due to accessing two files (main database file and index file) in
case of Indexing.

In case of large database ( records), the average time required to access a


record is less in case of Using File Index then in case of Direct searching from the
main database. This is because search operation on the Index File is a Binary
Search and the size of the Index file is smaller than the size of the main database
file.

10

8. CONCLUSIONS

As shown from the above discussions, indexing is way efficient than direct data
access from files. Although today most database systems use multilevel indexing
and B Tree structures for file accesses, they are just another form of indexing.
These use multilevel indexes which form a tree structure with tree and data pointers
in each of its nodes.

DBMS is an intermediate layer between programs and the data. Programs


access the DBMS, which then accesses the data. There are different types of DBMS
ranging from small systems that run on personal computers to huge systems that run
on mainframes.With the development of better indexing algorithms, our databases
systems become more efficient day by day fetching us query results more quickly.
Some commercially available Database management systems in the market are
dbase, FoxPro, IMS and Oracle, MySQL, SQL Servers and DB2 etc.

11

REFERENCES
[1] Ramez Elmasri, Shamkant B.Navathe, Fundamentals of Database Systems,
Pearson, 2015.

[2] URL: http://www.w3schools.com

[3] URL: http://www.wikipedia.com

[4] URL: http://www.tutorialspoint.com

[5] URL: http://php.net/microtime

[6] URL: http://stackoverflow.com


12

Você também pode gostar