Escolar Documentos
Profissional Documentos
Cultura Documentos
Sithole
1. DATABASE ENVIRONMENT
Definition of a Database:
It is a shared collection of interrelated data designed to meet the varied information needs of an
organisation.
It is a structured collection of stored operational data used by all the application systems of an
organisation. It is independent of any individual application
It is a central source of data to be shared by many users for a variety of related applications.
Data as a Resource:
Information, which is the analysis and synthesis of data is one of the most vital of corporate
resources of late.
Database Concepts:
The two essential concepts are based on:
A data model is the logical structure of the data as it appears at a particular level of the
database system. Each application, which uses a database, has its own data model
Data Models
How data appears as viewed by different applications using the same database system.
E.g. customer accounts file contain details about goods - stock file contain details about goods
Data Independence
Data models are not affected by any changes in storage techniques
Central data model & associated data models are distinct from the arrangement of data on any
particular storage media.
1 dbase notes.doc
Compiled by W. Sithole
Entity
An object or event about which someone chooses to collect data is an entity. An Entity may be a
person, or a place for example, a sales person, a city or a product. An entity can be also an event
or unit of time, such as a machine breakdown, a sale, or a month or a year.
Entity Class
It is a collection of entities with similar characteristics. It is known as Entity Sets/Entity Types. It
is grouped for convenience.
2 dbase notes.doc
Compiled by W. Sithole
Attribute
It is a property of a real world-entity rather than as a data-oriented term.
It is a property of an entity eg. Customer
Customer Number
Customer Name
Address
Telephone
Credit Limit
Balance
An attribute is some characteristics of an entity. There can be many attributes for each entity for
example, a patient can have many attributes, such as last name, first name, address, city and so on.
The word data item is also used in conjunction with an attribute. Data element is simply a
synonym for data item.
Data items can have values. These values can be of fixed or variable length. They can be
alphabetic, numeric or alphanumeric. Sometimes a data item can be referred to as a field.
A field represents something physical not logical, therefore many data items can be packed into a
field. A field can be read and can be converted to a number of data items. A common example of
this is to store the date in a single field as mm/dd/yyyy. In order to sort the file, in the order of
date, three separate data items are extracted from the field and sorted first by year, then by month,
and finally by day.
Typical values assigned to data items may be numbers, alphabetic characters, special characters,
and a combination of all three. These can be illustrated as follows: -
Identifier
This is an attribute that uniquely distinguishes an entity from the rest eg. EC Number identifies an
employee.
Association
Forms a relationship between two or more entities eg.
3 dbase notes.doc
Compiled by W. Sithole
Direct representation of association between entities distinguishes data base approach from
conventional file application.
Relationships
These are associations between entities (sometimes they are referred to as data associations). They
imply that values for the associated data items are in some way dependent on each other.
Records
A record is a collection of data items that have something in common with the entity described.
Below is a diagram to illustrate the structure of a record
Order File
Order# Description Quantity Amount
Keys
A key is one of the data items in a record. When a key uniquely identifies a record, it is called a
primary key for example order# can be a primary key because there is only one number assigned to
each customer order. In this way a primary key identifies the real world, that is customer order.
A key is called a secondary key if it can not uniquely identify a record. Secondary keys can be
used to select a group of records that belong to a set for example orders that come from the city of
Mutare. When it is not possible to identify a record uniquely by using one data item found in a
record a key can be constructed by choosing two or more data items and combine them.
When a data item is used as a key in a record, the description is underlined therefore in the order
record: -
If an attribute is a key in another file it should be underlined with a dashed line (_ _ _ _ _) and it is
a foreign key in this file.
Metadata
4 dbase notes.doc
Compiled by W. Sithole
Example
Data Item Data Type Length
Name Character 10
Surname Character 15
Date of Birth Date 10
Weight Numeric 2
Data item
This is a unit fact, the smallest named unit of data in a database that has meaning to a user.
It is also known as data element, field, or attribute.
Preferences:
Data item - unit of data
Field - is physical rather than logical term that refers to the column position within a record where
a data item is located.
Examples:
Employee-Name, Student#
Data Aggregate
It is a collection of data items that is named and referenced as a whole
Example:
NAME = Last-Name, First-Name, Initials
In COBOL data aggregates are referred to as group items. In the data dictionary they should
include; data aggregate name, description, names of the included data items.
Each user defines and implements the files needed for a specific application.
Data records are physically organised on storage devices using either sequential or random file
organisation, so that, each application has its own separate data file or files and software programs.
Example:
Although both users are interested in data about students, each maintains separate files, programs
to manipulate these files, each requires data not available from the other's files
Results in redundancy in defining and storing data resulting in wastage of storage space &
redundant efforts to maintain common data up-to-date
In the Database Approach, a single repository of data is maintained, defined once and accessed by
various users.
5 dbase notes.doc
Compiled by W. Sithole
What if the information required to solve a particular problem is located in more than one file?
Often extra programming and data manipulation will be required to obtain that information, for
example:-
Suppose you want to know all of the orders outstanding for a particular
customer. Some of the information is maintained in the order file, for an order
entry application. The rest of the information is maintained in a customer master
file. Thus the required information is stored in several files, each of which is
organised in a different way. To extract the required information, there is need
to sort both files until the records are arranged in the same order. Records from
these files will have to be matched, and the data items from the merging of both
files will have to be extracted and output.
- Obtaining this information requires additional programming and creation of more files
- Most organisations have developed information systems one at a time, as the need arises,
each with its own set of programs, files and users. After some time, these applications
and files may reach to a point where the organisation's information resources maybe out
of control.
- Some symptoms of this crisis are:
Data redundancy (similar data in different files)
Program or Data Dependency
Data Confusion (caused by continuously opening and closing different
files)
Excessive costs
1. Data Redundancy
Refers to the presence of duplicate data in multiple data files or in several data files. The
same piece of data, such as employee name and address, will be maintained and stored in
several different files by several systems. Separate software programs must be developed
to update this information and keep it current in each file in which it appears.
2. Program/Data Dependency
Refers to the close relationships between data stored in files and specific software
programs required to update and maintain these files. Every computer program or
application must describe the location of the data it uses. In a traditional file
environment, any change to the format or structure of the data in the file necessitates a
change in all of the software programs that use the data.
6 dbase notes.doc
Compiled by W. Sithole
These problems can be easily viewed or pictured or visualised through the following illustrations:
Cust Name
Social Security#
Address Loan
Loan A/C ID Loan Accounting
Interest Rate Account System
Loan Period
Loan Balance
Cust Name
Social Security#
Address Checking Checking
Checking A/C ID Account Accounting
Account Balance System
Advantages:
1. Easy to create and simple to use
2. Require minimal overheads to access and use
In the conventional file processing, the user defines and implements files for specific applications.
In the database approach, a single repository of data is maintained and defined once and accessed
by various users.
Four characteristics most important in distinguishing a database system from a traditional file
processing system are:
7 dbase notes.doc
Compiled by W. Sithole
In a database system, the DBMS access programs are written independently of any
specific files. The structure of data files is stored in the DBMS catalog separately from
the access programs. This is called program-data independence.
8 dbase notes.doc
Compiled by W. Sithole
DBMS CONCEPTS
A data model is the main tool for providing abstraction. It is a set of concepts used to describe the
structure of a database. It includes a set of operations for specifying retrievals and updates.
It is important to distinguish between the description of a database and the database itself. The
description of a database is called a database schema. The description of an entity is called a
schema. Data in the database at a particular moment is called a database instance.
DBMS ARCHITECTURE
Here we are looking at an architecture for database systems, called the three-level-schema
architecture.
The goal of the three-level schema architecture is to separate the user applications from the
physical database. In this architecture, schemas can be defined at the following three levels:
The internal level has an internal schema, which describes the physical storage structure of the
database. It is the one closest to the physical storage, that is, it is the one concerned with the way
the data is physically stored. This is usually the one taken by the systems programmers. The
systems programmer is concerned with the actual physical organisation and placement of the data
element in the database. The internal view is the internal or hardware view of the database. The
internal schema uses a physical data model and describes the complete details of data storage and
access paths for the database. The systems programmer designs and implements this view by
allocating cylinders, tracks and sectors for the various segments of the database, so that various
programs can run as smoothly and efficiently as possible.
The conceptual level has a conceptual schema, which describes the structure of the whole database
for a community of users. It is a logical view. It is how the Database appears to be organised to
the people who designed it. The conceptual schema is a global description of the database that
hides the details of physical storage structures and concentrates on describing entities, data types,
relationships and constraints. It is the view usually used by the Database Administrator. It includes
all the data elements in the Database and how these data elements logically relate to each other.
The external or view level includes a number of external schemas or user views. It is the one
concerned with the way the data is viewed by individual users, and is usually used by an
application programmer. Each external schema describes the database view of one group of
database users. Each view typically describes the part of the database that a particular user group
is interested in and hides the rest of the database from that user group.
9 dbase notes.doc
Compiled by W. Sithole
END USERS
external/conceptual
mapping
CONCEPTUAL CONCEPTUAL
LEVEL SCHEMA
conceptual/internal
mapping
INTERNAL INTERNAL
LEVEL SCHEMA
STORED
DATABASE
10 dbase notes.doc
Compiled by W. Sithole
External External
View A View B
External/Conceptual External/Conceptual
Mapping A Mapping B
Conceptual DBMS
View
Conceptual/Internal Mapping
Internal View
Stored Database
11 dbase notes.doc
Compiled by W. Sithole
Data Independence
The three-schema architecture can be used to explain the concept of data independence, which can
be defined as the capacity to change the schema at one level of a database system without having to
change the schema at the next higher level. There are two types of data independence:
Logical data independence is the capacity to change the conceptual schema without
having to change external schemas or application programs. We may change the external
schema to expand the database by adding a new record type or data item, or to reduce the
database by removing a record type or data item.
Physical data independence is the capacity to change the internal schema without having
to change the conceptual (or external) schemas. Changes to the internal schema may be
needed because some physical files are reorganised for example, by creating additional
access structures to improve the performance of retrieval or update. If the same data as
before remains in the database, we should not have to change the conceptual schema.
Improved consistency of data while reducing the waste in storage space due to a reduced
redundancy.
2. Data Independence
A Database system keeps descriptions of data separate from the applications that can
occur without necessarily requiring changes in every application program that uses the
data.
Most DBMS offer application program development tools that help application
programmers in writing program codes. These tools can be very powerful, and they
usually improve an application programmers productivity substantially. Object oriented
databases provide developers with libraries of reusable codes to speed up development of
12 dbase notes.doc
Compiled by W. Sithole
applications. Users also increase their productivity when Query Languages and report
generators allow them to produce reports from the database with little technical
knowledge and without any help from the programmers, thus avoiding the long time
periods that MIS departments typically take to develop new applications. The result is
greater use of the corporate database for ad-hoc queries. Users also increase their
productivity when they use microcomputer software designed to work with mainframe
database. This allows them to acquire and manipulate data with easy, without requiring
the assistance of programmers.
7. Improved Data Integrity: Because data redundancy is minimised and the threat to data
integrity is reduced. Data integrity ensures that the data in the database is accurate. Updated
values are available to all applications and it ensures data consistency to all applications.
Problems/Disadvantages of Databases
DBMS provide many opportunities and advantages, but these advantages may come at a price.
DBMS also poses problem as:-
1. Resource Problems
Characterised with high initial investment and possible need for additional hardware. It
requires large software system for creation and maintenance. It also requires a fairly large
computer to support it. A Database system usually requires extra computing resources.
After all, the new database system programs must run much more data, must be stored on-
line to answer queries, which we hope will increase. As a result much more terminals
may be needed to put managers and other users on-line, to the additional hard disk
system, which may be needed to put more data on-line and make it available to managers.
Communications devices maybe needed to connect the extra terminals to the database. It
maybe even necessary to increase the size or number of CPUs to run the extra software
required by the database system.
Currently PCs are becoming more powerful and DBMS becoming more compact
therefore the problem is becoming less serious. It is also being overcome by availability
of distributed relational databases
2. Security Problems
A database must have sufficient controls to ensure that data is made available to
authorised personnel only and that adding, deleting and updating of data in the database is
accomplished by authorised personnel only. Access security means much more than
merely providing log in codes, account codes and passwords. Security considerations
should include some means of controlling physical access to terminals, tapes, and other
devices. Security considerations should also include the non-computerised procedures
associated with the database such as forms to control the updating or deletion of records
or files and procedures for storing source documents. In addition, access to employee,
vendor, and customer data should conform to various state regulations, such as the 1974
13 dbase notes.doc
Compiled by W. Sithole
Privacy Act, and the 1978 Right to Financial Privacy Act. Certainly the database should
contain an archiving feature to copy all important files and programs and these should be
procedures for regular update and storage of these archival copies.
3. Ownership Problem
In file based systems employees who run application programs on application specific
files frequently feel that the data in these files are theirs and theirs alone. Users, such as
payroll department, personnel develop ownership of the files in the system. When a
database of such files is created, the data is owned by the entire company. Any user with
a need should be able to obtain the authority to read or otherwise access the data.
However, for a database to be successful the data must be viewed and treated as a
corporate resource, not as an individual's property.
Security and integrity may be compromised if DBA does not administer the database
properly.
The organisation experiences and overhead cost for providing security, concurrency
control, and recovery and integrity functions.
The generality with which the DBMS provides for defining and processing data can also
be problematic.
Database Management System is a layer of software which maintain the database & provide an
interface to the data for application programs, which use it.
14 dbase notes.doc
Compiled by W. Sithole
It allows creation, accessing, modification and updating of the database and the retrieval of data
and the generation of the reports.
The DBA (Database Administrator) ensures that the database meets its objectives.
To define, implement and control the database storage including the structure of the database.
To coordinate the data resources of the whole enterprise using user and management
cooperation.
To ensure that policies and procedures are established to guarantee effective production,
control and use of data.
To decide on the information content of the database & structure of different data models.
15 dbase notes.doc
Compiled by W. Sithole
BIT
BYTE/CHARACTER
DATA ELEMENT/FIELD
RECORD
FILE
DATABASE
16 dbase notes.doc
Compiled by W. Sithole
- In a DBMS, applications do not obtain the data they need directly from the storage media
(database)
- They request the data from the DBMS
- The DBMS then retrieves the data from the storage media and provides them to the application
programs
- A DBMS operates between application programs and the data
The illustration below shows the relationship of Application Programs, the DBMS and the Database.
COMPONENTS OF A DBMS
DBMS system software is usually developed by commercial vendors and purchased by organisations.
Some of these components are typically used by information specialists in the system, for example,
information systems specialists typically use the Data Dictionary, Data Languages, Teleprocessing Monitor,
Applications Development Systems, Security Software and archiving and recovery system components of
DBMS.
Other components such as Report Writers and Query Languages may be used by both programmers and
other non-specialists.
Contains the names and description of every data element in the Database
Through the use of its data dictionary, a DBMS stores data in a consistent manner thus reducing
redundancy. For example, the data dictionary ensures that the data element representing the number of an
inventory item named (stocknum) will be of uniform length and have other uniform characteristics
regardless of the application program that uses it.
Application developers use the data dictionary to create the records they need for the programs they are
developing
A Data Dictionary checks records that are being developed against the records that already exists in the
database and prevents redundancy in data element names
Because of the data dictionary an application program does not have to specify the characteristics of the
data it wants from the database. It merely requests the data from the DBMS
17 dbase notes.doc
Compiled by W. Sithole
This may permit changing the characteristics of a data element in the data dictionary without changing it in
all the application programs that use the data element
Defines Metadata
DATA LANGUAGES
To place a data element into the Data Dictionary, a special language is used to describe the characteristics
of the data element.
To ensure uniformity in accessing data from the database, a DBMS will require that standardised commands
be used in application programs.
These commands are part of a specialised language used by programmers to retrieve and process data from
the Database.
A DML usually consists of a series of commands such as FIND, GET, APPEND etc.
These commands are placed in an application program to instruct the DBMS to get the data the application
needs at the right time
SECURITY SOFTWARE
A security software package provides a variety of tools to shield the Database from unauthorised access.
Archiving programs provide the Database Manager with the tools to make copies of the database, which can
be used in the case of the original database records are damaged.
Restart or recovery systems are tools used to restart the database and to recover lost data in the event of a
failure
REPORT WRITERS
A report Writer allows the programmers, managers and other users to design output reports without writing
an application program in a programming language such as COBOL, SQL and other QUERY Language
A Query Language is a set of commands for creating, updating and accessing data from a Database.
Query Languages allow programmers to ask ad-hoc questions of the database interactive without the aid of
programmers
SQL is a set of about several English like commands that has become a standard in Database industry and
development
For SQL is used in many DBMS, managers who understand SQL syntax are able to use the same set of
commands regardless of the DBMS.
18 dbase notes.doc
Compiled by W. Sithole
This software must provide the manager with access to data in many Database Management Environments.
The basic form of an SQL command is:-
After FROM you list the name of the file or group of records that contain those fields
After WHERE you list any condition for the search of the records
Example:
If you wish to select all customer names from customer database where the city in which the customer lives
is Harare
Solution:
OR
The results would be a list of ALL fields/Specified fields of customers located in Harare only
Query Languages allow users to retrieve data from database without having detailed information about the
structure of the records or without being concerned about the processes the DBMS uses to retrieve the data.
Furthermore managers do not have to learn COBOL, BASIC etc
TELEPROCESSING MONITOR
It is a communications software package that manages communication between the database and remote
terminals
Teleprocessing monitors often handle order entry systems that have terminals located at remote sales
locations.
These maybe developed by DBMS software firms and offered as a companion package to their database
products
19 dbase notes.doc
Compiled by W. Sithole
Internal Schema
1. Earliest data processing applications there was no formal data management software, all data
descriptions and input/output instructions were coded in each application program resulting in no
data independence every change to a data file required modification or rewriting of the application
program.
2. Access Methods was the first formal data management software. It is a software routine that manages
the details of accessing and retrieving records in a file providing storage independence. Storage units
can be changed (newer units replacing older units) without altering or modifying application programs.
3. Two-level schema (two-schema architecture) was the most early database management systems
employed. Logical schema corresponds to an external or user view that describes the data as seen by
each application program. A physical schema corresponds to the internal schema that describes the
representation of data in computer facilities. This resulted in physical data independence that is, the
data structures or methods of representing data in secondary storage could be altered without modifying
application programs e.g. To achieve efficiency, linked lists could be used instead of indexes without
changing application programs. The two-level schema was characteristics of structured database
management systems, such as those that use the hierarchical and network data models. This did not
provide logical data independence.
4. Three-level schema provided by contemporary relational DBMS. The conceptual schema provides an
integrated view of the data resource for the entire organisation. This schema (conceptual) evolves over
time new data definitions are added to it as the database grows and matures. It provides both logical
20 dbase notes.doc
Compiled by W. Sithole
and physical independence. It has logical data independence; the conceptual schema can grow and
evolve over time without affecting the external schema resulting in existing application programs not
need to modify as database evolves.
A database management system that provides these three levels of data is said to follow a three-schema
architecture.
A schema is a logical model of a database. It captures the metadata that describe an organisations data in a
language that can be understood by the computer.
Physical data independence insulates a user from changes to the internal model
Logical data independence insulates a user from changes to the conceptual model.
21 dbase notes.doc
Compiled by W. Sithole
FILE ORGANISATION
A file contains groups of records used to provide information for operations, planning, decision
making etc.
It is a technique for physically arranging the records of a file on a secondary storage device.
File Organisation
Sequential Indexed
Direct
Hardware independent
(VSAM)
Hardware
dependent (ISAM)
22 dbase notes.doc
Compiled by W. Sithole
b)
Key H P Z
A D H K M P Q Z
c)
Chess Combat Defender Faceoff Zaxxon
Relative
Record number 1 2 3 4 n
d)
Key Hashing routine Relative record #
23 dbase notes.doc
Compiled by W. Sithole
Organisation Access
a) Sequential: Sequential:
Physical order of records in the file is the same Accessing a record is only by first
as the order in which records were written to the accessing all records that physically
file normally in ascending order of primary precede it
key
b) Indexed Sequential: Random/Sequential:
Records are stored in physical sequence Random access of individual records is
according to the primary key. possible without accessing other records.
The file management system or access method,
builds an index, separate from data records that Entire file can be accessed sequentially
contains key values and pointers to the data
themselves
c) Relative: Relative:
Also known as direct file organisation Each record can be retrieved by specifying
Records are often loaded in primary key its relative record number, which gives the
sequence so that the file can be processed position of the record relative to the
sequentially, but records can also be in random beginning of the file.
sequence.
The user or application program has
To specify the relative location of a desired
record.
d) Hashed: Relative:
Also known as direct file organisation in which Record is located by its relative record
hash addressing is used. number, as for a relative organisation.
The primary key value for a record is converted .
by an algorithm (called hashing routine) into a
relative record number.
Records are not in logical order.
Hashing algorithm scatters records throughout
the file, and is normally not in primary key
sequence.
File organisation is rarely changed but record access mode can change each time the file is used.
24 dbase notes.doc
Compiled by W. Sithole
Since it is not feasible to reserve a physical address for each possible record a method
called Hashing is used. Hashing is the process of calculating an address from the record
key.
Suppose that there were 500 employees in an organisation and we wanted to use the
Social Security Number as a key, it would be inefficient to reserve 999 999 999
addresses, one for each social security number.
Therefore, we could take the social security number and use it to derive the address of the
record. There are many hashing techniques, a common one is to divide the original
number by a prime number. This is known as the Division Method that approximates the
storage locations and then use the remainder as the address, as follows:-
Begin with the Social Security Number 053-4689-42. Then divide by 509,
yielding 105047. Note that 105047 multiplied by 509 does not equal 53468923
instead. The difference between the original number 53468942 and 53468923 is
19.
The storage location of the record for an employee whose Social Security
Number is 472-3840-86 has the same remainder.
When this occurs, the second person's record should be placed in a special
overflow area.
Example:
Qn. Given the following number 472-3840-86 divided by 509
(prime number). Find the physical location.
Solution:
The physical location is 472384086/509
Yielding 928063 X 509
= 472384067
Then the original number 472384086 as a difference of 19
Modular Arithmetic
Divide key by the number of locations available for storage, and take the remainder
for example, 100 locations and a 4-digit key 1537:
Storage location 153 remainder 37
Therefore the storage location is 37
Alphanumeric keys need to be converted to base 36 or ASCII code for each character
or digit
Folding
Divide key into two or more parts added together. For example 872377 = 872
25 dbase notes.doc
Compiled by W. Sithole
377
1249
Then apply the Modular Arithmetic
b) BURNS 10652
It works like a one-way street cannot be worked backwards.
Advantages:
Supports applications demanding quick record retrieval because locating and reading
desired record into memory usually requires a single access to the disk.
Involves a single calculation for finding the record number
Permits both numeric and alphanumeric keys
Easily implemented with COBOL, C, PASCAL instructions
Disadvantages:
Hashing algorithms might result in collisions, that is identical remainders called
crashes or synonyms. For example Product Number C-64744 and F-42742 both
yield remainder 9739 when divided by 11001.
When collisions occur an indicator is stored in the first record to warn a user of the
crash. The indicator reveals where the other record really resides.
Due to collisions, extra disk space is allotted for a record that would otherwise
collide with another
Due to random order of the file, a sorting step must occur before listing or otherwise
processing the file in sequence
When file becomes full, a programmer writes a one-time program to rebuild it with
expansion space.
NOTE:
26 dbase notes.doc
Compiled by W. Sithole
- This is a table scheme in which updates, searches and deletions could ideally be done in a constant
time
- We seek a mathematical function which produces table addresses when supplied with the key
- Since there are many more possible key values than addresses, this becomes a many-to-one function
in which many different values can read to the same address.
- Since we do not know which keys will arise in front, it is possible that 2 keys with the same address
will arrive and a hash collision will occur
- Therefore to design a good hash table we must find a solution to the following two problems:-
a) Find a hash function that minimises the number of collisions by spreading arriving records
around the table as evenly as possible
b) Since any hash function is many-to-one collisions are inevitable and therefore a good way of
resolving them is necessary
- There are basically four methods which are used to produce hash tables which are:- (mainly for
system software programming)
1) Truncation
2) Division
3) Midsquare
4) Partition/Folding
TRUNCATION
- This is a method where you normally take the last characters of the address
eg h(2467) = 467
h(12601) = 601
h(12467) = 467
DIVISION
- You take the key and MOD it by the MAXSIZE that is you will use the function:-
key MOD MAXISIZE
MIDSQUARE
27 dbase notes.doc
Compiled by W. Sithole
- It converts the filename into its decimal equivalence, finds the middle digit and square it to give the
address
eg 49294 = 4
24683 = 36
PARTITION/FOLDING
- This method divides the number into groups, adds the individual groups to give the address
eg 510324 = 51+03+24
78
Therefore h(510324) = 78
- The technique of searching in a systematic and repetitive fashion for an alternative notation is called
PROBING
Incrementing function
- The incrementing function takes an address (i) not a key and produces another hash address
- If the new location is occupied, we take that hash address and pass it again through the incrementing
function etc until we find an open location and with luck we may be able to place it in a few probes
- Therefore we should have an indicator to tell whether the position is occupied or unoccupied and as
such we say that using linear probing we first of all, apply h(k) and then as many increments (i) as we
need
Disadvantages:
- Clearly linear probing results in clustering where a number of synonyms will be adjacent to each other
and mixed with others and as the table runs these clusters will inevitably grow larger and larger
making update, search and delete operations run more slowly
Advantages:
- It is suitable for small lists
28 dbase notes.doc
Compiled by W. Sithole
The basic idea behind clustering is to try and store records that are logically related and
physically close together on disk.
Suppose the stored record most recently accessed is record R1, and suppose the next
stored record required is record R2. Suppose also that stored R1 is stored on page P1 and
R2 is stored on page P2. Then:-
1. If P1 and P2 are one and the same, then the access to R2 will not require any
physical input or output at all, because the desired page, page 2 will already
be in a buffer in main memory
2. If P1 and P2 are distinct but physically close together in particular if they
are physically adjacent then the access to record R2 will require a physical
input/output (unless of course Page P2 also happens to be in a main
memory buffer), but the seek time involved in that input/output will be
small, because the read/write heads will already be close to the desired
position. In particular, the seek time will be 0 if P1 and P2 are in the same
cylinder.
Indexing
This is another file organisation method, which is divided into two areas namely:-
1. The Data Area
Contains all the records with all values or entries organised
sequentially which can be in ascending order
2. Index Area
Contains the record key per given track number. This record
key must be the highest in that track number. The 2 areas are
linked or joined by pointers
29 dbase notes.doc
Compiled by W. Sithole
A non-dense index is sometimes called sparse index, does not contain an entry for every stored
record in the indexed file (1:m). Less storage space used index for a number of records.
S# Index Supplier
S1 Smith London
S2 Jones Paris
S2 S3 Blake Paris
S4 S4 Clarke London
S5 S5 Adams Athens
S6 Brown Paris
3. Compression Techniques
This is a way of minimising amount of storage for stored data by replacing the data with some
representation.
Front Compression
Rear Compression
Hierarchical Compression
Front Compression:
30 dbase notes.doc
Compiled by W. Sithole
Example:
The following 4 names appear in a stored table. The field length is 10 characters. Apply front
differential compression:
Farai
Farasiya
Farisai
Farikayi
Solution:
Farai 0 - Faraibbbbb
Farasiya 4 - siyabb
Farisai 3 - isaibbb
Farikayi 4 - kayibb
Rear Compression:
The following names appear in a stored table. The field length is 15 characters. Apply rear
compression.
Abrahams,GK
Ackermann,LZ
Ackroyd,S
Adams,T
Adams,TR
Adamson,CR
Allen,S
Ayres,ST
Bailey,TE
Baileyman,D
Solution:
Expanded form
Abrahams,GK 0-2 Ab Ab
Ackermann,LZ 1-3 cke Acke
Ackroyd,S 3-1 r Ackr
Adams,T 1-6 dams,T Adams,T
Adams,TR 7-1 R Adams,TR
Adamson,CR 5-1 o Adamso
Allen,S 1-1 l Al
Ayres,ST 1-1 y Ay
Bailey,TE 0-7 Bailey Bailey
Baileyman,D 6-1 m Baileym
31 dbase notes.doc
Compiled by W. Sithole
Hierarchical Compression:
A supplier-stored file might be clustered by values of the city field, foe example all London
suppliers would be stored together etc. The set of all supplier records for a given city might be
compressed into a single hierarchic stored record, in which the city value in question appears only
once, followed by all the other details for each supplier who happens to be in that city.
Athens
S5 Adams 3
0
London
S1 Smith 20 S4 Clark 20
Paris
S2 Jones 10 S3 Blake 30
Intra-file
Page p1 page p2
Inter-file
Combines supplier and shipment files into a single file and then apply intra-file compression to that single
file.
32 dbase notes.doc
Compiled by W. Sithole
The hierarchical and network models use standard files and provide structures that allow them to be cross-
referenced and integrated. They have been available since early 1970s. The relational model uses tables to
store data. It provides the ability to cross-reference and manipulate the data and it provides for data
integrity. The object-oriented model uses objects.
The upper most record in the tree structure is called the root record. From there data organised into groups
containing parent record can have many child records (called siblings), but each child record can have only
one parent record. Parent records are higher in the data structure than are child records, however, each
child can become a parent and have its own child records. Because relationships between data items follow
defined paths, access to the data is fast. However, any relationship between data items must be defined
when the database is being created.
Parent-Child relationship
1. One record type, called the root of the hierarchical schema does not participate as a child record
type in any Parent-Child relationship (PCR)
2. Every record type except the root participates as a child record type in exactly one PCR type
3. A record type can participate as parent record type in any number (zero/more) of PCR type
4. A record type that does not participate as parent record type in any PCR type is called a LEAF of
the hierarchical schema
5. If a record type participates as parent in more than one PCR type then its child record types are
ordered. The order is displayed, by convention, from left to right in a hierarchical diagram.
A network database is similar to a hierarchical database except that each record can have more than one
parent, thus creating a many-to-many relationship among the records. For example, a customer may be
called on by more than one salesperson in the same company, and a single salesperson may call on more
than one customer. Within this structure, any record can be related to any other data element.
The main advantage of a network database is its ability to handle sophisticated relationships among various
records. Therefore more than one path can lead to a desired data level.
The network database structure is more versatile and flexible than is the hierarchical structure because the
route to data is not necessarily downwards, it can be in any direction.
In the network structure again similar to the hierarchical structure data access is fast, because relationships
must be defined during the database design. However network complexity limits users in their ability to
access the database without the help of programming staff.
33 dbase notes.doc
Compiled by W. Sithole
A relational database is composed of many tables in which data are stored, but a relational database
involves more than just the use of tables. Tables in a relational database must have unique rows, and the
cells (the intersections of a row and a column - equivalent to a fields) must be single-valued (that is, each
cell must contain only one item of information, such as a name, address, or identification number).
A row is called a tuple and a column is called an attribute. The data type describing the types of values
that can appear in each column is called a domain.
Domain
Is a set of atomic values. An atomic value means that each value in the domain is indivisible as far as the
relational model is concerned.
Relation
A relation schema is a set of attributes. It is used to describe a relation. The degree of a relation is the
number of attributes n of its relation schema.
Defined, as a set of tuples and tuples in a relation do not have any particular order. Values within a tuple
are ordered. Values in the tuple are atomic therefore composite and multivalued attributes are not allowed
that is the First Normal Form assumption.
Tuple
All tuples in a relation must be distinct - no two tuples can have the same combination of values for their
attributes. Super key is an attribute such that it can not be duplicated within a relation e.g. {studentID,
Name, Age} cannot remove any attribute. A relation may have more than one key - each of the keys is
called a candidate key.
Example:
studentid and candidate_number
Example:
Relation of degree 7
34 dbase notes.doc
Compiled by W. Sithole
A database management system that allows data to be readily created, maintained, manipulated, and
retrieved from a relational database is called Relational Database Management System (RDBMS). The
RDBMS, not the user, must ensure that all tables conform to the requirements. The RDBMS also must
contain features that address the structure, integrity and manipulation of the database.
In a relational database, data relationships do not have to be predefined. Hence users can query a relational
database and establish data relationships spontaneously by joining common fields. A database query
language is a helpful tool that acts as an interface between users and a relational DBMS. The language
helps the users of a relational database to easily manipulate, analyse and create reports from the data
contained in the database. It is composed of easy-to-use statements that allow people other than
programmers to use the database.
While the relational model is well suited to the needs of strong and manipulating business data, it is not well
suited for handling the data needs of certain complex applications, such as computer-aided design (CAD)
and computer aided software engineering (CASE).
Business data follow a defined data structure that the relational models handle well. However, applications
such as CAD and CASE deal with a variety of complex data types that can not be easily expressed by
relational models. Such programs also require massive amounts of persistent data (data that can not be
altered and that are stored in their own private memory space), and a database for them must be able to
evolve without affecting the data in memory that the application uses to operate.
An object-oriented database uses objects and messages to accommodate new types of data and provide for
advanced data handling. A database management system that allows objects to be readily created,
maintained, manipulated and retrieved from an object-oriented database is called an Object-Oriented
Database Management System (OODBMS)
An object-oriented database management system must still provide features that you would expect in any
other database management system, but there is still no clear standard for the object-oriented model.
35 dbase notes.doc
Compiled by W. Sithole
A logical database design is a detailed description of a database in terms of the ways in which the users will
use the data.
During this phase an analyst performs a detailed study of the data identifying how the data is grouped
together and how they relate to each other. An analyst must also determine which fields have multiple
occurrences of data, which fields will be keys or indexes and the size and type of each field.
A Schema as a complete description of the contents and structure of a database. It defines the database to
the system, including the record layout, the names, length and size of all fields and the data relationships.
A Subschema defines each user's view, or specific parts of the database that a user can access. A
subschema restricts each user to certain records and fields within the database. Every database has one and
only one schema, but each user must have a subschema.
In SQL, commands are given to define the structure of the database. Each database is identified by a name,
which is given in a CREATE DATABASE command.
The entities are defined as tables, with each attribute defined as a column in the table. A table then is given
a name, and each attribute declared by giving it a column name and stating its type. Supported data types
include:-
CHARACTER - values
SMALLINT - A restricted range of integers
DECIMAL - Which allow a fixed number of decimal places
FLOAT - For floating point values
MONEY - Currency values
DATE - For dates
Each data type allows a certain set of possible values. There is also a possibility of a column having an
unknown value called NULL. When a column is specified, it is assumed to allow a value unless the phrase
NOT NULL is specified
NULL values should not be allowed in any column, which forms part of the primary key of the table.
The name Art.db is chosen for the database, while the tables are called painting, artist and gallery. The
database MONEY has been used so is assumed to be supported by the implementation. The only column,
which allows a NULL value, is Nationality in the artist table. A NULL value in this column of a particular
row would mean that the actual value is unknown.
36 dbase notes.doc
Compiled by W. Sithole
UNIQUE INDEXES are defined on the tables for the primary keys to prevent the system allowing rows in
the tables with duplicate values in the key.
Instead, an INDEX is created for the key and is specified as unique, so that any attempt to add rows with
same key will be trapped as an error. For the gallery and artist tables, the key has just one component
attribute, but the key for the painting table has two attributes and the index is created for the pair (title,
artist-name).
Indexes may be created for any number of columns in the table. Usually their purpose is to speed up access
to the data using the column value. Each index must be given a name, although it is not used again and
unless it is to be deleted. The names used for the indexes in the illustration above are painting.idx, artist.idx
and gallery.idx.
The SQL SELECT statement is used to retrieve data from a table. It combines elements of the relational
algebra operation via its various options
SELECTION
In its simplest form, a SELECT command will select all data from the table, as in the example:-
SELECT *
FROM Art
The asterisk (*) indicates that all the columns (fields) of the table Art are to be selected.
Using the WHERE clause will restrict the rows (records) which are selected to those satisfying the
condition for example:-
SELECT *
FROM Art
WHERE cost > 5000
In this form the SQL SELECT provides the functions of the SELECT statement of the relational algebra
Practical example for the two SELECT statement to view the contents of Art table is pictured as
follows:-
TABLE: Art
37 dbase notes.doc
Compiled by W. Sithole
TitleArtist_NameCostGallery_NamePoolVictor300ChitamboPeelJohn1000NyashaSonyArthur1500H
arareReelmTecla800NyashaTitoAmon4500Mutare
Questions:
1. Write an SQL code to view all records from the Art table
2. Write an SQL code to view all records from the Art table where cost is less than $1500 and
Gallery_Name is equal to Nyasha
Write an SQL statement to list only the columns Title, and Cost in the table Art
Solution 1:
SELECT *
FROM Art
Resulting Table
TitleArtist_NameCostGallery_NamePoolVictor300ChitamboPeelJohn1000NyashaSonyArthur1500Harare
ReelmTecla800NyashaTitoAmon4500Mutare
Solution 2:
SELECT *
FROM Art
WHERE cost < 1500 and Gallery_Name = Nyasha
Resulting Table
TitleArtist_NameCostGallery_NamePeelJohn1000NyashaReelmTecla800Nyasha
Solution 3:
Resulting Table
TitleCostPool300Peel1000Sony1500Reelm800Tito4500
PROJECTIONS
There is a provision in the SQL SELECT to cover the PROJECT of relational Algebra
The rows selected from a table can be projected into a list of their columns by including the column list
instead of the asterisk. The command:-
This is obtained from the Art table by first retrieving the rows, which satisfy the condition (Cost > 1000),
then projecting them into the 3 columns and the cost values are omitted from the result.
38 dbase notes.doc
Compiled by W. Sithole
Table Art
TitleArtist_NameGallery_NameSonyArthurHarareTitoAmonMutare
If the SELECT command specifies all the components of the primary key of the table as part of the column
list the resulting rows will also be identified by the key value
In particular there will be no duplicate rows in the table, however, if the list of columns does not contain the
key or primary key, there maybe duplicate rows in the resulting table. An example is shown below which is
the result of applying the command.
SELECT Gallery_Name
FROM Art
WHERE Cost > 700
Resulting Table
Gallery_NameNyashaHarareNyashaMutare
A variation of the SELECT command can be used to ensure that duplicate rows are removed from the
result. It uses the DISTINCT Key word within the SELECT
The above code will remove all duplicate rows producing the following table:-
Gallery_NameNyashaHarareMutare
* This is so, because we are projecting on Gallery_Name only but using a DISTINCT command
where we have to satisfy a condition given.
It is possible to specify a particular order for the rows based on the selected column values by including an
'ORDER BY' clause
For example:
SELECT DISTINCT Gallery_Name
FROM Art
WHERE Cost > 700
ORDER BY Gallery_Name
This will produce the rows in ascending order of gallery name as shown on the table below
Gallery_NameHarareMutareNyasha
GROUPED DATA
There are additional clauses in the SELECT command, which allows it to deal with groups in data rather
than individual rows. The GROUP BY clause combines records with identical values in the specific field
list into a single record
39 dbase notes.doc
Compiled by W. Sithole
The final result of the SELECT is formed by projecting values into the selected columns. For example,
consider the command: -
SELECT Gallery_Name
FROM Art
WHERE cost < 1000
GROUPED BY Gallery_Name
It will produce a list of Gallery_Names, which hold Art whose cost is < 1000. The GROUP BY clause
causes all the selected rows with the same Gallery_Name to be grouped into a single row.
The projection onto the Gallery_Name is then performed and resulting table has got no duplicate names.
In fact it is equivalent to SELECT DISTINCT command. An added advantage of grouping data is that
there are standard functions, which can be applied to groups and producing one value for the whole group.
They include:-
Example 1
Write an SQL command to calculate the SUM of ALL COST in table painting.
Solution:
SELECT SUM (cost)
FROM painting
The computer will then sum up all the cost figures in the table painting and display the total only.
Example 2
Write an SQL statement using table painting to display the following output:
Gallery-name Cost
Example 3
Write an SQL statement to find the total galleries in the table painting.
Solution:
SELECT DISTINCT(gallery-name)
FROM painting
ORDER BY gallery-name
OR
SELECT COUNT(gallery-name)
40 dbase notes.doc
Compiled by W. Sithole
FROM painting
GROUPED BY gallery-name
Example 4
Write an SQL command to find or to list the maximum cost value in the table painting.
Solution:
SELECT MAX(cost)
FROM painting
SUB QUERIES
The WHERE clause can be express a complex condition. It can be used in what is called a SUBQUERY.
This makes use of another SELECT statement as part of the condition Nested SELECT statement.
Suppose we want to find all paintings by a particular artist the following statement is issued.
SELECT artist-name
FROM artist-table
WHERE artist-name = John
This produces a table of artist names equal to John. It can be used as part of the WHERE condition in the
SELECT statement which accesses or retrieves the tuple painting.
It extracts rows from the table painting where the artist name appears in the subquery. The IN operator is
used to perform this test on the result of the subquery. This IN operator and its
negation/complement/inverse NOT IN are not the only operators for use in subqueries
SELECT *
FROM painting NOT IN(SELECT artist-name
FROM artist
WHERE artist-name = John)
ALL and ANY operators can be used with a relational operator such as >= to test the column value against
the result of the subquery. To select the titles of the most costly paintings we could use the following
command:
SELECT title
FROM painting
WHERE cost >= ALL (SELECT cost
FROM painting)
CONSTRUCTING USER ACCESS
When a central database is used for a number of different users who have different requirements, it is
essential to be able to tailor the data to the different needs. In this case, there are two SQL features which
provide these facilities:
41 dbase notes.doc
Compiled by W. Sithole
VIEWS
A view is a virtual (does not physically exist) table obtained from the real tables by a SELECT statement.
Its main use is to tailor the data of a table to the needs of particular users, so that it omits details of no
interest or should not see.
In the example of the table painting, it may be desired to let most users see all the data except for the cost.
A view can be created which omits the cost column as follows:
To the users it looks just like a table and can be treated as a table in most SQL commands. However, it is
not a real table. Its data is obtained from the painting table by performing the SELECT statement each time
it is accessed.
The statement:
SELECT *
FROM details
WHERE gallery-name = Chipangali
Uses the view as a table. It retrieves the data relating to the paintings in the Chipangali gallery, but does not
include the cost, since the virtual table is formed by ignoring the cost column and is not part of the view.
Views can be created for any SELECT statement, not just like those, which limit the columns of a table.
A virtual table of all paintings held at the gallery Chipangali would be created by the command:
CREATE VIEW Chipangali AS
SELECT *
FROM painting
WHERE gallery-name = Chipangali
This would contain all of the 4 columns of the table painting, but only those rows relating to the gallery
Chipangali.
Once a view has been created its definition, as a SELECT statement will exist until a DROP VIEW
command is performed.
42 dbase notes.doc
Compiled by W. Sithole
GRANTING PRIVILEGES
Users of a database are identified by a user name. Individual users can be granted privileges which give
them certain permission to use the SQL command on the database
Permissions may also be granted to all users by using the key word PUBLIC instead of the user name.
The GRANT CONNECT command is available to define passwords for a list of users. It has the form:
GRANT CONNECT TO <user list>
IDENTIFIED BY <passwords>
It can be used to set up the password(s) for the new users or to alter the existing user passwords. Some
implementations do not use this facility, but rely on the operating system to deal with passwords for users.
Specific privileges to permit the use of SQL statements on a table or view are allocated by further GRANT
command. They have the following form:
GRANT <privilege list>
ON <table or view>
TO <user list>
Where table is the name of the table or view, user list is either a list of names or the key word PUBLIC, and
privilege list is a list of key words for the privileges.
UPDATE may have a list of commands, stating those which are allowed to be updated. The default is, to
allow columns to be updated.
Would let all 2 named users to use the SELECT command on the table painting and UPDATE the columns
cost and gallery-name only.
Since the privileges can be granted selectively a considerable degree of control of user access to data is
available.
Class exercise:
43 dbase notes.doc
Compiled by W. Sithole
Student
Stud-ID Student-Name Town Course-Level Fee
HND1002 Chipo Harare HND 7500
ND2001 Edmore Mutare ND1 6500
ND200100 Takura Harare ND2 3000
ND2003 Simba Kwekwe ND1 6500
ND2008 Esther Bulawayo ND1 6500
HND1004 Rachel Mutare HND 7500
NC3001 James Gweru NC 3500
NC3007 Oscar Kwekwe NC 3500
ND2009 Linda Bulawayo ND1 6500
Question:
Given the following ERD design a detailed database using SQL necessary for the illustration.
44 dbase notes.doc
Compiled by W. Sithole
crs-id
dob
STUDENT COURSE
ATTENDS
TEACHES
Qualification #-of-stud
TEACHER
name
45 dbase notes.doc
Compiled by W. Sithole
1. Adding Data: SQL provides an INSERT command to add a single record to a table, for example:
INSERT INTO student VALUES
(HND1006, James Made, Mutare, HND, 7500)
This will add a row to the student table with all column values defined. The indexes associated
with the tables are updated automatically such that reentering the same record will be rejected.
2. Deleting Data: The DELETE command is used to remove rows or records from a table. In its
simplest form it will remove all rows as in the command:
DELETE FROM<table-name> will remove all rows
DELETE FROM <TABLE-NAME>
WHERE <condition> only rows meeting the set condition will be removed.
The WHERE clause is used in the DELETE command and in other commands. The conditions
can be quite complex, enabling the commands to be very selectively applied.
They allow:
(a) AND, OR and NOT to be used as logical connections
(b) Numerical and character data to be compared for either equality or inequality such as: >,
<, =, >=, <=
46 dbase notes.doc
Compiled by W. Sithole
SYNONYM USAGE
SELECT UNIQUE p#
FROM sp spx
WHERE p# IN
SELECT p#
FROM sp
WHERE s# (=) spx.s#
NOT IN
OR
SELECT p#
FROM sp
GROUP BY p#
HAVING COUNT (s#) > 1
DICTIONARY
Collection of relations ie. Catalog and columns
Useful to a user who does not know all the fields of some tables but only an attribute.
CREATE SYNONYMS
Specifies an alternative name for a table/view; often used to define an abbreviation or to avoid
prefacing with the owner name of the table.
DROP SYNONYMS
Destroys a synonym declaration
COMMENTS USAGE
Provides an explanatory remark for table columns
47 dbase notes.doc
Compiled by W. Sithole
COMMENTS STATEMENT
Provides an explanatory remark for table columns (stored as part of internal definition tables.
Used in updating a catalog together with DELETE, CREATE TABLE, ALTER, INSERT
STAGEMAJOR FUNCTIONS
Planning correctness, increase programmers
1. Develop entity charts productivity)
2. Analyse costs and benefits 3. Establish security techniques (Passwords,
3. Develop implementation plan access tables, encryption)
4. Evaluate and select software and hardware 4. Load databases (Special programs to load
5. Establish application priorities from different files)
6. Develop data standards (Naming conversions 5. Specify test procedures
and definitions eg Customer: Prospective, Prior, 6. Establish procedures for backup and
No Longer) recovery
7. Conduct user training
Requirements Formulation & Analysis
1. Define user requirements Operation & Maintenance
2. Develop data definitions 1. Monitor database performance
3. Develop data dictionary 2. Tune and reorganise databases
3. Enforce standards
Design 4. Support users
1. Design conceptual model
2. Design External models (Modelling Growth & Change
Organisations data, DBA interact with 1. Implement change control procedures 2.
users and other system specialists in data Plan growth and change
processing) Change in size: Storage space utilisation,
3. Design Internal models (schemas) DBA allocate additional space, reallocate existing
4. Design Integrity controls space
Change in content/structure: new
Implementation application requests, alter logical and
1. Specify database access policies (Rights) physical database structure
2. Develop standards for application Change in usage pattern: performance
programming (For consistency & monitoring, assigning frequently accessed
records to faster devices, additional higher
performance hardware devices.
48 dbase notes.doc
Compiled by W. Sithole
Planning
Implementation
Planning:
Its purpose is to develop a strategic plan for database development that supports the overall
organisation business plan
Design Stage
Its purpose is to develop a database architecture that will meet the information needs of the
organisation now and in the future. There are 3 stages in database design, that is, Conceptual,
Implementation & Physical design.
a) Conceptual Design: Its purpose is to synthesise the various user views and information
requirements into a global database design. The design is called Conceptual Schema/Data
Model and may be expressed in one of the several forms that is, entity relationship diagram,
semantic data model, normalise relation. The Conceptual Data Model describes entities,
attributes and relationships.
b) Implementation Design: Its purpose is to map the Conceptual Data Model into a logical
schema that can be processed by a particular DBMS. The conceptual data model is mapped
into hierarchical, network or relational data model.
49 dbase notes.doc
Compiled by W. Sithole
c) Physical Design: Last stage of Database design concerned with designing stored record
formats, selecting access methods and deciding on physical factors such as record blocking.
Also concerned with database security, integrity and backup and recovery.
Implementation Stage:
Once the database is completed, the implementation process begins. The first step is the creation
or initial load of the database. Database administration manages the loading process and resolves
any inconsistencies that arise during this process.
1. Planning:
Develop entity charts
Analyse costs and benefits
Develop implementation plan
Evaluate and select software or hardware
Establish application priorities
Develop data standards
2. Requirements Formulation & Analysis:
Define user requirements
Develop data definitions
Develop data dictionary
3. Database Design:
Design conceptual model
Design external models
Design internal models
Design integrity controls
4. Database Implementation:
Specify database access policies
Develop standards for application programming
Establish security techniques
Load database
Specify test procedures
Establish procedures for backup & recovery
Conduct user training
5. Operations & Maintenance:
Monitor database performance
50 dbase notes.doc
Compiled by W. Sithole
DATABASE IMPLEMENTATION
DBMS Functions:
Data storage, retrieval & update. A database may be shared by many users, the DBMS must provide
multiple user views and allow users to store, retrieve and update their data easily and efficiently
Data Dictionary/Directory
The DBMS must maintain a user accessible data dictionary
Recovery Services:
The DBMS must be able to restore the database or return it to a non-condition in the event of some system
failure. Sources of system failure include:
Operator error
Disk head crashes
Program error
Security mechanisms:
Data must be protected against accidental or intentional misuse or destruction. The DBMS must provide
mechanism for controlling accessed data and defining what action (read only, update may be taken by each
user.)
51 dbase notes.doc
Compiled by W. Sithole
NORMALISATION
Is the analysis of functional attributes (data items). The purpose of normalisation is to reduce complex user
views to a set of small, stable data structures. Normaliseed data structures are more flexible, stable and
easier to maintain than unnormalised structures.
Steps in Normalisation:
USER VIEWS
UNNORMALISED
RELATION
1NF
RELATIONS
2NF
RELATIONS
Remove transitive dependencies
3NF
RELATIONS
BCNF
RELATIONS
4NF
RELATIONS Remove join dependencies
1.
5NF
RELATIONS
52 dbase notes.doc
Compiled by W. Sithole
Unnormalised Relation:
It is a relation that contains one or more repeating groups for example GRADE-REPORT:
GRADE-REPORT
Stud# Studname Major Course# Crs-title Lec-name L-office Grade
38214 Takura IS IS350 Dbase Chamanga 6 A
IS465 SAD Kamudyariwa 10 C
69173 Esther PM IS465 SAD Kamudyariwa 10 A
PM300 Proj-Mgt Kamudyariwa 10 B
QM400 OR Magadza 11 C
Major 1:1
Stud# course# 1 : M
Crs-title 1 : M
Lec-name 1:M
L-office 1 : M
There are multiple values at the intersection of certain rows and columns. Since each student takes more
than one course, the course data in the above relation constituents a repeating group within student data. In
an unnormalised relation, a single attribute can not save as a candidate or primary key. Suppose we take
student number as a primary key, there is a one-to-one relationship from student number to student name
and major. However, the relationship is one-to-many from student number to course and remaining
attributes. The student number is not a primary key, since it does not uniquely identify all the attributes in
this relation.
Normalised Relations:
A normalised relation is one that contains only single values at the intersection of each row and
column. A normalised relation contains no repeating groups. To normalise a relation that contains a
single repeating group we remove the repeating group and form 2 relations. The 2 new relations
formed from the above example are as in Student(S) and Student-Course(SC). Student relation is
already in 3rd NF whereas Student-Course relation is in 1st NF.
Therefore stud# is not a candidate key because it does not uniquely identify all attributes in this relation.
53 dbase notes.doc
Compiled by W. Sithole
Update anomaly when one wants to change SAD to ASAD in crs-title there is need to search the entire
relation failure of which results in data inconsistent.
INF
A relation with a single repeating group will form 2 relations by removing the repeating group.
S(student)
Stud# Studname Major
38214 Takura IS
69173 Esther PM
SC(student-course)
Stud# Course# Crs-title Lec-name L-office Grade
38214 IS350 Dbase Chamanga 6 A
38214 IS465 SAD Kamudyariwa 10 C
69173 IS465 SAD Kamudyariwa 10 A
69173 PM300 Proj-Mgt Kamudyariwa 10 B
69173 QM400 OR Magadza 11 C
INF with primary key (stud#, course#) attributes from repeating group.
Primary key uniquely identifies students grade.
Student-Course still has data redundancy which results in update anomalies in INSERTING, DELETING,
UPDATING data.
INSERT:
To insert a new course it is impossible because if no student is taking that course that results in a null value
for stud# which is not allowed.
DELETE:
To delete a student record for a particular tuple results in loosing course title, and lecturer details. Leaving
the course details results in a NULL value for stud# which is part of the key and it is not allowed.
UPDATE:
To update course title since it appears a number of times for example, SAD there is need to search through
every tuple. There is inefficiency and might result in data inconsistencies in the case of failure to update all
occurrences.
The above problems being a result of nonkey attributes which are dependent on only part of the key, that is,
course# for example:
course# crs-title
Lec-name
54 dbase notes.doc
Compiled by W. Sithole
L-office
Grade is fully dependent on (stud, course#) whereas Crs-title, Lec-name, L-office partially depend on the
primary key (stud#, course#). As shown below.
Crs-title
Stud#
grade
Lec-name
Course#
L-office
R(Registration)
Stud# Course# Grade
38214 IS350 A
38214 IS465 C
69173 IS465 A
69173 PM300 B
69173 QM400 C
3NF
CL(Course-Lecturer)
Course# Crs-Title Lec-Name L-Office
IS350 DBase Chamanga 6
IS465 SAD Kamudyariwa 8
PM300 Project Mgt Kamudyariwa 8
QM400 OR Magadza 11
2NF
Course title appeara once in course-lecturer relation which solves the update anomally. Course data can be
inserted, deleted without reference tostudent data
Course# Crs-Title
Lec-Name
L-Office
Lec-Name L-Office This illustrates that there is a unique office for a lecturer, that
is transitive dependency when one nonkey attribute is dependent on one or more nonkey attributes.
55 dbase notes.doc
Compiled by W. Sithole
INSERT:
It is impossible to insert a new lecturer since it is dependent on course#. The new lecturer is not yet
assigned to teach at least one course. It is not possible for example to insert Ms Mvududu until one or more
courses have been assigned to her.
DELETE:
Deleting course data results in a lecturer data lost for example, deleting course# IS350 results in loss of
Chamanga data
UPDATE:
Lecturer data occur many times therefore changing lecturer office for Makunga requires searching every
tuple failure to which will result in data inconsistency for example one tuple reads Rm 8 and another will
read Rm 12.
3NF
Removing attributes that participate in transitive dependency, for example, Lec-Name and L-Office results
in the following relations:
C(Course)
Course# Crs-Title Lec-Name
IS350 DBase Chamanga
IS465 SAD Kamudyariwa
PM300 Project-Mgt Kamudyariwa
QM400 OR Magadza
Primary Key (Course#) and Foreign Key (Lec-Name)
L(Lecturer)
Lec-Name L-Office
Chamanga 6
Kachepa 11
Makunga 8
Primary Key (Lec-Name)
The assumption is that L-Office can have more than one occupant therefore Lec-Name becomes primary
key and associates the 2 relations course and lecturer.
In this 3NF insertion and deletion can be done without referencing other entities. Updates are also possible
because they are confined to a single tuple within a relation
56 dbase notes.doc
Compiled by W. Sithole
C(Course)
Course# Crs-Title Lec-Name
IS350 DBase Chamanga
IS465 SAD Kamudyariwa
PM300 Project-Mgt Kamudyariwa
QM400 OR Magadza
L(Lecturer)
Lec-Name L-Office
Chamanga 6
Kachepa 11
Makunga 8
R(Registration)
Stud# Course# Grade
38214 IS350 A
38214 IS465 C
69173 IS465 A
69173 PM300 B
69173 QM400 C
S(student)
Stud# Studname Major
38214 Takura IS
69173 Esther PM
Relations in 3NF are sufficient for most practical database design problems. When a relation has more than
one candidate key, problems may arise even if it is in 3NF, hence the further normal forms come in, for
example, BCNF, 4NF, 5NF, DKNF.
SMA(student-Major-Advisor)
Stud# Major Advisor
123 Physics Edwin
123 Music Chioniso
456 Biology Machuma
789 Physics Tawanda
999 Physics Edwin
57 dbase notes.doc
Compiled by W. Sithole
Student#
Advisor
Major
They are still anomalies in the relation above, that is, suppose that student# 456 changes her major, from
Biology to Maths, when the tuple of that student is updated, we lose that Machuma advises Biology (update
anomaly)
Suppose we want to insert a tuple with the information that Gamu advises in Computers. This can not be
done until at least one student majoring in Computers is assigned Gamu as an advisor (insertion anomaly)
In the above relation there are 2 candidate keys, student#, major and student#, advisor. The type of
anomalies that exist in this relation can occur when there are 2 or more overlapping candidate keys.
BCNF definition
A relation is in BCNF if and only if every determinant is a candidate key.
Determinant is any attribute simple or composite on which some other attribute is fully functionally
dependent, for example, in the above relation, the attribute advisor is determinant, since major is fully
functionally dependent on advisor.
To make the above relation in BCNF we make Advisor a candidate key and project the original 3 rd NF
relation into 2 relations that are in BCNF.
SA(Student-advisor) AM(Advisor-Major)
Student# Major Advisor Major
123 Physics Edwin Physics
123 Music Chioniso Music
456 Biology Machuma Biology
789 Physics Tawanda Physics
999 Physics
58 dbase notes.doc
Compiled by W. Sithole
Even when a relation is in BCNF it may still contain unwanted redundancy that may result in update
anomalies, for example, consider the following unnormalised relation
O(Offering)
Course Instructor Textbook
Mgt White Drucker
Black Peters
Green
Finance Gray Weston
Gilford
Assumptions:
1. Each course has one or more instructors
2. For each course, all of the textbooks indicated are used.
O(Offering)
Course Instructor Textbook
Management White Drucker
Management Green Drucker
Management Black Drucker
Management White Peters
Management Green Peters
Management Black Peters
Finance Gray Weston
Finance Gray Gilford
Normalised Relation
From the normalised relation offering, for each course, all possible combinations of instructor and
textbooks appear in the resulting relation. The primary key of this relation consist of all the 3 attributes
(BCNF). The above relation contains redundant data. This can lead to update anomalies, that is, suppose
you want to add a third textbook to the management course. This would require the addition of 3 new rows
to the relation, one for each instructor. From the above relation you can see that for each course there is a
well defined set of instructors (one-to-many relationship) and a well defined set of textbooks (one-to-many
relationship). However, the instructors and textbooks are independent of each other. The relationship can
be summarised as follows:
Multivalued dependency
Multivalued Dependency
Exists when there are 3 attributes for example, a, b, & c, and for each value of a there is a well defined set
of values of b and a well defined set of values of c. However, the set of values of b is independent of set c
and vice-versa
To remove the multivalued dependency from a relation, we project the relation into 2 relations each of
which contains one of the 2 independent attributes.
59 dbase notes.doc
Compiled by W. Sithole
4NF
A relation is in 4NF if it is in BCNF and contains no multivalued dependencies.
L(Lecturer) T(Text)
Course Instructor Course Textbook
Mgt White Mgt Drucker
Mgt Black Mgt Peters
Mgt Green Finance Weston
Finance Gray Finance Gilford
5NF
The normal formal form is designed to cope with join dependency. A relation that has a joint dependency
can not be decomposed by projection into other relations.
5th NF: a relation is said to be in 5NF if it is in 4NF and all loin dependencies are removed.
Limitations of Normalisation
Users may have to join several tables for retrieval which require additional computer time
Referential integrity is more difficult to enforce when a table is decomposed via normalisation
Objectives of Normalisation:
Reduce redundancy
Produce a stable data structure.
SECURITY
Security refers to the protection of data against unauthorised access, alterations or destruction
INTEGRITY
Refers to the accuracy or validity of data
In other words security involves ensuring that the users are allowed to do the things they are trying to do
Integrity also involves ensuring the things they are trying to do are correct.
In both cases the system needs to be aware of certain rules that users must not violate. These rules must be
specified (typically by the DBA), using suitable language, and must be maintained in the system catalog or
dictionary and in both cases the DBA or DBMS must monitor user operations to ensure that the rules are
thus enforced.
60 dbase notes.doc
Compiled by W. Sithole
There are numerous aspects to the security problem, among them are the following:
1. The legal, social and ethical aspects: Examples are does the person making a request, say for the
customer credit have a legal right to the requested information?
2. Physical control: Is the computer or terminal room locked or otherwise guarded?
3. Policy Questions: How does the enterprise owing the system decide on who should be allowed access
to hat?
4. Operational Problems: If a password scheme is used, how are the passwords kept secret and how are
they changed?
5. Hardware controls: Does the processing unit provide any security features such as storage protection
keys or a privileged operation mode.
6. Operating system security: Does the operating system erase the contents of storsge and data files when
they are finished with?
Now, modern DBMS typically support either or both of the two the approaches to data security. The
approaches are: Discretional or Mandatory.
Mandatory Control:
Each data element is tagged or labeled with a certain classic level and each user is given certain clearance
level.
A given data object can be accessed only by users with the appropriate clearance level. This is enforced by
the DBA
Regardless of whether we are dealing with a discretional or mandatory scheme, all the decisions as to which
users have to perform which operation or which object are policy decisions, not technical ones.
All the DBMS can do is to enforce those decisions once they are made.
It follows that, the result of those policy decisions:-
Must be made known to the system (by means of statements in some appropriate definition language),
and
Must be remembered by the system (by means of saving them in the catalog, in the form of security
rules also known as authorisation rules)
There must be a means of checking a given access request against the applicable security rules (by access
requests here we mean the combination of requested operation plus requested object plus requested user, in
general).
This checking is done by the DBMS security subsystem, also known as the authorisation subsystem.
In order that maybe able to decide which security rules are applicable to a given access request, the
subsystem must be able to recognise the source of that request that is, it must be able to recognise the
requesting user. For that reason, when users sign in to the system they are typically required to supply not
only their user ID (to say who they are), but also a password (to prove they are who they say they are).
The password supposed to be known only to the system and to the legitimate users of the user ID
concerned.
Regarding this last point, incidentally note that any number of distinct users might be able to share the same
group User ID. In this way the system can support user groups, and can thus provide a way of allowing
everyone, for instance, in accounting department to share the same privileges.
The operations of adding individual users to or removing individual users from a given group can then be
performed independent of the operation of specifying the privileges that apply to that group.
61 dbase notes.doc
Compiled by W. Sithole
Note however that the obvious place to keep a record of which groups are again in the catalog.
To repeat from the previous section most DBMS support either discretionary control or mandatory or both.
Infact, it would be more accurate to say that most systems support discretionary control and some systems
support mandatory control as well. Discretionary control is thus more likely to be encountered in practice.
As already noted, there is need to be a language that supports the definition of security rules. We therefore
begin by describing a hypothetical example of such a language, shown as follow:-
The above example is meant to illustrate the point that security rules have 5 components as follows:
1. A name (pr3 painting rule 3) in the example the rule will be registered in the system catalog under the
name pr3. The name will probably also appear in a message or diagnostics produced by the system in
response to an attempted violation of the rule.
2. One or more privileges (SELECT & UPDATE in the example) specified by means of the GRANT
clause.
3. The scope to which the rule applies specified by means of the ON clause. In the example the scope is
painting tuples or records where the gallery-name is not Chitombo.
4. One, or more users (more accurately user IDs) who are to be granted the specified privileges over the
specified scope, specified by means of the TO clause.
5. A violation response specified by the ON ATTEMPT violation clause, telling the system what to do if
the user attempts to violate a rule. In the example, the violation response is simply to REJECT the
attempt and provide suitable diagnostics. Such a respond will surely be the one mostly required on
practice so it is set to be the default response.
For example:
DESTROY SECURITY pr3
For simplicity we assume that destroying a given named relation will automatically destroy any security
rules that apply to that relation.
62 dbase notes.doc
Compiled by W. Sithole
AUDIT TRAILS
Its a special file or database in which the system keeps track automatically of all operations performed by
users on a regular database. A typical entry in the audit trail might contain the following information:
RECOVERY
Recovery is the process of rebuilding a system pack to its original status after a system, media, transaction
failure etc.
SYSTEM FAILURE
Shut downs caused by hardware or hubs in the O/S, hardware system or other system software will be
referred to as a system crash. When the system crashes, all transactions currently executing terminate.
The contents of internal memory (which include I/O buffers) are assumed lost. However, we assume that
external memory including disks on which the database resides are not affected by the system failure.
DATA SECURITY
The protection of data in the database against unauthorised disclosure, alteration or destruction.
Authorisation Mechanisms
a) Identification
b) Authentication
Identification Users have to identify themselves to the system before accessing the database by supplying
an operator/username using machine readable cards
Authentication - The process if proving their identification by providing passwords, pin numbers, answering
some questions from the system.
Access Control
For each user the system will maintain a user profile, generated from the user definition supplied by the
DBA.
The details of the appropriate identification and authentication procedures would have been given on the
access controls. Operations allowed for a particular user to perform are to be given. The DBMS will go
through a series of test to determine whether to grant or delay access to the user. The tests may be arranged
in a sequence of increasing complexity, so that a program may reach its final decision as quickly as
possible.
DATABASE INTEGRITY
Ensuring that the data is accurate all times.
Constraints
Each relation in the database will have a set of integrity constraints associated with it.
63 dbase notes.doc
Compiled by W. Sithole
These constraints will be held in the data dictionary as part of the conceptual schema
They specify for example, that values of a particular attribute in some relation are to be within certain
boundary, or that within each tuple of some relation the values of one attribute may not exceed that of
another.
1. Primary Key posses a property of uniqueness. No 2 tuples in the relation may have the same value
for this attribute or attribute combination
2. No component of a primary key value may be null
Enforcement
The DBMS must reject any attempt to generate a tuple whose key value is null or is a duplicate of the
one that already exists.
Bounds Entry
Values occurring in a particular attribute may be required to lie within certain bounds (eg values of
employee age: 15<age<60)
The constraints are specified by the Bounds Entry. The lower and upper limits have to be defined.
Values Entry
There may be a very small set of permitted values of some particular attribute combination eg permitted
values for primary colour are red, blue, green etc. In this case the permitted values could simply be
listed in a values entry for the relevant attribute or attribute combination
NB It might be desirable to list values or ranges of values that are not permissible for the attributes
concerned.
Format Entry
Values of a particular attribute may have to conform to a particular format. Eg the first character of a
supplier number must be the letter S.
The constraint is specified in a format entry for the relevant attribute.
Average Function
The set of values of a particular attribute relation may have to specify some statistical constraints eg no
employee may earn a salary that is more than twice the average salary for the department.
The predicate defining this constraint will enforce the library function AVERAGE
To enforce it the DBMS will have to monitor all storage operations against the employee relation
NB
All examples given above are of static constraints that is they specify conditions that must hold for
every given state of the database.
Another important type of constraint involves transition from one state to another eg when employees
salary is updated, the new value must be greater than the old value.
To specify such constraints it will need to specify the old and new values
The keywords OLD and NEW are reserved for this purpose.
A special case of transition is that from non-existence (ie addition of new tuple) or from existence to non-
existence (ie deletion of an existing tuple)
RECOVERY ROUTINES
Recovery routines are used to restore the database, or some portion of the database, to an earlier state after a
system failure (hardware or software) has caused the contents of the database in main storage to be lost.
They take as input a backup copy of the database (produced by the dump routines) together with the system
journal (which contains details of operations that have occurred since the dump was taken) and produce as
output a new copy of the data as it was before the failure occurred.
64 dbase notes.doc
Compiled by W. Sithole
NB Any transactions that were in progress at the time of the failure will probably have to be restarted.
BACKUP ROUTINE
Dump routines
These are used to take backup copies of selected portions of the database, also usually on tape.
It is normal practice to dump the database regularly say once a week
If the database is very large it may be more practical to dump one seventh of every day
Each time a dump is taken, a new system journal may be started and the previous one erased or
archived
Backup is normally initiated automatically by the DBMS before the database has committed its change.
Checkpoint/Restart Routines
Backing up and rerunning a long transaction in its entirety can be a time consuming process
Some systems permit transactions to take checkpoint at suitable points in their executions
The checkpoint routines will cause all changes made since the last checkpoint to be committed.
The checkpoint facility allows a long transaction to be divided up into a sequence of short ones
The checkpoint routine may also record values of specified program variables in a checkpoint entry in
the system journal
And in the case of an operation involving change to the database, the type of change and address of the data
changed, together with its before and after values
Encryption/scrambling
Used to protect or is the protection of the database against an infiltrator who attempts to by pass against
the system
Example of by passing the system involves a user who physically removes part of the database for
example by stealing a disk pack
Apart from normal security measures to prevent unauthorised personnel from entering the computer
centre, the most important safeguard against physical removal of part of the database is the use of
scrambling techniques
Scrambling/encryption and privacy transformations techniques involves the following:
(a) Shuffling the characters of each tuple (or record or message) into different order
(b) Replacement of each character (or group of characters) by a different character (or group
of characters), from the same alphabet or different one
(c) Groups of characters are algebraically combined in some way with a special group of
characters (privacy key) supplied by the owner of the data.
TRANSACTIONS
A transaction is a unit of work with the property that the database is;
a) In a consistent state (state of integrity) both before it and after it but
b) Is possibly not in such in state between these 2 times
In general any changes made to the database during a transaction should not be visible to concurrent
transactions until such changes have been made, in order to prevent these concurrent transactions from
seeing the database in an inconsistent state.
65 dbase notes.doc
Compiled by W. Sithole
Any data changed by a given transaction including data created or destroyed by that transaction should
remain locked until that transaction terminates
The above discipline must be enforced by the DBMS
A transaction will be backed out if on completion it is found that the database is not in a state of
integrity
A transaction may also be backed out if the system detects a deadlock: A general strategy for such a
situation is to choose one of the deadlocked transaction, say the one most recently started or the one
that has made the changes and remove it from the system, thus freeing its locked resources for use by
other transactions.
The process of back out involves undoing all the changes that the transaction has, made releasing all
resources locked by the transaction and scheduling all the transaction for re-execution.
Example of Transaction
In a banking system a typical transaction might be
Transfer amount X from account A to account B This would be viewed as a single operation and a user
would have to enter a command such as
The above transaction requires several changes to be made to the underlying database.
Specifically it involves updating the balance value in 2 distinct account tuples
Although the database is in a state of integrity before and after the sequence of changes, it may not be
throughout the entire transaction, ie some of the intermediate state (or transitions) may violate one or
more integrity constraints
It follows that there is need to be able to specify that certain constraints should not be checked until the
end of the transaction. These are called deferred constraints
By contrast, constraints that are enforced continuously during the intermediate steps of the transaction
are called intermediate
NB: The data sublanguages must include some means of signaling the end of the transaction, in order to
cause the DBMS to apply deferred checks
CONCURRENCY
In most systems, several users can access a database concurrently. The operating system switches execution
from one user program to another to minimise waiting for input or output operations
Within this approach transactions are often interleaved, that is, several steps are performed on transaction
A, then several steps on transaction B, followed by more steps on transaction A and so on.
66 dbase notes.doc
Compiled by W. Sithole
1. 2 users are in the process of updating the same record which represents a savings account record for
customer A
2. At present time customer A has a balance of $200 in her account
3. User 1 reads her record into the user work area, intending to post a customer withdrawal of $150
4. Next user 2 reads the same record into that user area, intending to post a customer deposit of $25
5. User 1 posts the withdrawal and stores the record, which now indicates a balance of $50
6. User 2 then posts the deposit (increasing the balance to $225) and stores this record on top of the one
stored by user 1
7. The record now indicates a balance of $225
8. In this case the transaction for user 1 has been lost because of interference between transaction
INCONSISTENT ANALYSIS
Usually occurs in traditional file approach when the same data are stored in multiple locations,
inconsistencies in the data are inevitable that is, several of the files below contain customer data
TRANSACTION RECOVERY
The transaction begins with the successful execution of a BEGIN TRANSACTION statement and it ends
with the successful execution of the COMMIT or the ROLLBACK statements
COMMIT establishes what is known/called a COMMIT point. This corresponds to the end of a logical unit
of work and to a point at which the database is or should be in a state of consistence
ROOLBACK, rolls the database back to the state it was in, which is the BEGIN TRANSACTION, which
then effectively means it goes to the previous COMMIT point.
When a commit point is established, all updates made by the program since the previous point are
committed that is, they are made permanent. Once committed, an update is guaranteed never to be undone.
67 dbase notes.doc
Compiled by W. Sithole
NOTE: Carefully the COMMIT and ROLLBACK terminate the transaction not the program. In general, a
single program execution will consist of a sequence of several transactions running one after the other as
illustrated on below:
1st Transaction
2nd Transaction
Recovery
Begin Rollback
Transaction
3rd Transaction
SYSTEM RECOVERY
The system must be prepared to recover not only from local failures such as the occurrence of an overflow
condition with an individual transaction, but also from global failures such as a power failure on the CPU.
A local failure by definition affects only the transaction in which the failure has occurred. A global failure
affects all the transactions in progress at the time of the failure and including the entire system or program.
A global failure normally fall into two broad categories.
A media failure is sometimes called the HARD CRUSH. A critical point regarding system failure
is that the contents of the main memory are lost (in particular, the database buffers are lost).
The precise state of any transaction that was in progress at the time of the failure is therefore no
longer known, such a transaction can therefore never be successfully completed and so must be
undone (ROLLED BACK) when the system restarts.
It is also necessary to redo certain transactions at restart time, that did successfully complete prior
to the crush but did not manage to get their updates transferred from the database buffers (memory)
to the physical database or media.
The question arises: How does the system know at restart time which transactions to undo and
which to redo?
68 dbase notes.doc
Compiled by W. Sithole
At certain prescribed intervals typically whenever some prescribed number of entries have been
written to the system. The system automatically takes a check point, that is physically writing the
contents of the database buffers.
Check pointing: Special marker record periodically written to a transaction log (special file
recording all changes made to the database). It allows long transactions to be divided into a
sequence of short ones.
DATABASE INTEGRITY
Database Presentation:
It is the maintenance of the correctness and consistence of the data.
Commercial DBMS have integrity subsystems for monitoring transactions which update the database and
detecting integrity violations. These integrity sybsystems are rather primitive and the problems of
maintaining the correctness of the database are left in the hands of the database implementers (DBA)
INTEGRITY RULES
Integrity rules are divided into broad categories namely:
1. Entity integrity constraint or rule
Entity integrity rule also known as intra-relational inegrity is concerned with maintaining the
correctness of relations among fields or relations, and one of the most important task of these rules is
key uniqueness. Primary key should never be null or partially null.
The set property of the relation guarantee that no 2 tuples or records in a relation have the same values
in all their components
Implementation of key uniqueness requires that the system guarantee that no new tuple can be accepted
for insertion into a relation, if it has the same values in all its prime attributes as some existing tuple in
a relationship.
In addition we must also guarantee that no existing tuples in a relation is updated in such a way as to
change its prime field values to be the same as those of some other tuple in the same relation.
The requirement is that a foreign key must have either a null entry or an entry that matches the primary
key value in a table to which it is related.
The enforcement of the referential integrity rule makes it impossible to delete a row or tuple in a table
whose primary key has matching foreign key values in another table eg.
Table: Customer
Cust# CustLName Cust-Fname CUST-DOB SalesP#
1008 Mutema John 08/12/37 37
1009 Dahwa Peter 11/10/43
10010 Musona Amon 02/05/86 14
10011 Asindadi Noah 12/12/32 21
Table: SalesPerson
SalesP# Area-Code Phone SP-Name SP-Sales-Amt
69 dbase notes.doc
Compiled by W. Sithole
The definition of a Primary Key and its uniqueness property, no duplicates, no NULL values and to enforce
it the DBMS rejects an attempt to input records with NULL primary key values or duplicate values
Functional Dependencies represent another form of integrity constraint.
Comparison expression eg qtyout value not to exceed qtyord value
Lower and Upper limit values specified
Valid/Permitted values for a certain attribute
Attribute values conforming to a particular format
Statistical constraint eg no employee may earn more than twice the average salary for the department
DEADLOCK
PRINCIPLES OF DEADLOCK
Occurs when each of the two transactions is waiting for the other to release an item
Solution
Deadlock prevention protocol
Every transaction locks all items it needs in advance
Deadlock detection
No locks but periodically checks if the system is in a state of deadlock
Wait-for graph
Abort some of the transactions if theres a deadlock
This is a permanent blocking of a set of processes or users that either compete for system resources or
communicate with each other.
Reusable resource
Is one that can be safely used by only one process or user at a time and is not depicted by that use.
Examples include, processors, input/output channels, main & secondary memory. Data structures such as
database and files are examples of reusable resources.
Consumable resource
Is one that can be created (produced) and destroyed (consumed). Examples include interacts, printer
papers, signal messages etc
70 dbase notes.doc
Compiled by W. Sithole
Only one process or user may use a resource for example a tuple at a time.
Held by
User A Tuple T1
requests
User A1 Tuple T1
Process/user Resource
Mutual exclusion
Held by
User A1
Tuple T1
requests
User A2
|
|
|
requests
User An
Held by
User A1 Tuple T1
requests
User A2
held by Tuple T2
71 dbase notes.doc
Compiled by W. Sithole
Circular wait
Tuple T1
Requests Held by
User A1
User A2
Held by
requests
Tuple T2
No preemption
No resource can be possibly removed from a process or user holding it
Held by
Tuple T1
User A1
requests
User A2
A Related Problem:
Indefinite Postponement
Similar to deadlock known as indefinite blocking or starvation
In a system that keeps users waiting while that is make resources allocation and user scheduling decisions, it
is possible to delay indefinitely the scheduling decisions, it is possible to delay indefinitely the scheduling
of a user while other users receive the system attention.
Indefinite postponement may occur because of biases in a system resources scheduling policies. When
resources are scheduled on a priority basis, it is possible for a given user to wait for a resource indefinitely
as users with a higher priority continue to arrive.
Eventually that users priority will exceed the priority of incoming users and the waiting user will be
serviced. The process is called AGING
72 dbase notes.doc
Compiled by W. Sithole
Deadlock Avoidance
Requires knowledge of future requests for a resource for example:
1. Do not grant authority to the user if his/her demands might lead to a deadlock
2. Do not grant incremental resource request to a user if this allocation might lead to a deadlock.
Deadlock detection
The goal of deadlock detection is to determine if a deadlock has occurred and determine precisely those
users and resources in the deadlock. Once this is determined the deadlock can be cleared from the system.
Deadlock recovery
Methods used to clear deadlocks from a system so that it may proceed to operate free of the deadlock, and
so that the deadlocked users may complete their execution and free their resources.
CONCURRENCY
Data Sharing
There are several problems which can result from the sharing of access to the database that is there is lost
update. If 2 users are allowed to hold the same tuple concurrently the first of the 2 subsequent update
operations will be nullified by the second, since the effect of the second will be to overwrite the result of the
first.
Solution
1. Grant the user issuing the first hold an exclusive lock on the data held
2. No other user will be allowed to access the data while it is locked to the first user
3. The user issuing the second hold will have to wait until the first user releases the lock
4. The second user will in turn be granted an exclusive lock on the data
5. The effect of the second hold will be to retrieve the data as updated by the first user.
However, the exclusive locking technique leads in turn to other problems that is deadlock and starvation
(discussed previously)
DATA SECURITY
The protection of data in the database against unauthorised disclosure, alteration or destruction.
Authorisation Mechanisms
c) Identification
d) Authentication
Identification Users have to identify themselves to the system before accessing the database by supplying
an operator/username using machine readable cards
Authentication - The process if proving their identification by providing passwords, pin numbers, answering
some questions from the system.
Access Control
For each user the system will maintain a user profile, generated from the user definition supplied by the
DBA.
The details of the appropriate identification and authentication procedures would have been given on the
access controls. Operations allowed for a particular user to perform are to be given. The DBMS will go
through a series of test to determine whether to grant or delay access to the user. The tests may be arranged
in a sequence of increasing complexity, so that a program may reach its final decision as quickly as
possible.
73 dbase notes.doc
Compiled by W. Sithole
DATABASE INTEGRITY
Ensuring that the data is accurate all times.
Constraints
Each relation in the database will have a set of integrity constraints associated with it.
These constraints will be held in the data dictionary as part of the conceptual schema
They specify for example, that values of a particular attribute in some relation are to be within certain
boundary, or that within each tuple of some relation the values of one attribute may not exceed that of
another.
3. Primary Key Posses a property of uniqueness. No 2 tuples in the relation may have the same value
for this attribute or attribute combination
4. No component of a primary key value may be null
Enforcement
The DBMS must reject any attempt to generate a tuple whose key value is null or is a duplicate of the
one that already exists.
Bounds Entry
Values occurring in a particular attribute may be required to lie within certain bounds (eg values of
employee age: 15<age<60)
The constraints are specified by the Bounds Entry. The lower and upper limits have to be defined.
Values Entry
There may be a very small set of permitted values of some particular attribute combination eg permitted
values for primary colour are red, blue, green etc. In this case the permitted values could simply be
listed in a values entry for the relevant attribute or attribute combination
NB: It might be desirable to list values or ranges of values that are not permissible for the attributes
concerned.
Format Entry
Values of a particular attribute may have to conform to a particular format. Eg the first character of a
supplier number must be the letter S.
The constraint is specified in a format entry for the relevant attribute.
Average Function
The set of values of a particular attribute relation may have to specify some statistical constraints eg no
employee may earn a salary that is more than twice the average salary for the department.
The predicate defining this constraint will enforce the library function AVERAGE
To enforce it the DBMS will have to monitor all storage operations against the employee relation
NB
All examples given above are of static constraints that is they specify conditions that must hold for
every given state of the database.
Another important type of constraint involves transition from one state to another eg when employees
salary is updated, the new value must be greater than the old value.
To specify such constraints it will need to specify the old and new values
The keywords OLD and NEW are reserved for this purpose.
74 dbase notes.doc
Compiled by W. Sithole
A special case of transition is that from non-existence (ie addition of new tuple) or from existence to non-
existence (ie deletion of an existing tuple)
RECOVERY ROUTINES
Recovery routines are used to restore the database, or some portion of the database, to an earlier state after a
system failure (hardware or software) has caused the contents of the database in main storage to be lost.
They take as input a backup copy of the database (produced by the dump routines) together with the system
journal (which contains details of operations that have occurred since the dump was taken) and produce as
output a new copy of the data as it was before the failure occurred.
NB: Any transactions that were in progress at the time of the failure will probably have to be restarted.
BACKUP ROUTINE
Dump routines
These are used to take backup copies of selected portions of the database, also usually on tape.
It is normal practice to dump the database regularly say once a week
If the database is very large it may be more practical to dump one seventh of every day
Each time a dump is taken, a new system journal may be started and the previous one erased or
archived
Backup is normally initiated automatically by the DBMS before the database has committed its change.
Checkpoint/Restart Routines
Backing up and rerunning a long transaction in its entirety can be a time consuming process
Some systems permit transactions to take checkpoint at suitable points in their executions
The checkpoint routines will cause all changes made since the last checkpoint to be committed.
The checkpoint facility allows a long transaction to be divided up into a sequence of short ones
The checkpoint routine may also record values of specified program variables in a checkpoint entry in
the system journal
And in the case of an operation involving change to the database, the type of change and address of the data
changed, together with its before and after values
Encryption/scrambling
Used to protect or is the protection of the database against an infiltrator who attempts to by pass against
the system
Example of by passing the system involves a user who physically removes part of the database for
example by stealing a disk pack
Apart from normal security measures to prevent unauthorised personnel from entering the computer
centre, the most important safeguard against physical removal of part of the database is the use of
scrambling techniques
Scrambling/encryption and privacy transformations techniques involves the following:
75 dbase notes.doc
Compiled by W. Sithole
(a) Shuffling the characters of each tuple (or record or message) into different order
(b) Replacement of each character (or group of characters) by a different character (or group
of characters), from the same alphabet or different one
(c) Groups of characters are algebraically combined in some way with a special group of
characters (privacy key) supplied by the owner of the data.
TRANSACTIONS
A transaction is a unit of work with the property that the database is;
a) In a consistent state (state of integrity) both before it and after it but
b) Is possibly not in such in state between these 2 times
In general any changes made to the database during a transaction should not be visible to concurrent
transactions until such changes have been made, in order to prevent these concurrent transactions from
seeing the database in an inconsistent state.
Any data changed by a given transaction including data created or destroyed by that transaction should
remain locked until that transaction terminates
The above discipline must be enforced by the DBMS
A transaction will be backed out if on completion it is found that the database is not in a state of
integrity
A transaction may also be backed out if the system detects a deadlock. A general strategy for such a
situation is to choose one of the deadlocked transaction, say the one most recently started or the one
that has made the changes and remove it from the system, thus freeing its locked resources for use by
other transactions.
The process of back out involves undoing all the changes that the transaction has, made releasing all
resources locked by the transaction and scheduling all the transaction for re-execution.
Example of Transaction
In a banking system a typical transaction might be:
Transfer amount X from account A to account B This would be viewed as a single operation and a user
would have to enter a command such as
The above transaction requires several changes to be made to the underlying database.
Specifically it involves updating the balance value in 2 distinct account tuples
Although the database is in a state of integrity before and after the sequence of changes, it may not be
throughout the entire transaction, ie some of the intermediate state (or transitions) may violate one or
more integrity constraints
It follows that there is need to be able to specify that certain constraints should not be checked until the
end of the transaction. These are called deferred constraints
By contrast, constraints that are enforced continuously during the intermediate steps of the transaction
are called intermediate
NB: The data sublanguages must include some means of signaling the end of the transaction, in order to
cause the DBMS to apply deferred checks
76 dbase notes.doc
Compiled by W. Sithole
CONCURRENCY
In most systems, several users can access a database concurrently. The operating system switches execution
from one user program to another to minimise waiting for input or output operations
Within this approach transactions are often interleaved, that is, several steps are performed on transaction
A, then several steps on transaction B, followed by more steps on transaction A and so on.
1. 2 users are in the process of updating the same record which represents a savings account record for
customer A
2. At present time customer A has a balance of $100 in her account
3. User 1 reads her record into the user work area, intending to post a customer withdrawal of $150
4. Next user 2 reads the same record into that user area, intending to post a customer deposit of $25
5. User 1 posts the withdrawal and stores the record, which now indicates a balance of $50
6. User 2 then posts the deposit (increasing the balance to $125) and stores this record on top of the one
stored by user 1
7. The record now indicates a balance of $125
8. In this case the transaction for user 1 has been lost because of interference between transaction
Data Sharing
There are several problems, which can result from the sharing of access to the database that is there, is lost
update. If 2 users are allowed to hold the same tuple concurrently the first of the 2 subsequent update
operations will be nullified by the second, since the effect of the second will be to overwrite the result of the
first.
Solution
6. Grant the user issuing the first hold an exclusive lock on the data held
7. No other user will be allowed to access the data while it is locked to the first user
8. The user issuing the second hold will have to wait until the first user releases the lock
9. The second user will in turn be granted an exclusive lock on the data
10. The effect of the second hold will be to retrieve the data as updated by the first user.
However, the exclusive locking technique leads in turn to other problems that is deadlock and starvation
(discussed previously)
Techniques to keep database in a consistent state with respect to specified constraints on the
database
Both Database security and Protection, and Database semantic Integrity are stored in the
DBMS catalog.
SUPPORT ROUTINES
Journaling Routines:
Records every operation in system log/audit trail/system journal
Dump Routines:
77 dbase notes.doc
Compiled by W. Sithole
Take back-up copies of the database, restarts a new system log after every dump routine
Recovery Routines:
Used to restore the database or some portion of the database after a system failure (hardware or
software) has caused contents of the database buffers in main storage to be lost.
Backout Routines:
Initiated automatically by the DBMS before transaction changes are committed.
Checkpoint/Restart Routines:
Cause all changes made since the last checkpoint to be committed. Instead of restarting a long
transaction it only restarts from the last checkpoint.
Detection Routines:
Detects any violations and back the transaction out of the system with the information on list of
constraints violated and offending tuples.
78 dbase notes.doc