Escolar Documentos
Profissional Documentos
Cultura Documentos
efficient to use.
Database Applications:
file systems
Drawbacks of using file systems to store data:
of program code
Hard to add new constraints or change existing ones
Atomicity of updates
Failures may leave database in an inconsistent state with partial
Security problems
Database systems offer solutions to all the above problems
Levels of Abstraction
Physical level describes how a record (e.g., customer) is stored.
Logical level: describes data stored in database, and the
View of Data
An architecture for a database system
Data Models
A collection of tools for describing
data
data relationships
data semantics
data constraints
Entity-Relationship model
Relational model
Other models:
object-oriented model
semi-structured data models
Older models: network model and hierarchical model
Entity-Relationship Model
Entities (objects)
E.g. customers, accounts, bank branch
Relational Model
Attributes
customername
192-83-7465
Johnson
019-28-3746
Smith
192-83-7465
Johnson
321-12-3123
Jones
019-28-3746
Smith
customerstreet
customercity
accountnumber
Alma
Palo Alto
A-101
North
Rye
A-215
Alma
Palo Alto
A-201
Main
Harrison
A-217
North
Rye
A-201
E.g.
create table account (
account-number char(10),
balance
integer)
DDL compiler generates a set of tables stored in a data
dictionary
Data dictionary contains metadata (i.e., data about data)
database schema
Data storage and definition language
language in which the storage structure and access methods
SQL
SQL: widely used non-procedural language
Database Users
Users are differentiated by the way they expect to interact with
the system
Application programmers interact with system through DML
calls
Sophisticated users form requests in a database query
language
Specialized users write specialized database applications that
Database Administrator
Schema definition
Storage structure and access method definition
Schema and physical organization modification
Granting user authority to access the database
Specifying integrity constraints
Acting as liaison with users
Monitoring performance and responding to changes in
requirements
Transaction Management
A transaction is a collection of operations that performs a single
Storage Management
Storage manager is a program module that provides the
Application Architectures
Entity Sets
A database can be modeled as:
a collection of entities,
relationship among entities.
An entity is an object that exists and is distinguishable from other
objects.
Example: specific person, company, event, plant
Entities have attributes
same properties.
Example: set of all persons, companies, trees, holidays
loan- amount
number
Attributes
An entity is represented by a set of attributes, that is descriptive
Composite Attributes
Relationship Sets
A relationship is an association among several entities
Example:
Hayes
customer entity
depositor
relationship set
A-102
account entity
set.
Relationship sets that involve two entity sets are binary (or degree
Mapping Cardinalities
Express the number of entities to which another entity can be
Mapping Cardinalities
One to one
One to many
Mapping Cardinalities
Many to one
Many to many
E-R Diagrams
Roles
Entity sets of a relationship need not be distinct
The labels manager and worker are called roles; they specify how
diamonds to rectangles.
Role labels are optional, and are used to clarify semantics of the
relationship
Cardinality Constraints
We express cardinality constraints by drawing either a directed
One-To-Many Relationship
In the one-to-many relationship a loan is associated with at most
Many-To-Many Relationship
via borrower
A loan is associated with several (possibly 0) customers
via borrower
Keys
A super key of an entity set is a set of one or more attributes
meaning.
E.g a ternary relationship R between A, B and C with arrows to B and C
could mean
1. each A entity is associated with a unique entity from B and C or
2. each pair of entities from (A, B) is associated with a unique C entity,
and each pair (A, C) is associated with a unique B
Each alternative has been used in different formalisms
To avoid confusion we outlaw more than one arrow
2. add (ei , ai ) to RA
4. add (ei , ci ) to RC
Design Issues
Use of entity sets vs. attributes
of the strong entity set on which the weak entity set is existence
dependent, plus the weak entity sets discriminator.
dashed line.
payment-number discriminator of the payment entity set
Primary key for payment (loan-number, payment-number)
course-number as an attribute.
Then the relationship with course would be implicit in the coursenumber attribute
Specialization
Top-down design process; we designate subgroupings within an
entity set that are distinctive from other entities in the set.
These subgroupings become lower-level entity sets that have
a person).
Attribute inheritance a lower-level entity set inherits all the
Specialization Example
Generalization
A bottom-up design process combine a number of entity sets
interchangeably.
different features.
E.g. permanent-employee vs. temporary-employee, in addition to
relationship
Design Constraints on a
Specialization/Generalization
Constraint on which entities can be members of a given
Design Constraints on
aSpecialization/Generalization (Contd.)
Completeness constraint -- specifies whether or not an entity in
Aggregation
Consider the ternary relationship works-on, which we saw earlier
Suppose we want to record managers for tasks performed by an
employee at a branch
Aggregation (Cont.)
Relationship sets works-on and manages represent overlapping
information
Every manages relationship corresponds to a works-on relationship
However, some works-on relationships may not correspond to any
manages relationships
So we cant discard the works-on relationship
or a relationship set.
The use of a ternary relationship versus a pair of binary
relationships.
The use of a strong or weak entity set.
The use of specialization/generalization contributes to
single unit without concern for the details of its internal structure.
UML
UML: Unified Modeling Language
UML has many components to graphically model different
differences.
line connecting the entity sets. The relationship set name is written
adjacent to the line.
The role played by an entity set in a relationship set may also be
specified by writing the role name on the line, adjacent to the entity
set.
The relationship set name may alternatively be written in a box,
Non-binary relationships cannot be directly represented in UML -they have to be converted to binary relationships.
table EM
Table EM has attributes corresponding to the primary key of E and an
attribute corresponding to multivalued attribute M
E.g. Multivalued attribute dependent-names of employee is represented
by a table
employee-dependent-names( employee-id, dname)
Each value of the multivalued attribute maps to a separate row of the
table EM
E.g., an employee entity with primary key John and
dependents Johnson and Johndotir maps to two rows:
(John, Johnson) and (John, Johndotir)
columns for the primary keys of the two participating entity sets,
and any descriptive attributes of the relationship set.
E.g.: table for relationship set borrower
Redundancy of Tables
Many-to-one and one-to-many relationship sets that are
table attributes
name, street, city
name, credit-rating
name, salary
Relations Corresponding to
Aggregation
To represent aggregation, create a table containing
primary key of the aggregated relationship,
the primary key of the associated entity set
Any descriptive attributes
Relations Corresponding to
Aggregation (Cont.)
E.g. to represent aggregation manages between relationship
End of Chapter 2
Existence Dependencies
If the existence of entity x depends on the existence of
loan
loan-payment
payment
Example of a Relation
Basic Structure
Formally, given sets D1, D2, . Dn a relation r is a subset of
D1 x D2 x x Dn
Thus a relation is a set of n-tuples (a1, a2, , an) where
ai Di
Example: if
Attribute Types
Each attribute of a relation has a name
The set of allowed values for each attribute is called the domain
of the attribute
Attribute values are (normally) required to be atomic, that is,
indivisible
E.g. multivalued attribute values are not atomic
E.g. composite attribute values are not atomic
The special value null is a member of every domain
The null value causes complications in the definition of many
operations
we shall ignore the effect of null values in our main presentation
and consider their effect later
Relation Schema
A1, A2, , An are attributes
R = (A1, A2, , An ) is a relation schema
E.g. Customer-schema =
(customer-name, customer-street, customer-city)
r(R) is a relation on the relation schema R
E.g.
customer (Customer-schema)
Relation Instance
The current values (relation instance) of a relation are
specified by a table
An element t of r is a tuple, represented by a row in a table
attributes
customer-name customer-street
Jones
Smith
Curry
Lindsay
Main
North
North
Park
customer
customer-city
Harrison
Rye
Rye
Pittsfield
tuples
Database
A database consists of multiple relations
Information about an enterprise is broken up into parts, with each
relational schemas
Keys
Let K R
K is a superkey of R if values for K are sufficient to identify a
union of the primary key of the strong entity set and the
discriminator of the weak entity set.
Relationship set. The union of the primary keys of the related
Query Languages
Language in which user requests information from the database.
Categories of languages
procedural
non-procedural
Pure languages:
Relational Algebra
Tuple Relational Calculus
Domain Relational Calculus
Pure languages form underlying basis of query languages that
people use.
Relational Algebra
Procedural language
Six basic operators
select
project
union
set difference
Cartesian product
rename
The operators take two or more relations as inputs and give a
Relation r
12
23 10
23 10
Select Operation
Notation:
p(r)
Relation r:
A,C (r)
10
20
30
40
Project Operation
Notation:
r s:
Union Operation
Notation: r s
Defined as:
r s = {t | t r or t s}
For r s to be valid.
r s:
r s = {t | t r and t s}
Set differences must be taken between compatible relations.
Cartesian-Product Operation-Example
Relations r, s:
10
10
20
10
a
a
b
b
r x s:
A
1
1
1
1
2
2
2
2
10
19
20
10
10
10
20
10
a
a
b
b
a
a
b
b
Cartesian-Product Operation
Notation r x s
Defined as:
r x s = {t q | t r and q s}
Assume that attributes of r(R) and s(S) are disjoint. (That is,
R S = ).
If attributes of r(R) and s(S) are not disjoint, then renaming must
be used.
Composition of Operations
Can build expressions using multiple operations
Example: A=C(r x s)
rxs
A=C(r x s)
1
1
1
1
2
2
2
2
10
19
20
10
10
10
20
10
a
a
b
b
a
a
b
b
1
2
2
10
20
20
a
a
b
Rename Operation
Allows us to name, and therefore to refer to, the results of
relational-algebra expressions.
Allows us to refer to a relation by more than one name.
Example:
x (E)
returns the expression E under the name X
If a relational-algebra expression E has arity n, then
Banking Example
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-only)
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
Example Queries
Find all loans of over $1200
$1200
Example Queries
Find the names of all customers who have a loan, an account, or
at bank.
customer-name (borrower) customer-name (depositor)
Example Queries
Find the names of all customers who have a loan at the Perryridge
branch.
customer-name (branch-name=Perryridge
(borrower.loan-number = loan.loan-number(borrower x loan)))
Find the names of all customers who have a loan at the Perryridge
customer-name(depositor)
x loan)))
Example Queries
Find the names of all customers who have a loan at the Perryridge
branch.
Query 1
customer-name(branch-name = Perryridge
(borrower.loan-number = loan.loan-number(borrower x loan)))
Query 2
customer-name(loan.loan-number = borrower.loan-number(
(branch-name = Perryridge(loan)) x
borrower)
)
Example Queries
Find the largest account balance
Rename account relation as d
The query is:
balance(account) - account.balance
(account.balance < d.balance (account x d (account)))
Formal Definition
A basic expression in the relational algebra consists of either one
of the following:
A relation in the database
A constant relation
Let E1 and E2 be relational-algebra expressions; the following are
Additional Operations
We define additional operations that do not add any power to the
relational algebra, but that simplify common queries.
Set intersection
Natural join
Division
Assignment
Set-Intersection Operation
Notation: r s
Defined as:
r s ={ t | t r and t s }
Assume:
1
2
1
r
rs
2
3
s
Natural-Join Operation
Notation: r
s
Let r and s be relations on schemas R and S respectively.The result is a
relation on schema R S which is obtained by considering each pair of
tuples tr from r and ts from s.
If tr and ts have the same value on each of the attributes in R S, a tuple t
is added to the result, where
t has the same value as tr on r
t has the same value as ts on s
Example:
R = (A, B, C, D)
S = (E, B, D)
Result schema = (A, B, C, D, E)
r s is defined as:
r.A, r.B, r.C, r.D, s.E (r.B = s.B r.D = s.D (r x s))
1
2
4
1
2
a
a
b
a
b
1
3
1
2
3
a
a
a
b
b
s
A
1
1
1
1
2
a
a
a
a
b
Division Operation
rs
Suited to queries that include the phrase for all.
Let r and s be relations on schemas R and S respectively
where
R = (A1, , Am, B1, , Bn)
S = (B1, , Bn)
The result of r s is a relation on schema
R S = (A1, , Am)
r s = { t | t R-S(r) u s ( tu r ) }
r s:
1
2
3
1
1
1
3
4
6
1
2
2
s
a
a
a
a
a
a
a
a
a
a
b
a
b
a
b
b
1
1
1
1
3
1
1
1
a
b
1
1
r s:
a
a
Let q r s
Then q is the largest relation satisfying q x s r
Definition in terms of the basic algebra operation
Assignment Operation
The assignment operation () provides a convenient way to
variable.
Example: Write r s as
Example Queries
Find all customers who have an account from at least the
account))
CN(BN=Uptown(depositor
account))
Query 2
customer-name, branch-name (depositor account)
temp(branch-name) ({(Downtown), (Uptown)})
Example Queries
Find all customers who have an account at all branches located
in Brooklyn city.
customer-name, branch-name (depositor account)
branch-name (branch-city = Brooklyn (branch))
Extended Relational-Algebra-Operations
Generalized Projection
Outer Join
Aggregate Functions
Generalized Projection
Extends the projection operation by allowing arithmetic functions
g sum(c) (r)
sum-C
27
7
3
10
branch-name
balance
A-102
A-201
A-217
A-215
A-222
sum(balance)
400
900
750
750
700
(account)
branch-name
Perryridge
Brighton
Redwood
balance
1300
1500
700
Outer Join
An extension of the join operation that avoids loss of information.
Computes the join and then adds tuples form one relation that
does not match tuples in the other relation to the result of the
join.
Uses null values:
branch-name
L-170
L-230
L-260
Downtown
Redwood
Perryridge
amount
3000
4000
1700
Relation borrower
customer-name loan-number
Jones
Smith
Hayes
L-170
L-230
L-155
Borrower
loan-number
branch-name
L-170
L-230
Downtown
Redwood
amount
customer-name
3000
4000
Jones
Smith
amount
customer-name
loan
borrower
loan-number
L-170
L-230
L-260
branch-name
Downtown
Redwood
Perryridge
3000
4000
1700
Jones
Smith
null
borrower
loan-number
branch-name
L-170
L-230
L-155
Downtown
Redwood
null
amount
3000
4000
null
customer-name
Jones
Smith
Hayes
loan
borrower
loan-number
branch-name
L-170
L-230
L-260
L-155
Downtown
Redwood
Perryridge
null
amount
3000
4000
1700
null
customer-name
Jones
Smith
null
Hayes
Null Values
It is possible for tuples to have a null value, denoted by null, for
Null Values
Comparisons with null values return the special truth value
unknown
If false was used instead of unknown, then
would not be equivalent to
not (A < 5)
A >= 5
unknown
operations:
Deletion
Insertion
Updating
All these operations are expressed using the assignment
operator.
Deletion
A delete request is expressed similarly to a query, except instead
particular attributes
A deletion is expressed in relational algebra by:
rrE
where r is a relation and E is a relational algebra query.
Deletion Examples
Delete all account records in the Perryridge branch.
branch)
depositor)
Insertion
To insert data into a relation, we either:
r r E
where r is a relation and E is a relational algebra expression.
The insertion of a single tuple is expressed by letting E be a
Insertion Examples
Insert information in the database specifying that Smith has
loan))
Updating
A mechanism to change a value in a tuple without charging all
Update Examples
Make interest payments by increasing all balances by 5 percent.
Views
In some cases, it is not desirable for all users to see the entire
logical model (i.e., all the actual relations stored in the database.)
Consider a person who needs to know a customers loan number
but has no need to see the loan amount. This person should see
a relation described, in the relational algebra, by
customer-name, loan-number (borrower
loan)
View Definition
A view is defined using the create view statement which has the
form
create view v as <query expression
where <query expression> is any legal relational algebra query
expression. The view name is represented by v.
Once a view is defined, the view name can be used to refer to
View Examples
Consider the view (named all-customer) consisting of branches
account)
loan)
branch-name
(branch-name = Perryridge (all-customer))
relation except amount. The view given to the person, branchloan, is defined as:
create view branch-loan as
branch-name, loan-number (loan)
Since we allow a view name to appear wherever a relation name
v1 to v2
A view relation v is said to be recursive if it depends on itself.
View Expansion
A way to define the meaning of views defined in terms of other
views.
Let view v1 be defined by an expression e1 that may itself contain
replacement step:
repeat
Find any view relation vi in e1
Replace the view relation vi by the expression defining vi
until no more view relations are present in e1
As long as the view definitions are not recursive, this loop will
terminate
{t | P (t) }
It is the set of all tuples t such that predicate P is true for t
t is a tuple variable, t[A] denotes the value of tuple t on attribute A
t r denotes that tuple t is in relation r
P is a formula similar to that of the predicate calculus
Banking Example
branch (branch-name, branch-city, assets)
customer (customer-name, customer-street, customer-city)
account (account-number, branch-name, balance)
loan (loan-number, branch-name, amount)
depositor (customer-name, account-number)
borrower (customer-name, loan-number)
Example Queries
Find the loan-number, branch-name, and amount for loans of
over $1200
{t | t loan t [amount] > 1200}
Find the loan number for each loan of an amount greater than
$1200
{t | s loan (t[loan-number] = s[loan-number]
s [amount] > 1200}
Example Queries
Find the names of all customers having a loan, an account, or
at the bank
{t | s borrower(t[customer-name] = s[customer-name])
u depositor(t[customer-name] = u[customer-name])
Example Queries
Find the names of all customers having a loan at the Perryridge
branch
{t | s borrower(t[customer-name] = s[customer-name]
u loan(u[branch-name] = Perryridge
u[loan-number] = s[loan-number]))}
Find the names of all customers who have a loan at the
Example Queries
Find the names of all customers having a loan from the
Example Queries
Find the names of all customers who have an account at all
Safety of Expressions
It is possible to write tuple calculus expressions that generate
infinite relations.
For example, {t | t r} results in an infinite relation if the
relational calculus
Each query is an expression of the form:
Example Queries
Find the branch-name, loan-number, and amount for loans of over
$1200
{< l, b, a > | < l, b, a > loan a > 1200}
Find the names of all customers who have a loan of over $1200
{< c > | l, b, a (< c, l > borrower < l, b, a > loan a > 1200)}
Find the names of all customers who have a loan from the
Example Queries
Find the names of all customers having a loan, an account, or
Safety of Expressions
{ < x1, x2, , xn > | P(x1, x2, , xn)}
is safe if all of the following hold:
1.All values that appear in tuples of the expression are values
from dom(P) (that is, the values appear either in P or in a tuple
of a relation mentioned in P).
2.For every there exists subformula of the form x (P1(x)), the
subformula is true if an only if P1(x) is true for all values x from
dom(P1).
3. For every for all subformula of the form x (P1 (x)), the
subformula is true if and only if P1(x) is true for all values x
from dom (P1).
End of Chapter 3
Result of customer-name
Result of branch-name(customer-city =
account
depositor))
Harrison(customer
Result of branch-name(branch-city =
Brooklyn(branch))
account)
ft-works
ft-works
Result of employee
ft-works
Result of employee
ft-works
E-R Diagram
Chapter 4: SQL
Basic Structure
Set Operations
Aggregate Functions
Null Values
Nested Subqueries
Derived Relations
Views
Modification of the Database
Joined Relations
Data Definition Language
Embedded SQL, ODBC and JDBC
Basic Structure
SQL is based on set and relational operations with certain
P is a predicate.
after select.
Find the names of all branches in the loan relations, and remove
duplicates
select distinct branch-name
from loan
The keyword all specifies that duplicates not be removed.
The query:
select loan-number, branch-name, amount 100
from loan
would return a relation which is the same as the loan relations,
except that the attribute amount is multiplied by 100.
select
from borrower, loan
Find the name, loan number and loan amount of all customers having a
clause:
old-name as new-name
Find the name, loan number and loan amount of all customers;
Tuple Variables
Tuple variables are defined in the from clause via the use of the
as clause.
Find the customer names and their loan numbers for all
String Operations
SQL includes a string-matching operator for comparisons on character
Main.
select customer-name
from customer
where customer-street like %Main%
Match the name Main%
in Perryridge branch
select distinct customer-name
from borrower, loan
where borrower loan-number - loan.loan-number and
branch-name = Perryridge
order by customer-name
We may specify desc for descending order or asc for ascending
Duplicates
In relations with duplicates, SQL can define how many copies of
(r1).
2. For each copy of tuple t1 in r1, there is a copy of tuple A(t1) in A(r1)
where A(t1) denotes the projection of the single tuple t1.
3. If there are c1 copies of tuple t1 in r1 and c2 copies of tuple t2 in r2,
there are c1 x c2 copies of the tuple t1. t2 in r1 x r2
Duplicates (Cont.)
Example: Suppose multiset relations r1 (A, B) and r2 (C)
are as follows:
r1 = {(1, a) (2,a)}
Set Operations
The set operations union, intersect, and except operate on
Set Operations
Find all customers who have a loan, an account, or both:
Aggregate Functions
These functions operate on the multiset of values of a column of
Null Values
It is possible for tuples to have a null value, denoted by null, for
E.g. Find all loan number which appear in the loan relation with
null values for amount.
select loan-number
from loan
where amount is null
The result of any arithmetic expression involving null is null
or
null = null
evaluates to unknown
Nested Subqueries
SQL provides a mechanism for the nesting of subqueries.
A subquery is a select-from-where expression that is nested
Example Query
Find all customers who have both an account and a loan at the
bank.
select distinct customer-name
from borrower
where customer-name in (select customer-name
from depositor)
Find all customers who have a loan at the bank but do not have
Example Query
Find all customers who have both an account and a loan at the
Perryridge branch
select distinct customer-name
from borrower, loan
where borrower.loan-number = loan.loan-number and
branch-name = Perryridge and
(branch-name, customer-name) in
(select branch-name, customer-name
from depositor, account
where depositor.account-number =
account.account-number)
Note: Above query can be written in a much simpler manner.
Set Comparison
Find all branches that have greater assets than some branch
located in Brooklyn.
select distinct T.branch-name
from branch as T, branch as S
where T.assets > S.assets and
S.branch-city = Brooklyn
Same query using > some clause
select branch-name
from branch
where assets > some
(select assets
from branch
where branch-city = Brooklyn)
0
5
6
) = true
(read: 5 < some tuple in the relation)
(5< some
0
5
) = false
(5 = some
0
5
) = true
0
(5 some 5 ) = true (since 0 5)
(= some) in
However, ( some) not in
(5< all
0
5
6
) = false
(5< all
6
10
) = true
(5 = all
4
5
) = false
4
(5 all 6 ) = true (since 5 4 and 5 6)
( all) not in
However, (= all) in
Example Query
Find the names of all branches that have greater assets than all
subquery is nonempty.
exists r r
not exists r r =
Example Query
Find all customers who have an account at all branches located in
Brooklyn.
select distinct S.customer-name
from depositor as S
where not exists (
(select branch-name
from branch
where branch-city = Brooklyn)
except
(select R.branch-name
from depositor as T, account as R
where T.account-number = R.account-number and
S.customer-name = T.customer-name))
(Schema used in this example)
Note that X Y = X Y
Note: Cannot write this query using = all and its variants
Perryridge branch.
select T.customer-name
from depositor as T
where unique (
select R.customer-name
from account, depositor as R
where T.customer-name = R.customer-name and
R.account-number = account.account-number and
account.branch-name = Perryridge)
(Schema used in this example)
Example Query
Find all customers who have at least two accounts at the
Perryridge branch.
select distinct T.customer-name
from depositor T
where not unique (
select R.customer-name
from account, depositor as R
where T.customer-name = R.customer-name and
R.account-number = account.account-number and
account.branch-name = Perryridge)
(Schema used in this example)
Views
Provide a mechanism to hide certain data from the view of
Example Queries
A view consisting of branches and their customers
select customer-name
from all-customer
where branch-name = Perryridge
Derived Relations
Find the average account balance of those branches where the
With Clause
With clause allows views to be defined locally to a query, rather
with max-balance(value) as
select max (balance)
from account
select account-number
from account, max-balance
where account.balance = max-balance.value
Example Query
Delete the record of all accounts with balances below the
$200 savings account. Let the loan number serve as the account
number for the new savings account
insert into account
select loan-number, branch-name, 200
from loan
where branch-name = Perryridge
insert into depositor
select customer-name, loan-number
from loan, borrower
where branch-name = Perryridge
and loan.account-number = borrower.account-number
The select from where statement is fully evaluated before any of
its results are inserted into the relation (otherwise queries like
insert into table1 select * from table1
would cause problems
Update of a View
Create a view of all loan data in loan relation, hiding the amount
attribute
create view branch-loan as
select branch-name, loan-number
from loan
Add a new tuple to branch-loan
Transactions
A transaction is a sequence of queries and update statements executed
as a single unit
Transactions are started implicitly and terminated by one of
commit work: makes all updates of the transaction permanent in the
database
rollback work: undoes all updates performed by the transaction.
Motivating example
If one steps succeeds and the other fails, database is in an inconsistent state
Therefore, either both steps should succeed or neither should
If any step of a transaction fails, all work done by the transaction can be
system failures
Transactions (Cont.)
In most database systems, each SQL statement that executes
end
Joined Relations
Join operations take two relations and return as a result another
relation.
These additional operations are typically used as subquery
any tuple in the other relation (based on the join condition) are
treated.
Join Types
Join Conditions
inner join
left outer join
right outer join
full outer join
natural
on <predicate>
using (A1, A2, ..., An)
branch-name
amount
L-170
Downtown
3000
L-230
Redwood
4000
L-260
Perryridge
1700
Relation borrower
customer-name
loan-number
Jones
L-170
Smith
L-230
Hayes
L-155
loan.loan-number = borrower.loan-number
loan-number
branch-name
amount
customer-name
loan-number
L-170
Downtown
3000
Jones
L-170
L-230
Redwood
4000
Smith
L-230
branch-name
amount
customer-name
loan-number
L-170
Downtown
3000
Jones
L-170
L-230
Redwood
4000
Smith
L-230
L-260
Perryridge
1700
null
null
branch-name
amount
customer-name
L-170
Downtown
3000
Jones
L-230
Redwood
4000
Smith
branch-name
amount
customer-name
L-170
Downtown
3000
Jones
L-230
Redwood
4000
Smith
L-155
null
null
Hayes
branch-name
amount
customer-name
L-170
Downtown
3000
Jones
L-230
Redwood
4000
Smith
L-260
Perryridge
1700
null
L-155
null
null
Hayes
Find all customers who have either an account or a loan (but not
both) at the bank.
select customer-name
from (depositor natural full outer join borrower)
where account-number is null or loan-number is null
length n.
int. Integer (a finite subset of the integers that is machine-dependent).
smallint. Small integer (a machine-dependent subset of the integer
domain type).
numeric(p,d). Fixed point number, with user-specified precision of p digits,
time 09:00:30.75
command:
create table r (A1 D1, A2 D2, ..., An Dn,
(integrity-constraint1),
...,
(integrity-constraintk))
r is the name of the relation
each Ai is an attribute name in the schema of relation r
Di is the data type of values in the domain of attribute Ai
Example:
of a relation
alter table r drop A
where A is the name of an attribute of relation r
Dropping of attributes not supported by many databases
Embedded SQL
The SQL standard defines embeddings of SQL in a variety of
to the preprocessor
EXEC SQL <embedded SQL statement > END-EXEC
Note: this varies by language. E.g. the Java embedding uses
# SQL { . } ;
Example Query
From within a host language, find the names and cities of
customers with more than the variable amount dollars in some
account.
Specify the query in SQL and declare a cursor for it
EXEC SQL
declare c cursor for
select customer-name, customer-city
from depositor, customer, account
where depositor.customer-name = customer.customer-name
and depositor account-number = account.account-number
and account.balance > :amount
END-EXEC
is for update
declare c cursor for
select *
from account
where branch-name = Perryridge
for update
To update tuple at the current location of cursor
update account
set balance = balance + 100
where current of c
Dynamic SQL
Allows programs to construct and submit SQL queries at run
time.
Example of the use of dynamic SQL from within a C program.
ODBC
Open DataBase Connectivity(ODBC) standard
ODBC (Cont.)
Each database system supporting ODBC provides a "driver" library that
communicates with the server to carry out the requested action, and
fetch results.
ODBC program first allocates an SQL environment, then a database
connection handle.
Opens database connection using SQLConnect(). Parameters for
SQLConnect:
connection handle,
the server to which to connect
the user identifier,
password
Must also specify types of arguments:
ODBC Code
int ODBCexample()
{
RETCODE error;
HENV env; /* environment */
HDBC conn; /* database connection */
SQLAllocEnv(&env);
SQLAllocConnect(env, &conn);
SQLConnect(conn, "aura.bell-labs.com", SQL_NTS, "avi", SQL_NTS,
"avipasswd", SQL_NTS);
{ . Do actual work }
SQLDisconnect(conn);
SQLFreeConnect(conn);
SQLFreeEnv(env);
}
corresponding C variables.
Arguments to SQLBindCol()
ODBC stmt variable, attribute position in query result
The type conversion from SQL to C.
The address of the variable.
For variable-length types like character arrays,
The maximum length of the variable
Location to store actual length when a tuple is fetched.
Note: A negative value returned for the length field indicates null
value
by the standard.
Core
Level 1 requires support for metadata querying
Level 2 requires ability to send and retrieve arrays of parameter
values and more detailed catalog information.
SQL Call Level Interface (CLI) standard similar to ODBC
JDBC
JDBC is a Java API for communicating with database systems
supporting SQL
JDBC supports a variety of features for querying and updating
Open a connection
Create a statement object
Execute queries using the Statement object to send queries and
fetch results
Exception mechanism to handle errors
JDBC Code
public static void JDBCexample(String dbid, String userid, String passwd)
{
try {
Class.forName ("oracle.jdbc.driver.OracleDriver");
Connection conn = DriverManager.getConnection(
"jdbc:oracle:thin:@aura.bell-labs.com:2000:bankdb", userid, passwd);
Statement stmt = conn.createStatement();
Do Actual Work .
stmt.close();
conn.close();
}
catch (SQLException sqle) {
System.out.println("SQLException : " + sqle);
}
Prepared Statement
Prepared statement allows queries to be compiled and executed
Databases)
Transactions in JDBC
As with ODBC, each statement gets committed automatically in
JDBC
conn.setAutoCommit(false);
conn.commit() or conn.rollback()
conn.setAutoCommit(false);
procedures/functions to be invoked.
CallableStatement cs1 = conn.prepareCall( {call proc (?,?)} ) ;
CallableStatement cs2 = conn.prepareCall( {? = call func (?,?)} );
ResultSet.
Provides Functions for getting number of columns, column name,
type, precision, scale, table from which the column is derived etc.
ResultSetMetaData rsmd = rs.getMetaData ( );
for ( int i = 1; i <= rsmd.getColumnCount( ); i++ ) {
String name = rsmd.getColumnName(i);
String typeName = rsmd.getColumnTypeName(i);
}
Application Architectures
Applications can be built using one of two architectures
Two-tier Model
E.g. Java code runs at client site and uses JDBC to
Application/HTTP
Server
Servlets
JDBC
Network
Client
Client
Client
Database
Server
server
Client sends request over http or application-specific protocol
Application or Web server receives request
Request handled by CGI program or servlets
Security handled by application at server
Better security
Fine granularity security
Simple client, but only packaged transactions
End of Chapter
Query-by-Example (QBE)
Basic Structure
Queries on One Relation
Queries on Several Relations
The Condition Box
The Result Relation
Ordering the Display of Tuples
Aggregate Operations
Modification of the Database
Method 1:
P._x
Method 2: Shorthand notation
P._y
P._z
Find the loan number of all loans with a loan amount of more than $700
and Jones.
Negation in QBE
values
Find all account numbers with a balance between $1,300 and $2,000
Aggregate Operations
The aggregate operators are AVG, MAX, MIN, SUM, and CNT
The above operators must be postfixed with ALL (e.g.,
Query Examples
Find the average balance at each branch.
Query Example
Find all customers who have an account at all branches located
in Brooklyn.
Approach: for each customer, find the number of branches in
Brooklyn at which they have accounts, and compare with total
number of branches in Brooklyn
QBE does not provide subquery functionality, so both above tasks
have to be combined in a single query.
Can be done for this query, but there are queries that require
Perryridge.
expression.
Insert the fact that account A-9732 at the Perryridge
new $200 savings account for every loan account they have, with
the loan number serving as the account number for the new
savings account.
all values in the tuple. QBE does not allow users to update the
primary key fields.
Update the asset value of the Perryridge branch to $10,000,000.
Datalog
Basic Structure
Syntax of Datalog Rules
Semantics of Nonrecursive Datalog
Safety
Relational Operations in Datalog
Recursion in Datalog
The Power of Recursion
Basic Structure
Prolog-like logic-based language that allows recursive queries;
relation v1.
? v1(A-217, B).
To find account number and balance of all accounts in v1 that
Example Queries
Each rule defines a set of tuples that a view relation must contain.
is
for all A, B
if (A, Perryridge, B) account and B > 700
then (A, B) v1
The set of tuples in a view relation is then defined as the union of
all the sets of tuples defined by the rules for the view relation.
Example:
Negation in Datalog
Define a view relation c that contains the names of all customers
Semantics of a Rule
A ground instantiation of a rule (or simply instantiation) is the
(database instance) l if
1. For each positive literal qi(vi,1, ..., vi,ni ) in the body of R, l contains
the fact qi(vi,1, ..., vj,ni).
2. For each negative literal not qj(vj,1, ..., vj,ni) in the body of R, l does
not contain the fact qj(vi,1, ..., vj,ni).
Layering of Rules
Define the interest on each account in Perryridge
interest(A, l) : perryridge-account(A,B),
interest-rate(A,R), l = B * R/100.
perryridge-account(A,B) :account(A, Perryridge, B).
interest-rate(A,0) :account(N, A, B), B < 2000.
interest-rate(A,5) :account(N, A, B), B > 2000.
Layering of the view relations
Semantics of a Program
Let the layers in a given program be 1, 2, ..., n. Let i denote the
set of all rules defining view relations in layer i.
Define I0 = set of facts stored in the database.
Recursively define li+1 = li infer(i+1, li )
The set of facts in the view relations defined by the program
Safety
It is possible to write rules that generate an infinite number of
answers.
gt(X, Y) : X > Y
not-in-loan(B, L) : not loan(B, L)
To avoid this possibility Datalog rules must satisfy the following
conditions.
Every variable that appears in the head of the rule also appears in a
non-arithmetic positive literal in the body of the rule.
This condition can be weakened in special cases based on the
Updates in Datalog
Some Datalog extensions support database modification using + or
Recursion in Datalog
Suppose we are given a relation
manager(X, Y)
containing pairs of names X, Y such that Y is a manager of X (or
equivalently, X is a direct employee of Y).
? empl(X, Jones).
Recursion in SQL
SQL:1999 permits recursive view definition
E.g. query to find all employee-manager pairs
Monotonicity
A view V is said to be monotonic if given any two sets of facts
Non-Monotonicity
Procedure Datalog-Fixpoint is sound provided the rules in the
Stratified Negation
A Datalog program is said to be stratified if its predicates can be
Non-Monotonicity (Cont.)
There are useful queries that cannot be expressed by a stratified
program
E.g., given information about the number of each subpart in each
part, in a part-subpart hierarchy, find the total number of subparts of
each part.
A program to compute the above query would have to mix
aggregation with recursion
However, so long as the underlying data (part-subpart) has no
cycles, it is possible to write a program that mixes aggregation with
recursion, yet has a clear meaning
There are ways to evaluate some such classes of non-stratified
programs
Report Generators
Report generators are tools to generate human-readable
A Formatted Report
End of Chapter
The v1 Relation
Result of infer(R, I)
Domain Constraints
Integrity constraints guard against accidental damage to the
constraint.
They test values inserted in the database, and test queries to
type Pounds.
However, we can convert type as below
(cast r.A as Pounds)
(Should also multiply by the dollar-to-pound conversion-rate)
Referential Integrity
Ensures that a value that appears in one relation for a given set of
E2
constraints. For the relation schema for a weak entity set must
include the primary key of the entity set on which it depends.
= t1[K] (r2)
If this set is not empty, either the delete command is rejected
as an error, or the tuples that reference t1 must themselves
be deleted (cascading deletions are possible).
= t1[K] (r2)
using the old value of t1 (the value before the update is applied).
If this set is not empty, the update may be rejected as an error,
or the update may be cascaded to the tuples in the set, or the
tuples in the set may be deleted.
Assertions
An assertion is a predicate expressing a condition that we wish
Assertion Example
The sum of all loan amounts for each branch must be less than
Assertion Example
Every loan has at least one borrower who maintains an account with
Triggers
A trigger is a statement that is executed automatically by the
Trigger Example
Suppose that instead of allowing negative account balances, the
instead of
actions, BUT
Triggers can be used to record actions-to-be-taken in a separate table,
and we can have an external process that repeatedly scans the table
and carries out external-world actions.
E.g. Suppose a warehouse has the following tables
cases
Define methods to update fields
Carry out actions as part of the update methods instead of through a
trigger
Security
Security - protection from malicious attempts to steal or modify data.
Security (Cont.)
Physical level
Physical access to computers allows destruction of data by
Human level
Users must be screened to ensure that an authorized users do
Authorization
Forms of authorization on parts of the database:
Read authorization - allows reading, but not modification of data.
Insert authorization - allows insertion of new data, but not
data.
Delete authorization - allows deletion of data
Authorization (Cont.)
Forms of authorization to modify the database schema:
Index authorization - allows creation and deletion of indices.
Resources authorization - allows creation of new relations.
Alteration authorization - allows addition or deletion of attributes in
a relation.
Drop authorization - allows deletion of relations.
View Example
Suppose a bank clerk needs to know the names of the
processing begins.
Authorization on Views
Creation of view does not require resources authorization since
Granting of Privileges
The passage of authorization from one user to another may be
U1
DBA
U2
U3
U4
U5
a user-id
public, which allows all valid users the privilege granted
A role (more on this later)
Granting a privilege on a view does not imply granting any
Privileges in SQL
select: allows read access to relation,or the ability to query using
the view
Example: grant users U1, U2, and U3 select authorization on the branch
relation:
Roles
Roles permit common privileges for a class of users can be
may hold.
If <revokee-list> includes public all users lose the privilege
different grantees, the user may retain the privilege after the
revocation.
All privileges that depend on the privilege being revoked are also
revoked.
E.g. we cannot restrict students to see only (the tuples storing) their
own grades
All end-users of an application (such as a web application) may
Encryption
Data may be encrypted when database authorization provisions
Encryption (Cont.)
Authentication
Password based authentication is widely used, but is susceptible
to sniffing on a network
Challenge-response systems avoid transmission of passwords
E.g. use private key (in reverse) to encrypt data, and anyone can
verify authenticity by using public key (in reverse) to decrypt data.
Only holder of private key could have created the encrypted data.
Digital signatures also help ensure nonrepudiation: sender
cannot later claim to have not created the data
End of Chapter
Statistical Databases
Problem: how to ensure privacy of individuals while allowing use
Authorization-Grant Graph
Authorization Graph
security)
Protection from improper use of superuser authority.
Protection from improper use of privileged machine intructions.
Network-Level Security
Each site must ensure that it communicate with trusted sites (not
intruders).
Links must be protected from theft or modification of messages
Mechanisms:
Database-Level Security
Assume security at network, operating system, human, and
physical levels.
Database specific issues:
each user may have authority to read only part of the data and to
write only part of the data.
User authority may correspond to entire files or relations, but it may
also correspond only to parts of files or relations.
Local autonomy suggests site-level authorization control in a
distributed database.
Global control suggests centralized control.
units
Examples of non-atomic domains:
Set of names, composite attributes
Identification numbers like CS101 that can be broken up into
parts
A relational schema R is in first normal form if the domains of all
Example
Consider the relation schema:
Redundancy:
Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
Wastes space
Complicates updating, introducing possibility of inconsistency of assets value
Null values
Decomposition
Decompose the relation schema Lending-schema into:
R2 (r)
Decomposition of R = (A, B)
R2 = (A)
A B
1
2
A(r)
B(r)
1
2
1
r
A (r)
R2 = (B)
B (r)
1
2
1
2
functional dependencies
multivalued dependencies
Functional Dependencies
Constraints on the set of legal relations.
Require that the value for a certain set of attributes determines
key.
R and R
The functional dependency
holds on R if and only if for any legal relations r(R), whenever any
two tuples t1 and t2 of r agree on the attributes , they also agree
on the attributes . That is,
t1[] = t2 [] t1[ ] = t2 [ ]
Example: Consider r(A,B) with the following instance of r.
1
1
3
4
5
7
K R, and
for no K, R
Functional dependencies allow us to express constraints that
test relations to see if they are legal under a given set of functional
dependencies.
If a relation r is legal under a set F of functional dependencies, we
functional dependencies F.
Note: A specific instance of a relation schema may satisfy a
of a relation
E.g.
customer-name, loan-number customer-name
customer-name customer-name
In general, is trivial if
closure of F.
We denote the closure of F by F+.
We can find all of F+ by applying Armstrongs Axioms:
if , then
(reflexivity)
if , then
(augmentation)
Example
R = (A, B, C, G, H, I)
F={ AB
AC
CG H
CG I
B H}
some members of F+
AH
by transitivity from A B and B H
AG I
by augmenting A C with G, to get AG CG
and then transitivity with CG I
CG HI
from CG H and CG I : union rule can be inferred from
definition of functional dependencies, or
Augmentation of CG I to infer CG CGI, augmentation of
CG H to infer CGI HI, and then transitivity
F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1and f2 in F+
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F+
until F+ does not change any further
NOTE: We will see an alternative procedure for this task later
result := ;
while (changes to result) do
for each in F do
begin
if result then result := result
end
AC
CG H
CG I
B H}
(AG)+
1. result = AG
2. result = ABCG
3. result = ABCGH
4. result = ABCGHI
(A C and A B)
(CG H and CG AGBC)
(CG I and CG AGBCH)
Is AG a candidate key?
1. Is AG a super key?
1. Does AG R?
2. Is any subset of AG a superkey?
1. Does A+ R?
2. Does G+ R?
Canonical Cover
Sets of functional dependencies may have redundant
E.g. on LHS:
{A B, B C, AC D} can be simplified to
{A B, B C, A D}
Extraneous Attributes
Consider a set F of functional dependencies and the functional
dependency in F.
Attribute A is extraneous in if A
and F logically implies (F { }) {( A) }.
Attribute A is extraneous in if A
and the set of functional dependencies
(F { }) { ( A)} logically implies F.
Note: implication in the opposite direction is trivial in each of
dependency in F.
F = (F { }) { ( A)},
2.
Canonical Cover
A canonical cover for F is a set of dependencies Fc such that
repeat
Use the union rule to replace any dependencies in F
1 1 and 1 1 with 1 1 2
Find a functional dependency with an
extraneous attribute either in or in
If an extraneous attribute is found, delete it from
until F does not change
Note: Union rule may become applicable after some extraneous
F = {A BC
BC
AB
AB C}
AB C.
by A B and B C.
The canonical cover is:
AB
BC
Goals of Normalization
Decide whether a particular relation R is in good form.
In the case that a relation R is not in good form, decompose it
functional dependencies
multivalued dependencies
Decomposition
Decompose the relation schema Lending-schema into:
R2 = (A)
A B
1
2
A(r)
B(r)
1
2
1
r
A (r)
R2 = (B)
B (r)
1
2
1
2
that is,
(F1 F2 Fn)+ = F+
Example
R = (A, B, C)
F = {A B, B C)
Lossless-join decomposition:
R1 R2 = {B} and B BC
Dependency preserving
R1 = (A, B), R2 = (A, C)
Lossless-join decomposition:
R1 R2 = {A} and A AB
R2)
is trivial (i.e., )
is a superkey for R
Example
R = (A, B, C)
F = {A B
B C}
Key = {A}
R is not in BCNF
Decomposition R1 = (A, B), R2 = (B, C)
R1 and R2 in BCNF
Lossless-join decomposition
Dependency preserving
decomposition of R
E.g. Consider R (A, B, C, D), with F = { A B, B C}
Decompose R into R1(A,B) and R2(A,C,D)
Neither of the dependencies in F contain only attributes from (A,C,D) so
we might be mislead into thinking R2 satisfies BCNF.
In fact, dependency A C in F+ shows R2 is not in BCNF.
Final decomposition
R1, R3, R4
in F, the dependency
(+ - ) Ri
can be shown to hold on Ri, and Ri violates BCNF.
F = {JK L
L K}
Two candidate keys = JK and JL
R is not in BCNF
Any decomposition of R will fail to preserve
JK L
in F+
at least one of the following holds:
is trivial (i.e., )
is a superkey for R
Each attribute A in is contained in a candidate key for R.
(NOTE: each attribute may be in a different candidate key)
If a relation is in BCNF it is in 3NF (since in BCNF one of the first
3NF (Cont.)
Example
R = (J, K, L)
F = {JK L, L K}
Two candidate keys: JK and JL
R is in 3NF
JK L
LK
JK is a superkey
K is contained in a candidate key
FDs in F+.
Use attribute closure to check, for each dependency , if
is a superkey.
If is not a superkey, we have to verify if each attribute in is
Example
Relation schema:
Banker-info-schema = (branch-name, customer-name,
banker-name, office-number)
The functional dependencies for this relation schema are:
{customer-name, branch-name}
3NF and
the decomposition is lossless
the dependencies are preserved
It is always possible to decompose a relation into relations in
BCNF and
the decomposition is lossless
it may not be possible to preserve dependencies.
R = (J, K, L)
F = {JK L, L K}
J
j1
l1
k1
j2
l1
k1
j3
l1
k1
null
l2
k2
Design Goals
Goal for a relational database design is:
BCNF.
Lossless join.
Dependency preservation.
If we cannot achieve this, we accept one of
Multivalued Dependencies
There are database schemas in BCNF that do not seem to be
sufficiently normalized
Consider a database
teachers any one of which can be the courses instructor, and the
set of books, all of which are required for the course (no matter
who teaches it).
course
database
database
database
database
database
database
operating systems
operating systems
operating systems
operating systems
teacher
Avi
Avi
Hank
Hank
Sudarshan
Sudarshan
Avi
Avi
Jim
Jim
book
DB Concepts
Ullman
DB Concepts
Ullman
DB Concepts
Ullman
OS Concepts
Shaw
OS Concepts
Shaw
classes
Since there are non-trivial dependencies, (course, teacher, book)
teacher
database
Avi
database
Hank
database
Sudarshan
operating systems
Avi
operating systems
Jim
teaches
course
book
database
database
operating systems
operating systems
DB Concepts
Ullman
OS Concepts
Shaw
text
MVD (Cont.)
Tabular representation of
Example
Let R be a relation schema with a set of attributes that are
that Y Z if Y W
Example (Cont.)
In our example:
course teacher
course book
The above formal definition is supposed to formalize the
If Y Z then Y Z
Indeed we have (in above notation) Z1 = Z2
The claim follows.
Theory of MVDs
From the definition of multivalued dependency, we can derive the
following rule:
If , then
Example
R =(A, B, C, G, H, I)
F ={ A B
B HI
CG H }
R is not in 4NF since A B and A is not a superkey for R
Decomposition
a) R1 = (A, B)
(R1 is in 4NF)
b) R2 = (A, C, G, H, I)
c) R3 = (C, G, H)
(R3 is in 4NF)
d) R4 = (A, C, G, I)
e) R5 = (A, I)
(R5 is in 4NF)
f)R6 = (A, C, G)
(R6 is in 4NF)
correctly, the tables generated from the E-R diagram should not need
further normalization.
However, in a real (imperfect) design there can be FDs from non-key
FDs from non-key attributes of a relationship set possible, but rare ---
r2
rn)
The relation r1
tuples
assumption
e.g. customer-name, branch-name
Reuse of attribute names is natural in SQL since relation names
account
depositor
normalization
Examples of bad database design, to be avoided:
case separately.
Q.E.D.
End of Chapter
Sample Relation r
customer-loan
An Instance of Banker-schema
Tabular Representation of
An Illegal bc Relation
Decomposition of loan-info
R model.
The object-oriented paradigm is based on encapsulating code
Object Structure
An object has associated with it:
A set of variables that contain the data for the object. The value of
each variable is itself an object.
A set of messages to which the object responds; each message may
have zero, one, or more parameters.
A set of methods, each of which is a body of code to implement a
message; a method returns a value as the response to the message
The physical representation of data is visible only to the
object.
The term message does not necessarily imply physical message
Object Classes
Similar objects are grouped into a class; each such object is
strict encapsulation
Methods are defined separately
Inheritance
E.g., class of bank customers is similar to class of bank
person
Similarly, customer is a specialization of person.
Create classes person, employee and customer
Inheritance (Cont.)
Place classes into a specialization/IS-A hierarchy
2. Class extent of employee includes only employee objects that are not in a
subclass such as officer, teller, or secretary
Multiple Inheritance
With multiple inheritance a class may have more than one superclass.
subclasses
A person can play the roles of student, a teacher or footballPlayer,
or any combination of the three
E.g., student teaching assistant who also play football
That is, allow an object to take on any one or more of a set of types
But many systems insist an object should have a most-specific
class
That is, there must be one class that an object belongs to which is
a subclass of all other classes that the object belongs to
Create subclasses such as student-teacher and
student-teacher-footballPlayer for each combination
When many combinations are possible, creating
subclasses for each combination can become cumbersome
Object Identity
An object retains its identity even if some or all of the values
Object Identifiers
Object identifiers used to uniquely identify objects
Object Containment
users.
Object-Oriented Languages
Object-oriented concepts can be used in different ways
Persistence of Objects
Approaches to make transient objects persistent include
establishing
Persistence by Class declare all objects of a class to be
persistent; simple but inflexible.
Persistence by Creation extend the syntax for creating objects to
specify that that an object is persistent.
Persistence by Marking an object that is to persist beyond
program execution is marked as persistent before program
termination.
Persistence by Reachability - declare (root) persistent objects;
objects are persistent if they are referred to (directly or indirectly)
from a root object.
Easier for programmer, but more overhead for database system
Similar to garbage collection used e.g. in Java, which
ODMG Types
Template class d_Ref<class> used to specify references
(persistent pointers)
Template class d_Set<class> used to define sets of objects.
d_string
Interpretation of these types is platform independent
Dynamically allocated data (e.g. for d_string) allocated in the
database, not in main memory
};
int
int
find_balance();
update_balance(int delta);
Implementing Relationships
Relationships between classes implemented by references
Special reference types enforces integrity by adding/removing
inverse links.
Type d_Rel_Ref<Class, InvRef> is a reference to Class, where
attribute InvRef of Class is the inverse reference.
Similarly, d_Rel_Set<Class, InvRef> is used for a set of references
Assignment method (=) of class d_Rel_Ref is overloaded
d_Rel_Set use type definition to find and update the inverse link
automatically
Implementing Relationships
E.g.
deletion
Only for classes for which this feature has been specified
Specification via user interface, not C++
Automatic maintenance of class extents not supported in
earlier versions of ODMG
open a database:
open(databasename)
set_object_name(object, name)
rename_object(oldname, newname)
d_Extent<Customer> customerExtent(bank_db);
d_Iterator<T> create_iterator()
to create an iterator on the class extent
d_Set<d_Ref<Account>> result;
d_OQL_Query q1("select a
from Customer c, c.accounts a
where c.name=Jones
and a.find_balance() > 100");
d_oql_execute(q1, result);
Provides error handling mechanism based on C++ exceptions,
through class d_Error
Provides API for accessing the schema of a database.
Uses exactly the same pointer type for in-memory and database
objects
Persistence is transparent applications
Except when creating objects
Same functions can be used on in-memory and persistent objects
since pointer types are the same
Implemented by a technique called pointer-swizzling which is
described in Chapter 11.
No need to call mark_modified(), modification detected
automatically.
ODMG Java
Transaction must start accessing database from one of the root
End of Chapter
Nested Relations
Motivation:
title,
a set of authors,
Publisher, and
a set of keywords
Non-1NF relation books
flat-books
author
title
keyword
title
pub-name, pub-branch
(title, author)
(title, keyword)
(title, pub-name, pub-branch)
Collection Types
Set type (not in SQL:1999)
create table books (
..
keyword-set setof(varchar(20))
)
Sets are an instance of collection types. Other instances include
book-review clob(10KB)
blob: binary large objects
image
blob(10MB)
movie
blob (2GB)
small pieces
to be represented directly.
Unnamed row types can also be used in SQL:1999 to define
composite attributes
array [Silberschatz,`Korth,`Sudarshan]
Set value attributes (not supported in SQL:1999)
set( v1, v2, , vn)
To create a tuple of the books relation
(Compilers, array[`Smith,`Jones],
Publisher(`McGraw-Hill,`New York),
set(`parsing,`analysis))
To insert the preceding tuple into the relation books
insert into books
values
(`Compilers, array[`Smith,`Jones],
Publisher(McGraw Hill,`New York ),
set(`parsing,`analysis))
Inheritance
Suppose that we have the following type definition for people:
Multiple Inheritance
SQL:1999 does not support multiple inheritance
If our type system supports multiple inheritance, we can define a
Table Inheritance
Table inheritance allows an object to have multiple types by
subtables of people
create table students of Student
under people
create table teachers of Teacher
under people
Each tuple in a subtable (e.g. students and teachers) is implicitly
types.
1. Store only local attributes and the primary key of the supertable in
subtable
Inherited attributes derived by means of a join with the
supertable
their subtables
Access to all attributes of a tuple is faster: no join required
If entities must have most specific type, tuple is stored only in
Reference Types
Object-oriented languages provide the ability to create and refer to
objects.
In SQL:1999
We will study how to define references first, and later see how to use
references
create the tuple with a null reference and then set the reference
separately by using the function ref(p) applied to a tuple variable
E.g. to create a department with name CS and head being the
use
instead of
select p.oid
select ref(p)
departments
Avoids need for a separate query to retrieve the identifier:
Path Expressions
Find the names and addresses of the heads of all departments:
expression
Path expressions help avoid explicit joins
Collection-Value Attributes
Collection-valued attributes can be treated much like relations, using
keywords,
select title
from books
where database in (unnest(keyword-set))
Note: Above syntax is valid in SQL:1999, but the only collection type
supported by SQL:1999 is the array type
To get a relation containing pairs of the form title, author-name for
Unnesting
The transformation of a nested relation into a form with fewer (or no)
Nesting
Nesting is the opposite of unnesting, creating a collection-valued attribute
NOTE: SQL:1999 does not support nesting
Nesting can be done in a manner similar to aggregation, but using the
Nesting (Cont.)
Another approach to creating nested relations is to use
collection
Can thus create arrays, unlike earlier approach
including
Loops, if-then-else, assignment
Many databases have proprietary procedural extensions to SQL
SQL Functions
Define a function that, given a book title, returns the count of the
select name
from books4
where author-count(title)> 1
SQL Methods
Methods can be viewed as functions associated with structured
types
They have an implicit first parameter called self which is set to the
structured-type value on which the method is invoked
The method code can refer to attributes of the structured-type value
using the self variable
E.g.
self.a
Drawbacks
Code to implement function may need to be loaded into
database system and executed in the database systems
address space
risk of accidental corruption of database structures
security risk, allowing users access to unauthorized data
communication
Procedural Constructs
SQL:1999 supports a rich variety of procedural constructs
Compound statement
is of the form begin end,
may contain multiple SQL statements between begin and end.
Local variables can be declared within a compound statements
.. signal out-of-stock
end
The handler here is exit -- causes enclosing begin..end to be exited
Other actions possible on exception
End of Chapter
Introduction
XML: Extensible Markup Language
Defined by the WWW Consortium (W3C)
Originally intended as a document markup language not a
database language
Documents have tags giving extra information about sections of the
document
E.g. <title> XML </title> <slide> Introduction </slide>
</bank>
XML: Motivation
Data interchange is critical in todays networked world
Examples:
Banking: funds transfer
Order processing (especially inter-company orders)
Scientific data
Chemistry: ChemML,
Genetics:
representing information
XML has become the basis for all new generation data
interchange formats
Proper nesting
<account> <balance> . </balance> </account>
Improper nesting
<account> <balance> . </account> </balance>
Formally: every start tag must have a unique matching end tag, that
is in the context of the same parent element.
Every document must have a single top-level element
</account>
</customer>
.
.
</bank-1>
Example:
<account>
This account is seldom used any more.
<account-number> A-102</account-number>
<branch-name> Perryridge</branch-name>
<balance>400 </balance>
</account>
Useful for document markup, but discouraged for data
representation
Attributes
Elements can have attributes
tag of an element
An element may have several attributes, but each attribute name
by ending the start tag with a /> and deleting the end tag
<account number=A-101 branch=Perryridge balance=200 />
To store string data that may contain tags, without the tags being
<![CDATA[<account> </account>]]>
Here, <account> and </account> are treated as just strings
Namespaces
XML data has to be exchanged between organizations
Same tag name may have different meaning in different
Namespaces
<bank Xmlns:FB=http://www.FirstBank.com>
<FB:branch>
<FB:branchname>Downtown</FB:branchname>
<FB:branchcity> Brooklyn</FB:branchcity>
</FB:branch>
</bank>
XML Schema
Newer, not yet widely used
names of elements, or
#PCDATA (parsed character data), i.e., character strings
EMPTY (no subelements) or ANY (anything can be a subelement)
Example
| - alternatives
+ - 1 or more occurrences
* - 0 or more occurrences
Bank DTD
<!DOCTYPE bank [
<!ELEMENT bank ( ( account | customer | depositor)+)>
<!ELEMENT account (account-number branch-name balance)>
<! ELEMENT customer(customer-name customer-street
customer-city)>
<! ELEMENT depositor (customer-name account-number)>
<! ELEMENT account-number (#PCDATA)>
<! ELEMENT branch-name (#PCDATA)>
<! ELEMENT balance(#PCDATA)>
<! ELEMENT customer-name(#PCDATA)>
<! ELEMENT customer-street(#PCDATA)>
<! ELEMENT customer-city(#PCDATA)>
]>
Name
Type of attribute
CDATA
ID (identifier) or IDREF (ID reference) or IDREFS (multiple IDREFs)
more on this later
Whether
mandatory (#REQUIRED)
has a default value (value),
or neither (#IMPLIED)
Examples
be distinct
Thus the ID attribute value is an object identifier
An attribute of type IDREF must contain the ID value of an
<!DOCTYPE bank-2[
<!ELEMENT account (branch, balance)>
<!ATTLIST account
account-number ID
# REQUIRED
owners
IDREFS # REQUIRED>
<!ELEMENT customer(customer-name, customer-street,
customer-city)>
<!ATTLIST customer
customer-id
ID
# REQUIRED
accounts
IDREFS # REQUIRED>
declarations for branch, balance, customer-name,
customer-street and customer-city
]>
Limitations of DTDs
No typing of text elements and attributes
customer elements
XML Schema
XML Schema is a more sophisticated schema language which
used.
XPath
Simple language consisting of path expressions
XSLT
Simple language designed for translation from XML to XML and
XML to HTML
XQuery
An XML query language with a rich set of features
Wide variety of other languages have been proposed, and some
XML data
An XML document is modeled as a tree, with nodes corresponding
XPath
XPath is used to address (select) parts of documents using
path expressions
A path expression is a sequence of steps separated by /
E.g.
/bank-2/customer/name/text( )
returns the same names, but without the enclosing tags
XPath (Cont.)
The initial / denotes root of the document (above the top-level tag)
Path expressions are evaluated left to right
Each step operates on the set of instances produced by the previous step
Selection predicates may follow any step in a path, in [ ]
E.g.
balance subelement
Attributes are accessed using @
Functions in XPath
XPath provides several functions
in predicates
IDREFs can be referenced using function id()
of account elements.
E.g. /bank-2//name
finds any name element anywhere under the /bank-2 element,
XSLT
A stylesheet stores formatting options for a document, usually
templates
Templates combine selection using XPath with construction of
results
XSLT Templates
Example of XSLT template with match and select part
<xsl:template match=/bank-2/customer>
<xsl:value-of select=customer-name/>
</xsl:template>
<xsl:template match=*/>
The match attribute of xsl:template specifies a pattern in XPath
Elements in the XML document matching the pattern are processed
namespace is output as is
E.g. to wrap results in new XML elements.
<xsl:template match=/bank-2/customer>
<customer>
<xsl:value-of select=customer-name/>
</customer>
</xsl;template>
<xsl:template match=*/>
Example output:
<customer> John </customer>
<customer> Mary </customer>
another tag
E.g. cannot create an attribute for <customer> in the previous
example by directly using xsl:value-of
XSLT provides a construct xsl:attribute to handle this situation
xsl:attribute adds attribute to the preceding element
E.g. <customer>
<xsl:attribute name=customer-id>
<xsl:value-of select = customer-id/>
</xsl:attribute>
results in output of the form
<customer customer-id=.> .
computed names
Structural Recursion
Action of a template can be to recursively apply templates to the
Joins in XSLT
XSLT keys allow elements to be looked up (indexed) by values of
subelements or attributes
Keys must be declared (with a name) and, the key() function can then
use=account-number/>
<xsl:value-of select=key(acctno, A-101)
Keys permit (some) joins to be expressed in XSLT
<xsl:key name=acctno match=account use=account-number/>
<xsl:key name=custno match=customer use=customer-name/>
<xsl:template match=depositor.
<cust-acct>
<xsl:value-of select=key(custno, customer-name)/>
<xsl:value-of select=key(acctno, account-number)/>
</cust-acct>
</xsl:template>
<xsl:template match=*/>
Sorting in XSLT
Using an xsl:sort directive inside a template causes all elements
<xsl:template match=/bank>
<xsl:apply-templates select=customer>
<xsl:sort select=customer-name/>
</xsl:apply-templates>
</xsl:template>
<xsl:template match=customer>
<customer>
<xsl:value-of select=customer-name/>
<xsl:value-of select=customer-street/>
<xsl:value-of select=customer-city/>
</customer>
<xsl:template>
<xsl:template match=*/>
XQuery
XQuery is a general purpose query language for XML data
Currently being standardized by the World Wide Web Consortium
(W3C)
The textbook description is based on a March 2001 draft of the standard.
The final version may differ, but major features likely to stay unchanged.
Alpha version of XQuery engine available free from Microsoft
XQuery is derived from the Quilt query language, which itself borrows
find all accounts with balance > 400, with each result enclosed in
an <account-number> .. </account-number> tag
for
$x in /bank-2/account
let
$acctno := $x/@account-number
where $x/balance > 400
return <account-number> $acctno </account-number>
Let clause not really needed in this query, and selection can be
E.g. document(bank-2.xml)/bank-2/account
Aggregate functions such as sum( ) and count( ) can be applied
Joins
Joins are specified in a manner very similar to SQL
for $b in /bank/account,
$c in /bank/customer,
$d in /bank/depositor
where $a/account-number = $d/account-number
and $c/customer-name = $d/customer-name
return <cust-acct> $c $a </cust-acct>
The same query can be expressed with the selections specified
as XPath selections:
for $a in /bank/account
$c in /bank/customer
$d in /bank/depositor[
account-number =$a/account-number and
customer-name = $c/customer-name]
return <cust-acct> $c $a</cust-acct>
subelements/tags
XQuery path expressions support the > operator for
dereferencing IDREFs
Equivalent to the id( ) function of XPath, but simpler to use
Can be applied to a set of IDREFs to get a set of results
June 2001 version of standard has changed > to =>
Sorting in XQuery
Sortby clause can be used at the end of any expression. E.g. to
data:
Relational databases
Data must be translated into relational form
Advantage: mature database systems
Disadvantages: overhead of translating data and queries
E.g. store each top level element as a string field of a tuple in a database
Use a single relation to store all elements, or
Use a separate relation for each top-level element type
E.g. account, customer, depositor
Indexing:
Store values of subelements/attributes to be indexed, such as
customer-name and account-number as extra fields of the
relation, and build indices
Oracle 9 supports function indices which use the result of a
function as the key value. Here, the function should return the
value of the required subelement/attribute
Benefits:
Can store any XML data even without DTD
As long as there are many top-level elements in a document, strings are
small compared to full document, allowing faster access to individual
elements.
Drawback: Need to parse strings to access values inside the elements;
parsing is slow.
non-volatile, used primarily for backup (to recover from disk failure),
and for archival data
sequential-access much slower than disk
very high capacity (40 to 300 GB tapes available)
tape can be removed from drive storage costs much cheaper
than disk, but drives are expensive
Tape jukeboxes available for storing massive amounts of data
hundreds of terabytes (1 terabyte = 109 bytes) to even a
Storage Hierarchy
memory).
secondary storage: next level in hierarchy, non-volatile,
access time
also called off-line storage
E.g. magnetic tape, optical storage
NOTE: Diagram is schematic, and simplifies the structure of actual disk drives
Magnetic Disks
Read-write head
Disk Subsystem
to the disk.
4 to 8 MB per second is typical
Multiple disks may share a controller, so rate that controller can handle is also
important
E.g. ATA-5: 66 MB/second, SCSI-3: 40 MB/s
Fiber Channel: 256 MB/s
given 1000 relatively new disks, on an average one will fail every
1200 hours
arm movement
returns
Controller then writes to disk whenever the disk has no other requests or
request has been pending for some time
Database operations that require data to be safely stored before continuing can
continue without waiting for data to be written to disk
Writes can be reordered to minimize disk arm movement
Log disk a disk devoted to writing a sequential log of block updates
Journaling file systems write data in safe order to NV-RAM or log disk
Reordering without journaling: risk of corruption of file system data
RAID
RAID: Redundant Arrays of Independent Disks
The chance that some disk out of a set of N disks will fail is much higher
E.g., a system with 100 disks, each with MTTF of 100,000 hours (approx. 11
years), will have a system MTTF of 1000 hours (approx. 41 days)
Techniques for using redundancy to avoid data loss are critical with large
numbers of disks
Originally a cost-effective alternative to large, expensive disks
disks
In an array of eight disks, write bit i of each byte to disk i.
Each access can read data at eight times the rate of a single disk.
But seek/access time worse than for a single disk
Bit level striping is not used much any more
Block-level striping with n disks, block i of a file goes to disk
(i mod n) + 1
Requests for different blocks can run in parallel if the blocks reside
on different disks
A request for a long sequence of blocks can utilize all disks in
parallel
RAID Levels
Schemes to provide redundancy at lower cost by using disk
striping.
RAID Level 3: Bit-Interleaved Parity
a single parity bit is enough for error correction, not just detection, since
we know which disk has failed
When writing data, corresponding parity bits must also be computed
Faster data transfer than with a single disk, but fewer I/Os per
second since every disk has to participate in every I/O.
Subsumes Level 2 (provides all its benefits, at lower cost).
RAID Level 4: Block-Interleaved Parity; uses block-level
Provides higher I/O rates for independent block reads than Level 3
block read goes to a single disk, so blocks stored on different
Provides high transfer rates for reads of multiple blocks than nostriping
Before writing a block, parity data must be computed
Can be done by using old parity block, old value of current block
data and parity among all N + 1 disks, rather than storing data
in N disks and parity in 1 disk.
E.g., with 5 disks, parity block for nth set of blocks is stored on
disk (n mod 5) + 1, with the data blocks stored on the other 4
disks.
Monetary cost
Performance: Number of I/O operations per second, and bandwidth during
normal operation
Performance during failure
Performance during rebuild of failed disk
Including time taken to rebuild failed disk
RAID 0 is used only when data safety is not important
to access all disks, wasting disk arm movement, which block striping
(level 5) avoids
Level 6 is rarely used since levels 1 and 5 offer adequate safety for
almost all applications
So competition is between 1 and 5 only
Hardware Issues
Software RAID: RAID implementations done entirely in
in a mirrored system
Such corrupted data must be detected when power is restored
Optical Disks
Compact disk-read only memory (CD-ROM)
Magnetic Tapes
Hold large volumes of data and provide high transfer rates
Few GB for DAT (Digital Audio Tape) format, 10-40 GB with DLT (Digital
Linear Tape) format, 100 GB+ with Ultrium format, and 330 GB with Ampex
helical scan format
Transfer rates from few to 10s of MB/s
Currently the cheapest storage medium
disks
limited to sequential access.
Some formats (Accelis) provide faster seek (10s of seconds) at cost of lower
capacity
Used mainly for backup, for storage of infrequently used information,
Storage Access
A database file is partitioned into fixed-length storage units called
blocks.
Buffer manager subsystem responsible for allocating buffer
Buffer Manager
Programs call on the buffer manager when they need a block
from disk.
1. If the block is already in the buffer, the requesting program is given
the address of the block in main memory
2. If the block is not in the buffer,
1.
the buffer manager allocates space in the buffer for the block,
replacing (throwing out) some other block, if required, to make
space for the new block.
2. The block that is thrown out is written back to disk only if it was
modified since the most recent time that it was written to/fetched
from the disk.
3. Once space is allocated in the buffer, the buffer manager reads
the block from the disk to the buffer, and passes the address of
the block in main memory to requester.
Buffer-Replacement Policies
Most operating systems replace the block least recently used (LRU
strategy)
Idea behind LRU use past pattern of block references as a
File Organization
The database is stored as a collection of files. Each file is a
Fixed-Length Records
Simple approach:
Deletion of record I:
alternatives:
move records i + 1, . . ., n
to i, . . . , n 1
move record n to i
do not move records, but
link all free records on a
free list
Free Lists
Store the address of the first deleted record in the file header.
Use this first record to store the address of the second deleted record,
and so on
Can think of these stored addresses as pointers since they point to
Variable-Length Records
Variable-length records arise in database systems in
several ways:
Storage of multiple record types in a file.
Record types that allow variable lengths for one or more fields.
Record types that allow repeating fields (used in some older
data models).
Byte string representation
reserved space
pointers
Reserved space can use fixed-length records of a known
Pointer Method
Pointer method
is space
Sequential store records in sequential order, based on the
record; the result specifies in which block of the file the record
should be placed
Records of each relation may be stored in a separate file. In a
names of relations
names and types of attributes of each relation
names and definitions of views
integrity constraints
relations
Hardware Swizzling
With hardware swizzling, persistent pointers in objects
Hardware Swizzling
Persistent pointer is conceptually split into two parts: a page
3. Replace each entry (pi, Pi) in the translation table, by (vi, Pi)
continue
Optimizations:
If all pages are allocated the same address as in the short page
identifier, no changes required in the page!
No need for deswizzling swizzled page can be saved as-is to disk
A set of pages (segment) can share one translation table. Pages can
still be swizzled as and when fetched (old copy of translation table is
needed).
A process should not access more pages than size of virtual
different from the formal in which they are stored on disk in the
database. Reasons are:
software swizzling structure of persistent and in-memory pointers
are different
database accessible from different machines, with different data
representations
Make the physical representation of objects in the database
independent of the machine and the compiler.
Can transparently convert from disk representation to form required
on the specific machine, language, and compiler, when the object
(or page) is brought into memory.
Large Objects
Large objects : binary large objects (blobs) and character
regions of an object:
B+-tree file organization (described later in Chapter 12) can be
modified to represent large objects
Each leaf page of the tree stores between half and 1 page worth of
data from the object
Special-purpose application programs outside the database are
End of Chapter
Basic Concepts
Indexing mechanisms used to speed up access to desired data.
records in a file.
An index file consists of records (called index entries) of the
form
search-key
pointer
Index files are typically much smaller than the original file
Two basic kinds of indices:
Ordered Indices
Indexing techniques evaluated on basis of:
In an ordered index, index entries are stored sorted on the
search key value. E.g., author catalog in library.
Primary index: in a sequentially ordered file, the index whose
in the file.
values.
Applicable when records are sequentially ordered on search-key
To locate a record with search-key value K we:
deletions.
Generally slower than dense index for locating records.
Good tradeoff: sparse index with an index entry for every block in
Multilevel Index
If primary index does not fit in memory, access becomes
expensive.
To reduce number of disk accesses to index records, treat
the file.
Secondary Indices
Frequently, one wants to find all the records whose
used extensively.
n children.
A leaf node has between [(n1)/2] and n1 values
Special cases:
Example of a B+-tree
Example of B+-tree
indices.
The B+-tree contains a relatively small number of levels
Queries on B+-Trees
Find all records with a search-key value of k.
child node
3. Otherwise k Km1, where there are m pointers in the
take the n(search-key value, pointer) pairs (including the one being
inserted) in sorted order. Place the first n/2 in the original node,
and the rest in a new node.
let the new node be p, and let k be the least key value in p. Insert
(k,p) in the parent of the node being split. If the parent is full, split it
and propagate the split further up.
The splitting of nodes proceeds upwards till a node that is not full
is found. In the worst case the root node may be split increasing
the height of the tree by 1.
and the entries in the node and a sibling fit into a single node,
then
Redistribute the pointers between the node and a sibling such that
both have more than the minimum number of entries.
Update the corresponding search-key value in the parent of the
node.
The node deletions may cascade upwards till a node which has
n/2 or more pointers is found. If the root node has only one
pointer after deletion, it is deleted and the sole child becomes the
root.
Node with Perryridge becomes underfull (actually empty, in this special case)
and merged with its sibling.
As a result Perryridge nodes parent became underfull, and was merged with its
sibling (and an entry was deleted from their parent)
Root node then had only one child, and was deleted and its child became the new
root node
instead of pointers.
Since records are larger than pointers, the maximum
pointers.
To improve space utilization, involve more sibling nodes in redistribution
during splits and merges
Involving 2 siblings in redistribution (to avoid split / merge where possible)
results in each node having at least 2n / 3 entries
pointers.
Static Hashing
A bucket is a unit of storage containing one or more records (a
well as deletion.
Records with different search-key values may be mapped to
be the integer i.
The hash function returns the sum of the binary
Hash Functions
Worst has function maps all search-key values to the same
Insufficient buckets
Skew in distribution of records. This can occur due to two
reasons:
multiple records have same search-key value
chosen hash function produces non-uniform distribution of key
values
Although the probability of bucket overflow can be reduced, it
Hash Indices
Hashing can be used not only for file organization, but also for
index-structure creation.
A hash index organizes the search keys, with their associated
Dynamic Hashing
Good for database that grows and shrinks in size
Allows the hash function to be modified dynamically
Extendable hashing one form of dynamic hashing
Example (Cont.)
Hash structure after insertion of one Brighton and two
Downtown records
Example (Cont.)
Hash structure after insertion of Mianus record
Example (Cont.)
Example (Cont.)
Hash structure after insertion of Redwood and Round Hill
records
Multiple-Key Access
Use multiple indices for certain types of queries.
Example:
select account-number
from account
where branch-name = Perryridge and balance - 1000
Possible strategies for processing query using indices on
single attributes:
1. Use index on branch-name to find accounts with balances of
$1000; test branch-name = Perryridge.
2. Use index on balance to find accounts with balances of $1000;
test branch-name = Perryridge.
3. Use branch-name index to find pointers to all records pertaining
to the Perryridge branch. Similarly use index on balance. Take
intersection of both sets of pointers obtained.
Grid Files
Structure used to speed the processing of general multiple
column of its cell using the linear scales and follow pointer
find corresponding candidate grid array cells, and look up all the
buckets pointed to from those cells.
across cells.
Otherwise there will be too many overflow buckets.
Periodic re-organization to increase grid size will help.
Bitmap Indices
Bitmap indices are a special type of index designed for efficient
from, say, 0
Given a number n it must be easy to retrieve record n
Particularly easy if records are of fixed size
distinct values
E.g. gender, country, state,
E.g. income-level (income broken up into a small number of levels
such as 0-9999, 10000-19999, 20000-50000, 50000- infinity)
A bitmap is simply an array of bits
Intersection (and)
Union (or)
Complementation (not)
Each operation takes two bitmaps of the same size and applies
E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space
used by relation.
If number of distinct attribute values is 8, bitmap is only 1% of
relation size
Deletion needs to be handled properly
End of Chapter
Partitioned Hashing
Hash values are split into segments that depend on each
(customer-street, customer-city)
search-key value
(Main, Harrison)
(Main, Brooklyn)
(Park, Palo Alto)
(Spring, Brooklyn)
(Alma, Palo Alto)
hash value
101 111
101 001
010 010
001 001
110 010
Evaluation
The query-execution engine takes a query-evaluation plan,
executes that plan, and returns the answers to the query.
expressions
E.g., balance<2500(balance(account)) is equivalent to
balance(balance<2500(account))
Each relational algebra operation can be evaluated using one of
called an evaluation-plan.
E.g., can use an index on balance to find accounts with balance < 2500,
or can perform complete relation scan and discard accounts with
balance 2500
answering query
Many factors contribute to time cost
disk accesses, CPU, or even network communication
* average-seek-cost
* average-block-read-cost
cost measure
We ignore the difference in cost between sequential and random I/O for
simplicity
We also ignore CPU costs for simplicity
Costs depends on the size of the buffer in main memory
sequential and random I/O, and take buffer size into account
We do not include cost to writing output to disk in our cost
formulae
Selection Operation
File scan search algorithms that locate and retrieve records
on the blocks
Plus number of blocks containing records that satisfy
selection condition
Will see how to estimate this cost in Chapter 14
A V(r) use index to find first index entry v and scan index
1 2. . . n(r)
Use corresponding index for each condition, and take union of all the
obtained sets of record pointers.
Then fetch records from file
Negation:
(r)
Sorting
We may build an index on the relation, and then use the index to
read the relation in sorted order. May lead to one disk block
access for each tuple.
For relations that fit in memory, techniques like quicksort can be
External Sort-Merge
Let M denote memory size (in pages).
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run Ri; increment i.
Let the final value of I be N
2. Merge the runs (N-way merge). We assume (for now) that N < M.
we ignore final write cost for all operations since the output of
an operation may be sent to the parent operation without
being written to disk
Thus total number of disk accesses for external sorting:
br ( 2 logM1(br / M) + 1)
Join Operation
Several different algorithms to implement joins
Nested-loop join
Block nested-loop join
Indexed nested-loop join
Merge-join
Hash-join
Choice based on cost estimate
Examples use the following information
depositor: 5000
depositor: 100
400
Nested-Loop Join
To compute the theta join
condition.
Expensive since it examines every pair of tuples in the two
relations.
nr bs + br
disk accesses.
Each block in the inner relation s is read once for each block in
the outer relation (instead of once for each tuple in the outer
relation
Best case: br + bs block accesses.
Improvements to nested loop and block nested loop
algorithms:
In block nested-loop, use M 2 disk blocks as blocking unit for
outer relations, where M = memory size in blocks; use remaining
two blocks to buffer inner relation and output
Cost =
br / (M-2) bs + br
Worst case: buffer has space for only one page of r, and, for each
relation.
Let customer have a primary B+-tree index on the join attribute
Merge-Join
1. Sort both relations on their join attribute (if not already sorted on the
join attributes).
2. Merge the sorted relations to join them
Merge-Join (Cont.)
Can be used only for equi-joins and natural joins
Each block needs to be read only once (assuming all tuples for
br + bs
Hash-Join
Applicable for equi-joins and natural joins.
A hash function h is used to partition tuples of both relations
h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes
n is denoted as nh.
Hash-Join (Cont.)
Hash-Join (Cont.)
Hash-Join Algorithm
The hash-join of r and s is computed as follows.
1. Partition the relation s using hashing function h. When
partitioning a relation, one block of memory is reserved as
the output buffer for each partition.
2. Partition r similarly.
3. For each i:
(a) Load si into memory and build an in-memory hash index on it
using the join attribute. This hash index uses a different hash
function than the earlier one h.
(b) Read the tuples in ri from the disk one by one. For each tuple
tr locate each matching tuple ts in si using the in-memory hash
index. Output the concatenation of their attributes.
Handling of Overflows
Hash-table overflow occurs in partition si if si does not fit in
Cost of Hash-Join
If recursive partitioning is not required: cost of hash join is
3(br + bs) +2 nh
If recursive partitioning required, number of passes required for
relation.
Total cost estimate is:
2(br + bs logM1(bs) 1 + br + bs
If the entire build input can be kept in main memory, n can be
set to 0 and the algorithm does not partition the relations into
temporary files. Cost estimate goes down to br + bs.
depositor
Hybrid HashJoin
Useful when memory sized are relatively large, and the build input
Division of memory:
The first partition occupies 20 blocks of memory
1 block is used for input, and 1 block each for buffering the other 4
partitions.
the first is used right away for probing, instead of being written out
and read back.
Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for
bs
Complex Joins
Join with a conjunctive condition:
1 2... n
1 . . . i 1 i +1 . . . n
Join with a disjunctive condition
1 2 ... n s
1 s)
(r
s) . . . (r
s)
i s:
Other Operations
Duplicate elimination can be implemented via
hashing or sorting.
On sorting duplicates will come adjacent to each other,
and all but one set of duplicates can be deleted.
Optimization: duplicates can be deleted during run
generation as well as at intermediate merge steps in
external sort-merge.
Hashing is similar duplicates will come into the same
bucket.
Projection is implemented by performing projection on
duplicate elimination.
Sorting or hashing can be used to bring tuples in the same group
together, and then the aggregate functions can be applied on each
group.
Optimization: combine tuples in the same group during run
generation and intermediate merges, by computing partial
aggregate values
For count, min, max, sum: keep aggregate values on tuples
end
In r
s)
Evaluation of Expressions
So far: we have seen algorithms for individual operations
Alternatives for evaluating an entire expression tree
Materialization
Materialized evaluation: evaluate one operation at a
Materialization (Cont.)
Materialized evaluation is always applicable
Cost of writing results to disk and reading them back can be
quite high
Our cost formulas for operations ignore cost of writing results to
disk, so
Overall cost = Sum of costs of individual operations +
when one is full write it to disk while the other is getting filled
Allows overlap of disk writes with computation and reduces
execution time
Pipelining
Pipelined evaluation : evaluate several operations
relation to disk.
Pipelining may not always be possible e.g., sort, hash-join.
For pipelining to be effective, use evaluation algorithms that
generate output tuples even as tuples are received for inputs to the
operation.
Pipelines can be executed in two ways: demand driven and
producer driven
Pipelining (Cont.)
In demand driven or lazy evaluation
E.g. file scan: initialize file scan, store pointer to beginning of file as state
E.g.merge join: sort relations and store pointers to beginning of sorted
relations as state
next()
E.g. for file scan: Output next tuple, and advance and store file pointer
E.g. for merge join: continue with merge from earlier state till
next output tuple is found. Save pointers as iterator state.
close()
Pipelining (Cont.)
In produce-driven or eager pipelining
input tuples
E.g. merge join, or hash join
These result in intermediate results being written to disk and then
read back always
Algorithm variants are possible to generate (at least some) results
Complex Joins
Join involving three relations: loan
depositor
customer
compute loan
(depositor
Chapter 14
Query Optimization
Introduction
Alternative ways of evaluating a given query
Equivalent expressions
Different algorithms for each operation (Chapter 13)
Cost difference between a good and a bad way of evaluating a
attributes, etc.
Introduction (Cont.)
Relations generated by two equivalent expressions have the
same set of attributes and contain the same set of tuples,
although their attributes may be ordered differently.
Introduction (Cont.)
Generation of query-evaluation plans for an expression involves
several steps:
1. Generating logically equivalent expressions
Use equivalence rules to transform an expression into an
equivalent one.
Overview of chapter
Statistical information for cost estimation
Equivalence rules
Cost-based optimization algorithm
Optimizing nested subqueries
Materialized views and view maintenance
nr
br =
f r
occupy
E.g. Binary search cost estimate becomes
SC ( A, r )
Ea 2 = log2 (br ) +
1
f
condition.
If min(A,r) and max(A,r) are available in catalog
C = 0 if v < min(A,r)
v min( A, r )
C = nr .
max( A, r ) min( A, r )
In absence of statistical information c is assumed to be nr / 2.
Conjunction:
1 2. . . n (r).
s1 s2 . . . sn
nr
nrn
s
s
s
nr 1 (1 1 ) (1 2 ) ... (1 n )
nr
nr
nr
occupies sr + ss bytes.
If R S = , then r
s is the same as r x s.
number of tuples in r
tuples in s.
symmetric.
In the example query depositor
customer, customer-name in
depositor is a foreign key of customer
hence, the result has exactly ndepositor tuples, which is 5000
S, the
nr ns
V ( A, s )
customer without
gF(r)
= V(A,r)
Set operations
Estimated size of r
s = size of r
s + size of r
Estimated size of r
s = size of r
s + size of r + size of s
min(V(A,r), n (r) )
More accurate estimate can be got using probability theory, but
this one works fine generally
estimated V(A, r
s) = min (V(A,r), n r
s)
V(A,r
s) =
s)
More accurate estimate can be got using probability theory, but this
one works fine generally
Transformation of Relational
Expressions
Two relational algebra expressions are said to be equivalent if on
equivalent
Can replace expression of first form by second, or vice versa
Equivalence Rules
1. Conjunctive selection operations can be deconstructed into a
sequence of individual selections.
( E ) = ( ( E ))
1
( ( E )) = ( ( E ))
1
theta joins.
a. (E1 X E2) = E1
b. 1(E1
E2
E2) = E1
1 2 E2
E2)
E3 = E1
(E2
E3)
1 E2)
2 3
E3 = E1
2 3
(E2
E3)
E2) = (0(E1))
E2
E2) = (1(E1))
(2 (E2))
L1 L2 ( E1....... E2 ) = ( L1 ( E1 ))...... ( L2 ( E2 ))
(b) Consider a join E1
E2.
E1 E2 = E2 E1
E1 E2 = E2 E1
(E1
E2) = (E1)
(E2)
Also:
(E1
E2) = (E1) E2
and similarly for in place of , but not for
Transformation Example
Query: Find the names of all customers who have an account at
depositor)))
customer-name
((branch-city =Brooklyn (branch))
(account depositor))
(account
depositor)))
account)
depositor)
When we compute
(r1
If r2
r2)
r 3 = r1
r2)
(r2
r3 )
r2 is small, we choose
r3
with
account
Evaluation Plan
An evaluation plan defines exactly what algorithm is used for each
Cost-Based Optimization
Consider finding the best join-order for r1
r2
. . . rn.
There are (2(n 1))!/(n 1)! different join orders for above
Cost of Optimization
With dynamic programming time complexity of optimization with
r2
r3)
r4
r5
set of n given relations; must find the best join order for each
subset, for each interesting sort order
Simple extension of earlier dynamic programming algorithms
Usually, number of interesting orders is quite small and doesnt
affect time/space complexity significantly
Heuristic Optimization
Cost-based optimization is expensive, even with
dynamic programming.
Systems may use heuristics to reduce the number of
account the probability that the page containing the tuple is in the
buffer.
Intricacies of SQL complicate query optimization
select customer-name
from borrower
where exists (select *
from depositor
where depositor.customer-name =
borrower.customer-name)
Conceptually, nested subquery is executed once for each tuple in
select customer-name
from borrower, depositor
where depositor.customer-name = borrower.customer-name
Note: above query doesnt correctly deal with duplicates, can be
modified to do so as we will see
In general, it is not possible/straightforward to move the entire
nested subquery from clause into the outer level query from clause
A temporary relation is created instead, and used in body of outer
level query
from L1
where P1 and exists (select *
from L2
where P2)
To:
create table t1 as
select distinct V
from L2
where P21
select
from L1, t1
where P1 and P22
create table t1 as
select distinct customer-name
from depositor
select customer-name
from borrower, t1
where t1.customer-name = borrower.customer-name
The process of replacing a nested query by a query with a join
Materialized Views**
A materialized view is a view whose contents are computed and
stored.
Consider the view
update
A better option is to use incremental view maintenance
Join Operation
Consider the materialized view v = r
s and an update to r
Let rold and rnew denote the old and new states of relation r
Consider the case of an insert to r:
s as (rold ir)
s) (ir
s)
But (rold s) is simply the old value of the materialized view, so the
incremental change to the view is just
ir s
Thus, for inserts
s)
s)
Aggregation Operations
count : v = Agcount(B)(r).
gsum (B)(r)
Other Operations
Set intersection: v = r s
Handling Expressions
To handle an entire expression, we derive expressions for
computing the incremental change to the result of each subexpressions, starting from the smallest sub-expressions.
E.g. consider E1
complex expression
Suppose the set of tuples to be inserted into E1 is given by D1
Computed earlier, since smaller sub-expressions are handled
first
E2 is given by
A materialized view v = r
A user submits a query
s is available
r
t
t
A materialized view v = r
s, which can
materialize?.
This decision must be made on the basis of the system workload
Indices are just like materialized views, problem of index
End of Chapter
(Extra slides with details of selection cost estimation
follow)
V(branch-name,account) is 50
10000/50 = 200 tuples of the account relation pertain to
Perryridge branch
200/20 = 10 blocks for these tuples
A binary search to find the first record would take
log2(500) = 9 block accesses
Total cost of binary search is 9 + 10 -1 = 18 block
on search-key of index.
A3 (primary index on candidate key, equality). Retrieve a
fr
LBi c
+c
nr
Example (Cont.)
Consider using algorithm A10:
Transaction Concept
A transaction is a unit of program execution that
inconsistent.
When the transaction is committed, the database must
be consistent.
Two main issues to deal with:
ACID Properties
To preserve integrity of data, the database system must ensure:
Atomicity. Either all operations of the transaction are
properly reflected in the database or none are.
Consistency. Execution of a transaction in isolation
read(A)
A := A 50
write(A)
read(B)
B := B + 50
write(B)
Transaction State
Active, the initial state; the transaction stays in this state
while it is executing
Partially committed, after the final statement has been
executed.
Failed, after the discovery that normal execution can no
longer proceed.
Aborted, after the transaction has been rolled back and the
Concurrent Executions
Multiple transactions are allowed to run concurrently in the
Schedules
Schedules sequences that indicate the chronological order in
Example Schedules
Let T1 transfer $50 from A to B, and T2 transfer 10% of
Serializability
Basic Assumption Each transaction preserves database
consistency.
Thus serial execution of a set of transactions preserves
database consistency.
A (possibly concurrent) schedule is serializable if it is
Conflict Serializability
Instructions li and lj of transactions Ti and Tj respectively, conflict
if and only if there exists some item Q accessed by both li and lj,
and at least one of these instructions wrote Q.
1. li = read(Q), lj = read(Q).
2. li = read(Q), lj = write(Q).
3. li = write(Q), lj = read(Q).
4. li = write(Q), lj = write(Q).
T3
read(Q)
T4
write(Q)
write(Q)
We are unable to swap instructions in the above schedule
to obtain either the serial schedule < T3, T4 >, or the serial
schedule < T4, T3 >.
View Serializability
Let S and S be two schedules with the same set of
schedule.
Every conflict serializable schedule is also view serializable.
Schedule 9 (from text) a schedule which is view-serializable
Recoverability
Need to address the effect of transaction failures on concurrently
running transactions.
Recoverable schedule if a transaction Tj reads a data items
previously written by a transaction Ti , the commit operation of Ti
appears before the commit operation of Tj.
The following schedule (Schedule 11) is not recoverable if T9
Recoverability (Cont.)
Cascading rollback a single transaction failure leads to
Recoverability (Cont.)
Cascadeless schedules cascading rollbacks cannot occur;
cascadeless
Implementation of Isolation
Schedules must be conflict or view serializable, and
Serializable default
Repeatable read
Read committed
Read uncommitted
read.
Lower degrees of consistency useful for gathering approximate
information about the database, e.g., statistics for query optimizer.
..., Tn
T2
read(X)
T3
T4
T5
read(Y)
read(Z)
read(V)
read(W)
read(W)
read(Y)
write(Y)
write(Z)
read(U)
read(Y)
write(Y)
read(Z)
write(Z)
read(U)
write(U)
T1
T3
T2
T4
graph is acyclic.
Cycle-detection algorithms exist which take order n2 time, where
End of Chapter
Schedule 7
Precedence Graph
fig. 15.21
Lock-Based Protocols
A lock is a mechanism to control concurrent access to a data item
Data items can be locked in two modes :
T2: lock-S(A);
read (A);
unlock(A);
lock-S(B);
read (B);
unlock(B);
display(A+B)
Locking as above is not sufficient to guarantee serializability if A and B
get updated in-between the read of A and B, the displayed sum would be
wrong.
A locking protocol is a set of rules followed by all transactions while
starvation.
Lock Conversions
Two-phase locking with lock conversions:
First Phase:
can acquire a lock-S on item
can acquire a lock-X on item
can convert a lock-S to a lock-X (upgrade)
Second Phase:
can release a lock-S
can release a lock-X
can convert a lock-X to a lock-S (downgrade)
This protocol assures serializability. But still relies on the
if Ti has a lock on D
then
read(D)
else
begin
if necessary wait until no other
transaction has a lock-X on D
grant Ti a lock-S on D;
read(D)
end
if Ti has a lock-X on D
then
write(D)
else
begin
if necessary wait until no other trans. has any lock on D,
if Ti has a lock-S on D
then
upgrade lock on D to lock-X
else
grant Ti a lock-X on D
write(D)
end;
All locks are released after commit or abort
Implementation of Locking
Lock Table
Black rectangles indicate granted
Graph-Based Protocols
data items.
If di dj then any transaction accessing both di and dj must access
di before accessing dj.
Implies that the set D may now be viewed as a directed acyclic
graph, called a database graph.
The tree-protocol is a simple kind of graph protocol.
Tree Protocol
Timestamp-Based Protocols
Each transaction is issued a timestamp when it enters the system. If
T2
T3
read(Y)
T4
T5
read(X)
write(Y)
write(Z)
read(X)
read(Z)
read(X)
abort
write(Z)
abort
write(Y)
write(Z)
transaction
with smaller
timestamp
transaction
with larger
timestamp
recoverable.
Solution:
A transaction is structured such that its writes are all performed at
the end of its processing
All writes of a transaction form an atomic action; no transaction may
execute while a transaction is being written
A transaction that aborts is restarted with a new timestamp
protocol.
Thomas' Write Rule allows greater potential concurrency. Unlike
Validation-Based Protocol
Execution of transaction Ti is done in three phases.
executes fully in the hope that all will go well during validation
If for all Ti with TS (Ti) < TS (Tj) either one of the following
condition holds:
finish(Ti) < start(Tj)
start(Tj) < finish(Ti) < validation(Tj) and the set of data items
written by Ti does not intersect with the set of data items read by Tj.
T14
read(B)
read(A)
(validate)
display (A+B)
T15
read(B)
B:- B-50
read(A)
A:- A+50
(validate)
write (B)
write (A)
Multiple Granularity
tree-locking protocol)
When a transaction locks a node in the tree explicitly, it implicitly
IS
IX
S IX
IS
IX
S IX
Multiversion Schemes
concurrency.
Multiversion Timestamp Ordering
Multiversion Two-Phase Locking
Each successful write results in the creation of a new version of
immediately.
serializability.
Suppose that transaction Ti issues a read(Q) or write(Q) operation.
transactions
Update transactions acquire read and write locks, and hold all
Deadlock Handling
T1:
write (X)
write(Y)
T2:
write(Y)
write(X)
T1
lock-X on X
write (X)
T2
lock-X on Y
write (X)
wait for lock-X on X
Deadlock Handling
older transaction may wait for younger one to release data item.
Younger transactions never wait for older ones; they are rolled back
instead.
a transaction may die several times before acquiring needed data
item
wound-wait scheme preemptive
Deadlock Detection
of a pair G = (V,E),
V is a set of vertices (all the transactions in the system)
E is a set of edges; each element is an ordered pair Ti Tj.
If Ti Tj is in E, then there is a directed edge from Ti to Tj,
Deadlock Recovery
break deadlock.
insertions/deletions.
Index locking protocols provide higher concurrency while
T1 may see some records inserted by T2, but may not see
others inserted by T2
are available; see Section 16.9 for one such protocol, the B-link
tree protocol
End of Chapter
Lock Table
Schedule 3
Schedule 4
Granularity Hierarchy
Compatibility Matrix
Lock-Compatibility Matrix
Failure Classification
Transaction failure :
Recovery Algorithms
Recovery algorithms are techniques to ensure database
Storage Structure
Volatile storage:
Stable-Storage Implementation
Maintain multiple copies of each block on separate disks
completes.
Record in-progress disk writes on non-volatile storage (Nonvolatile RAM or special area of disk).
Data Access
Physical blocks are those blocks residing on the disk.
Buffer blocks are the blocks residing temporarily in main
memory.
Block movements between disk and main memory are initiated
We assume, for simplicity, that each data item fits in, and is
Buffer Block A
Buffer Block B
A
output(B)
read(X)
write(Y)
x2
x1
y1
work area
of T1
memory
work area
of T2
disk
and B). A failure may occur after one of these modifications have
been made but before all of them are made.
Log-Based Recovery
A log is kept on stable storage.
written.
We assume for now that log records are written directly to stable
storage (that is, they are not buffered)
Two approaches using logs
Deferred database modification
Immediate database modification
modifications to the log, but defers all the writes to after partial
commit.
Assume that transactions execute serially
Transaction starts by writing <Ti start> record to log.
A write(X) operation results in a log record <Ti, X, V> being
and only if both <Ti start> and<Ti commit> are there in the log.
Redoing a transaction Ti ( redoTi) sets the value of all data items
T1 : read (C)
A: - A - 50
C:- C- 100
Write (A)
write (C)
read (B)
B:- B + 50
write (B)
written
We assume that the log record is output directly to stable storage
Can be extended to postpone log record output, so long as prior to
execution of an output(B) operation for a data block B, all log
records corresponding to items B must be flushed to stable storage
Output of updated blocks can take place at any time before or
Write
Output
<T0 start>
<T0, A, 1000, 950>
To, B, 2000, 2050
A = 950
B = 2050
<T0 commit>
<T1 start> x1
<T1, C, 700, 600>
C = 600
BB, BC
<T1 commit>
BA
Note: BX denotes block containing X.
That is, even if the operation is executed multiple times the effect is
the same as if it is executed once
Needed since operations may get re-executed during recovery
When recovering after failure:
Checkpoints
Problems in recovery procedure as discussed earlier :
checkpointing
1. Output all log records currently residing in main memory onto stable
storage.
2. Output all modified buffer blocks to the disk.
3. Write a log record < checkpoint> onto stable storage.
Checkpoints (Cont.)
During recovery we need to consider only the most recent
Example of Checkpoints
Tf
Tc
T1
T2
T3
T4
checkpoint
system failure
Shadow Paging
Shadow paging is an alternative to log-based recovery; this
committed.
No recovery is needed after a crash new transactions can start right
(garbage collected).
changed
since several transactions may be active when a checkpoint is performed.
During the scan, perform undo for each log record that belongs
to a transaction in undo-list.
During the scan, perform redo for each log record that belongs
to a transaction on redo-list
Example of Recovery
Go over the steps of the recovery algorithm on the following log:
<T0 start>
<T0, A, 0, 10>
<T0 commit>
<T1 start>
<T1, B, 0, 10>
<T2 start>
/* Scan in Step 4 stops here */
<T2, C, 0, 10>
<T2, C, 10, 20>
<checkpoint {T1, T2}>
<T3 start>
<T3, A, 10, 20>
<T3, D, 0, 10>
<T3 commit>
Log records are output to stable storage in the order in which they
are created.
Transaction Ti enters the commit state only when the log record
<Ti commit> has been output to stable storage.
Before a block of data in main memory is output to the database, all
log records pertaining to data in that block must have been output to
stable storage.
This rule is called the write-ahead logging or WAL rule
Database Buffering
Database maintains an in-memory buffer of data blocks
updates is output to disk, log records with undo information for the updates
are output to the log on stable storage first.
No updates should be in progress on a block when it is output to disk. Can
be ensured as follows.
Before writing a data item, transaction acquires exclusive lock on block containing
the data item
Lock can be released once the write is completed.
Such locks held for short duration are called latches.
Before a block is output to disk, the system acquires an exclusive latch on the
block
Ensures no update can be in progress on the block
storage
Periodically dump the entire content of the database to stable storage
No transaction may be active during the dump procedure; a procedure
similar to checkpointing must take place
Output all log records currently residing in main memory onto stable
storage.
Output all buffer blocks onto the disk.
Copy the contents of the database to stable storage.
Output a record <dump> to log on stable storage.
To recover from disk failure
restore database from most recent dump.
Consult the log and redo all transactions that committed after the dump
Can be extended to allow transactions to be active during dump;
early.
They cannot be undone by restoring old values (physical undo),
since once a lock is released, other transactions may have updated
the B+-tree.
Instead, insertions (resp. deletions) are undone by executing a
deletion (resp. insertion) operation (known as logical undo).
For such operations, undo log records should contain the undo
operation to be executed
called logical undo logging, in contrast to physical undo logging.
Redo information is logged physically (that is, new value for each
information.
1. If a log record <Ti, X, V1, V2> is found, perform the undo and log a
special redo-only log record <Ti, X, V1>.
2. If a <Ti, Oj, operation-end, U> record is found
Rollback the operation logically using the undo information U.
operation-begin> is found
from undo-list
Follow WAL: all log records pertaining to a block must be output before
the block is output
ARIES
ARIES is a state of the art recovery method
ARIES Optimizations
Physiological redo
transaction
RedoInfo
UndoInfo
(CLR) used to log actions taken during recovery that never need to
be undone
Also serve the role of operation-abort log records used in advanced
recovery algorithm
Have a field UndoNextLSN to note next (earlier) record to be undone
Records in between would have already been undone
Required to avoid repeated undo of already undone actions
Contains:
DirtyPageTable and list of active transactions
For each active transaction, LastLSN, the LSN of the last log
record written by the transaction
Fixed position on disk notes LSN of last completed
checkpoint log record
.. On next page
is found:
1. If the page is not in DirtyPageTable or the LSN of the log record is
less than the RecLSN of the page in DirtyPageTable, then skip the
log record
2. Otherwise fetch the page from disk. If the PageLSN of the page
fetched from disk is less than the LSN of the log record, redo the
log record
NOTE: if either test is negative the effects of the log record have
already appeared on the page. First test avoids even fetching the
page from disk!
Generate a CLR containing the undo action performed (actions performed during
undo are logged physicaly or physiologically).
CLR for record n noted as n in figure below
Set UndoNextLSN of the CLR to the PrevLSN value of the update log record
Arrows indicate UndoNextLSN value
ARIES supports partial rollback
Used e.g. to handle deadlocks by rolling back just enough to release reqd. locks
Figure indicates forward actions after partial rollbacks
records 3 and 4 initially, later 5 and 6, then full rollback
4'
3'
6'
5' 2'
1'
undo it
After undoing a log record
failed
to distinguish primary site failure from link failure maintain several
communication links between the primary and the remote backup.
Transfer of control:
To take over control backup site first perform recovery using its copy of
the database and all the long records it has received from the primary.
Thus, completed transactions are redone and incomplete transactions
are rolled back.
When the backup site takes over processing it becomes the new primary
To transfer control back to old primary when it recovers, old primary must
receive redo logs from the old backup and apply all updates locally.
replicated data
Remote backup is faster and cheaper, but less tolerant to failure
more on this in Chapter 19
written at primary
Problem: updates may not arrive at backup before it takes over.
Two-very-safe: commit when transactions commit log record is
End of Chapter
Centralized Systems
Run on a single computer system and do not interact with other
computer systems.
General-purpose computer system: one to a few CPUs and a
desk-top unit, single user, usually has only one CPU and one or
two hard disks; the OS may support only one user.
Multi-user system: more disks, more memory, multiple CPUs,
Client-Server Systems
Server systems satisfy requests generated at m client systems, whose
Transaction Servers
Also called query server systems or SQL server systems;
transaction.
Open Database Connectivity (ODBC) is a C language
Monitors other processes, and takes recovery actions if any of the other
processes fail
E.g. aborting any transactions being executed by a server process
and restarting it
Buffer pool
Lock table
Log buffer
Cached query plans (reused if same query submitted again)
All database processes can access shared memory
To ensure that no two processes are accessing the same data
each database process operates directly on the lock table data structure
(Section 16.1.4) instead of sending requests to lock manager process
Mutual exclusion ensured on the lock table using semaphores, or more
commonly, atomic instructions
If a lock can be obtained, the lock table is updated directly in shared memory
If a lock cannot be immediately obtained, a lock request is noted in the lock table
and the process (or thread) then waits for lock to be granted
When a lock is released, releasing process updates lock table to record release of
lock, as well as grant of lock to waiting requests (if any)
Process/thread waiting for lock may either:
Continually scan lock table to check for lock grant, or
Use operating system semaphore mechanism to wait on a semaphore.
Data Servers
Used in LANs, where there is a very high speed connection
between the clients and the server, the client machines are
comparable in processing power to the server machine, and the
tasks to be executed are compute intensive.
Ship data to client machines where processing is performed, and
clients.
Used in many object-oriented database systems
Issues:
Parallel Systems
Parallel database systems consist of multiple processors and
powerful processors
A massively parallel or fine grain parallel machine utilizes
Speedup
Speedup
Scaleup
Scaleup
Interconnection Architectures
Shared Memory
Processors and disks have access to a common memory,
Shared Disk
All processors can directly access all disks via an
subsystem.
Shared-disk systems can scale to a somewhat larger number of
Shared Nothing
Node consists of a processor, memory, and one or more disks.
Hierarchical
Combines characteristics of shared-memory, shared-disk, and
shared-nothing architectures.
Top level is a shared-nothing architecture nodes connected by
a few processors.
Alternatively, each node could be a shared-disk system, and
Distributed Systems
Data spread over multiple machines (also referred to as sites or
nodes.
Network interconnects the machines
Data shared by users on multiple machines
Distributed Databases
Homogeneous distributed databases
site
Transaction cannot be committed at one site and aborted at another
The two-phase commit protocol (2PC) used to ensure atomicity
Basic idea: each site executes transaction till just before commit, and
the leaves final decision to a coordinator
Each site must follow decision of coordinator: even if there is a failure
while waiting for coordinators decision
To do so, updates of transaction are logged to stable storage and
transaction is recorded as waiting
More details in Sectin 19.4.1
2PC is not always appropriate: other transaction models based on
Network Types
Local-area networks (LANs) composed of processors that are
periodic dial-up (using, e.g., UUCP), that are connected only for
part of the time.
Continuous connection WANs, such as the Internet, where
End of Chapter
Interconnection Networks
A Distributed System
Local-Area Network
other
Transactions may access data at one or more sites
Data Replication
A relation or fragment of a relation is replicated if it is stored
Data Fragmentation
Division of relation r into fragments r1, r2, , rn which contain
more fragments
Vertical fragmentation: the schema for relation r is split into
account-number
A-305
A-226
A-155
balance
500
336
62
account1=branch-name=Hillside(account)
branch-name
Valleyview
Valleyview
Valleyview
Valleyview
account-number
A-177
A-402
A-408
A-639
balance
205
10000
1123
750
account2=branch-name=Valleyview(account)
8
customer-name
tuple-id
Lowman
1
Hillside
Camp
2
Hillside
Camp
3
Valleyview
Kahn
4
Valleyview
Kahn
5
Hillside
Kahn
6
Valleyview
Green
7
Valleyview
deposit1=branch-name, customer-name, tuple-id(employee-info)
account number
balance
tuple-id
500
A-305
336
A-226
205
A-177
10000
A-402
62
A-155
1123
A-408
750
A-639
deposit2=account-number, balance, tuple-id(employee-info)
1
2
3
4
5
6
7
9
Advantages of Fragmentation
Horizontal:
allows tuples to be split so that each part of the tuple is stored where
it is most frequently accessed
tuple-id attribute allows efficient joining of vertical fragments
allows parallel processing on a relation
Vertical and horizontal fragmentation can be mixed.
10
Data Transparency
Data transparency: Degree to which system user may remain
unaware of the details of how and where the data items are stored
in a distributed system
Consider transparency issues in relation to:
Fragmentation transparency
Replication transparency
Location transparency
11
12
13
Use of Aliases
Alternative to centralized scheme: each site prefixes its own site
14
Distributed Transactions
15
Distributed Transactions
Transaction may access data at several sites.
Each site has a local transaction manager responsible for:
16
17
Failure of a site.
Loss of massages
Handled by network transmission control protocols such as TCPIP
Failure of a communication link
Handled by network protocols, by routing messages via
alternative links
Network partition
A network is said to be partitioned when it has been split into
two or more subsystems that lack any connection between them
Note: a subsystem may consist of a single node
Network partitioning and site failures are generally
indistinguishable.
18
Commit Protocols
Commit protocols are used to ensure atomicity across sites
19
executed
Let T be a transaction initiated at site Si, and let the transaction
coordinator at Si be Ci
20
Ti.
Ci adds the records <prepare T> to the log and forces log to stable
storage
sends prepare T messages to all sites at which T executed
Upon receiving message, transaction manager at site determines
to the log and forces record onto stable storage. Once the record
stable storage it is irrevocable (even if failures occur)
Coordinator sends a message to each participant informing it of
22
the fate of T.
If T committed, redo (T)
If T aborted, undo (T)
The log contains no control records concerning T replies that Sk
coordinator to recover.
24
Sites that are not in the partition containing the coordinator think the
coordinator has failed, and execute the protocol to deal with failure
of the coordinator.
No harm results, but sites may still have to wait for decision from
coordinator.
The coordinator and the sites are in the same partition as the
coordinator think that the sites in the other partition have failed,
and follow the usual commit protocol.
Again, no harm results
25
Instead of <ready T>, write out <ready T, L> L = list of locks held by
T when the log is written (read locks can be omitted).
For every in-doubt transaction T, all the locks noted in the
<ready T, L> log record are reacquired.
After lock reacquisition, transaction processing can resume; the
No network partitioning
At any point, at least one site must be up.
At most K sites (participants as well as coordinator) can fail
Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.
higher overheads
assumptions may not be satisfied in practice
Wont study it further
27
Two phase commit would have the potential to block updates on the
accounts involved in funds transfer
Alternative solution:
Debit money from source account and send a message to other
site
Site receives message and credits destination account
Messaging has long been used for distributed transactions (even
before computers were invented!)
Atomicity issue
31
Writing to this relation is treated as any other update, and is undone if the
transaction aborts.
34
Concurrency Control
environment.
We assume that each site participates in the execution of a
35
Single-Lock-Manager Approach
36
Simple implementation
Simple deadlock handling
Disadvantages of scheme are:
37
failures
Disadvantage: deadlock detection is more complicated
Primary copy
Majority protocol
Biased protocol
Quorum consensus
38
Primary Copy
Choose one replica of data item to be the primary copy.
Site containing the replica is called the primary site for that data
item
Different data items can have different primary sites
When a transaction needs to lock a data item Q, it requests a
Majority Protocol
Local lock manager at each site administers lock and unlock
40
Biased Protocol
Local lock manager at each site as in majority protocol, however,
42
weights is >= Qr
Each write must lock enough replicas that the sum of the site
weights is >= Qw
For now we assume all replicas are written
Deadlock Handling
Consider the following two transactions and history, with item X and
transaction T1 at site 1, and item Y and transaction T2 at site 2:
T1:
write (X)
write (Y)
X-lock on X
write (X)
T2:
write (Y)
write (X)
X-lock on Y
write (Y)
wait for X-lock on X
Centralized Approach
A global wait-for graph is constructed and maintained in a single
45
Local
Global
46
47
1. T2 releases resources at S1
resulting in a message remove T1 T2 message from the
delete message
this can happen due to network delays
The coordinator would then find a false cycle
T1 T2 T3 T1
The false cycle above never existed in reality.
False cycles cannot occur if two-phase locking is used.
48
Unnecessary Rollbacks
Unnecessary rollbacks may result when deadlock has indeed
49
Timestamping
Timestamp based concurrency-control protocols can be used in
distributed systems
Each transaction must be given a unique timestamp
Main problem: how to generate a timestamp in a distributed
fashion
Each site generates a unique local timestamp using either a logical
counter or the local clock.
Global unique timestamp is obtained by concatenating the unique
local timestamp with the unique identifier.
50
Timestamping (Cont.)
A site with a slow clock will assign smaller timestamps
51
database
That is, a state of the database reflecting all effects of all
transactions up to some point in the serialization order, and no
effects of any later transactions.
E.g. Oracle provides a create snapshot statement to create a
53
Multimaster Replication
With multimaster replication (also called update-anywhere
54
Updates at any replica translated into update at primary site, and then propagated
back to all replicas
Updates to an item are ordered serially
But transactions may read an old value of an item and use it to perform an
Updates are performed at any replica and propagated to all other replicas
Causes even more serialization problems:
Availability
56
Availability
High availability: time for which system is not fully usable should
components
Failures are more likely in large distributed systems
To be robust, a distributed system must
Detect failures
Reconfigure the system so computation may continue
Recovery/reintegration when a site or link is repaired
Failure detection: distinguishing link failure from site failure is
hard
(partial) solution: have multiple links, multiple link failure is likely a
site failure
57
Reconfiguration
Reconfiguration:
Reconfiguration (Cont.)
Since network partition may not be distinguishable from site
59
Majority-Based Approach
The majority protocol for distributed concurrency control can be
lower version numbers (no need to obtain locks on all replicas for
this task)
60
Majority-Based Approach
Majority protocol (Cont.)
Write operations
find highest version number like reads, and set new version
number to old highest version + 1
Writes are then performed on all locked replicas and version
number on these replicas is set to new version number
Failures (network and site) cause no problems as long as
Sites at commit contain a majority of replicas of any updated data
items
During reads a majority of replicas are available to find version
numbers
Subject to above, 2 phase commit can be used to update replicas
Note: reads are guaranteed to see latest version of data item
Reintegration is trivial: nothing needs to be done
Quorum consensus algorithm can be similarly extended
61
Allows reads to read any one replica but updates require all replicas
to be available at commit time (called read one write all)
Read one write all available (ignoring failed sites) is attractive,
but incorrect
If failed link may come back up, without a disconnected site ever
being aware that it was disconnected
The site then has old values, and a read from that site would return
an incorrect value
If site was aware of failure reintegration could have been performed,
but no way to guarantee this
With network partitioning, sites in each partition may update same
item concurrently
believing sites in other partitions have all failed
62
Site Reintegration
When failed site recovers, it must catch up with all updates that it
Solution 2: lock all replicas of all data items at the site, update to
latest version, then release locks
Other solutions with better concurrency also available
63
All actions performed at a single site, and only log records shipped
No need for distributed concurrency control, or 2 phase commit
Using distributed databases with replicas of data items can
64
Coordinator Selection
Backup coordinators
65
Bully Algorithm
If site Si sends a request that is not answered by the coordinator
66
same algorithm.
If there are no active sites with higher numbers, the recovered
site forces all processes with lower numbers to let it become the
coordinator site, even if there is a currently active coordinator
with a lower number.
67
68
69
Query Transformation
Translating algebraic queries on fragments.
70
account relation.
Final strategy is for the Hillside site to return account1 as the result
of the query.
71
depositor
branch
result at site SI
72
= account
depositor at S2. Ship temp1 from S2 to S3, and
compute temp2 = temp1 branch at S3. Ship the result temp2 to SI.
Devise similar strategies, exchanging the roles S1, S2, S3
Must consider following factors:
73
Semijoin Strategy
Let r1 be a relation with schema R1 stores at site S1
temp1 at S2
r2.
74
Formal Definition
The semijoin of r1 with r2, is denoted by:
r1
r2
it is defined by:
R1 (r1
r2)
Thus, r1
r2.
r1.
75
r2
r3
shipped to S4 and r3
S2 ships tuples of (r1
76
Advantages
Preservation of investment in existing
hardware
system software
Applications
Local autonomy and administrative control
Allows use of special-purpose DBMSs
Step towards a unified homogeneous DBMS
Query Processing
Several issues in query processing in a heterogeneous database
Schema translation
information
Decide which sites to execute query
Global query optimization
80
Mediator Systems
Mediator systems are systems that integrate multiple
81
82
Directory Systems
Typical kinds of directory information
White pages
Yellow pages
83
85
(RDNs)
E.g. of a DN
c: country
86
87
their DNs
Leaf level usually represent specific objects
Internal node entries represent objects such as organizational units,
organizations or countries
Children of a node inherit the DN of the parent, and add on RDNs
E.g. internal node with DN c=USA
89
LDAP Queries
LDAP query must specify
A scope
Just the base, the base and its children, or the entire subtree
Attributes to be returned
Limits on number of results and on resource consumption
May also specify whether to automatically dereference aliases
LDAP URLs are one way of specifying query
LDAP API is another alternative
90
LDAP URLs
First part of URL specifis server and DN of base
ldap:://aura.research.bell-labs.com/o=Lucent,c=USA
Optional further parts separated by ? symbol
ldap:://aura.research.bell-labs.com/o=Lucent,c=USA??sub?cn=Korth
Optional parts specify
1. attributes to return (empty means all)
2. Scope (sub indicates entire subtree)
3. Search condition (cn=Korth)
91
93
94
trees
Suffix of a DIT gives RDN to be tagged onto to all entries to get an overall DN
E.g. two DITs, one with suffix o=Lucent, c=USA
o=Lucent, c=India
E.g. Ou= Bell Labs may have a separate DIT, and DIT for o=Lucent may have a
leaf with ou=Bell Labs containing a referral to the Bell Labs DIT
Referalls are the key to integrating a distributed collection of directories
When a server gets a query reaching a referral node, it may either
Forward query to referred DIT and return answer to client, or
Give referral back to client, which transparently sends query to referred DIT
End of Chapter
Extra Slides (material not in book)
1. 3-Phase commit
2. Fully distributed deadlock detection
3. Naming transparency
4. Network topologies
96
Assumptions:
No network partitioning
At any point, at least one site must be up.
At most K sites (participants as well as coordinator) can fail
Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.
97
< precommit T>) in its log and forces record to stable storage.
Coordinator sends a message to each participant informing it of
the decision
Participant records decision in its log
If abort decision reached then participant aborts locally
If pre-commit decision reached then participant replies with
<acknowledge T>
98
99
100
101
A site that failed and recovered must ignore any precommit record in its
log when determining its status.
4. Each participating site records sends its local status to Cnew
102
103
Site 1
T1 T2 T3
Site 2
T3 T4 T5
Site 3
T5 T1
104
deadlock detector
Distributed Deadlock Detection - path pushing algorithm
Path pushing initiated wen a site detects a local cycle involving
107
108
have:
Site 1
EX(2) T5 T1 T2 T3 EX(2)
Site 2
EX(1) T3 T4 T5 EX(3)
Site 3
EX(2) T5 T1 EX(1)
109
T2, T3) to site 2 since 5 > 3 and T3 is waiting for a data item, at
site 2
The new state of the local wait-for graph:
Site 1
EX(2) T5 T1 T2 T3 EX(2)
Site 2
T5 T1 T2T3 T4
Deadlock Detected
Site 3
EX(2) T5 T1 EX(1)
110
Naming of Items
111
unique name.
Use of postscripts to determine those replicas that are replicas of
the same data item, and those fragments that are fragments of the
same data item.
fragments of same data item: .f1, .f2, , .fn
replicas of same data item: .r1, .r2, , .rn
site17.account.f3.r2
112
113
114
115
fragments:
one to be inserted into deposit1
one to be inserted into deposit2
If deposit is replicated, the tuple (Valleyview, A-733, Jones 600)
116
Network Topologies
117
Network Topologies
118
119
the failure of a single link results in a network partition, but since one
of the partitions has only a single site it can be treated as a singlesite failure.
low communication cost
failure of the central site results in every site in the system becoming
disconnected
120
Robustness
A robustness system must:
indistinguishable.
121
122
End of Chapter
123
Figure 19.7
124
Figure 19.13
125
Figure 19.14
126
Introduction
Parallel machines are becoming quite common and affordable
large volumes of transaction data are collected and stored for later
analysis.
multimedia objects like images are increasingly stored in databases
Large-scale parallel database systems increasingly used for:
Parallelism in Databases
Data can be partitioned across multiple disks for parallel I/O.
Individual relational operations (e.g., sort, join, aggregation) can
be executed in parallel
data can be partitioned and each processor can work independently
on its own partition.
Queries are expressed in high level language (SQL, translated to
relational algebra)
makes parallelization easier.
Different queries can be run in parallel with each other.
I/O Parallelism
Reduce the time required to retrieve relations from disk by partitioning
the relations on multiple disks.
Horizontal partitioning tuples of a relation are divided among many
3.Locating all tuples such that the value of a given attribute lies
within a specified range range queries.
E.g., 10 r.A < 25.
Round robin:
Advantages
Hash partitioning:
Can lookup single disk, leaving others available for answering other
queries.
Index on partitioning attribute can be local to disk, making lookup
and update more efficient
No clustering, so difficult to answer range queries
needs to be accessed.
For range queries on partitioning attribute, one to a few disks
If a relation contains only a few tuples which will fit into a single
disks.
If a relation consists of m disk blocks and there are n disks
Handling of Skew
The distribution of tuples to disks may be skewed that is,
some disks have many tuples, while others may have fewer
tuples.
Types of skew:
Attribute-value skew.
Some values appear in the partitioning attributes of many tuples;
all the tuples with the same value for the partitioning attribute
end up in the same partition.
Can occur with range-partitioning and hash-partitioning.
Partition skew.
With range-partitioning, badly chosen partition vector may assign
too many tuples to some partitions and too few to others.
Less likely with hash-partitioning if a good hash-function is
chosen.
If any normal partition would have been skewed, it is very likely the
skew is spread over a number of virtual partitions
Skewed virtual partitions get spread across a number of processors,
so work gets distributed evenly!
Interquery Parallelism
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a
architectures
Locking and logging must be coordinated by passing messages
between processors.
Data in a local buffer may have been updated at another processor.
Cache-coherency has to be maintained reads and writes of data
in buffer must find latest version of data.
Intraquery Parallelism
read-only queries
shared-nothing architecture
n processors, P0, ..., Pn-1, and n disks D0, ..., Dn-1, where disk Di is
associated with processor Pi.
If a processor has multiple disks they can simply simulate a
Parallel Sort
Range-Partitioning Sort
Choose processors P0, ..., Pm, where m n -1 to do sorting.
Create range-partition vector with m entries, on the sorting attributes
Redistribute the relation using range partitioning
all tuples that lie in the ith range are sent to processor Pi
Pi stores the tuples it received temporarily on disk Di.
This step requires I/O and communication overhead.
Each processor Pi sorts its partition of the relation locally.
Each processors executes same operation (sort) in parallel with other
m, the key values in processor Pi are all less than the key values in Pj.
Parallel Join
The join operation requires pairs of tuples to be tested to see if
they satisfy the join condition, and if they do, the pair is added to
the join output.
Parallel join algorithms attempt to split the pairs to be tested over
Partitioned Join
For equi-joins and natural joins, it is possible to partition the two input
relations across the processors, and compute the join locally at each
processor.
Let r and s be the input relations, and we want to compute r
r.A=s.B
s.
r and s each are partitioned into n partitions, denoted r0, r1, ..., rn-1 and
Fragment-and-Replicate Join
Partitioning not possible for some join conditions
a. Asymmetric
Fragment and
Replicate
processor.
r is partitioned into n partitions,r0, r1, ..., r n-1;s is partitioned into m
partitions, s0, s1, ..., sm-1.
Any partitioning technique may be used.
There must be at least m * n processors.
Label the processors as
P0,0, P0,1, ..., P0,m-1, P1,0, ..., Pn-1m-1.
Pi,j computes the join of ri with sj. In order to do so, ri is replicated to
Pi,0, Pi,1, ..., Pi,m-1, while si is replicated to P0,i, P1,i, ..., Pn-1,i
Any join technique can be used at each processor Pi,j.
relation.
A hash function h1 takes the join attribute value of each tuple in s
processors.
Can also partition the tuples (using either range- or hashpartitioning) and perform duplicate elimination locally at each
processor.
Projection
Grouping/Aggregation
Partition the relation on the grouping attributes and then compute
partitioning.
Interoperator Parallelism
Pipelined parallelism
r2
r3
r4
temp1 = r1
r2
r3
r4
results to disk
Useful with small number of processors, but does not scale up
Independent Parallelism
Independent parallelism
Query Optimization
Query optimization in parallel databases is significantly more complex
How to parallelize each operation and how many processors to use for it.
What operations to pipeline, what operations to execute independently in
parallel, and what operations to execute sequentially, one after the other.
Determining the amount of resources to allocate for each operation is a
problem.
E.g., allocating more processors than optimal can result in high
communication overhead.
Long pipelines should be avoided as the final operation may wait a lot
supported.
For example, index construction on terabyte databases can take
hours or days even on a parallel system.
Need to allow other processing (insertions/deletions/updates) to
End of Chapter
Chapter 21:
Application Development and
Administration
21.1
Overview
interface to databases
Enable large numbers of users to access databases from
anywhere
Avoid the need for downloading/installing specialized code, while
providing a good graphical user interface
E.g.: Banks, Airline/Car reservations, University course
registration/grading,
http://www.bell-labs.com/topics/book/db-book
The first part indicates how the document is to be accessed
http indicates that the document is to be accessed using the
Hyper Text Transfer Protocol.
The second part gives the unique name of a machine on the
Internet.
The rest of the URL identifies the document within the machine.
The local identification can be:
The path name of a file on the machine, or
An identifier (path name) of a program, plus arguments to be
features.
HTML also provides input features
Select from a set of options
Text boxes
</table>
<center> The <i>account</i> relation </center>
<form action=BankQuery method=get>
Select account/loan and enter number <br>
<select name=type>
<option value=account selected> Account
<option> value=Loan>
Loan
</select>
<input type=text size=5 name=number>
<input type=submit value=submit>
</form>
</body> </html>
8
11
Web Servers
A Web server can easily serve as a front end to a variety of
information services.
The document name in a URL may identify an executable
13
14
That is, once the server replies to a request, the server closes the
connection with the client, and forgets all about the request
In contrast, Unix logins, and JDBC/ODBC connections stay
connected until the client disconnects
retaining user authentication and other information
connections on a machine
Information services need session information
15
16
Servlets
Java Servlet specification defines an API for communication
server
Two-tier model
Each request spawns a new thread in the Web server
thread is closed once the request is serviced
Server-Side Scripting
Server-side scripting simplifies the task of connecting a database
to the Web
Define a HTML document with embedded executable code/SQL
queries.
Input values from HTML forms can be used directly in the
embedded code/SQL queries.
When the document is requested, the Web server executes the
embedded code/SQL queries to generate the actual HTML
document.
Numerous server-side scripting languages
19
20
Performance Tuning
21.21
Performance Tuning
Adjusting various parameters and design choices to improve
22
Bottlenecks
Performance of most systems (at least before they are tuned)
is idle), or in software
Removing one bottleneck often exposes another
De-bottlenecking consists of repeatedly finding bottlenecks, and
removing them
This is a heuristic
23
Identifying Bottlenecks
Transactions request a sequence of services
service
transactions repeatedly do the following
request a service, wait in queue for the service, and get serviced
Bottlenecks in a database system typically show up as very high
25
Tunable Parameters
Tuning of hardware
Tuning of schema
Tuning of indices
Tuning of materialized views
Tuning of transactions
26
Tuning of Hardware
Even well-tuned transactions typically require a few I/O
operations
Typical disk supports about 100 random I/O operations per second
Suppose each transaction requires just 2 random I/O operations.
Then to support n transactions per second, we need to stripe data
across n/50 disks (ignoring skew)
Number of I/O operations per transaction can be reduced by
27
n*
price-per-disk-drive
accesses-per-second-per-disk
price-per-MB-of-memory
ages-per-MB-of-memory
Solving above equation with current disk and memory prices leads to:
29
workload
RAID 5 may require more disks than RAID 1 to handle load!
Apparent saving of number of disks by RAID 5 (by using parity, as
opposed to the mirroring done by RAID 1) may be illusory!
Thumb rule: RAID 5 is fine when writes are rare and data is very
(the workload) and recommend which indices would be best for the
workload
32
Space
Time for view maintenance
Immediate view maintenance:done as part of update txn
time overhead paid by update transaction
Deferred view maintenance: done only when required
update transaction is not affected, but system time is spent
on view maintenance
until updated, the view may be out-of-date
Preferable to denormalized schema since view maintenance
to materialize
Materialized view selection wizards
34
Tuning of Transactions
Basic approaches to tuning of transactions
36
Performance Simulation
Performance simulation using queuing model useful to predict
level details
Model service time, but disregard details of service
E.g. approximate disk read time by using an average disk read time
Experiments can be run on model, and provide an estimate of
system
E.g. number of disks, memory, algorithms, etc
38
Performance Benchmarks
21.39
Performance Benchmarks
Suites of tasks used to quantify the performance of software
systems
Important in comparing database systems, especially as systems
40
types
E.g., suppose a system runs transaction type A at 99 tps and transaction
type B at 1 tps.
Given an equal mixture of types A and B, throughput is not (99+1)/2 =
50 tps.
Running one transaction of each type takes time 1+.01 seconds, giving a
throughput of 1.98 tps.
To compute average throughput, use harmonic mean:
n
1/t1 + 1/t2 + + 1/tn
Interference (e.g. lock contention) makes even this incorrect if
different transaction types run concurrently
41
classes
E.g. Teradata is tuned to decision support
Others try to balance the two requirements
42
Benchmarks Suites
The Transaction Processing Council (TPC) benchmark suites
43
44
increasing transactions-per-second
reflects real world applications where more customers means more
database size and more transactions-per-second
External audit of TPC performance numbers mandatory
45
46
Other Benchmarks
OODB transactions require a different set of benchmarks.
47
Standardization
21.48
Standardization
The complexity of contemporary database systems and the need
Standardization (Cont.)
Anticipatory standards lead the market place, defining features
50
standard)
Defines levels of compliance (entry, intermediate and full)
Even now few database vendors have full SQL-92 implementation
51
extensions
SQL/Bindings (Part 5): embedded SQL for different embedding
languages
52
53
interconnectivity
based on Call Level Interface (CLI) developed by X/Open consortium
defines application programming interface, and SQL features that
must be supported at different levels of compliance
JDBC standard used for Java
X/Open XA standards define transaction management standards
functionality
54
object-oriented databases
version 1 in 1993 and version 2 in 1997, version 3 in 2000
provides language independent Object Definition Language (ODL)
as well as several language specific bindings
Object Management Group (OMG) standard for distributed
55
XML-Based Standards
Several XML based Standards for E-commerce
56
E-Commerce
21.57
E-Commerce
E-commerce is the process of carrying out various activities
58
E-Catalogs
Product catalogs must provide searching and browsing facilities
59
Marketplaces
Marketplaces help in negotiating the price of a product when there
Reverse auction
Auction
Exchange
Real world marketplaces can be quite complicated due to product
differentiation
Database issues:
Authenticate bidders
Record buy/sell bids securely
Communicate bids quickly to participants
Delays can lead to financial loss to some participants
Need to handle very large volumes of trade at times
E.g. at the end of an auction
60
Types of Marketplace
Reverse auction system: single buyer, multiple sellers.
Order Settlement
Order settlement: payment for goods and delivery
Insecure means for electronic payment: send credit card number
62
check payment
63
64
Digital Cash
Credit-card payment does not provide anonymity
physical cash
E.g. DigiCash
Based on encryption techniques that make it impossible to find out
who purchased digital cash from the bank
Digital cash can be spent by purchaser in parts
much like writing a check on an account whose owner is
anonymous
65
Legacy Systems
21.66
Legacy Systems
Legacy systems are older-generation systems that are incompatible
is problematic
Very expensive, since legacy system may involve millions of lines of
code, written over decades
Original programmers usually no longer available
Switching over from old system to new system is a problem
more on this later
One approach: build a wrapper layer on top of legacy application to
system
Improvements are made on existing system design in this process
68
69
70
End of Chapter
21.71
Overview
Temporal Data
Spatial and Geographic Databases
Multimedia Databases
Mobility and Personal Databases
Time In Databases
While most databases tend to model reality at a point in time (at
normalized fashion.
Represent a line segment by the coordinates of its endpoints.
Approximate a curve by partitioning it into a sequence of segments
List of vertices in order, starting vertex is the same as the ending vertex, or
Represent boundary edges as separate tuples, with each containing
identifier of the polygon, or
Use triangulation divide polygon into triangles
Note the polygon identifier with each of its triangles.
10
11
12
Design Databases
Represent design components as objects (generally
rectangles, polygons.
Complex two-dimensional objects: formed from simple
13
E.g., pipes should not intersect, wires should not be too close to
each other, etc.
14
Geographic Data
Raster data consist of bit maps or pixel maps, in two or more
dimensions.
Example 2-D raster image: satellite image of cloud cover, where
each pixel stores the cloud visibility in a particular area.
Additional dimensions might include the temperature at different
altitudes at different regions, or measurements taken at different
points in time.
Design databases generally do not store raster data.
15
16
Spatial Queries
Nearness queries request objects that lie near a specified
location.
Nearest neighbor queries, given a point or an object, find
18
19
dimensions.
Each level of a k-d tree partitions the space into two.
choose one dimension for partitioning at the root level of the tree.
choose another dimensions for partitioning in nodes at the next level
and so on, cycling through the dimensions.
In each node, approximately half of the points stored in the sub-
number of points.
The k-d-B tree extends the k-d tree to allow multiple child nodes
20
21
correspondingly each such node has four child nodes corresponding to the four
quadrants and so on
Leaf nodes have between zero and some fixed maximum number of points
(set to 1 in example).
22
Quadtrees (Cont.)
PR quadtree: stores points; space is divided based on regions,
A node is a leaf node is all the array values in the region that it
covers are the same. Otherwise, it is subdivided further into four
children of equal area, and is therefore an internal node.
Each node corresponds to a sub-array of values.
The sub-arrays corresponding to leaves either contain just a single
array element, or have multiple array elements, all of which have
the same value.
Extensions of k-d trees and PR quadtrees have been proposed
R-Trees
R-trees are a N-dimensional extension of B+-trees, useful for
24
R Trees (Cont.)
25
Example R-Tree
A set of rectangles (solid line) and the bounding boxes (dashed line) of the
nodes of an R-tree for the rectangles. The R-tree is shown on the right.
26
Search in R-Trees
To find data items (rectangles/polygons) intersecting
(overlaps) a given query point/region, do the following,
starting from the root node:
If the node is a leaf node, output the data items whose keys
intersect the given query point/region.
Else, for each child of the current node whose bounding box
overlaps the query point/region, recursively search the child
Can be very inefficient in worst case since multiple paths may
need to be searched
but works acceptably in practice.
Simple extensions of search procedure to handle predicates
27
Insertion in R-Trees
To insert a data item:
Goal: divide entries of an overfull node into two sets such that the
bounding boxes have minimum total area
This is a heuristic. Alternatives like minimum overlap are
possible
Finding the best split is expensive, use heuristics instead
See next slide
28
as follows
1. Find pair of entries with maximum separation
that is, the pair such that the bounding box of the two would
has the maximum wasted space (area of bounding box sum
of areas of two entries)
2. Place these entries in two new nodes
3. Repeatedly find the entry with maximum preference for one of the
two new nodes, and assign the entry to that node
Preference of an entry to a node is the increase in area of
bounding box if the entry is added to the other node
4. Stop when half the entries have been added to one node
Then assign remaining entries to the other node
Cheaper linear split heuristic works in time linear in number of
entries,
Cheaper but generates slightly worse splits.
29
Deleting in R-Trees
Deletion of an entry in an R-tree done much like a B+-tree
deletion.
In case of underfull node, borrow entries from a sibling if possible,
else merging sibling nodes
Alternative approach removes all entries from the underfull node,
deletes the node, then reinserts all entries
30
Multimedia Databases
31
Multimedia Databases
To provide such database functions as indexing and
index structures.
Must provide guaranteed steady retrieval rates for
continuous-media data.
32
JPEG and GIF the most widely used formats for image data.
MPEG standard for video data use commonalties among a
sequence of frames to achieve a greater degree of
compression.
MPEG-1 quality comparable to VHS video tape.
Continuous-Media Data
Most important types are video and audio data.
Characterized by high data volumes and real-time information-
delivery requirements.
Data must be delivered sufficiently fast that there are no gaps in the
audio or video.
Data must be delivered at a rate that does not cause overflow of
system buffers.
Synchronization among distinct data streams must be maintained
video of a person speaking must show lips moving
34
Video Servers
Video-on-demand systems deliver video from central video
35
Similarity-Based Retrieval
Examples of similarity based retrieval
Pictorial data: Two pictures or images that are slightly different as
36
Mobility
37
38
39
GIS queries
Techniques to track locations of large numbers of mobile hosts
Broadcast data can enable any number of clients to receive the
40
User time.
Communication cost
Connection time - used to assign monetary charges in
peak periods
41
Broadcast Data
Mobile support stations can broadcast frequently-requested data
Allows mobile hosts to wait for needed data, rather than having to
consume energy transmitting a request
Supports mobile hosts without transmission capability
A mobile host may optimize energy costs by determining if a query
will be broadcast.
a changeable schedule.
For changeable schedule: the broadcast schedule must itself be
broadcast at a well-known radio frequency and at well-known time
intervals
Data reception may be interrupted by noise
disconnection.
Problems created if the user of the mobile host issues
43
Mobile Updates
Partitioning via disconnection is the normal mode of operation in
mobile computing.
For data updated by only one mobile host, simple to propagate
44
45
3. If there is a pair of hosts k and m such that Vd,i [k]< Vd,j [k], and
Vd,i [m] > Vd,j [m], then the copies are inconsistent
That is, two or more updates have been performed
on d independently.
46
47
End of Chapter
48
Transaction-Processing Monitors
Transactional Workflows
High-Performance Transaction Systems
TP Monitor Architectures
Workflow Systems
Transactional Workflows
Workflows are activities that involve the coordinated execution of
10
Examples of Workflows
11
12
Transactional Workflows
Must address following issues to computerize a workflow.
constituent tasks, and the states (i.e. values) of all variables in the
execution plan.
13
Workflow Specification
Static specification of task coordination:
Tasks and dependencies among them are defined before the execution
of the workflow starts.
Can establish preconditions for execution of each task: tasks are
executed only when their preconditions are satisfied.
Defined preconditions through dependencies:
Execution states of other tasks.
14
15
Failure-Automicity Requirements
Usual ACID transactional requirements are too
workflow will terminate in a state that satisfies the failureatomicity requirements defined by the designer.
Committed - objectives of a workflow have been achieved.
Aborted - valid termination state in which a workflow has failed to
achieve its objectives.
A workflow must reach an acceptable termination state even
16
Execution of Workflows
Workflow management systems include:
Scheduler - program that process workflows by submitting
entity.
Mechanism to query to state of the workflow system.
17
workflow.
Fully distributed - has no scheduler, but the task agents coordinate
18
Workflow Scheduler
Ideally scheduler should execute a workflow only after
19
Recovery of a Workflow
Ensure that is a failure occurs in any of the workflow-
20
21
High Performance
Transaction Systems
22
23
Main-Memory Database
Commercial 64-bit systems can support main memories of
tens of gigabytes.
Memory resident data allows faster processing of
transactions.
Disk-related limitations:
24
25
Group Commit
Idea: Instead of performing output of log records to stable
26
a design
Subtasks: support partial rollback
Recoverability: on crash state should be restored even
28
Long-Duration Transactions
Represent as a nested transaction
abort.
Active long-duration transactions resume once any short
possibility of aborts.
Need alternatives to waits and aborts; alternative techniques
29
Concurrency Control
Correctness without serializability:
30
31
32
transactions.
Lowest level of nesting: standard read and write operations.
Nesting can create higher-level operations that may enhance
concurrency.
Types of nested/ multilevel transactions:
33
Example of Nesting
Rewrite transaction T1 using subtransactions Ta and Tb
34
Compensating Transactions
Alternative to undo operation; compensating transactions deal
35
Implementation Issues
For long-duration transactions to survive system crashes, we
must log not only changes to the database, but also changes to
internal system data pertaining to these transactions.
Logging of updates is made more complex by physically large
36
Transaction Management in
Multidatabase Systems
Transaction management is complicated in multidatabase
Solutions:
37
Transaction Management
Local transactions are executed by each local DBMS, outside of
38
Two-Level Serializability
DBMS ensures local serializability among its local transactions,
a data item at one site depends on a value that it read for a data
item on another site.
For strong correctness: No transaction may have a value
dependency.
Global-read-write/local-read protocol; Local transactions have
40
Global Serializability
Even if no information is available concerning the structure of the
undirected cycles.
41
Si
Ensures global transactions are serialized at each site,
global transactions.
42
End of Chapter
43
44
lock-S (Q)
read (Q)
unlock (Q)
T4
lock-X (Q)
read (Q)
write (Q)
unlock (Q)
45
Cursor Stability
Form of degree-two consistency designed for programs
ensures that
The tuple that is currently being processed by the iteration is
locked in shared mode.
Any modified tuples are locked in exclusive mode until the
transaction commits.
Used on heavily accessed relations as a means of
consistency constraints.
46
Basic Concepts
Data are represented by collections of records.
end
Data-Structure Diagrams
Schema representing the design of a network database.
diagram.
General Relationships
To represent an E-R relationship of degree 3 or higher, connect
DBTG Sets
The structure consisting of two record types that are linked
has precisely one owner, and has zero or more member records.
No member record of a set can participate in more than one
Repeating Groups
Provide a mechanism for a field to have a set of values rather
customer (customer-name)
customer-address (customer-street, customer-city)
The following diagrams represent these sets without the
repeating-group construct.
DBTG Variables
Record Templates
Currency pointers
Example Schema
branch.
Six currency pointers
Three pointers for record types: one each tot he most recently
accessed customer, account, and branch record
Two pointers for set types: one to the most recently accessed
record in an occurrence of the set depositor, one to the most
recently accessed record in an occurrence of the set accountbranch
One run-unit pointer.
Status flags: four variables defined previously
Following diagram shows an example program work area state.
currency pointers
get copies of the record to which the current of run-unit points
Predicates
For queries in which a field value must be matched with a
database.
To create a new record of type <record type>
make the currency pointer of that type point to the record in the
database to be deleted
delete that record by executing
erase <record type>
Delete an entire set occurrence by finding the owner of the set
and executing
erase all <record type>
Deletes the owner of the set, as well as all the sets members.
If a member of the set is an owner of another set, the members of
that second set also will be deleted.
erase all is recursive.
statement.
connect <record type> to <set-type>
Remove a record from a set by executing the disconnect
statement.
disconnect <record type> from <set-type>
manual:
automatic:
branch.
Set insertion is manual for set type depositor and is automatic
Set Ordering
Set ordering is specified by a programmer when the set is defined:
order is <order-mode>
first. A new record is inserted in the first position; the set is in
chronological ordering.
next. Suppose that the currency pointer or <set-type> points to
record X.
If X is a member type, a new record is inserted in the next position
following X.
If X is an owner type, a new record is inserted in the first position.
it is associated.
Example data-structure diagram and corresponding database.
Figure missing
records.
Rather ant using multiple pointers in the customer record, we can
Sample Database
DBTG Set
A customer Record
Basic Concepts
A hierarchical database consists of a collection of records which
data value.
A link is an association between precisely two records.
The hierarchical model differs from the network model in that the
Tree-Structure Diagrams
The schema for a hierarchical database consists of
General Structure
diagrams.
single instance of a database tree
The root of this tree is a dummy node
The children of that node are actual instances of the appropriate
record type
When transforming E-R diagrams to corresponding tree-structure
Single Relationships
customer.
In T2, create account-customer, a many-to-one link from
customer to account.
Sample Database
General Relationships
Example ternary E-R diagram and corresponding tree-structure
Several Relationships
To correctly transform an E-R diagram with several relationships,
diagrams.
several accounts.
An account may belong to only one customer, and a customer
Example Schema
Record templates
Currency pointers
Status flag
A particular program work area is associated with precisely one
application program.
Example program work area:
be searched.
State of the program work area after executing get command to
<record-type>
Set currency pointer to that record
Copy its contents into the appropriate work-area template.
If no such record exists in the tree, then the search fails, and
Example Queries
Print the address of customer Fleming:
than $10,000.
get first account
where customer.customer-name = Fleming;
and account.balance > 10000;
if DB-status = 0 then print (account.account-number);
<condition>.
If the where clause is omitted, then the next record of type
Example Query
Print the account number of all the accounts that have a balance
record that was located with either get first or get next.
Locates the next record (in preorder) that satisfies <condition> in
Example Query
Print the total balance of all accounts belonging to Boyd:
sum := 0;
get first customer
where customer.customer-name = Boyd;
get next within parent account;
while DB-status = 0 do
begin
sum = sum + account.balance;
get next within parent account;
end
print (sum);
We exit from the while loop and print out the value of sum only
Update Facility
Various mechanisms are available for updating information in the
database.
Creation and deletion of records (via the insert and delete
operations).
Modification (via the replace operation) of the content of existing
records.
Example Queries
Add a new customer, Jackson, to the Seashore branch:
customer.customer-name := Jackson;
customer.customer-street := Old Road;
customer.customer-city := Queens;
insert customer
where branch.branch-name = Seashore;
Create a new account numbered A-655 that belongs to customer
Jackson;
account.account-number := A-655;
account.balance := 100;
insert account
where customer.customer-name = Jackson;
that record into the work-area template for <record type>, and
change the desired fields in that template.
Reflect the changes in the database by executing
replace
replace dies not have <record type> as an argument; the record
Example Query
Change the street address of Boyd to Northview:
Deletion of a Record
To delete a record of type <record type>, set the currency pointer
Virtual Records
For many-to-many relationships, record replication is necessary
single copy of that record is kept in one of the trees and all other
records are replaced with a virtual record.
Let R be a record type that is replicated in T1, T2, . . ., Tn. Create
Sample Database
data-definition language.
Designer defines a physically hierarchy as the database schema.
Can define several subschemas (or view) by constructing a logical
hierarchy from the record types constituting the schema.
Options such as block sizes, special pointer fields, and so on, allow
the database administrator to tune the system.
Sample Database
Appendix C:
Advanced Normalization Theory
Reasoning with MVDs
Higher normal forms
DKNF
multivalued dependencies:
1. Reflexivity rule. If is a set of attributes and , then
holds.
2. Augmentation rule. If holds and is a set of attributes,
then holds.
3. Transitivity rule. If holds and holds, then
holds.
holds, then
holds.
7. Replication rule. If
holds and
holds, then
holds, then
8. Coalescence rule. If
holds and and there is a such
that R and = and
, then
holds.
holds.
holds and
Intersection rule. If
holds.
Difference rule. If If
holds and
holds and
holds.
holds and
holds, then
holds, then
holds, then
Example
R = (A, B, C, G, H, I)
D = {A
B
CG
B
HI
H}
A
CGHI.
Since A
B, the complementation rule (4) implies that
A
R B A.
Since R B A = CGHI, so A
CGHI.
A
HI.
Since A
B and B
rule (6) implies that B
Since HI B = HI, A
Example (Cont.)
Some members of D+ (cont.):
B
H.
Apply the coalescence rule (8); B
HI holds.
Since H HI and CG
H and CG HI = , the
coalescence rule is satisfied with being B, being HI, being CG,
and being H. We conclude that B
H.
A
CG.
A
CGHI and A
HI.
By the difference rule, A
Since CGHI HI = CG, A
CGHI HI.
CG.
dependency R1 R2
*( (R - ), )
R2. Conversely,
is equivalent to
However, there are join dependencies that are not equivalent to any
multivalued dependency.
Example
Consider Loan-info-schema = (branch-name, customer-name, loan-
number, amount).
Each loan has one or more customers, is in one or more branches
(loan-number, amount))
Loan-info-schema is not in PJNF with respect to the set of
dependencies containing the above join dependency. To put Loaninfo-schema into PJNF, we must decompose it into the three
schemas specified by the join dependency:
(loan-number, branch-name)
(loan-number, customer-name)
(loan-number, amount)
Example
Accounts whose account-number begins with the digit 9 are
DKNF design:
Regular-acct-schema = (branch-name, account-number, balance)
Special-acct-schema = (branch-name, account-number, balance)
Domain constraints for {Special-acct-schema} require that for
each account:
denote the domain of attribute Ai, and let all these domains be
infinite. Then all domain constraints D are of the form Ai dom
(Ai ).
Let the general constraints be a set G of functional, multivalued,
D, K, and G.
Failure Modes:
transaction/process, system and media/device
or
insert K in B-tree B
Recovery independence
can recover some pages separately from others
acquisition/release
Lock requests
conditional, instant duration, manual duration, commit duration
Buffer Manager
Fix, unfix and fix_new (allocate and fix new pg)
Aries uses steal policy - uncommitted writes may be
on disk
dirty pages written out in a continuous manner to disk
Some Notation
rollback
one CLR for each normal log record which is undone
Normal Processing
Transactions add log records
Checkpoints are performed periodically
contains
Active transaction list,
LSN of most recent log records of transaction, and
List of dirty pages in the buffer (and their recLSNs)
Recovery Phases
Analysis pass
forward from last checkpoint
Redo pass
forward from RedoLSN, which is determined in analysis pass
Undo pass
backwards from end of log, undoing incomplete transactions
Analysis Pass
RedoLSN = min(LSNs of dirty pages recorded
in checkpoint)
if no dirty pages, RedoLSN = LSN of checkpoint
pages dirtied later will have higher LSNs)
Redo Pass
Undo Pass
Single scan backwards in log, undoing actions of
``loser'' transactions
for each transaction, when a log record is found, use prev_LSN
fields to find next record to be undone
can skip parts of the log with no records from loser transactions
don't perform any undo for CLRs (note: UndoNxtLSN for CLR
indicates next record to be undone, can skip intermediate
records of that transactions)
Transaction Table
During recovery:
initialized during analysis pass from most recent checkpoint
modified during analysis as log records are encountered, and
during undo
generated)
if page is not dirty, store L as RecLSN of the page in dirty pages
table
Updates
Page latch held in X mode until log record is logged
so updates on same page are logged in correct order
page latch held in S mode during reads since records may get
moved around by update
latch required even with page locking if dirty reads are allowed
Updates (Contd.)
Protocol to avoid deadlock involving latches
deadlocks involving latches and locks were a major problem in
System R and SQL/DS
transaction may hold at most two latches at-a-time
must never wait for lock while holding latch
if both are needed (e.g. Record found after latching page):
release latch before requesting lock and then reacquire latch (and
Savepoints
Simply notes LSN of last record written by transaction
them
deadlocks can be resolved by rollback to appropriate
Rollback
Scan backwards from last log record of txn
(last log record of txn = transTable[TransID].UndoNxtLSN
More on Rollback
Extra logging during rollback is bounded
make sure enough log space is available for rollback in case of
system crash, else BIG problem
Transaction Termination
Checkpoints
begin_chkpt record is written first
transaction table, dirty_pages table and some other file
Checkpoint (contd)
Pages need not be flushed during checkpoint
are flushed on a continuous basis
Restart Processing
Finds checkpoint begin using master record
Do restart_analysis
Do restart_redo
... some details of dirty page table here
Do restart_undo
reacquire locks for prepared transactions
checkpoint
Redo Pass
Scan forward from RedoLSN
If log record is an update log record, AND is in
dirty_page_table AND LogRec.LSN >= RecLSN of the page in
dirty_page_table
then if pageLSN < LogRec.LSN then perform redo; else just
update RecLSN in dirty_page_table
Optimizations of redo
Dirty page table info can be used to pre-read pages during
redo
Out of order redo is also possible to reduce disk seeks
Undo Pass
Rolls back loser transaction in reverse order in single
scan of log
stops when all losers have been fully undone
processing of log records is exactly as in single transaction
rollback
4'
3'
6'
5' 2'
1'
Undo Optimizations
Parallel undo
each txn undone separately, in parallel with others
can even generate CLRs and apply them separately , in
parallel for a single transaction
pages
Transaction Recovery
Loser transactions can be restarted in some cases
e.g. Mini batch transactions which are part of a larger transaction
Media Recovery
For archival dump
can dump pages directly from disk (bypass buffer, no latching
needed) or via buffer, as desired
this is a fuzzy dump, not transaction consistent
corruption
e.g. Application program with direct access to buffer crashes
before writing undo log record
recovery mechanism
used also for other operations like creating a file (which can
then be used by other txns, before the creater commits)
updates of nested top action commit early and should not be
undone
during undo