Você está na página 1de 24

Introduction to Databases

Daniela Puiu
Applications Specialist
Center for the Study of Biological Complexity, VCU
dpuiu@vcu.edu
804-827-0952

General Concepts
Database definition
Organized collection of logically related data

Data
Known facts
Types: text, graphics, images, sound, videos

Database management system (DBMS)


Software package for defining and managing
a database

Database Examples
Class roster
Hospital patients
Literature (published articles in a certain
field)
Genomic information
Protein structure
Taxonomy
Single nucleotide polymorphism

Example: Microbial Database


Data about the protein coding regions in the microbial
genomes sequenced so far.
Organism:
Name
Accession number
Genome size
GC%
Release date
Genome center
Sequence

Gene (protein coding regions):


Name
Accession number
Organism
Location on the chromosome
(start,end)
Strand
Size
Product
Sequence

Database Models

Flat files
Hierarchical
Network
Relational
Object oriented
Object relational
Web enabled

60
60
70
80
90
90
90

Database Types (cont.)


Type

Typical number of
users

Typical
architecture

Typical size

Personal

Desktop/Laptop/
PDA

MB

Workgroup

5-25

Client/server:2 tier MB-GB

Department

25-100

Client/server:3 tier GB

Enterprise

>100

Client/server:
distributed

GB-TB

Internet

>1000

Web sever &


application
servers

MB-GB

Flat Files
Characteristics:
Data is stored as records in regular files
Records usually have a simple structure and fixed
number of fields
For fast access may support indexing of fields in
the records
No mechanisms for relating data between files
One needs special programs in order to access
and manipulate the data

Flat Files Example


Microbial database:
Genbank format:
Escherichia coli K12
Streptococcus pneumoniae R6

Fasta format: multiple files


Escherichia coli K12: genome , genes , gene positions
Streptococcus pneumoniae R6: genome , genes , gene positions

Data manipulation:

Sequence extraction, search


Indexing
Format conversion

Relational Database
Characteristics:
Data is organized into tables: rows & columns
Each row represents an instance of an entity
Each column represents an attribute of an entity
Metadata describes each table column
Relationships between entities are represented
by values stored in the columns of the
corresponding tables (keys)
Accessible through Standard Query Language
(SQL)

Enterprise data model


Graphical representation of the high level
entities
Example: Microbial database
each organism has multiple corresponding genes
One:Many relation

Organism

Gene

Metadata
Data that describes the properties or
characteristics of other data
Does not include sample data
Allows database designers and users to
understand the meaning of the data

Metadata & Data Table


Organism
Name

Type

Max Length

Description

Name

Alphanumeric

100

Organism name

Size

Integer

10

Genome length (bases)

Gc

Float

Percent GC

Accession

Alphanumeric

10

Accession number

Release

Date

Release date

Center

Alphanumeric

100

Genome center name

Sequence

Alphanumeric

Variable

Sequence

Name

Size

Gc

Accession

Release

Center

Sequence

Escherichia coli K12

4,640,000

50

NC_000913

09/05/1997

Univ.
Wisconsin

AGCTTTTC
ATT

Streptococcus
pneumoniae R6

2,040,000

40

NC_003098

09/07/2001

Eli Lilly and


Company

TTGAAAGA
AAA

Metadata & Data Table (cont.)


Gene
Name

Type

Max Length

Description

Name

Alphanumeric

100

Gene name

Accession

Alphanumeric

10

Gene accession number

OAccesion

Alphanumeric

10

Organism accession number

Start

Integer

10

Gene start

End

Integer

10

Gene end

Strand

Character

Gene strand

Product

Alphanumeric

1000

Gene annotation

Sequence

Alphanumeric

Variable

Gene sequence

Name

Accession

OAccession

Start

End

Strand

Product

Sequence

thrL

16127995

NC_000913

190

255

the operon leader


peptide

MKRI

thrA

16127996

NC_000913

337

2799

homoserine
dehydrogenase I

MRVL

transposas
e_A

15902058

NC_003098

20207

20554

transposase

MWYN

Relationships

Used to connect tables


Field(s) that have the same value in the related tables
Organism.Accession=Gene.OAccession
Organism.Accession
Unique
Primary key
Gene.OAccession
Not unique
Secondary key

SQL
ANSI (American National Standards
Institute) standard computer language for
accessing and manipulating database
systems.
SQL statements are used to retrieve and
update data in a database.
Includes:
Data Manipulation Language (DML)
Data Definition Language (DDL)

Data Manipulation Language


Syntax for executing queries, updating,
inserting, and deleting records.

SELECT - extracts data from one or more table


INSERT INTO - inserts new data into a table
UPDATE - updates data in a table
DELETE FROM - deletes data from a table

DML Example
Select all Escherichia coli K12 genes which are in the 1MB2MB region of the chromosome:
SELECT *
FROM Organism, Gene
WHERE
Organism.Name=Escherichia coli K12 AND
Organism.Accession=Gene.OAccession AND
Gene.Start>=1,000,000 AND
Gene.End<=2,000,000

DML Example (cont.)


INSERT INTO Gene
(Name, Accession, OAccession, Start, End, Strand, Sequence)
VALUES
(thrL, 16127995,NC_000913,190,255,+,thr operon leader
peptide, MKRI)
UPDATE Gene SET Start=160 WHERE Accession= NC_000913
DELETE FROM Gene WHERE Accession= NC_000913

Data Definition Language


Syntax for creating ,editing, deleting:
Databases
Tables
Views
Indexes
Constraints
Users
Privileges

DDL Examples
CREATE DATABASE Microbial;
CREATE TABLE Organism (
Name varchar(100)
Size int(10)
Gc decimal(5)
Accession varchar(10)
Release date(8)
Center varchar(100));
ALTER TABLE Organism ADD Sequence varchar;
DROP TABLE Organism;

DBMS
Software package for defining and
managing a database.
Examples:
Proprietary: MS Access, MS SQL Server,
DB2, Oracle, Sybase
Open source: MySql, PostgreSQL

DBMS Advantages
Program-data independence
Minimal data redundancy
Improved data consistency & quality
Access control
Transaction control

Improved accessibility & data sharing


Increased productivity of application
development
Enforced standards

Web Databases
Data is accessible through Internet
Have different underlying database
models
Example: biological databases
Molecular data: NCBI , Swissprot , PDB , GO
Protein interaction : DIP , BIND
Organism specific: Mouse , Worm, Yeast
Literature: Pubmed
Disease

CSBC Resources
Database and software list
Molecular databases: Genbank, EMBL, NR, NT,
RefSeq, Swissprot
DBMS:
MS Excel, MS Access
MySQL, PostgreSQL

Computer resources
watson.vcu.edu : 8 processor Sun server
medusa.vcu.edu : 64 processor Beowulf cluster

Você também pode gostar