Escolar Documentos
Profissional Documentos
Cultura Documentos
Daniela Puiu
Applications Specialist
Center for the Study of Biological Complexity, VCU
dpuiu@vcu.edu
804-827-0952
General Concepts
Database definition
Organized collection of logically related data
Data
Known facts
Types: text, graphics, images, sound, videos
Database Examples
Class roster
Hospital patients
Literature (published articles in a certain
field)
Genomic information
Protein structure
Taxonomy
Single nucleotide polymorphism
Database Models
Flat files
Hierarchical
Network
Relational
Object oriented
Object relational
Web enabled
60
60
70
80
90
90
90
Typical number of
users
Typical
architecture
Typical size
Personal
Desktop/Laptop/
PDA
MB
Workgroup
5-25
Department
25-100
Client/server:3 tier GB
Enterprise
>100
Client/server:
distributed
GB-TB
Internet
>1000
MB-GB
Flat Files
Characteristics:
Data is stored as records in regular files
Records usually have a simple structure and fixed
number of fields
For fast access may support indexing of fields in
the records
No mechanisms for relating data between files
One needs special programs in order to access
and manipulate the data
Data manipulation:
Relational Database
Characteristics:
Data is organized into tables: rows & columns
Each row represents an instance of an entity
Each column represents an attribute of an entity
Metadata describes each table column
Relationships between entities are represented
by values stored in the columns of the
corresponding tables (keys)
Accessible through Standard Query Language
(SQL)
Organism
Gene
Metadata
Data that describes the properties or
characteristics of other data
Does not include sample data
Allows database designers and users to
understand the meaning of the data
Type
Max Length
Description
Name
Alphanumeric
100
Organism name
Size
Integer
10
Gc
Float
Percent GC
Accession
Alphanumeric
10
Accession number
Release
Date
Release date
Center
Alphanumeric
100
Sequence
Alphanumeric
Variable
Sequence
Name
Size
Gc
Accession
Release
Center
Sequence
4,640,000
50
NC_000913
09/05/1997
Univ.
Wisconsin
AGCTTTTC
ATT
Streptococcus
pneumoniae R6
2,040,000
40
NC_003098
09/07/2001
TTGAAAGA
AAA
Type
Max Length
Description
Name
Alphanumeric
100
Gene name
Accession
Alphanumeric
10
OAccesion
Alphanumeric
10
Start
Integer
10
Gene start
End
Integer
10
Gene end
Strand
Character
Gene strand
Product
Alphanumeric
1000
Gene annotation
Sequence
Alphanumeric
Variable
Gene sequence
Name
Accession
OAccession
Start
End
Strand
Product
Sequence
thrL
16127995
NC_000913
190
255
MKRI
thrA
16127996
NC_000913
337
2799
homoserine
dehydrogenase I
MRVL
transposas
e_A
15902058
NC_003098
20207
20554
transposase
MWYN
Relationships
SQL
ANSI (American National Standards
Institute) standard computer language for
accessing and manipulating database
systems.
SQL statements are used to retrieve and
update data in a database.
Includes:
Data Manipulation Language (DML)
Data Definition Language (DDL)
DML Example
Select all Escherichia coli K12 genes which are in the 1MB2MB region of the chromosome:
SELECT *
FROM Organism, Gene
WHERE
Organism.Name=Escherichia coli K12 AND
Organism.Accession=Gene.OAccession AND
Gene.Start>=1,000,000 AND
Gene.End<=2,000,000
DDL Examples
CREATE DATABASE Microbial;
CREATE TABLE Organism (
Name varchar(100)
Size int(10)
Gc decimal(5)
Accession varchar(10)
Release date(8)
Center varchar(100));
ALTER TABLE Organism ADD Sequence varchar;
DROP TABLE Organism;
DBMS
Software package for defining and
managing a database.
Examples:
Proprietary: MS Access, MS SQL Server,
DB2, Oracle, Sybase
Open source: MySql, PostgreSQL
DBMS Advantages
Program-data independence
Minimal data redundancy
Improved data consistency & quality
Access control
Transaction control
Web Databases
Data is accessible through Internet
Have different underlying database
models
Example: biological databases
Molecular data: NCBI , Swissprot , PDB , GO
Protein interaction : DIP , BIND
Organism specific: Mouse , Worm, Yeast
Literature: Pubmed
Disease
CSBC Resources
Database and software list
Molecular databases: Genbank, EMBL, NR, NT,
RefSeq, Swissprot
DBMS:
MS Excel, MS Access
MySQL, PostgreSQL
Computer resources
watson.vcu.edu : 8 processor Sun server
medusa.vcu.edu : 64 processor Beowulf cluster