Databases

An introduction to biological databases
Database or databank ?
At the beginning, subtle distinctions were done between databases and databanks (in UK, but not in the USA), such as: Database management programs for the gestion of databanks From now on, the term database (db) is usually preferred
What is a database ?
A collection of...

data
structured searchable (index) updated periodically (release) cross-referenced (hyperlinks)
-> table of contents -> new edition -> links with other db
Includes also associated tools (software) necessary for db access, db updating, db information insertion, db information deletion.
Databases: an simple example

Introduction To Database Teacher Database (ITDTdb) (flat file, 3 entries) Accession number: 1
First Name: Amos Last Name: Bairoch Course: DEA=oct-nov-dec 2000 http://expasy4.expasy.ch/people/amos.html // Accession number: 2 First Name: Laurent Last name: Falquet Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; // Accession number 3: First Name: Marie-Claude Last name: Blatter Garin Course: EMBnet=sept 2000;DEA=oct-nov-dec 2000; http://expasy4.expasy.ch/people/Marie-Claude.Blatter-Garin.html //
Easy to manage: all the entries are visible at the same time !
Databases: an simple example (cont.)

Relational database ( table file ):
Teacher
Amos Laurent M-Claude
Accession number
1 2 3
Education
Biochemistry Biochemistry Biochemistry
Course DEA EMBnet
Date Oct-nov-dec 2000 Sept 2000
Involved teachers 1,3 2,3
Easier to manage; choice of the output
Why biological databases ?

Explosive growth in biological data
Data (sequences, 3D structures, 2D gel analysis, MS analysis.) are no longer published in a conventional manner, but directly submitted to databases
Essential tools for biological research, as classical publications used to be !
Biological databases
Some databases in the field of molecular biology
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb, BBDB, BCGD, Beanref, Biolmage, BioMagResBank, BIOMDB, BLOCKS, BovGBASE, BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView, GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline, GenLink, GENOTK, GenProtEC, GIFTS, GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel, MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub, 0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB, PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP, SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D, SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISSMODEL Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT, WormPep, YEPD, YPD, YPM, etc .................. !!!!
Some statistics
More than 1000 different databases Generally accessible through the web (useful link: www.expasy.ch/alinks.html)

Variable size: <100Kb to >10Gb

DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller
Update frequency: daily to annually
Categories of databases for Life Sciences

Sequences (DNA, protein) -> Primary db Genomics Protein domain/family -> Secondary db Mutation/polymorphism Proteomics (2D gel, MS) 3D structure -> Structure db Metabolism Bibliography Others
Distribution of sequence databases

Books, articles Computer tapes Floppy disks CD-ROM FTP On-line services WWW DVD
1968 1982 1984 1989 1989 1982 1993 2001
-> 1985 ->1992 -> 1990 -> ? -> ? -> 1994 -> ? -> ?
Sequence Databases: some technical definitions
Data storage management:

Format (flat file):
flat file: text file relational (e.g., Oracle) object oriented (rare in biological field)
fasta GCG NBRF/PIR MSF. standardized format ? Federated databases: different autonomous, redundant, heterogeneous db linked together by links/hyperlinks.
Ideal minimal content of a sequence db

Sequences !! Accession number (AC) References Taxonomic data ANNOTATION/CURATION Keywords Cross-references Documentation
Sequence database: example

SWISS-PROT Flat file
taxonomy
ID EPO_HUMAN STANDARD; PRT; 193 AA. AC P01588; DT 21-JUL-1986 (Rel. 01, Created) DT 21-JUL-1986 (Rel. 01, Last sequence update) DT 30-MAY-2000 (Rel. 39, Last annotation update) DE Erythropoietin precursor. GN EPO. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., RA Kawakita M., Shimizu T., Miyake T.; RT "Isolation and characterization of genomic and cDNA clones of human RT erythropoietin."; RL Nature 313:806-810(1985). ... CC -!- FUNCTION: ERYTHROPOIETIN IS THE PRINCIPAL HORMONE INVOLVED IN THE CC REGULATION OF ERYTHROCYTE DIFFERENTIATION AND THE MAINTENANCE OF A CC PHYSIOLOGICAL LEVEL OF CIRCULATING ERYTHROCYTE MASS. CC -!- SUBCELLULAR LOCATION: SECRETED. CC -!- TISSUE SPECIFICITY: PRODUCED BY KIDNEY OR LIVER OF ADULT MAMMALS CC AND BY LIVER OF FETAL OR NEONATAL MAMMALS. CC -!- PHARMACEUTICAL: Available under the names Epogen (Amgen) and CC Procrit (Ortho Biotech). CC -!- DATABASE: NAME=R&D Systems' cytokine source book; CC WWW="http://www.rndsystems.com/cyt_cat/epo.html". DR EMBL; X02158; CAA26095.1; -. DR EMBL; X02157; CAA26094.1; -. DR EMBL; M11319; AAA52400.1; -. DR EMBL; AF053356; AAC78791.1; -. DR EMBL; AF202308; AAF23132.1; -. DR EMBL; AF202306; AAF23132.1; JOINED. ... KW Erythrocyte maturation; Glycoprotein; Hormone; Signal; Pharmaceutical. FT SIGNAL 1 27 FT CHAIN 28 193 ERYTHROPOIETIN. FT PROPEP 190 193 MAY BE REMOVED IN PROCESSED PROTEIN. FT DISULFID 34 188 ...
reference
annotations
Cross-references
Keywords
Sequence database: example (cont.)

FT DISULFID 34 188 FT DISULFID 56 60 FT CARBOHYD 51 51 N-LINKED (GLCNAC...). FT CARBOHYD 65 65 N-LINKED (GLCNAC...). FT CARBOHYD 110 110 N-LINKED (GLCNAC...). FT CARBOHYD 153 153 FT CONFLICT 40 40 E -> Q (IN CAA26095). FT CONFLICT 85 85 Q -> QQ (IN REF. 5). FT CONFLICT 140 140 G -> R (IN CAA26095). ** Chromosomal location: 7q22 SQ SEQUENCE 193 AA; 21306 MW; C91F0E4C26A52033 CRC64; MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR //
sequence
Sequence database: example

a SWISS-PROT entry, in fasta format:
>sp|P01588|EPO_HUMAN ERYTHROPOIETIN PRECURSOR - Homo sapiens (Human). MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLERYLLEAKEAE NITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEA VLRGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPD AASAAPLRTITADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR
Databases 1: nucleotide sequence
The main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan) There are also specialized databases for the different types of RNAs (i.e. tRNA, rRNA, tm RNA, uRNA, etc) 3D structure (DNA and RNA) Others: Aberrant splicing db; Eucaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource
EMBL/GenBank/DDJB
These 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax) Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from:

Non-confidential data are exchanged daily Currently: 8.3 x106 sequences, over 9.7 x109 bp; Sequences from > 50000 different species;
Genome projects and sequencing centers Individual scientists Patent offices (i.e. European Patent Office, EPO)
EMBL/GenBank/DDBJ

Heterogeneous sequence length: genomes, variants, fragments Sequence sizes:

Archive: nothing goes out -> highly redundant ! full of errors: in sequences, in annotations, in CDS attribution no consistency of annotations; most annotations are done by the submitters; heterogeneity of the quality and the completion and updating of the informations
max 300000 bp /entry (! genomic sequences, overlapping) min 10 bp /entry
EMBL/GenBank/DDJB
Unexpected informations you can find in these db:

FT source FT FT FT FT FT 1..124 /db_xref="taxon:4097" /organelle="plastid:chloroplast" /organism="Nicotiana tabacum" /isolate="Cuban cahibo cigar, gift from President Fidel Castro"
Or:
FT source 1..17084 FT /chromosome="complete mitochondrial genome" FT /db_xref="taxon:9267" FT /organelle="mitochondrion" FT /organism="Didelphis virginiana" FT /dev_stage="adult" FT /isolate="fresh road killed individual" FT /tissue_type="liver"
EMBL entry: example

ID HSERPG standard; DNA; HUM; 3398 BP. XX AC X02158; XX SV X02158.1 XX DT 13-JUN-1985 (Rel. 06, Created) DT 22-JUN-1993 (Rel. 36, Last updated, Version 2) XX DE Human gene for erythropoietin XX KW erythropoietin; glycoprotein hormone; hormone; signal peptide. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Primates; Catarrhini; Hominidae; Homo. XX RN [1] RP 1-3398 RX MEDLINE; 85137899. RA Jacobs K., Shoemaker C., Rudersdorf R., Neill S.D., Kaufman R.J., RA Mufson A., Seehra J., Jones S.S., Hewick R., Fritsch E.F., Kawakita M., RA Shimizu T., Miyake T.; RT Isolation and characterization of genomic and cDNA clones of human RT erythropoietin; RL Nature 313:806-810(1985). XX DR GDB; 119110; EPO. DR GDB; 119615; TIMP1. DR SWISS-PROT; P01588; EPO_HUMAN. XX
keyword taxonomy
references
Cross-references
EMBL entry (cont.)

CC Data kindly reviewed (24-FEB-1986) by K. Jacobs FH Key Location/Qualifiers FH FT source 1..3398 FT /db_xref=taxon:9606 FT /organism=Homo sapiens FT mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) FT CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) FT /db_xref=SWISS-PROT:P01588 FT /product=erythropoietin FT /protein_id=CAA26095.1 FT /translation=MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE FT AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG FT QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD FT TFRKLFRVYSNFLRGKLKLYTGEACRTGDR FT mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) FT /product=erythropoietin FT sig_peptide join(615..627,1194..1261) FT exon 397..627 FT /number=1 FT intron 628..1193 FT /number=1 FT exon 1194..1339 FT /number=2 FT intron 1340..1595 FT /number=2 FT exon 1596..1682 FT /number=3 FT intron 1683..2293 FT /number=3 FT exon 2294..2473 FT /number=4 FT intron 2474..2607 FT /number=4 FT exon 2608..3327 FT /note=3' untranslated region FT /number=5 XX SQ Sequence 3398 BP; 698 A; 1034 C; 991 G; 675 T; 0 other; agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 60 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 120
annotation
sequence
GenBank entry: example

LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI
GenBank entry (cont.)

TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" 628..1193 /number=1 exon 1194..1339 /number=2 mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760) /product="erythropoietin" intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc intron
DDJB entry: example

LOCUS HSERPG 3398 bp DNA HUM 22-JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313, 806-810(1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs FEATURES Location/Qualifiers source 1..3398 /db_xref="taxon:9606" /organism="Homo sapiens" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /db_xref="SWISS-PROT:P01588" /product="erythropoietin" /protein_id="CAA26095.1" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLLE AKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVLRG QALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTITAD TFRKLFRVYSNFLRGKLKLYTGEACRTGDR
DDJB (cont.)
mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2763) /product="erythropoietin" sig_peptide join(615..627,1194..1261) exon 397..627 /number=1 intron 628..1193 /number=1 exon 1194..1339 /number=2 intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat
The tremendous increase in nucleotide sequences
EMBL datafirst increase in data due to the PCR development
1980: 80 genes fully sequenced !
EMBL divisions
EMBL has been divided into subdatabases to allow easier data management and searches

fun, hum, inv, mam, org, phg, pln, pro, rod, syn, unc, vrl, vrt est, gss, htg, sts, patent
RefSeq a SWISS-PROT clone?
The NCBI Reference Sequence project (RefSeq) will provide reference sequence standards for the naturally occurring molecules of the central dogma, from chromosomes to mRNAs to proteins. RefSeq standards provide a foundation for the functional annotation of the human genome. They provide a stable reference point for mutation analysis, gene expression studies, and polymorphism discovery. Molecule Accession Format Genome
Complete Genome NC_###### Archaea, Bacterial, Organelle,Virus, Viroid Complete Chrom. NC_###### Eukaryote Complete Sequence NC_###### Plasmid Genomic Contig NT_###### Homo sapiens mRNA NM_###### Homo sapiens, Mus musculus, Rattus norvegicus Protein NP_###### All of the above
RefSeq a SWISS-PROT clone?
RefSeq records are created via a process consisting of:

identifying sequences that represent distinct genes establishing the correct gene name-to-accession number association identifying the full extent of available sequence data creating a new RefSeq record with a status of:

PREDICTED PROVISIONAL REVIEWED
Provisional RefSeq records are reviewed by a biologist who confirms the initial name-to-sequence association, adds information including a summary of gene function, and, more importantly, corrects, re-annotates, or extends the sequence data using data available in other GenBank records.
Databases 2: genomics
Contain information on genes, gene location (mapping), gene nomenclature and links to sequence databases; Exist for most organisms important for life science research; Examples: MIM, GDB (human), MGD (mouse), FlyBase (Drosophila), SGD (yeast), MaizeDB (maize), SubtiList (B.subtilis), etc.; Format: generally relational (Oracle, SyBase or AceDb).
MIM

OMIM: Online Mendelian Inheritance in Man a catalog of human genes and genetic disorders contains a summary of literature, pictures, and reference information. It also contains numerous links to articles and sequence information.
MIM: example
*133170 ERYTHROPOIETIN; EPO Alternative titles; symbols EP TABLE OF CONTENTS
TEXT REFERENCES SEE ALSO CONTRIBUTORS CREATION DATE EDIT HISTORY

Database Links Gene Map Locus: 7q21
Note: pressing the symbol will find the citations in MEDLINE whose text most closely matches the text of the preceding OMIM paragraph, using the Entrez MEDLINE neighboring function.
TEXT Human erythropoietin is an acidic glycoprotein hormone with molecular weight 34,000. As the prime regulator of red cell production, its major functions are to promote erythroid differentiation and to initiate hemoglobin synthesis. Sherwood and Shouval (1986) described a human renal carcinoma cell line that continuously produces erythropoietin. Eschbach et al. (1987) demonstrated the effectiveness of recombinant human erythropoietin in treating the anemia of end-stage renal disease. Lee-Huang (1984) cloned human erythropoietin cDNA in E. coli. McDonald et al. (1986) and Shoemaker and Mitsock (1986) cloned the mouse gene and the latter workers showed that coding DNA and amino acid sequence are about 80% conserved between man and mouse. This is a much higher order of conservation than for various interferons, interleukin-2, and GM-CSF.
Ensembl

Contains all the human genome DNA sequences currently available in the public domain. Automated annotation: by using different software tools, features are identified in the DNA sequences:

Created and maintained by the EBI and the Sanger Center (UK) www.ensembl.org
Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homologies
Database 3: protein sequence

SWISS-PROT: created in 1986 (A.Bairoch) TrEMBL: created in 1996; complement to SWISS-PROT; derived
from automated EMBL CDS translations ( proteomic version of EMBL)
GenPept: derived from automated GenBank CDS translations and

journal scans ( proteomic version of GenBank)
PIR: Protein Information Resources MIPS: Martinsried Institute for Protein Sequences
PIR + PATCHX (supplement of unverified protein sequences from external sources)
Database 3: protein sequence
NRL-3D: produced by PIR from PDB (3D struture) sequences
Many specialized protein databases for specific families or groups of proteins.

Examples: YPD (yeast proteins), AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) etc.
SWISS-PROT

Collaboration between the SIB (CH) and EMBL/EBI (UK) Annotated (manually), non-redundant, crossreferenced, documented protein sequence database. 88 000 sequences from more than 6800 different species; 70 000 references (publications); 550 000 cross-references (databases); ~200 Mb of annotations. Weekly releases; available from about 50 servers across the world, the main source being ExPASy
SWISS-PROT: example
Never changed
SWISS-PROT (cont.)
SWISS-PROT (cont.)
TrEMBL (Translation of EMBL)

Computer-annotated supplement to SWISS-PROT, as it is impossible to cope with the flow of data Well-structure SWISS-PROT-like resource Derived from automated EMBL CDS translation (maintained at the EBI (UK)) TrEMBL is automatically generated and annotated using software tools (incompatible with the SWISS-PROT in terms of quality) TrEMBL contains all what is not yet in SWISSPROT Yerk!! But there is no choice and these software tools are becoming quite good !
The simplified story of a Sprot entry

cDNAs, genomes, .
EMBLnew
CDS
EMBL
TrEMBL
Automatic Redundancy check (merge) InterPro (family attribution) Annotation Manual Redundancy (merge, conflicts) Annotation Sprot tools (macros) Sprot documentation Medline Databases (MIM, MGD.) Brain storming
TrEMBLnew
SWISS-PROT
Once in Sprot, the entry is no more in TrEMBL, but still in EMBL (archive)
SWISS-PROT introduces a new arithmetical concept !
How many sequences in SWISS-PROT + TrEMBL ?
88000 + 300 000 = about 240000
SWISS-PROT and TrEMBL (SPTR) a minimal of redundancy
TrEMBL divisions
TrEMBL: SPTrEMBL + REMTrEMBL
SPTrEMBL: TrEMBL entries that will eventually be integrated into SWISS-PROT, but that have not yet be manually annotated REMTrEMBL: sequences that are not destined to be included in SWISS-PROT

Immunoglobulins and T-cell receptors Synthetic sequences Patented sequences Small fragments (<8 aa) CDS not coding for real proteins
TrEMBL new: updates to the latest release of TREMBL
TrEMBL divisions
Subdivisions

Archae Fungus Human Invertebrate Mammals Major Hist. Comp. Organelles Phage Plant Prokaryote Rodent Uncommented Viral Vertebrate
arc fun hum inv mam mhc org phg pln pro rod unc vrl vrt
TrEMBL: example
GenPept (translation of GenBank)
GenPept is a protein database translated from the last release of GenBank (+ journal scans) The current release has 484496 entries In contrast to TrEMBL, keeps all protein sequences including small fragments (< 8 aa), immunoglobulins. Redundancy: 20 entries for human EPO
GenPept: example
LOCUS L33410_1 [HUMMLCMPL] DEFINITION Human c-mpl ligand (ML) mRNA, complete cds; erythropoietin homology domain bp 66..522. DATE 07-JAN-1995 ACCESSION L33410 NID ORGANISM Homo_SP_sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. COMMENT CDS 216..1277 /gene="ML" /product="c-mpl ligand" /protein_id="AAA59857.1" /db_xref="GI:506827" WEIGHT 37823 LENGTH 353 ORIGIN 1 MELTELLLVV MLLLTARLTL SSPAPPACDL RVLSKLLRDS HVLHSRLSQC PEVHPLPTPV 61 LLPAVDFSLG EWKTQMEETK AQDILGAVTL LLEGVMAARG QLGPTCLSSL LGQLSGQVRL 121 LLGALQSLLG TQLPPQGRTT AHKDPNAIFL SFQHLLRGKV RFLMLVGGST LCVRRAPPTT 181 AVPSRTSLVL TLNELPNRTS GLLETNFTAS ARTTGSGLLK WQQGFRAKIP GLLNQTSRSL 241 DQIPGYLNRI HELLNGTRGL FPGPSRRTLG APDISSGTSD TGSLPPNLQP GYSPSPTHPP 301 TGQYTLFPLP PTLPTPVVQL HPLLPDPSAP TPTPTSPLLN TSYTHSQNLS QEG //
PIR

Protein Information Resource, created in 1984 Successor of the National Biochemical Research Foundation (NBRF) protein sequence database developed in 1965 by M. O. Dayhoff Atlas of Protein Sequence and Structure Maintained by MIPS (Germany) and JIPID (Japan) Provides some cross-referencing to EMBL/GenBank/DDJB and PDB, GDB, FlyBase, OMIM, SGD, and MGD In august 2000: 178050 entries. Redundancy: 3 entries for human EPO
PIR: example
>P1;ZUHU erythropoietin precursor - human C;Species: Homo sapiens (man) C;Date: 27-Nov-1985 #sequence_revision 27-Nov-1985 #text_change 22-Jun-1999 C;Accession: A01855; A24744; A25384; A22210; S56178 R;Jacobs, K.; Shoemaker, C.; Rudersdorf, R.; Neill, S.D.; Kaufman, R.J.; Mufson, A.; Seehra, J.; Jones, S.S.; Hewick, R.; Fritsch, E.F.; Kawakita, M.; Shimizu, T.; Miyake, T. Nature 313, 806-810, 1985 A;Title: Isolation and characterization of genomic and cDNA clones of human erythropoietin. A;Reference number: A01855; MUID:85137899 A;Accession: A01855 A;Molecule type: mRNA; DNA A;Residues: 1-193 A;Cross-references: GB:X02157; GB:X02158 R;Lin, F.K.; Suggs, S.; Lin, C.H.; Browne, J.K.; Smalling, R.; Egrie, J.C.; Chen, K.K.; Fox, G.M.; Martin, F.; Stabinsky, Z.; Badrawi, S.M.; Lai, P.H.; Goldwasser, E. Proc. Natl. Acad. Sci. U.S.A. 82, 7580-7584, 1985 A;Title: Cloning and expression of the human erythropoietin gene. A;Reference number: A24744; MUID:86067948 A;Accession: A24744 A;Molecule type: DNA A;Residues: 1-193 A;Cross-references: GB:M11319; NID:g182197; PIDN:AAA52400.1; PID:g182198 R;Lai, P.H.; Everett, R.; Wang, F.F.; Arakawa, T.; Goldwasser, E. J. Biol. Chem. 261, 3116-3121, 1986 A;Title: Structural characterization of human erythropoietin. A;Reference number: A25384; MUID:86140080 A;Accession: A25384 A;Molecule type: protein A;Residues: 28-86,'Q',87-193 A;Experimental source: urine A;Note: forms without the carboxyl-terminal residue and the four carboxyl-terminal residues were observed R;Yanagawa, S.; Hirade, K.; Ohnota, H.; Sasaki, R.; Chiba, H.; Ueda, M.; Goto, M. J. Biol. Chem. 259, 2707-2710, 1984 A;Title: Isolation of human erythropoietin with monoclonal antibodies. A;Reference number: A22210; MUID:84135751
PIR (cont.)
A;Accession: A22210 A;Molecule type: protein A;Residues: 28-29,'X',31-33,'L',35-50,'X',52-53,'D',55,'G',57 R;Matsumoto, S.; Ikura, K.; Ueda, M.; Sasaki, R. Plant Mol. Biol. 27, 1163-1172, 1995 A;Title: Characterization of a human glycoprotein (erythropoietin) produced in cultured tobacco cells. A;Reference number: S56178; MUID:95284365 A;Accession: S56178 A;Molecule type: protein A;Residues: 28-33,'X',35-37 C;Comment: Erythropoietin is produced by kidney or liver of adult mammals and by liver of fetal or neonatal mammals. C;Genetics: A;Gene: GDB:EPO A;Cross-references: GDB:119110; OMIM:133170 A;Map position: 7q21.3-7q22.1 A;Introns: 5/1; 53/3; 82/3; 142/3 C;Function: A;Description: the primary inducer of erythrocyte formation C;Superfamily: erythropoietin C;Keywords: erythropoiesis; glycoprotein; hormone; kidney; liver F;1-27/Domain: signal sequence #status predicted F;28-193/Product: erythropoietin #status experimental F;34-188,56-60/Disulfide bonds: #status experimental F;51,65,110/Binding site: carbohydrate (Asn) (covalent) #status experimental F;153/Binding site: carbohydrate (Ser) (covalent) #status experimental >P1;ZUHU MGVHECPAWL WLLLSLLSLP LGLPVLGAPP RLICDSRVLE RYLLEAKEAE NITTGCAEHC SLNENITVPD TKVNFYAWKR MEVGQQAVEV WQGLALLSEA VLRGQALLVN SSQPWEPLQL HVDKAVSGLR SLTTLLRALG AQKEAISPPD AASAAPLRTI TADTFRKLFR VYSNFLRGKL KLYTGEACRT GDR*
Composite protein sequence db

Different composite db use different primary sources and different redundancy criteria in their amalgamation procedures NRDB
PDB SWISS-PROT PIR GenPept SP update GenPept update
OWL
SWISS-PROT PIR GenBank NRL-3D
MIPSX
PIR MIPS NRL-3D SWISS-PROT EMBL translation GenBank translation Kabat (immuno) PseqIP
SPTrEMBL *
SWISS-PROT SPTrEMBL TrEMBLnew
Redundancy priority criteria
* Also called SWall at EBI
SWIR: SPTrEMBL + Wormpep
Composite: protein family
The proteins /genes are classified by superfamily/family according to Blast/Fasta (homology) results General:

ProtFam: PIR ProtoMap: SWISS-PROT SYSTERS: SWISS-PROT and PIR (non redundant) ProClass: PIR and PROSITE HOVERGEN: vertebrates HOBACGEN: bacteria COG: complete organism genome
Species specific:
ProtoMap: example
ProtoMap (cont.)
Database 4: protein domain/family
Contains biologically significant pattern / profiles/ HMM formulated in such a way that, with appropriate computional tools, it can rapidly and reliably determine to which known family of proteins (if any) a new sequence belongs to -> tools to identify what is the function of uncharacterized proteins translated from genomic or cDNA sequences ( functional diagnostic )
Protein domain/family

Most proteins have modular structure Estimation: ~ 3 domains / protein Domains (conserved sequences or structures) are identified by multi sequence alignments
Domains can be defined by different methods:

Pattern (regular expression); used for very conserved domains Profiles (weighted matrices): two-dimensional tables of position specific match-, gap-, and insertion-scores, derived from aligned sequence families; used for less conserved domains Hidden Markov Model (HMM); probabilistic models; an other method to generate profiles.
Some statistics
15 most common protein domains for H. sapiens (Incomplete)

Immunoglobulin and major histocompatibility complex domain Eukaryotic protein kinase Zinc finger, C2H2 type Rhodopsin-like GPCR superfamily Src homology 3 (SH3) domain RNA-binding region RNP-1 (RNA recognition motif) Fibronectin type III domain Pleckstrin homology (PH) domain Homeobox domain Major histocompatibility complex protein, Class I EF-hand family EGF-like domain RING finger Cadherin domain PDZ domain (also known as DHR or GLGF) Serine proteases, trypsin family
http://www.ebi.ac.uk/proteome/HUMAN/interpro/top15d.html
Protein domain/family db
Secondary databases are the fruit of analyses of the sequences found in the primary db Either manually curated (i.e. PROSITE, Pfam, etc.) or automatically generated (i.e. ProDom, DOMO) Some depend on the method used to detect if a protein belongs to a particular domain/family (patterns, profiles, HMM)
Protein domain/family db
Secondary db Primary source Information
PROSITE
PROSITE PRINTS Pfam BLOCKS IDENTIFY
SWISS-PROT
SWISS-PROT OWL and SWISS-PROT SWISS-PROT
Patterns
(Regular expression)
Profiles
(Weighted matrices)
Aligned motifs
(Fingerprints)
HMM
(Hidden Markov Models)
PROSITE/PRINTS Aligned motifs BLOCKS/PRINTS

Fuzzy regular expressions
Prosite
Created in 1988 (SIB) Contains functional domains fully annotated, based on two methods: patterns and profiles Entries are deposited in PROSITE in two distinct files:
Pattern/profiles with the lists of all matches in the parent version of SWISS-PROT Documentation
Aug 2000: contains 1064 documentation entries that describe 1424 different patterns, rules and profiles/matrices.
Prosite (pattern): example

ID AC DT DE PA NR Diagnostic NR performance NR CC CC DR List of DR matches DR DR DR DR DO EPO_TPO; PATTERN. PS00817; OCT-1993 (CREATED); NOV-1995 (DATA UPDATE); JUL-1998 (INFO UPDATE). Erythropoietin / thrombopoeitin signature. P-x(4)-C-D-x-R-[LIVM](2)-x-[KR]-x(14)-C. /RELEASE=38,80000; /TOTAL=14(14); /POSITIVE=14(14); /UNKNOWN=0(0); /FALSE_POS=0(0); /FALSE_NEG=0; /PARTIAL=1; /TAXO-RANGE=??E??; /MAX-REPEAT=1; /SITE=3,disulfide; /SITE=11,disulfide; P48617, EPO_BOVIN , T; P33707, EPO_CANFA , T; P33708, EPO_FELCA , T; P01588, EPO_HUMAN , T; P07865, EPO_MACFA , T; Q28513, EPO_MACMU , T; P07321, EPO_MOUSE , T; P49157, EPO_PIG , T; P29676, EPO_RAT , T; P33709, EPO_SHEEP , T; P42705, TPO_CANFA , T; P40225, TPO_HUMAN , T; P40226, TPO_MOUSE , T; P49745, TPO_RAT , T; P42706, TPO_PIG , P; PDOC00644;
//
Prosite (profile): example

PROSITE: PS50097 ID AC DT DE MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA MA . BTB; MATRIX. PS50097; DEC-1999 (CREATED); DEC-1999 (DATA UPDATE); DEC-1999 (INFO UPDATE). BTB domain profile. /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=67; /DISJOINT: DEFINITION=PROTECT; N1=6; N2=62; /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=.9751; R2=.02068202; TEXT='-LogE'; /CUT_OFF: LEVEL=0; SCORE=363; N_SCORE=8.5; MODE=1; TEXT='!'; /CUT_OFF: LEVEL=-1; SCORE=267; N_SCORE=6.5; MODE=1; TEXT='?'; /DEFAULT: D=-20; I=-20; B1=-50; E1=-50; MI=-105; MD=-105; IM=-105; DM=-105; MM=1; M0=-2; /I: B1=0; BI=-105; BD=-105; /M: SY='C'; M=-6,-10,28,-14,-9,-15,-20,-14,-19,-15,-17,-14,-8,-19,-14,-15,0,0,-9,-32,-17,-12; /M: SY='D'; M=-16,41,-28,53,15,-34,-11,-1,-33,0,-27,-25,21,-11,0,-8,2,-6,-26,-38,-19,7; /M: SY='V'; M=2,-23,-8,-28,-24,-1,-24,-25,16,-20,7,6,-20,-25,-23,-20,-10,-4,24,-23,-9,-24; /M: SY='T'; M=-2,-13,-18,-19,-13,-7,-24,-19,6,-8,-2,1,-11,-17,-11,-10,-1,10,10,-24,-6,-13; /M: SY='L'; M=-11,-30,-22,-33,-24,15,-32,-23,25,-29,35,17,-26,-27,-23,-22,-24,-9,16,-17,3,-24; /M: SY='V'; M=0,-11,-18,-13,-10,-12,-20,-13,1,-6,-4,2,-10,-19,-6,-7,-4,-2,8,-25,-9,-9; /M: SY='V'; M=1,-25,-3,-29,-25,-2,-26,-26,17,-22,10,7,-23,-25,-23,-22,-11,-3,24,-27,-10,-25; /M: SY='D'; M=-6,7,-26,8,7,-25,6,-7,-27,0,-23,-17,8,-13,0,-3,3,-6,-23,-27,-17,3; /I: I=-5; MI=0; IM=0; DM=-15; MD=-15; /M: SY='G'; M=-6,8,-27,8,-3,-27,22,-7,-30,-8,-26,-19,10,-14,-8,-9,2,-9,-24,-28,-21,-6; /M: SY='K'; M=-7,-4,-23,-4,7,-23,-13,-2,-21,10,-18,-9,-3,-12,7,9,-2,-4,-16,-25,-12,6; /M: SY='E'; M=-8,-6,-21,-8,1,-15,-21,-7,-7,-1,-10,-5,-3,-14,0,-1,-2,-2,-6,-26,-9,-1; /M: SY='F'; M=-12,-28,-22,-34,-26,31,-31,-21,18,-26,16,9,-22,-27,-27,-21,-20,-9,14,-6,13,-26; /M: SY='R'; M=-13,-9,-24,-10,-3,-11,-21,7,-17,7,-16,-4,-4,-8,2,9,-9,-9,-16,-20,-1,-2; /M: SY='A'; M=21,-15,-8,-22,-17,-10,-10,-23,0,-15,-5,-5,-14,-18,-17,-19,4,6,12,-24,-15,-17; /M: SY='H'; M=-15,5,-22,2,-1,-20,-16,65,-26,-8,-21,-5,15,-19,6,-2,-2,-11,-26,-32,7,0; /M: SY='K'; M=-12,-5,-29,-5,5,-25,-18,-8,-26,34,-24,-9,-1,-14,8,34,-8,-8,-17,-20,-10,5; /M: SY='A'; M=4,-12,-12,-16,-10,-6,-18,-14,-2,-13,-1,-2,-11,-17,-12,-13,-3,1,2,-24,-8,-11; /M: SY='V'; M=-7,-26,-19,-31,-26,7,-32,-24,27,-23,14,11,-22,-25,-23,-22,-13,0,28,-19,3,-26; /M: SY='L'; M=-10,-30,-20,-30,-21,9,-30,-20,22,-29,47,20,-29,-29,-20,-20,-29,-10,12,-20,0,-21; /M: SY='A'; M=18,-6,0,-12,-8,-18,-6,-16,-15,-10,-18,-12,-2,-14,-8,-13,18,11,-5,-32,-19,-8;
Prosite (profile): example (cont.)

MA MA MA MA MA MA MA MA MA NR NR NR CC DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DR DO // /M: SY='T'; M=-3,3,-16,1,-3,-18,-12,-9,-20,-6,-19,-15,2,-7,-6,-6,10,15,-13,-27,-12,-5; /M: SY='G'; M=-1,1,-25,2,-9,-26,31,-12,-32,-10,-26,-18,4,-17,-12,-10,1,-12,-24,-25,-22,-11; /M: SY='E'; M=-9,3,-24,4,13,-25,-16,-1,-24,13,-21,-13,3,-9,6,13,-3,-6,-20,-27,-13,8; /M: SY='I'; M=-6,-21,-18,-25,-21,-2,-29,-21,21,-21,14,10,-19,-24,-17,-19,-13,-3,19,-23,-3,-20; /M: SY='E'; M=-4,3,-23,3,4,-18,-11,-7,-17,-1,-18,-13,3,-9,-1,-5,1,-4,-14,-25,-11,1; /M: SY='I'; M=-8,-25,-23,-27,-20,1,-30,-21,21,-20,18,12,-22,-18,-18,-18,-18,-7,16,-21,-1,-20; /M: SY='P'; M=-6,0,-24,2,1,-22,-13,-8,-21,-2,-23,-15,1,14,-4,-7,3,2,-19,-31,-18,-3; /M: SY='E'; M=-7,1,-27,4,11,-24,-15,-4,-19,2,-18,-11,0,-1,6,-1,-2,-6,-19,-25,-14,7; /I: E1=0; IE=-105; DE=-105; /RELEASE=39,87397; /TOTAL=46(44); /POSITIVE=45(43); /UNKNOWN=1(1); /FALSE_POS=0(0); /FALSE_NEG=0; /PARTIAL=0; /TAXO-RANGE=??E?V; /MAX-REPEAT=2; O14867, BAC1_HUMAN, T; P97302, BAC1_MOUSE, T; P97303, BAC2_MOUSE, T; P41182, BCL6_HUMAN, T; P41183, BCL6_MOUSE, T; Q01295, BRC1_DROME, T; Q01296, BRC2_DROME, T; Q01293, BRC3_DROME, T; Q28068, CALI_BOVIN, T; Q13939, CALI_HUMAN, T; Q08605, GAGA_DROME, T; Q01820, GCL1_DROME, T; P10074, HKR3_HUMAN, T; Q04652, KELC_DROME, T; P42283, LOLL_DROME, T; P42284, LOLS_DROME, T; O14682, PI10_HUMAN, T; Q05516, PLZF_HUMAN, T; O43791, SPOP_HUMAN, T; P42282, TTKA_DROME, T; P17789, TTKB_DROME, T; P21073, VA55_VACCC, T; P24768, VA55_VACCV, T; P21037, VC02_VACCC, T; P17371, VC02_VACCV, T; P32228, VC04_SPVKA, T; P32206, VC13_SPVKA, T; P21013, VF03_VACCC, T; P24357, VF03_VACCV, T; P22611, VMT8_MYXVL, T; P08073, VMT9_MYXVL, T; O43167, Y441_HUMAN, T; Q10225, YAZ4_SCHPO, T; P40560, YIA1_YEAST, T; P34324, YKV2_CAEEL, T; P34371, YLJ8_CAEEL, T; P34568, YNV5_CAEEL, T; P41886, YPT9_CAEEL, T; Q09563, YR47_CAEEL, T; Q10017, YSW1_CAEEL, T; Q13105, Z151_HUMAN, T; Q60821, Z151_MOUSE, T; P24278, ZN46_HUMAN, T; Q13829, TNP1_HUMAN, ?; PDOC50097;
PRINTS

Compendium of protein motif fingerprints Most protein families are characterized by several conserved motifs Fingerprint: set of motif(s) (simple or composite, such as multidomains) = signature of family membership True family members exhibit all elements of the fingerprint, while subfamily members may possess only a part
ProDom
consists of an automated compilation of homologous domain alignment (procedure based on PSI-BLAST searches)
Updating problem !
Last ProDom update: February 7, 2000 built from SWISS-PROT 38 + TREMBL +TREMBL updates - October 22, 1999
ProDom: example
Your query
Protein domain/family: Composite databases

Example: InterPro
Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites; Single set of documents linked to the various methods; Will be used to improve the functional annotation of SWISS-PROT (classification of unknown protein)
This release contains 3052 entries, representing 574 domains, 2418 families, 46 repeats and 14 post-translational modification sites.
InterPro: example
IPR001323 Name Type Erythropoietin/thrombopoeitin Family Abstract Erythropoietin, a plasma glycoprotein, is the primary physiological mediator of erythropoiesis [1] . It is involved in the regulation of the level of peripheral erythrocytes by stimulating the differentiation of erythroid progenitor cells, found in the spleen and bone marrow, into mature erythrocytes [2] . It is primarily produced in adult kidneys and foetal liver, acting by attachment to specific binding sites on erythroid progenitor cells, stimulating their differentiation [3] . Severe kidney dysfunction causes reduction in the plasma levels of erythropoietin, resulting in chronic anaemia - injection of purified erythropoietin into the blood stream can help to relieve this type of anaemia. Levels of erythropoietin in plasma fluctuate with varying oxygen tension of the blood, but androgens and prostaglandins also modulate the levels to some extent [3] . Erythropoietin glycoprotein sequences are well conserved, a consequence of which is that the hormones are cross-reactive among mammals, i.e. that from one species, say human, can stimulate erythropoiesis in other species, say mouse or rat [4] . Thrombopoeitin (TPO), a glycoprotein, is the mammalian hormone which functions as a megakaryocytic lineage specific growth and differentiation factor affecting the proliferation and maturation from their committed progenitor cells acting at a late stage of megakaryocyte development. It acts as a circulating regulator of platelet numbers.
InterPro: example
... Examplelist P33708 P33709 P49745 view matches for the examples Publications 1. Shoemaker C.B., Mitsock L.D. 849-858 (1986) 2. Takeuchi M., Takasaki S., Miyazaki H., Kato T., Hoshi S., Kochibe N., Kobata A. J. Biol. Chem. 263: 3657-3663 (1988) 3. Lin F.K., Lin C.H., Lai P.H., Browne J.K., Egrie J.C., Smalling R., Fox G.M., Chen K.K., Castro M., Suggs S. Gene 44: 201-209 (1986) 4. Nagao M., Suga H., Okano M., Masuda S., Narita H., Ikura K., Sasaki R. Nucleotide sequence of rat erythropoietin. 1171: 99-102 (1992) Children IPR003013 Signatures PROSITE PS00817 EPO_TPO PFAM PF00758 EPO_TPO Matches Table Graphical
Databases 5: mutation/polymorphism

Contain informations on sequence variations that are linked or not to genetic diseases; Mainly human but: OMIA - Online Mendelian Inheritance in Animals General db:

Disease-specific db: most of these databases are either linked to a single gene or to a single disease;

OMIM HMGD - Human Gene Mutation db SVD - Sequence variation db HGBASE - Human Genic Bi-Allelic Sequences db dbSNP - Human single nucleotide polymorphism (SNP) db
p53 mutation db ADB - Albinism db (Mutations in human genes causing albinism) Asthma and Allergy gene db .
Mutation/polymorphisms: definitions

SNPs: single nucleotide polymorphisms c-SNPs: coding single nucleotide polymorphisms

(Single Nucleotide Polymorphisms within cDNA sequences)
SAPs: single amino-acid polymorphisms
Missense mutation: -> SAP Nonsense mutation: -> STOP Insertion/deletion of nucleotides -> frameshift ! Numbering of the mutation depends on the db (aa no 1 is not necessary the initiator Met !)
Mutation/polymorphisms
dbSNP consortium http://snp.cshl.org/

dbSNP at NCBI http://www.ncbi.nlm.nih.gov/SNP/

Bayer, Roche, IBM, Pfizer, Novartis, Motorola Mission: develop up to 300,000 SNPs distributed evenly throughout the human genome and make the informations related to these SNPs available to the public without intellectual property restrictions. The project started in April 1999 and is anticipated to continue until the end of 2001.
Collaboration between the National Human Genome Research Institute and the National Center for Biotechnology Information (NCBI) Mission: central repository for both single base nucleotide subsitutions and short deletion and insertion polymorphisms Aug 24, 2000 , dbSNP has submissions for 803557 SNPs.
Chromosome 21 dbSNP http://csnp.isb-sib.ch/

A joint project between the Division of Medical Genetics of the University of Geneva Medical School and the SIB Mission: comprehensive cSNP (Single Nucleotide Polymorphisms within cDNA sequences) database and map of chromosome 21
Mutation/polymorphisms
Very heterogeneous format;
Generally modest size;

There are initiatives to standardize and to unify these databases (SVD - Sequence Variation Database project at EBI: HMutDB)
Databases 6: proteomics
Contain informations obtained by 2D-PAGE: master images of the gels and description of identified proteins Examples: SWISS-2DPAGE, ECO2DBASE, Maize2DPAGE, Sub2D, Cyano2DBase, etc. Format: composed of image and text files Most 2D-PAGE databases are federated and use SWISS-PROT as a master index There is currently no protein Mass Spectrometry (MS) database (not for long)
Databases 7: 3D structure
Contain the spatial coordinates of macromolecules whose 3D structure has been obtained by X-ray or NMR studies Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, virus, complex protein/DNA) PDB (Protein Data Bank), SCOP (structural classification of proteins (according to the secondary structures)), BMRB (BioMagResBank; RMN results) Future: Homology-derived 3D structure db.
PDB

Protein Data Bank, managed by RCSB Currently there are ~13000 structures for about 4000 different molecules, but far less protein family ! There are also databases that contain data derived from PDB. Examples: HSSP (homologyderived secondary structure of proteins), SWISS-3DIMAGE (images)
Restriction enzyme
PDB: example
HEADER COMPND COMPND SOURCE AUTHOR REVDAT JRNL JRNL JRNL JRNL JRNL JRNL REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK REMARK LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 S.K.NAIR,D.W.CHRISTIANSON 12CA 6 1 15-OCT-92 12CA 0 12CA 7 AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 1 12CA 14 2 12CA 15 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 3 12CA 17 3 REFINEMENT. 12CA 18 3 PROGRAM PROLSQ 12CA 19 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 3 R VALUE 0.170 12CA 21 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 4 12CA 24 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
PDB (cont.)
SHEET 3 S10 PHE 66 PHE 70 -1 O ASN 67 N LEU 60 12CA 68 SHEET 4 S10 TYR 88 TRP 97 -1 O PHE 93 N VAL 68 12CA 69 SHEET 5 S10 ALA 116 ASN 124 -1 O HIS 119 N HIS 94 12CA 70 SHEET 6 S10 LEU 141 VAL 150 -1 O LEU 144 N LEU 120 12CA 71 SHEET 7 S10 VAL 207 LEU 212 1 O ILE 210 N GLY 145 12CA 72 SHEET 8 S10 TYR 191 GLY 196 -1 O TRP 192 N VAL 211 12CA 73 SHEET 9 S10 LYS 257 ALA 258 -1 O LYS 257 N THR 193 12CA 74 SHEET 10 S10 LYS 39 TYR 40 1 O LYS 39 N ALA 258 12CA 75 TURN 1 T1 GLN 28 VAL 31 TYPE VIB (CIS-PRO 30) 12CA 76 TURN 2 T2 GLY 81 LEU 84 TYPE II(PRIME) (GLY 82) 12CA 77 TURN 3 T3 ALA 134 GLN 137 TYPE I (GLN 136) 12CA 78 TURN 4 T4 GLN 137 GLY 140 TYPE I (ASP 139) 12CA 79 TURN 5 T5 THR 200 LEU 203 TYPE VIA (CIS-PRO 202) 12CA 80 TURN 6 T6 GLY 233 GLU 236 TYPE II (GLY 235) 12CA 81 CRYST1 42.700 41.700 73.000 90.00 104.60 90.00 P 21 2 12CA 82 ORIGX1 1.000000 0.000000 0.000000 0.00000 12CA 83 ORIGX2 0.000000 1.000000 0.000000 0.00000 12CA 84 ORIGX3 0.000000 0.000000 1.000000 0.00000 12CA 85 SCALE1 0.023419 0.000000 0.006100 0.00000 12CA 86 SCALE2 0.000000 0.023981 0.000000 0.00000 12CA 87 SCALE3 0.000000 0.000000 0.014156 0.00000 12CA 88 ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89 ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90 ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91 ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92 ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93 ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94 ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95 ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96 ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97 ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98 ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99 ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100 ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101 ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102 .
Databases 8: metabolic

Contain informations that describe enzymes, biochemical reactions and metabolic pathways; ENZYME and BRENDA: nomenclature databases that store informations on enzyme names and reactions; Examples of metabolic databases: EcoCyc (specialized on Escherichia coli), KEGG, EMP/WIT; Usualy these databases are tightly coupled with query software that allows the user to visualise reaction schemes.
Databases 9: bibliographic
Bibliographic reference databases contain citations and abstract informations of published life science articles; Example: Medline Other more specialized databases also exist (example: Agricola).
Medline
MEDLINE covers the fields of medicine, nursing, dentistry, veterinary medicine, the health care system, and the preclinical sciences more than 4,000 biomedical journals published in the United States and 70 other countries Contains over 10 million citations since 1966 until now Contains links to biological db and to some journals New records are added to PreMEDLINE daily!

Many papers not dealing with human are not in Medline ! Before 1970, keeps only the first 10 authors ! Not all journals have citations since 1966 !
Medline/Pubmed
PubMed is developed by the National Center for Biotechnology Information (NCBI) PubMed provides access to bibliographic information such as MEDLINE, PreMEDLINE, HealthSTAR, and to integrated molecular biology databases (composite db) PMID: 10923642 (PubMed ID), UI: 20378145 (Medline ID)
Databases 10: others

There are many databases that cannot be classified in the categories listed previously; Examples: ReBase (restriction enzymes), TRANSFAC (transcription factors), OGLYCBASE (O-linked sugars), Protein-protein interactions db (DIR), biotechnology patents db, etc.; As well as many other resources concerning any aspects of macromolecules and molecular biology.
Proliferation of databases

What is the best db for sequence analysis ? Which does contain the highest quality data ? Which is the more comprehensive ? Which is the more up-to-date ? Which is the less redundant ? Which is the more indexed (allows complex queries) ? Which Web server does respond most quickly ? .??????
Some important practical remarks

Databases: many errors (automated annotation) ! Not all db are available on all servers The update frequency is not the same for all servers; creation of db_new between releases (exemple: EMBLnew; TrEMBLnew.) Some servers add automatically useful cross-references to an entry (implicit links) in addition to already existing links (explicit links)
Database retrieval tools
Sequence Retrieval System (SRS, Europe) allows any flat-file db to be indexed to any other; allows to formulate queries across a wide range of different db types via a single interface, without any worry about data structure, query languages Entrez (USA): less flexible than SRS but exploits the concept of neighbouring , which allows related articles in different db to be linked together, whether or not they are cross-referenced directly ATLAS: specific for macromolecular sequences db (i.e. NRL-3D) .
More informations about SWISS-PROT
The golden goals of SWISS-PROT

Annotated / curated Complete Non-redundant Highly cross-referenced Available from a variety of servers and through sequence analysis software tools Associated with wide range of documentation
Review: Protein sequence databases R. Apweiler (2000), Adv. in protein chemistry, 54, 31-70
SWISS-PROT: species

6840 different species 20 species represent about 45% of all sequences in the database 5000 species are only represented by one to three sequences. In most cases, these are sequences which were obtained in the context of a phylogenetic study
SWISS-PROT: cross-references

SWISS-PROT was the first database with crossreferences. Explicitly cross-referenced to 34 databases Cross-ref to DNA (EMBL/GenBank/DDBJ), 3Dstructure (PDB), literature (Medline), genomic (MIM, MGD, FlyBase, SGD, SubtiList, etc.), 2D-gel (SWISS-2DPAGE), specialized db (PROSITE, TRANSFAC) Implicitly cross-referenced to additional db on the WWW (GeneCards, PRODOM, etc.)
Annotations

Function(s) Post-translational modifications (PTM) Domains Quaternary structure Similarities Diseases, mutagenesis Conflicts, variants Cross-references
A Swiss-Prot entry
Sprot entry (cont.)
Sprot entry (cont.)
Sprot entry (cont.)
Sprot entry (cont.)
Future for human proteins

Original estimate: from 70000 to 100000 genes Incyte recently announced an estimation of 140000 genes More recent estimations give about 30000 to 40000 genes C. elegans and Drosophila have ~15000 genes. There was two sets of genome duplication in the evolutionary history leading to vertebrates. Very roughly it means that: Human genes=~60000 genes - losses + new genes But more than 1 million proteins ! (due to PTM, alternative products, variants)
http://www.ensembl.org/genesweep.html
Genesweep
http://www.ensembl.org/genesweep.html
What after genomes?

Proteome projects are an essential tool for the understanding of real proteins There will be a flood of characterization data (MS, 2D) that will be the equivalent of ESTs at the protein level Protein databases are going to be more and more important for new biological studies
Databases in GCG
DNA
EMBL, EPD, RepBase, vectordb (NCBI) Swiss-Prot, TrEMBL, PDB
Protein
Other
PROSITE, REBASE
How to access databases in GCG?

Fetch or typedata ? Stringsearch Name Lookup (based on SRS)
Useful to generate list files

Databases

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Databases

Enviado por

Direitos autorais:

Formatos disponíveis

An introduction to biological databases

structured searchable (index) updated periodically (release) cross-referenced (hyperlinks)

Databases: an simple example

Databases: an simple example (cont.)

Course DEA EMBnet

Date Oct-nov-dec 2000 Sept 2000

Involved teachers 1,3 2,3

Easier to manage; choice of the output

Why biological databases ?

Explosive growth in biological data

Variable size: <100Kb to >10Gb

DNA: > 10 Gb Protein: 1 Gb 3D structure: 5 Gb Other: smaller

Update frequency: daily to annually

Categories of databases for Life Sciences

Distribution of sequence databases

1968 1982 1984 1989 1989 1982 1993 2001

Sequence Databases: some technical definitions

Data storage management:

Format (flat file):

Ideal minimal content of a sequence db

Sequence database: example

Sequence database: example (cont.)

Sequence database: example

Databases 1: nucleotide sequence

Heterogeneous sequence length: genomes, variants, fragments Sequence sizes:

max 300000 bp /entry (! genomic sequences, overlapping) min 10 bp /entry

Unexpected informations you can find in these db:

EMBL entry: example

EMBL entry (cont.)

GenBank entry: example

GenBank entry (cont.)

DDJB entry: example

The tremendous increase in nucleotide sequences

EMBL datafirst increase in data due to the PCR development

1980: 80 genes fully sequenced !

RefSeq a SWISS-PROT clone?

RefSeq a SWISS-PROT clone?

RefSeq records are created via a process consisting of:

PREDICTED PROVISIONAL REVIEWED

TEXT REFERENCES SEE ALSO CONTRIBUTORS CREATION DATE EDIT HISTORY

Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homologies

Database 3: protein sequence

GenPept: derived from automated GenBank CDS translations and

PIR + PATCHX (supplement of unverified protein sequences from external sources)

Database 3: protein sequence

NRL-3D: produced by PIR from PDB (3D struture) sequences

Many specialized protein databases for specific families or groups of proteins.

TrEMBL (Translation of EMBL)

The simplified story of a Sprot entry

SWISS-PROT introduces a new arithmetical concept !

How many sequences in SWISS-PROT + TrEMBL ?

88000 + 300 000 = about 240000

SWISS-PROT and TrEMBL (SPTR) a minimal of redundancy

TrEMBL: SPTrEMBL + REMTrEMBL

TrEMBL new: updates to the latest release of TREMBL

GenPept (translation of GenBank)

Composite protein sequence db

Redundancy priority criteria

* Also called SWall at EBI

SWIR: SPTrEMBL + Wormpep

Composite: protein family

Database 4: protein domain/family