Você está na página 1de 41


• Bioinformatics is an interdisciplinary field mainly involving
molecular biology and genetics, computer science, mathematics,
and statistics.
• computational techniques for solving biological problems
1. data problems: representation (graphics), storage and retrieval
(databases), analysis (statistics, artificial intelligence,
optimization, etc.)
2. biology problems: sequence analysis, structure or function
prediction, data mining, etc. also called computational biology
• National Center for Biotechnology Information (NCBI 2001)
defines bioinformatics as: Bioinformatics is the field of science in
which biology, computer science, and IT merge into a single

• There are three important sub-disciplines within bioinformatics

1. the development of new algorithms and statistics which assess
relationships among members of large data sets

2. the analysis and interpretation of various types of data

including nucleotide and amino acid sequences, protein domains,
and protein structures

3. the development and implementation of tools that enable

efficient access and management of different types of information.
• Types of datasets : genome sequences, macromolecular
structures, and functional genomics experiments (e.g.
microarray data)

• other data : phylogenetic and metabolic pathway analysis,

the text of scientific papers, and plant varietal information
and statistics.

• Analysis of biological data requires application of large

number of techniques like primary sequence alignment,
protein 3D structure alignment, phylogenetic tree
construction, prediction and classification of protein
structure, prediction of RNA structure, prediction of protein
function, and expression data clustering.
• Development of suitable algorithms is an important part of

• The techniques and algorithms were specifically developed

for the analysis of biological data, for instance, the dynamic
programming algorithm for sequence alignment is one of the
most popular programmes among the biologists

• The sequence information generated worldwide is stored

systematically in different types of databases

• Hence, it is necessary to understand about the databases

and their different types
Pattern recognition
• The initiation of translation or transcription process is
determined by the presence of specific patterns of DNA or
RNA, or motifs.
• Research on detecting specific patterns of DNA sequences
such as genes, protein coding regions, promoters, etc., leads
to uncover functional aspects of cells.
• Patterns are used in database searching eg:- BLOCKS in
protein database
• Pattern searching on BLAST and FASTA for the closest

Gene features DNA characteristics

Coding sequences ORFs,GC rich, CpG content

Translational start and Start:ATG, Stop:TAA,TAG,TGA

stop sites

Splice site(exon/intron Consensus sequences


Promoter regions TATA,shine-dalgarno,Pribnow,Kozak

consensus, CpG content

Poly A Signals Consensus sequence ,10-20 bases

upstream to poly A tail
Prokaryotic gene structure

ORF (open reading frame)

TATA box
Start codon Stop codon

Frame 1
Frame 2
Frame 3
• Advantages
– Simple gene structure
– Small genomes (0.5 to 10 million bp)
– No introns
– Genes are called Open Reading Frames (ORFs)
– High coding density (>90%)
• Disadvantages
– Some genes overlap (nested)
– Some genes are quite short (<60 bp)
Gene finding approaches

1) Rule-based (e.g, start & stop codons)

2) Content-based (e.g., codon bias,

promoter sites)

3) Similarity-based (e.g., orthologs)

4) Pattern-based (e.g., machine-learning)

5) Ab-initio methods (FFT)

Simple rule-based gene finding
• Look for putative start Codon (ATG)
• Staying in same frame, scan in groups of three
until a stop Codon is found
• If no: of codons >=50, assume it’s a gene
• At end of chromosome, repeat process for
reverse complement
Example ORF
Content based gene prediction method

• RNA polymerase promoter site (-10, -30 site

or TATA box)
• Shine-Dalgarno sequence (+10, Ribosome
Binding Site) to initiate protein translation
• Codon biases
• High GC content
Similarity-based gene finding
• Take all known genes from a related genome and compare
them to the query genome via BLAST
• Disadvantages:
– Orthologs (genes in different species that evolved from a common
ancestral gene by speciation)/paralogs genes related by
duplication within a genome –evolve new function) sometimes
lose function and become pseudogenes
– Not all genes will always be known in the comparison genome
– The best species for comparison isn’t always obvious
• Similarity comparisons are good supporting evidence for
prediction validity
Machine Learning Techniques
Hidden Markov Model
ANN based method
Bayes Networks
Ab-initio Methods
• Fast Fourier Transform (algorithm) based
• Poor performance
• Able to identify new genes
• FTG method (FTG is a web server for analyzing
nucleotide sequences to predict the genes
using Fourier transform techniques).
Eukaryotic genes
• Complex gene structure
• Large genomes (0.1 to 3 billion bases)
• Exons and Introns (interrupted)
• Low coding density (<30%)
– 3% in humans, 25% in Fugu, 60% in yeast
• Alternate splicing (40-60% of all genes)
• Considerable number of pseudogenes
Finding Eukaryotic Genes Computationally

• Rule-based
– Not as applicable – too many false positives
• Content-based Methods
– CpG islands, GC content, hexamer repeats, composition statistics, codon
• Feature-based Methods
– donor sites, acceptor sites, promoter sites, start/stop codons, polyA signals,
feature lengths
• Similarity-based Methods
– sequence homology, EST (expressed sequence tags) searches
• Pattern-based
– HMMs, Artificial Neural Networks
• Most effective is a combination of all the above
Gene prediction programs
• Rule-based programs
– Use explicit set of rules to make decisions.
– Example: GeneFinder
• Neural Network-based programs
– Use data set to build rules.
– Examples: Grail, GrailEXP
• Hidden Markov Model-based programs
– Use probabilities of states and transitions between
these states to predict features.
– Examples: Genscan, GenomeScan
Combined Methods

• GRAIL (http://compbio.ornl.gov/Grail-1.3/)

• FGENEH (http://www.bioscience.org/urllists/genefind.htm)

• HMMgene (http://www.cbs.dtu.dk/services/HMMgene/)

• GENSCAN(http://genes.mit.edu/GENSCAN.html)
• GenomeScan (http://genes.mit.edu/genomescan.html)

• Twinscan (http://ardor.wustl.edu/query.html)
Egpred: Prediction of Eukaryotic Genes

• Similarity Search
– First BLASTX against RefSeq datbase

– Second BLASTX against sequences from first BLAST

– Detection of significant exons from BLASTX output

– BLASTN against Introns to filter exons

• Prediction using ab-initio programs

– NNSPLICE used to compute splice sites

• Combined method
Biological databases
• Biological databases : libraries of life sciences information,
collected from scientific experiments, published literature, high-
throughput experiment technology, and computational analysis

• Information from research area including genomics, proteomics,

metabolomics, microarray gene expression, phylogenetics

• There are two main functions of biological databases:

1. Make biological data available to scientists.

2. To make biological data available in computer readable form.

• Biological databases can be broadly classified into sequence and
structure databases

• Sequence databases are applicable to both nucleic acid sequences

and protein sequences

• structure database is applicable to only Proteins.

• The first database was created within a short period after the Insulin
protein sequence was made available in 1956.

• Around mid 1960s, the first nucleic acid sequence of Yeast tRNA with
77 bases (individual units of nucleic acids) was found out. During this
period, three dimensional structures of proteins were studied and
the well known Protein Data Bank was developed as the first protein
structure database with only 10 entries in 1972
• Databases in general can be classified in to primary, secondary or composite

• A primary database contains information of the sequence or structure alone.

• Experimental results are submitted directly into the database by researchers, and
the data are essentially archival in nature.

• Once given a database accession number, the data in primary databases are never
changed: they form part of the scientific record.

• Examples of these include

1. Swiss-Prot , PIR - protein sequences,

2. EMBL, GenBank & DDBJ -Genome sequences {International Nucleotide Sequence

Database Collaboration (INSDC)}

3. PDB, SCOP-protein structures.

International nucleotide data banks


EMBL International NLM

EBI Advisory Meeting NCBI


Genbank (NCBI)

Created in 1988 as a part of the

National Library of Medicine at NIH
– Open access, annotated collection of publically available
nucleotide sequence
– Produced & maintained by NCBI
– Accessed & searched through Entrez system at NCBI
– Develop software tools for sequence analysis
• European Molecular Biology Laboratory
• Supported by 20 European countries &
• Nucleotide sequence database
• Maintained by EBI (European
Bioinformatics Institute)
• DNA Data Bank of Japan
• Collaboration with EMBL & Genbank
• Run by National Institute of Genetics
• A secondary database contains derived information
from the primary database.

• They are often referred to as curated databases but

this is a bit of a misnomer because primary
databases are also curated to ensure that the data
in them is consistent and accurate.
Primary database Secondary database

Curated database;
Synonyms Archival database

Results of analysis, literature

Direct submission of
research and interpretation,
Source of data experimentally-derived data
often of data in primary
from researchers

•InterPro (protein families,

•ENA, GenBank and DDBJ (nucl
motifs and domains)
eotide sequence)
•Array Express
Knowledgebase (sequence and
Archive and GEO (functional
functional information on
Examples genomics data)
•Protein Data Bank (PDB;
•Ensembl (variation, function,
coordinates of three-
regulation and more layered
dimensional macromolecular
onto whole genome
Composite protein Databases
• There are a number of "composite" databases of protein

• These compile their sequence data from the primary

sequence databases and filter them to retain only the non-
redundant sequences.

• The best-known are OWL, NCBI

• PIR (Protein Information Resource), SWISS-PROT, TrEMBL

• PROSITE, Pfam (motif databases)

Database searching

• Database use a system where an entry can be

identified in 2 different ways :
1. Identifier

2. Accession code (or number)

1. Identifier :
– String of letters & digits

– Abbreviation of full protein or gene name

– “locus” in GenBank , “entry name” in SWISS-PROT

– Changeable

Eg : KRAF_HUMAN is the entry name for Raf-1 oncogene from Homo


2. Accession code (or number) :

– A number ( possibly with a few character in front) uniquely identifies an
entry in its database

– Stable

Eg : accession code for KRAF_HUMAN in SWISS-PROT is P04049

• Some software systems must be used to perform
the searches like
– all entries with keyword (eg : “GTPase”)

– entry with a given literature reference (by author or

article )

– all protein with a keyword (eg : “ribosomal”)

• Two examples of such software systems :

– SRS - The sequence retrieval system

• SRS :
– Sequence Retrieval System

– Developed by EBI

– System for integrating heterogeneous databases

– Web oriented system, accessed through HTML pages & Common Gateway
Interface(CGI) scripts

– Developed & accessible at NCBI Entrez site

– Provide search facilities for large no. of databases & links between them

– Provides a well defined web interface