Você está na página 1de 50

Gene Prediction

Presented by:
Soneela Anjum(459-FBAS/BSBI/F15)
Maheera Amjad(478-FBAS/BSBI/F15)
Gene:
Gene:
A unit of heredity material that is transferred
from a parent to offspring and is held to determine some
characteristic of the offspring.
Or
A sequence of nucleotides coding for protein.
The gene of the coding region encodes instructions that
allow a cell to produce a specific protein or enzyme. There
are nearly 50,000 and 100,000 genes with each being
made up of hundreds of thousands of chemical bases.
Gene Prediction:
Gene prediction or gene finding refers to the process
of identifying the regions of genomic DNA that
encode genes.
Its all about detecting coding regions and infer gene
structure.
The process includes detection of the location of open
reading frames (ORFs) and delineation of the
structures of introns as well as exons if the genes of
interest are of eukaryotic origin.
Contd.
In the field of bioinformatics, gene
identification from large DNA sequence is
known to be a significant setback.
The ultimate goal is to describe all the genes
computationally with near 100% accuracy.
Gene finding is one of the first and most
important steps in understanding the genome
of a species once it has been sequenced.
Background
In 1960s it was discovered that the sequence of
codons in a gene determines the sequence of
amino acids in a protein.
We all are familiar with the central dogma of life.
Contd.
The central dogma of molecular biology deals with the
detailed residue by-residue transfer of sequential
information. It states that such information cannot be
transferred from protein to either protein of nucleic acid.
Francis Crick.
Originally stated in 1958, but questioned in the 1960s due
to evidence of viral RNA to DNA transfer.
These events and experimental work led to the gene
prediction.
Computational Gene Prediction:
We provide a DNA sequence as input;
S =(s1,s2,...sn) * ,
in which ={'A','C','G','T '}.
Accurate labeling of each element in S as belongs to
a coding region (exon), intergenic region or non-
coding region (intron).
Gene prediction is difficult
DNA sequence signals have low information
content (degenerated and highly
unspecic).
It is difficult to discriminate real signals.
Sequencing errors.
Very short exons (3bp), especially initial.
Many very long introns.
Gene Prediction Limits
Existing predictors are for protein coding regions
Non-coding areas are not detected (5 and 3
UTR)
Non-coding RNA genes are missed
Predictions are for typical genes
Partial genes are often missed
Training sets may be biased
Atypical genes use other grammars
Prokaryotic Gene Prediction
High gene density and simple gene
structure.
Short genes have little information.
Overlapping genes.
No introns.
GENE PREDICTION IN PROKARYOTES
Prokaryotes, which include bacteria and Archaea have
small genomes with sizes ranging from 0.5to10Mbp.
Promoters are DNA segments upstream of transcripts
that initiate transcription

Promoter attracts RNA Polymerase to the transcription


start site.
Cont.

Upstream transcription start site (TSS;


position 0) there are promoters .
Gene Prediction in Eukaryotes
Low gene density and complex gene
structure.
Alternative splicing.
Pseudo-genes.
GENE PREDICTION IN EUKARYOTES
The main issue in prediction of eukaryotic genes
is the identication of exons, introns, and splicing
sites.
Gene prediction in eukaryotes is controlled by:
A poly-A signal
A conserved motif slightly downstream of a
coding region with a consensus CAATAAA(T/C).
Contd.
Alternating exons and introns

Intron starts usually by GT and ends by AG


Types of Exons
1. Initial exons
2. Internal exons
3. Terminal exons
4. Single-exongenes, i.e. genes without
introns.
Consensus Splice SitesSplicing Signals
Try to recognize location of
splicing signals at exon-intron
junctions.
Profiles for sites are still weak
and leads the problem to the
Hidden Markov Model (HMM)
approaches, which capture the
statistical dependencies
between sites.
Splice Site Detection
In the exon-intron junctions there is a large
similarity to the consensus sequence algorithms
based on position specific weight matrices.
However, this is far too simple, since it does not
use all the information encoded in a gene. Thus
more integrated approaches are sought. This
naturally leads us to Hidden Markov Models.
Gene finding programs:
The current gene prediction methods can be
classified into two major categories:
Homology based approaches
Ab-initio based approaches
Homology based approaches:
The homology-based method makes predictions based
on signicant matches of the query sequence with
sequences of known genes.
Online tools involved are:
Procrustes.
GeneWise.
Ab-initio based approaches:
Predicts the gene on the basis of given
sequence only.
Two major features associated with gene:
Existence of gene signals
Gene content(statistical description of
coding regions)
Gene signals:
Gene signals include:
Start and stop codons.
Intron splice signals:
Donor-splice: splicing site at the beginning of an
intron, intron 5' left end.
Acceptor-splice: splicing site at the end of an
intron, intron 3' right end.
Contd.
Transcription factor binding sites.
Ribosomal binding sites.
Polyadenylation (poly-A) sites.
Reading frames:
Every region of DNA has six possible reading frames.
Three in the forward and three in the reverse direction.
An open reading frame(ORF) starts with an atg (Met) in
most species and ends with a stop codon (taa, tag or tga).
A frame longer than thirty codons without interruption by
stop codons is suggestive of a gene coding region.
The frame is further tested for the presence of signals and
protein translated by it.
Contd.
For example, the following sequence of DNA.
5' 3'
atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa

1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa
M P K L N S V E G F S S F E D D V *
2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat
C P S * I A * R G F H H L R T M Y
3 gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata
A Q A E * R R G V F I I * G R C I
Contd.
The three reading frames in the forward direction are
shown with the translated amino acids below
each DNA sequence.
Frame 1 starts with the "a", Frame 2 with the "t" and
Frame 3 with the "g".
Stop codons are indicated by an "*" in the protein
sequence.
The longest ORF is in Frame 1.
Gene prediction using MM and HMM:
MM describes the possibility of distribution of
nucleotides in gene.
k is the order of a Markov Model.
A zero order MM assumes that each base exist
independently.
A first order MM assumes that occurrence of any
base depends on the base preceding it.
Contd.
A second order MM assumes previous two bases.
And so on..
MM assumes that oligonucleotide distributions in the
coding regions are different from those for the non
coding regions.
Effective Markov Models are built in set of three
nucleotides.
Regulatory Sequences:
DNA elements that are present at the start sites of
gene.
Serve as binding sites for the gene transcription
machinery.
Should not be confused with the translation start
sites(start codons-Methionine).
Directly regulate gene expression.
Each gene has a unique set of regulatory sequence.
Contd.
CAAT box
Operator(biology)
Pribnow box
TATA box
A-box
Z-box
C-box
E-box
G-box
Promoter regions in Prokaryotes:
In prokaryotes promoter and
regulatory segments are
present at 35 and 10 base pairs
upstream from transcription
site.
It is referred as -35 and -10.
-35 box has sequences of
TTGACA.
-10 box has sequences of
TATAAT(pribnow box).
Promotor regions in Eukaryotes:
The core of majority eukaryotic promoter is TATA
box located 30bps upstream.
Many genes have an initiator sequence(Inr).
TATA box:
Sequence of TATA(A/T)A(A/T).
Inr(Initiator sequence):
Pyrimidine rich sequence(C/T)(C/T)CA(C/T)(C/T).
Contd.
CAAT box:
Also referred as CAT box.
GGCCAATCT sequence.
Upstream by 60-100 bases to the transcription site.
GC box:
GGGCGG Sequence.
Found at upstream of TATA box.
~110 bases upstream from transcription initiation site.
GenScan:
GenScan:
A BioInformatics tool that is used to identify the
complete gene structure.
Can be used to predict location of genes.
Their exon-intron boundaries in genomic
sequences
The GENSCAN Web server can be found at MIT.
GENEID
Geneid
Geneid is one of the oldest gene structure prediction
programs, recently updated to a new and faster
version.
Uses a hierarchical search structure (signal exon
gene).
1st: nds signals (splice sites, start and stop codons).
2nd: from the found signals start to score regions for
exon-dening signals and protein coding potential.
3rd: a dynamic programming algorithm is used to
search the space of predicted exons to assemble the
gene structure.
Contd.
Very fast and scale linearly with the length of the
sequence (both in time and memory) adapted to
analyze large sequences.
Trained with Drosophila and Human.
Available at http://www1.imim.es/geneid.html
Species:
Homo sapiens

Command:
geneid -P human.param -G -o
geneid4840.fasta

Running time:
0.13 secs

Accuracy of the different methods


Evaluation of the dierent programs (Rogic et al.,
2001)
Cont.
Overall performances are the best for HMMgene and
GENSCAN.
For almost all the tested programs, medium exons
(70-200 nucleotides long), are most accurately
predicted.
Internal exons are much more likely to be correctly
predicted.
Initial and terminal exons are most likely to be missed
completely.
Only HMM gene and GENSCAN have reliable scores for
exon prediction.
References:
https://www.ncbi.nlm.nih.gov/books/NBK21745/
Book:
Essentials of Bioinformatics by Jin Xiong.
THANK YOU
ANY QUESTIONS???????

Você também pode gostar