Escolar Documentos
Profissional Documentos
Cultura Documentos
STRUCTURE
PREDICTION
Neeru Redhu
CCS HAU
CLASSIFICATION OF
SECONDARY STRUCTURE
Defining features
Dihedral angles
Hydrogen bonds
Geometry
Assignedmanually by
experimentalists
Automatic
DSSP (Kabsch & Sander,1983)
STRIDE (Frishman & Argos, 1995)
Continuum (Andersen et al.)
CLASSIFICATION
GHWIAT
HWIATRGQLIREAYEDY
GQLIREAYEDYRHFSS
SSECPFIP
CEEEEE
EEEEECHHHHHHHHHHH
HHHHHHHHHHHCCCHH
HHCCCCCC
WHY SECONDARY STRUCTURE
PREDICTION?
o Statistical method
o Chou-Fasman method, GOR I-IV
o Nearest neighbors
o NNSSP, SSPAL
o Neural network
o PHD, Psi-Pred, J-Pred
o Support vector machine (SVM)
o HMM
ACCURACY MEASURE
JPred2
PHD
15
10
0
30 40 50 60 70 80 90 100
Percentage correctly predicted residues per protein
HOW FAR CAN WE GO?
Currently ~76%
1/5 of proteins with more than 100
homologs
>80%
Assignment is ambiguous (5-15%).
non-unique protein structures, H-bond cutoff, etc.
Some segments can have multiple
structure types.
Different secondary structures between
homologues (~12%). Prediction limit
88%.
Non-locality.
ASSUMPTIONS
l Example:
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(a,aai) = 500/20,000, p(a) = 4,000/20,000, p(aai) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
CHOU-FASMAN ALGORITHM
Helix, Strand
1. Scan for window of 6 residues where average score > 1 (4
residues for helix and 3 residues for strand)
2. Propagate in both directions until 4 (or 3) residue window
with mean propensity < 1
3. Move forward and repeat
Conflict solution
Any region containing overlapping alpha-helical and beta-
strand assignments are taken to be helical if the average
P(helix) > P(strand). It is a beta strand if the average P(strand)
> P(helix).
Accuracy: ~50% ~60%
GHWIATRGQLIREAYEDYRHFSSECPFIP
INITIATION
T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
PROPAGATION
T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
FIRST GENERATION METHODS:
SINGLE RESIDUE STATISTICS
Accuracy: Q3 = 50-60%
SECOND GENERATION METHODS: SEGMENT
STATISTICS
Problems:
Low accuracy - Q3 below 66% (results).
Q3 of b-strands (E) : 28% - 48%.
Predicted structures were too short.
THE GOR METHOD
Main problems:
Prediction accuracy <70%
Prediction accuracy for b-strand 28-
48%
Predicted chains are usually too
short
what leads do the difficult use
of predictions
NEAREST NEIGHBOR METHOD
p (aai | E j ) p(aai , E j )
S (i, j ) = log = log
p (aa i ) p (aai ) p ( E j )
SINGLETON SCORE MATRIX
T R G Q L I R E A Y E D Y R H F S S E C P F I P
| | | | |
. . .E C Y E Y B R H R . . . .
j-4 j-3 j-2 j-1 j j+1 j+2 j+3 j+4
L H H H H H H L L
NEIGHBORS
1 - L H H H H H H L L - S1
2 - L L H H H H H L L - S2
3 - L E E E E E E L L - S3
4 - L E E E E E E L L - S4
n - L L L L E E E E E - Sn
n+1 - H H H L L L E E E - Sn+1
Sequence-profile alignment.
Compare a sequence against protein family.
More specific.
position
Predator: Nearest-neighbour
PSI-Pred: Neural networks
SSPro: Neural networks
SAM-T02: Homologs (Hidden Markov Models)
PHD: Neural networks
Due to the improvement of protein information in
databases i.e. better evolutionary information,
todays predictive accuracy is ~80%
It is believed that maximum reachable accuracy is
88%
PSSP DATA PREPARATION
Public Protein Data Sets used in PSSP research
contain protein secondary structure sequences. In
order to use classification algorithms we must
transform secondary structure sequences into
classification data tables.
Records in the classification data tables are called,
in PSSP literature (learning) instances.
The mechanism used in this transformation
process is called window.
A window algorithm has a secondary structure
as input and returns a classification table: set of
instances for the classification algorithm.
NEURONS
normal state
addictive state
NEURAL NETWORK
J1
J2
J3
J4
3.
Enter sequences
SIMPLE NEURAL NETWORK
1 1
J11
1 J12 0
1 1 1 0 1 -1
1 0 1 0 1
0
0 0 1
0 1
1
1 2 2+?
1 1 1
Er ror
o
u
t
in
-2 -1 1 2
-1
Junct ions
SIMPLE NEURAL NETWORK WITH
HIDDEN LAYER
J J in
2 1
out i = f f
k
ij jk
j k
NEURAL NETWORK FOR SECONDARY
STRUCTURE
D (L)
R (E)
Q (E)
A
C
D G (E)
E
F
G F (E)
H H
I
K
V (E)
L
M P (E) E
N
P
Q A (H)
R
S
A (H) L
T
V
W
Y
Y (H)
.
V (E)
K (E)
K (E)
PSIPRED
Window size = 15
Two networks
First network (sequence-to-structure):
315 = (20 + 1) 15 inputs
extra unit to indicate where the windows spans either N or C
terminus
Data are scaled to [0-1] range by using 1/[1+exp(-x)]
75 hidden units
3 outputs (H, E, L)
Second network (structure-to-structure):
Structural correlation between adjacent sequences
60 = (3 + 1) 15 inputs
60 hidden units
3 outputs
Accuracy ~76%
THE PSI-PRED ALGORITHM
Given an amino-acid sequence, predict
secondary structure elements in the
protein
Three stages:
1. Generation of a sequence profile (the
multiple alignment step)
2. Prediction of an initial secondary
structure (the neural network step)
3. Filtering of the predicted structure
(another neural network step)
GENERATION OF SEQUENCE PROFILE
A BLAST-like program called PSI-BLAST
used for this step
We saw BLAST earlier -- it is a fast way to find
high scoring local alignments
PSI-BLAST is an iterative approach
an initial scan of a protein database using the target
sequence T
align all matching sequences to construct a sequence
profile
scan the database using this new profile
Can also pick out and align distantly related
protein sequences for our target sequence T
THE SEQUENCE PROFILE LOOKS LIKE
THIS
Has 20 x M numbers
The numbers are log likelihood of each residue at each position
PREPARING FOR THE SECOND STEP
Feed the sequence profile to an artificial neural
network
But before feeding, do a simply scaling to bring
the numbers to 0-1 scale
1
x -x
1+ e
ARTIFICIAL NEURAL NETWORK
Supervised learning algorithm
Training examples. Each example has a label
class of the example, e.g., positive or negative
helix, strand, or coil
Learns how to predict the class of an example
ARTIFICIAL NEURAL NETWORK
Directed graph
Nodes or units or neurons
http://www.akri.org/cognition/images/annet2.gif
http://www.geocomputation.org/2000/GC016/GC016_01.GIF
(units)
WHAT A UNIT (NEURON) DOES
xi = w ij y j + wi
j N -{i}
y i = f i (x i ) = f i w ij y j + w i
j N -{i}
wi is called the bias of the unit
WEIGHTS, BIAS AND TRANSFER FUNCTION
Weights
wij and bias wi of each unit are
parameters of the ANN.
Parameter values are learned from input data
Transfer function is usually the same for
every unit in the same layer
Graphical architecture (connectivity) is
decided by you.
Could use fully connected architecture: all
units in one layer connect to all units in next
layer
STEP 2
Feed the sequence profile to the input layer of an
ANN
Not the whole profile, only a window of 15
consecutive positions
For each position, there are 20 numbers in the
profile (one for each amino acid)
Therefore ~ 15 x 20 = 300 numbers fed
Therefore, ~ 300 input units in ANN
3 output units, for strand, helix, coil
each number is confidence in that secondary structure
for the central position in the window of 15
e.g.,
helix 0.18
strand 0.09
15
coil 0.67
Optional reading:
Review by Burkhard Rost:
http://cubic.bioc.columbia.edu/papers/2003_rev_dek
ker/paper.html