Você está na página 1de 65

PROTEIN SECONDARY

STRUCTURE
PREDICTION

Neeru Redhu
CCS HAU
CLASSIFICATION OF
SECONDARY STRUCTURE

Defining features
Dihedral angles
Hydrogen bonds
Geometry
Assignedmanually by
experimentalists
Automatic
DSSP (Kabsch & Sander,1983)
STRIDE (Frishman & Argos, 1995)
Continuum (Andersen et al.)
CLASSIFICATION

Eight states from DSSP


H: a-helix
G: 310 helix 24 26 E H < S+ 0 0 132
I: p-helix 25 27 R H < S+ 0 0 125
26 28 N < 0 0 41
E: b-strand 27 29 K 0 0 197
B: bridge 28 ! 0 0 0
29 34 C 0 0 73
T: b-turn
30 35 I E -cd 58 89B 9
S: bend 31 36 L E -cd 59 90B 2
C: coil 32 37 V E -cd 60 91B 0
33 38 G E -cd 61 92B 0
CASP Standard
H = (H, G, I), E = (E, B), C = (C, T, S)
DIHEDRAL ANGLES
RAMACHANDRAN PLOT (ALPHA)
RAMACHANDRAN PLOT (BETA)
WHAT IS SECONDARY
STRUCTURE PREDICTION?

Given a protein sequence (primary structure)

GHWIAT
HWIATRGQLIREAYEDY
GQLIREAYEDYRHFSS
SSECPFIP

l Predict its secondary structure content


(C=Coils H=Alpha Helix E=Beta Strands)

CEEEEE
EEEEECHHHHHHHHHHH
HHHHHHHHHHHCCCHH
HHCCCCCC
WHY SECONDARY STRUCTURE
PREDICTION?

o An easier problem than 3D structure


prediction (more than 40 years of history).
o Accurate secondary structure prediction can
be an important information for the tertiary
structure prediction
o Protein function prediction
o Protein classification
o Predicting structural change
PREDICTION METHODS

o Statistical method
o Chou-Fasman method, GOR I-IV
o Nearest neighbors
o NNSSP, SSPAL
o Neural network
o PHD, Psi-Pred, J-Pred
o Support vector machine (SVM)
o HMM
ACCURACY MEASURE

Three-state prediction accuracy: Q3

Q3 = correctly predicted residues


number of residues
l A prediction of all loop: Q3 ~ 40%
l Correlation coefficients
IMPROVEMENT OF ACCURACY

1974 Chou & Fasman ~50-53%


1978 Garnier 63%
1987 Zvelebil 66%
1988 Qian & Sejnowski 64.3%
1993 Rost & Sander 70.8-72.0%
1997 Frishman & Argos <75%
1999 Cuff & Barton 72.9%
1999 Jones 76.5%
2000 Petersen et al. 77.9%
PREDICTION ACCURACY (EVA)
25
PSIPRED
SSpro
PROF
20
PHDpsi
Percentage of all 150 proteins

JPred2
PHD
15

10

0
30 40 50 60 70 80 90 100
Percentage correctly predicted residues per protein
HOW FAR CAN WE GO?

Currently ~76%
1/5 of proteins with more than 100
homologs
>80%
Assignment is ambiguous (5-15%).
non-unique protein structures, H-bond cutoff, etc.
Some segments can have multiple
structure types.
Different secondary structures between
homologues (~12%). Prediction limit
88%.
Non-locality.
ASSUMPTIONS

o The entire information for forming secondary


structure is contained in the primary sequence.
o Side groups of residues will determine structure.
o Examining windows of 13 - 17 residues is
sufficient to predict structure.
o Basis for window size selection:
a-helices 5 40 residues long
b-strands 5 10 residues long
PSSP ALGORITHMS
There are three generations in PSSP
algorithms
First Generation: based on statistical information
of single aminoacids
Second Generation: based on windows (segments)
of aminoacids. Typically a window containes 11-21
aminoacids
Third Generation: based on the use of windows on
evolutionary information
PSSP: FIRST GENERATION
First generation PSSP systems are based on
statistical information on a single aminoacid
The most relevant algorithms:
Chow-Fasman, 1974
GOR, 1978
Both algorithms claimed 74-78% of predictive
accuracy, but tested with better constructed
datasets were proved to have the predictive
accuracy ~50% (Nishikawa, 1983)
SECONDARY STRUCTURE
PROPENSITY

From PDB database, calculate the propensity for


a given amino acid to adopt a certain ss-type

P (a | aai ) p(a , aai )


Pa =
i
=
p(a ) p(a ) p (aai )

l Example:
#Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500
P(a,aai) = 500/20,000, p(a) = 4,000/20,000, p(aai) = 2,000/20,000
P = 500 / (4,000/10) = 1.25
CHOU-FASMAN ALGORITHM

Helix, Strand
1. Scan for window of 6 residues where average score > 1 (4
residues for helix and 3 residues for strand)
2. Propagate in both directions until 4 (or 3) residue window
with mean propensity < 1
3. Move forward and repeat
Conflict solution
Any region containing overlapping alpha-helical and beta-
strand assignments are taken to be helical if the average
P(helix) > P(strand). It is a beta strand if the average P(strand)
> P(helix).
Accuracy: ~50% ~60%
GHWIATRGQLIREAYEDYRHFSSECPFIP
INITIATION

Identify regions where 4/6 have a P(H) >1.00


alpha-helix nucleus

T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57

T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
PROPAGATION

Extend helix in both directions until a set of


four residues have an average P(H) <1.00.

T S P T A E L M R S T G
P(H) 69 77 57 69 142 151 121 145 98 77 69 57
FIRST GENERATION METHODS:
SINGLE RESIDUE STATISTICS

Chou & Fasman (1974 & 1978) :


Some residues have particular secondary-
structure preferences. Based on empirical
frequencies of residues in a-helices, b-sheets, and
coils.

Examples: Glu -helix


Val -strand
CHOU-FASMAN METHOD
Name P (H) P(E ) P(t urn) f (i) f (i+ 1) f(i+ 2) f(i+ 3)
Alanine 142 83 66 0.06 0.076 0.035 0.058
Arginine 98 93 95 0.07 0.106 0.099 0.085
Aspartic Acid 101 54 146 0.147 0.11 0.179 0.081
Asparagine 67 89 156 0.161 0.083 0.191 0.091
Cysteine 70 119 119 0.149 0.05 0.117 0.128
Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064
Glutamine 111 110 98 0.074 0.098 0.037 0.098
Glycine 57 75 156 0.102 0.085 0.19 0.152
Histidine 100 87 95 0.14 0.047 0.093 0.054
Isoleucine 108 160 47 0.043 0.034 0.013 0.056
Leucine 121 130 59 0.061 0.025 0.036 0.07
Lysine 114 74 101 0.055 0.115 0.072 0.095
Methionine 145 105 60 0.068 0.082 0.014 0.055
Phenylalanine 113 138 60 0.059 0.041 0.065 0.065
Proline 57 55 152 0.102 0.301 0.034 0.068
Serine 77 75 143 0.12 0.139 0.125 0.106
Threonine 83 119 96 0.086 0.108 0.065 0.079
Tryptophan 108 137 96 0.077 0.013 0.064 0.167
Tyrosine 69 147 114 0.082 0.065 0.114 0.125
Valine 106 170 50 0.062 0.048 0.028 0.053
Amino Acid P P Pt
Glu 1.51 0.37 0.74
Met 1.45 1.05 0.60
Ala 1.42 0.83 0.66
Val 1.06 1.70 0.50
Ile 1.08 1.60 0.50
Tyr 0.69 1.47 1.14
Pro 0.57 0.55 1.52
Gly 0.57 0.75 1.56
CHOU-FASMAN METHOD

Accuracy: Q3 = 50-60%
SECOND GENERATION METHODS: SEGMENT
STATISTICS

Similar to single-residue methods, but


incorporating additional information
(adjacent residues, segmental
statistics).

Problems:
Low accuracy - Q3 below 66% (results).
Q3 of b-strands (E) : 28% - 48%.
Predicted structures were too short.
THE GOR METHOD

developed by Garnier, Osguthorpe & Robson


build on Chou-Fasman Pij values
evaluate each residue PLUS adjacent 8 N-terminal
and 8 carboxyl-terminal residues
sliding window of 17 residues
underpredicts b-strand regions
GOR method accuracy Q3 = ~64%
PSSP: SECOND GENERATION
Based on the information contained in a
window of aminoacids (11-21 aa.)
The most systems use algorithms based
on:
l Statistical information
l Physico-chemical properties
l Sequence patterns
l Multi-layered neural networks
l Graph-theory
l Multivariante statistics
l Expert rules
l Nearest-neighbour algorithms
l No Bayesian networks
PSSP: SECOND GENERATION

Main problems:
Prediction accuracy <70%
Prediction accuracy for b-strand 28-
48%
Predicted chains are usually too
short
what leads do the difficult use
of predictions
NEAREST NEIGHBOR METHOD

o Predict secondary structure of the central


residue of a given segment from homologous
segments (neighbors)
(i) From database, find some number of the closest
sequences to a subsequence defined by a window
around the central residue, or
(ii) Compute K best non-intersecting local
alignments of a query sequence with each sequence.

o Use max (na, nb, nc) for neighbor consensus or


max(sa, sb, sc) for consensus sequence hits
ENVIRONMENT PREFERENCE
SCORE

Each amino acid has a preference to a specific


structural environments.
Structural variables:
secondary structure, solvent accessibility
Non-redundant protein structure database:
FSSP

p (aai | E j ) p(aai , E j )
S (i, j ) = log = log
p (aa i ) p (aai ) p ( E j )
SINGLETON SCORE MATRIX

Helix Sheet Loop


Buried Inter Exposed Buried Inter Exposed Buried Inter Exposed
ALA -0.578 -0.119 -0.160 0.010 0.583 0.921 0.023 0.218 0.368
ARG 0.997 -0.507 -0.488 1.267 -0.345 -0.580 0.930 -0.005 -0.032
ASN 0.819 0.090 -0.007 0.844 0.221 0.046 0.030 -0.322 -0.487
ASP 1.050 0.172 -0.426 1.145 0.322 0.061 0.308 -0.224 -0.541
CYS -0.360 0.333 1.831 -0.671 0.003 1.216 -0.690 -0.225 1.216
GLN 1.047 -0.294 -0.939 1.452 0.139 -0.555 1.326 0.486 -0.244
GLU 0.670 -0.313 -0.721 0.999 0.031 -0.494 0.845 0.248 -0.144
GLY 0.414 0.932 0.969 0.177 0.565 0.989 -0.562 -0.299 -0.601
HIS 0.479 -0.223 0.136 0.306 -0.343 -0.014 0.019 -0.285 0.051
ILE -0.551 0.087 1.248 -0.875 -0.182 0.500 -0.166 0.384 1.336
LEU -0.744 -0.218 0.940 -0.411 0.179 0.900 -0.205 0.169 1.217
LYS 1.863 -0.045 -0.865 2.109 -0.017 -0.901 1.925 0.474 -0.498
MET -0.641 -0.183 0.779 -0.269 0.197 0.658 -0.228 0.113 0.714
PHE -0.491 0.057 1.364 -0.649 -0.200 0.776 -0.375 -0.001 1.251
PRO 1.090 0.705 0.236 1.249 0.695 0.145 -0.412 -0.491 -0.641
SER 0.350 0.260 -0.020 0.303 0.058 -0.075 -0.173 -0.210 -0.228
THR 0.291 0.215 0.304 0.156 -0.382 -0.584 -0.012 -0.103 -0.125
TRP -0.379 -0.363 1.178 -0.270 -0.477 0.682 -0.220 -0.099 1.267
TYR -0.111 -0.292 0.942 -0.267 -0.691 0.292 -0.015 -0.176 0.946
VAL -0.374 0.236 1.144 -0.912 -0.334 0.089 -0.030 0.309 0.998
TOTAL SCORE

Alignment score is the sum of score in a window of length


l:
l /2
Score(i, j ) = [M (i + k , j + k ) + cS (i + k , j + k )]
k =- l / 2

i-4 i-3 i-2 i-1 i i+1 i+2 i+3 i+4

T R G Q L I R E A Y E D Y R H F S S E C P F I P
| | | | |
. . .E C Y E Y B R H R . . . .
j-4 j-3 j-2 j-1 j j+1 j+2 j+3 j+4

L H H H H H H L L
NEIGHBORS

1 - L H H H H H H L L - S1
2 - L L H H H H H L L - S2
3 - L E E E E E E L L - S3
4 - L E E E E E E L L - S4
n - L L L L E E E E E - Sn
n+1 - H H H L L L E E E - Sn+1

max (na, nb, nL) or max (Ssa, Ssb, SsL)


EVOLUTIONARY INFORMATION

All naturally evolved proteins with more than


35% pairwise identical residues over more than
100 aligned residues have similar structures.
Stability of structure w.r.t. sequence divergence
(<12% difference in secondary structure).
Position-specific sequence profile, containing
crucial information on evolution of protein family,
can help secondary structure prediction (increase
information content).
Gaps rarely occur in helix and strand.
~1.4%/year increase in Q3 due to database
growth during past ~10 years.
HOW TO USE IT

Sequence-profile alignment.
Compare a sequence against protein family.

More specific.

BLAST vs. PSI-BLAST.

Look up PSSM instead of PAM or BLOSUM.


l /2
Score(i, j ) = [ PSSM ( j + k , i + k ) + cS (i + k , j + k )]
k =- l / 2

position

amino acid type


THIRD GENERATION METHODS

Third generation methods reached 77%


accuracy.
They consist of two new ideas:
1. A biological idea
Using evolutionary information based on
conservation analysis of multiple sequence
alignments.
2. A technological idea
Using neural networks.
PSSP: THIRD GENERATION
PHD: First algorithm in this generation (1994)
Evolutionary information improves the prediction accuracy to
72%
Use of evolutionary information:
1. Scan a database with known sequences with
alignment methods for finding similar sequences
2. Filter the previous list with a threshold to
identify the most significant sequences
3. Build aminoacid exchange profiles based on
the probable homologs (most significant
sequences)
4. The profiles are used in the prediction, i.e.
in building the classifier
PSSP: THIRD GENERATION
Many of the second generation algorithms have
been updated to third generation
The most important algorithms of today

Predator: Nearest-neighbour
PSI-Pred: Neural networks
SSPro: Neural networks
SAM-T02: Homologs (Hidden Markov Models)
PHD: Neural networks
Due to the improvement of protein information in
databases i.e. better evolutionary information,
todays predictive accuracy is ~80%
It is believed that maximum reachable accuracy is
88%
PSSP DATA PREPARATION
Public Protein Data Sets used in PSSP research
contain protein secondary structure sequences. In
order to use classification algorithms we must
transform secondary structure sequences into
classification data tables.
Records in the classification data tables are called,
in PSSP literature (learning) instances.
The mechanism used in this transformation
process is called window.
A window algorithm has a secondary structure
as input and returns a classification table: set of
instances for the classification algorithm.
NEURONS

normal state

addictive state
NEURAL NETWORK

Input layer Hidden layer Output layer

J1

J2

J3

J4

3.

Input signals are summed


neurons and turned into zero or one

Feed-forward multilayer network


NEURAL NETWORK TRAINING

Compare Prediction to Reality


Adjust Weights

Enter sequences
SIMPLE NEURAL NETWORK

Simple Neural Network


out 0 = J1 1 in 1 + J1 2 in 2

1 1
J11

out = t anh (out 0)

1 J12 0

Error = | out_net out_desired |


Training a neural network

1 1 1 0 1 -1

1 0 1 0 1
0

0 0 1
0 1
1
1 2 2+?
1 1 1
Er ror
o
u
t

in
-2 -1 1 2
-1
Junct ions
SIMPLE NEURAL NETWORK WITH
HIDDEN LAYER

Simple Neural Network


With Hidden Layer


J J in
2 1
out i = f f
k

ij jk
j k
NEURAL NETWORK FOR SECONDARY
STRUCTURE
D (L)
R (E)
Q (E)
A
C
D G (E)
E
F
G F (E)
H H
I
K
V (E)
L
M P (E) E
N
P
Q A (H)
R
S
A (H) L
T
V
W
Y
Y (H)
.
V (E)
K (E)
K (E)
PSIPRED

D. Jones, J. Mol. Boil. 292, 195 (1999).


Method : Neural network

Input data : PSSM generated by PSI-BLAST

Bigger and better sequence database


Combining several database and data filtering
Training and test sets preparation
Secondary structure prediction only makes sense for
proteins with no homologous structure.
No sequence & structural homologues between training
and test sets by PSI-BLAST (mimicking realistic
situation).
PSI-PRED (DETAILS)

Window size = 15
Two networks
First network (sequence-to-structure):
315 = (20 + 1) 15 inputs
extra unit to indicate where the windows spans either N or C
terminus
Data are scaled to [0-1] range by using 1/[1+exp(-x)]
75 hidden units
3 outputs (H, E, L)
Second network (structure-to-structure):
Structural correlation between adjacent sequences
60 = (3 + 1) 15 inputs
60 hidden units
3 outputs
Accuracy ~76%
THE PSI-PRED ALGORITHM
Given an amino-acid sequence, predict
secondary structure elements in the
protein

Three stages:
1. Generation of a sequence profile (the
multiple alignment step)
2. Prediction of an initial secondary
structure (the neural network step)
3. Filtering of the predicted structure
(another neural network step)
GENERATION OF SEQUENCE PROFILE
A BLAST-like program called PSI-BLAST
used for this step
We saw BLAST earlier -- it is a fast way to find
high scoring local alignments
PSI-BLAST is an iterative approach
an initial scan of a protein database using the target
sequence T
align all matching sequences to construct a sequence
profile
scan the database using this new profile
Can also pick out and align distantly related
protein sequences for our target sequence T
THE SEQUENCE PROFILE LOOKS LIKE
THIS

Has 20 x M numbers
The numbers are log likelihood of each residue at each position
PREPARING FOR THE SECOND STEP
Feed the sequence profile to an artificial neural
network
But before feeding, do a simply scaling to bring
the numbers to 0-1 scale

1
x -x
1+ e
ARTIFICIAL NEURAL NETWORK
Supervised learning algorithm
Training examples. Each example has a label
class of the example, e.g., positive or negative
helix, strand, or coil
Learns how to predict the class of an example
ARTIFICIAL NEURAL NETWORK
Directed graph
Nodes or units or neurons

Edges between units

Each edge has a weight (not known a priori)


LAYERED ARCHITECTURE

http://www.akri.org/cognition/images/annet2.gif

Input here is a four-dimensional vector. Each dimension goes


into one input unit
LAYERED ARCHITECTURE

http://www.geocomputation.org/2000/GC016/GC016_01.GIF

(units)
WHAT A UNIT (NEURON) DOES

Unit i receives a total input xi from the units


connected to it, and produces an output yi = fi(xi)
where fi() is the transfer function of unit i

xi = w ij y j + wi
j N -{i}


y i = f i (x i ) = f i w ij y j + w i
j N -{i}
wi is called the bias of the unit
WEIGHTS, BIAS AND TRANSFER FUNCTION

Unit takes n inputs


Each input edge has weight wi
Bias b
Output a

Transfer function f()


Linear, Sigmoidal, or other
WEIGHTS, BIAS AND TRANSFER
FUNCTION

Weights
wij and bias wi of each unit are
parameters of the ANN.
Parameter values are learned from input data
Transfer function is usually the same for
every unit in the same layer
Graphical architecture (connectivity) is
decided by you.
Could use fully connected architecture: all
units in one layer connect to all units in next
layer
STEP 2
Feed the sequence profile to the input layer of an
ANN
Not the whole profile, only a window of 15
consecutive positions
For each position, there are 20 numbers in the
profile (one for each amino acid)
Therefore ~ 15 x 20 = 300 numbers fed
Therefore, ~ 300 input units in ANN
3 output units, for strand, helix, coil
each number is confidence in that secondary structure
for the central position in the window of 15
e.g.,

helix 0.18

strand 0.09
15

coil 0.67

Input layer Hidden


layer
STEP 3

Feed the output of 1st ANN to the 2nd ANN


Each window of 15 positions gave 3 numbers from the
1st ANN
Take 15 successive windows outputs and feed them to
2nd ANN
Therefore, ~ 15 x 3 = 45 input units in ANN
3 output units, for strand, helix, coil
CROSS-VALIDATION
Partition the training data into training set
(two thirds of the examples) and test set
(remaining one third)
Train PSIPRED on training set, test predictions
and compare with known answers on test set.
What is an answer?
For each position of sequence, a prediction of what
secondary structure that position is involved in
That is, a sequence over H/S/C (helix/strand/coil)
How to compare answer with known answer?
Number of positions that match
Suggested reading:
Chapter 15 in Current Topics in Computational
Molecular Biology, edited by Tao Jiang, Ying Xu,
and Michael Zhang. MIT Press. 2002.

Optional reading:
Review by Burkhard Rost:
http://cubic.bioc.columbia.edu/papers/2003_rev_dek
ker/paper.html

Você também pode gostar