Escolar Documentos
Profissional Documentos
Cultura Documentos
N
_
and can be estimated using the en-
ergy landscape analysis of protein folding, which
argues that folding most readily occurs when cor-
rectly folded states are highly stabilized with re-
spect to all alternatively folded states. Here, E is
the energy difference between the correct folded
state and the mean of the distribution over mis-
folded compact structures; dE is the standard devi-
ation of the distribution over misfolded states.
Several studies on the classication of structures
of proteins and domains (Orengo et al., 1994; Islam
et al., 1995; Sowdhamini et al., 1996; Murzin et al.,
1995; Fischer et al., 1995) have shown that simi-
larities in structure and topology are most evident
at the level of domains rather than entire proteins,
and that conformational motifs recur with very
high frequency. Using potentials of mean force,
Wodak and co-workers have argued that some
J. Mol. Biol. (1997) 272, 95105
00222836/97/36009511 $25.00/0/mb971205 # 1997 Academic Press Limited
short segments have a particularly high correlation
between their sequence and structure correspond
well to the sites that form early during folding
(Rooman et al., 1991, 1992). Preliminary studies of
foldon structures show that some of them are struc-
turally similar to each other and may thus have the
same evolutionary origin. Knowledge of structural
and evolutionary relationships between foldons
would be of great use in understanding the protein
folding process and the origin of protein structural
complexity. Here we attempt to determine the num-
ber of different kinds of foldons required to con-
struct all foldable proteins. To this end we rst
make structural comparisons of foldons and extract
structurally similar ones from our original data set.
It is important to note that foldon boundaries, un-
like structural domains, depend on the sequence of
the protein. Therefore, domains dened by purely
structural criteria may differ from the foldons of a
given protein in size and structure.
Another characteristic of foldons, which can be
used in classication, is their relative foldability .
Foldons with large should, in general, exhibit
some of the properties of whole proteins such as an
ability to fold independently and to recognize their
own sequence and structure in threading
algorithms. In most energy-based alignments the
sequence of an entire protein recognizes its own
structure as being the lowest energy alignment
(Hendlich et al., 1990; Bryant & Lawrence, 1993;
Jones et al., 1992; Huang et al., 1995; Koretke et al.,
1996; DeBolt & Skolnick, 1996). This indicates that
the sequence of a protein ts its own structure
rather well. The situation in which long-range
interactions do not contradict short-range inter-
actions, i.e. secondary structure propensities are
consistent with the tertiary interactions, has been
called the ``consistency principle'' (Go, 1983). This,
in turn, has motivated the more general principle of
minimal frustration (Bryngelson & Wolyes, 1987;
Bryngelson et al., 1995, and references cited therein).
Conceptually, the original version of the consist-
ency principle, would seem to require that any div-
ision of such a protein into smaller parts produce
subunits with consistent interactions. The energy
and structure of these subunits should be the same
in isolation and within the mean eld of the native
protein, provided the interaction with the solvent
does not dominate. Experimental data show that
some of the protein fragments do in fact retain their
native structure when isolated from the zest of the
protein and can fold reversibly and independently
into the native state (Wetlaufer, 1981; Griko et al.,
1992; Ikura et al., 1993; Kippen et al., 1994). On the
other hand, the structure of some protein fragments
is distorted upon isolation, and consequent clea-
vage of the polypeptide chain can produce seg-
ments whose stable conformations have low
structural similarity to their conformations in the
native state (Wetlaufer, 1981; Gay et al., 1995). This
indicates the existence of frustration of the local in
sequence interactions within these protein segments
by the interaction with the rest of the protein.
There are different ways to overcome frustration
during folding. In many models of kinetically fold-
able proteins (Bryngelson & Wolynes, 1987, 1989;
Onuchic et al., 1995; Leopold et al., 1992) the driv-
ing force for folding largely resides in tertiary
interactions alone, which start to dominate over
local interactions as folding proceeds. In view of
this, the consistency principle required generaliz-
ation and more quantitative formulation which is
provided by the energy landscape theory. In this
theory interactions that stablize native congur-
ations over others lead to a folding funnel, result-
ing in fast folding either by a downhill mechanism
or by a mechanism involving activation to a tran-
sition state with a relatively low barrier. The com-
petition between self-organization of folding in the
main funnel and the kinetic trapping in subsidiary
funnels, occurring obligatively at a local glass tem-
perature, largely determines the overall rate of the
folding process. Folding can be very fast if the
folding temperature T
f
is much greater than glass
temperature T
g
. Under these conditions we have
``minimal frustration''. Minimal frustration quan-
ties the extent to which all interactions work
together to produce a folded, rather than a mis-
folded structure and implies smoothness of the en-
ergy landscape and relatively high rates of protein
folding. Uncertainty about the congurational en-
tropy of molten globule states makes it difcult to
give a precise value for T
f
/T
g
, but if entropy is ex-
tensive in chain length, T
f
/T
g
, is a monotonically
increasing function of . Thus, the principle of
minimal frustration allows us to estimate quantitat-
ively how fast a protein segment with partially
consistent internal interactions can fold into its
native state. If several segments have large va-
lues, this implies that these non-contiguous seg-
ments may appear as stable intermediates in the
protein folding.
The overall smoothness of the energy-landscape
as well as the stability of proteins are probably the
result of a long evolutionary process. If these fac-
tors represented the main selection pressure of
evolution and were strong enough, the protein
molecules could have become very stable, quickly
foldable, perfectly unfrustrated systems. However,
there are other selection pressures, namely that
foldable proteins should carry out specic func-
tions and should interact properly with other com-
ponents of the cell.
In the rst section of this paper we determine
the number of different kinds of foldons required
to construct all foldable proteins. To this end we
rst make structural comparisons of many foldons
and extract structurally similar ones from our orig-
inal data set. In the second section we check
whether our foldon data set is large enough to
rebuild the backbones of proteins and evaluate the
accuracy of the prediction using foldon matching
modelling. In the last part of this paper we study
how frustrated the structures of foldons from our
representative data set really are. In order to do
this we search for foldons which recognize their
96 The Foldon Universe
own sequence and structure upon threading. This
gives us an opportunity to understand what extent
the ``consistency principle'' of local versus non-local
interactions is the dominant mode of achieving
minimal frustration used by evolution.
Results
Criterion for foldon structural similarity
It can be seen from the distribution of Q-scores
(Figure. 1) that distributions for both sequence-
sequence as well as sequence-structure alignments
overlap each other in the neighborhood of Q - 0.2.
Alignment of randomly permuted sequences of tar-
get foldons with template structures yields a mean
Q-score value 0.15(0.09). As shown on Figure 1
starting from the value Q > 0.3 there is an excess of
numbers of matches for sequence-structure align-
ment compared to sequence sequence alignment.
This indicates that for Q > 0.3 sequence-structure
alignment is able to discriminate structurally simi-
lar foldons with low sequence identity from the
random matches of non-similar foldons with low
sequence identity. A precise cut-off value for the
Q-score is difcult to determine. Visual examin-
ation of the structures of foldon pairs with Q - 0.3
shows that although many structures are very
similar, others are not. Thus, Q = 0.3 can be used
as a cut-off value only tfully.
In order to nd a more appropriate cut-off value
we compared structures of 25 homologous proteins
from the globin family. The maximum of the Q-
score distribution for this data set is positioned at
Q = 0.42, and we used this value as the cut-off for
most of our studies. This cut-off corresponds to
r.m.s. deviations less than 5 A
and is in agreement
with the r.m.s. threshold - 5.25 A
obtained for 40
residues long a-turn-a motifs (Wintjens et al., 1996).
Using this cut-off value most of the homologous
proteins are classied as identical. According to our
similarity criterion two foldons A and B are
assumed to be structurally identical if the Q-score is
greater than 0.42 for both alignments: alignment of
sequence A with the structure B and alignment of
sequence B with the structure of foldon A. Using
this criterion we are able to eliminate the depen-
dence of the Q-score on the number of residues of
the target sequence and avoid matches comprising
small number of residues. Our algorithm considers
only alignments of segments of similar length as a
good match, and foldons with signicantly different
length are assumed to be different. In most cases
the lowest energy alignment corresponds to the
alignment of the entire target foldon and only in a
few cases the best alignment may leave up to three
residues uncovered at the end of the target foldon.
Estimated size of the foldon universe
We rst address the question of how many
structurally different kinds of foldons are present
in our data set. In order to nd structurally similar
foldons we made 35,910 (190 189) pair-wise
foldon comparisons based on sequence-sequence
and sequence-structure alignments with no gaps
allowed (see Table 1). That is to say, the sequence
of each foldon was aligned to both the sequence
and the structure of all other foldons from our data
set. As a result of foldon alignments, seven foldon
pairs and no triplets were found at the Q = 0.42
level using either the original energy functions or
self-consistently optimized energy functions. For
comparison, at the level of Q = 0.3 we obtain 35
foldon dublets and 28 triplets. Two examples of
structurally similar foldon pairs with different
Q-scores are presented in Figure 2(a) and (b).
In order to estimate the number of different
kinds of foldons in the underlying set of proteins,
we assume that the underlying set of foldons of
size N is randomly mixed and each foldon has the
same probability of being found in any protein.
Then, the probability to observe any foldon duplet
in the sub-set would be 1/N. If we select from the
underlying set of foldons a sub-set of size n, then
the number of possible foldon pairs occurring
upon foldon pair-wise comparison would be
n(n 1)/2. The number of matches representing
pairs of the same kind of foldon can be estimated
as n(n 1)/2N which yields about 2600 foldons in
the universe. The very generous cut-off Q = 0.3
would yield a smaller size N - 200 to 500 foldons.
The approximation we used regarding uniform
distribution of protein-folds is rather rough. A
more rened argument should take into account
that different families of protein-folds are not
equally populated (Orengo et al., 1994). It is worth
Figure 1. Distribution of Q-scores for two types of align-
ment: genetic alignment based on the Smith-Waterman
alignment algorithm with Dayhoff scoring matrix
(shaded boxes), sequence-structure alignment with the
original energy functions (opened boxes). Similar results
were obtained using the self-consistently optimized
energy function. Inset: right tail of the same distri-
butions in enlarged scale. The sequence of each foldon
was aligned to the sequence-structure of all foldons
from the data set except the target foldon.
The Foldon Universe 97
noting that the size of the foldon universe obtained
in the present work is somewhat smaller but of the
same order as the size of exon universe (1000 to
7000). This argues in favor of the idea that exons
and foldons were once related, although in some
cases the same foldon may be encoded by different
exons as a result of a convergent evolution process.
Is the current foldon data set complete enough
for structure prediction?
Several attempts have been made to use infor-
mation about the structure of protein fragments in
order to guide the conformational search pro-
cedures for complete native structures (Bowie &
Eisenberg, 1994; Srinivasan & Rose, 1995). Predic-
tion by recognition of known structures of protein
fragments based on sequence-structure alignment
algorithms (Bowie et al., 1991; Jones et al., 1992)
looks for short fragments compatible with the seg-
ment of sequence to be folded. A set of fragments
with optimized sequence-structure relationships
would therefore be of great use in solving this pro-
blem.
In order to check whether our foldon data set
can be used for protein structure prediction, we
threaded the sequence of several proteins through
the structures of foldons from our data set. We
then chose those foldons which have the highest
Q-score with the native structure of the target pro-
tein. As a result about two-thirds of the sequence
of most of the target proteins were covered with
overlapping foldons having Q > 0.42, whereas
another part of the sequence remained structurally
uncovered. The nal model structure of the target
sequence can be obtained by using the modelling
technique implemented in QUANTA as described
by Sali & Blundell (1993). According to this algor-
ithm, the comparative modelling occurs by satis-
faction of spatial restraints on the unknown
sequence. The spatial restraints are obtained by
multiple alignment of the target sequence with
overlapping foldons with Q > 0.42 and are dened
in the terms of a probability density function.
Optimization of the probability density function is
implemented in the program MODELLER. We ap-
plied MODELLER to rebuild the structure of
phage 434 Cro protein. We chose this protein since
it has very low sequence identity even with its clo-
sest homologs. The model backbone structure for
the major part of the sequence of 434 Cro protein
(residues 27 to 65) obtained by alignment of its
sequence with the structures of seven foldons from
our data set is presented in Figure 3. The model
structure obtained has an r.m.s. error of 4.8 A
with
respect to the native structure, and the Q-score is
equal to 0.37. The model structure is not as accu-
rate as an X-ray structure (resolution for 434 Cro
protein is 2.35 A
on C
a
atoms. These foldons as well as the corre-
sponding exons seem to be a result of the duplication in
course of the evolution. (b). Foldons from 434 Cro pro-
tein and myoglobin with averaged Q = 0.31 and r.m.s.
3.6 A
.
98 The Foldon Universe
Sequences of most proteins do in fact represent the
best matching sequences for the native confor-
mations, since these sequences recognize their
native structure as the lowest energy conformation
among the large number of alternative alignments
(Bryant & Lawrence, 1993; Goldstein et al., 1992b;
Koretke et al., 1996). In the case of foldons or pro-
tein fragments it is not clear how many of them
would nd their native structure and sequence as
a lowest energy alignment in the threading pro-
cedure.
To nd foldons that recognize their own
sequence and structure, the sequence of each fol-
don was translated along the scaffolds from 100
different protein structures. The sequences of these
proteins were then threaded through the structure
of the target foldon. Altogether, about 5000 alterna-
tive structures were generated in both cases. This
procedure was repeated for each of the 190 foldons
from the data set. Q-scores between target and
template structures were estimated according to
equation (3) and the lowest energy alignment was
taken to be the best one.
A protein molecule can be found in the different
conformations depending on the functional state,
temperature and pH conditions. We showed that
X-ray structures of the sperm whale myoglobin ob-
tained under different states of the heme iron have
a native energy difference not exceeding 10%. If
the energy of the native conformation of a target
foldon is lower or within 10% of the energy of the
best alignment of the target foldon sequence with
the template structures, this indicates that the fol-
don would recognize its own structure as the best
choice. In this case the corresponding Q-score
between the native structure of the foldon and the
structure of the best alignment is equal to 1.0. On
the other hand, if the best alignment of sequences
of template foldons with the structure of the target
foldon has a larger energy than the native state of
target foldon, we believe the target foldon recog-
nizes its own sequence. A foldon recognizing the
sequence or structure of a structurally similar fol-
don would have a Q-score very close but not equal
to 1.0. We use Q = 0.9 as a upper bound for self-
recognition.
As a result of the current sequence-structure
threading procedure, 41% of all foldons recognize
their own structure while 11% of foldons recognize
their own sequence. About 50% of all foldons from
the data set pick up neither their sequence nor
their structure as a lowest energy alignment. Thus,
if we limit ourselves by considering only individu-
ally non-frustrated foldons then the size of the
existing foldon database would be reduced by half.
Figure 3. The model structure of region 27 to 65 of 434
Cro protein as a result of threading the sequence of pro-
tein through the structures of seven overlapping foldons
from the six proteins with the following pdb codes:
2TRX, 1MBA, 3ADK, 3PTN, 5ABP and 5TLN. These fol-
dons have the Q-score 50.42 with the native regions of
434 Cro protein. The nal model structure of the target
sequence was obtained using MODELLER modelling
technique implemented in QUANTA. The overall Q-
score between the model structure and the same region
of native protein is equal to 0.37 and the r.m.s. value is
4.8 A
l
e
P
r
o
t
e
i
n
n
a
m
e
L
e
n
g
t
h
F
o
l
d
o
n
j
u
n
c
t
i
o
n
s
2
o
v
o
O
v
o
m
u
c
o
i
d
t
h
i
r
d
d
o
m
a
i
n
5
6
5
6
c
4
p
t
i
z
T
r
y
p
s
i
n
i
n
h
i
b
i
t
o
r
5
8
5
8
b
2
c
r
o
a
4
3
4
C
r
o
p
r
o
t
e
i
n
6
5
2
5
c
6
5
c
2
c
i
2
a
C
h
y
m
o
t
r
y
p
s
i
n
i
n
h
i
b
i
t
o
r
2
6
5
6
5
c
1
u
b
q
U
b
i
q
u
i
t
i
n
7
6
3
1
c
7
6
c
3
f
x
c
F
e
r
r
e
d
o
x
i
n
9
8
3
7
c
9
8
c
5
p
c
y
P
l
a
s
t
o
c
y
a
n
i
n
9
9
3
4
c
9
9
b
1
w
r
p
a
T
r
p
r
e
p
r
e
s
s
o
r
1
0
2
1
0
2
b
1
c
y
c
F
e
r
r
o
c
y
t
o
c
h
r
o
m
e
(
C
)
1
0
3
3
4
5
5
7
6
1
0
3
2
5
6
b
C
y
t
o
c
h
r
o
m
e
B
5
6
2
,
c
h
a
i
n
A
1
0
6
3
4
c
7
0
1
0
6
c
2
s
s
i
a
S
u
b
t
i
l
i
s
i
n
i
n
h
i
b
i
t
o
r
1
0
7
2
8
4
6
1
0
7
c
1
r
e
i
B
e
n
c
e
-
J
o
n
e
s
p
r
o
t
e
i
n
,
c
h
a
i
n
A
1
0
7
2
2
4
6
c
7
3
1
0
7
2
t
r
x
T
h
i
o
r
e
d
o
x
i
n
,
c
h
a
i
n
A
1
0
8
2
5
c
7
9
c
1
0
8
5
c
p
v
P
a
r
v
a
l
b
u
m
i
n
B
1
0
8
3
5
c
5
9
c
1
0
8
1
h
m
q
M
e
t
h
e
m
e
r
y
t
h
r
i
n
,
c
h
a
i
n
A
1
1
3
2
8
c
5
5
1
1
3
c
1
b
p
2
P
h
o
s
p
h
o
l
i
p
a
s
e
A
2
1
2
3
2
5
4
6
1
2
3
1
r
b
b
R
i
b
o
n
u
c
l
e
a
s
e
B
,
c
h
a
i
n
A
1
2
4
3
2
5
4
7
3
1
0
0
1
2
4
2
l
y
z
L
y
s
o
z
y
m
e
1
2
9
2
8
c
5
5
c
1
0
3
b
1
2
9
2
a
z
a
A
z
u
r
i
n
,
c
h
a
i
n
A
1
2
9
3
1
c
1
1
2
c
1
2
9
1
s
n
c
a
S
t
a
p
h
y
l
o
c
o
c
c
a
l
n
u
c
l
e
a
s
e
1
3
5
2
2
5
2
8
8
1
3
5
3
f
x
n
F
l
a
v
o
d
o
x
i
n
1
3
8
9
7
b
1
3
8
b
1
m
b
a
M
y
o
g
l
o
b
i
n
1
4
6
2
8
6
4
c
9
1
1
4
6
2
s
o
d
S
u
p
e
r
o
x
i
d
e
d
i
s
m
u
t
a
s
e
,
c
h
a
i
n
B
1
5
1
2
8
c
8
2
c
1
5
1
c
2
i
1
b
I
n
t
e
r
l
e
u
k
i
n
1
5
3
4
3
1
0
0
c
1
2
1
c
1
5
3
4
t
n
c
a
T
r
o
p
o
n
i
n
C
1
6
0
2
2
c
5
8
c
1
6
0
b
4
g
c
r
g
I
I
-
C
r
y
s
t
a
l
l
i
n
1
7
4
3
7
b
8
2
b
1
2
4
c
1
4
8
c
1
7
4
2
l
t
n
L
e
c
t
i
n
,
c
h
a
i
n
A
1
8
1
4
6
8
5
1
0
9
1
8
1
c
8
d
f
r
D
i
h
y
d
r
o
f
o
l
a
t
e
r
e
d
u
c
t
a
s
e
1
8
6
2
8
1
2
1
b
1
6
3
1
8
6
3
a
d
k
A
d
e
n
y
l
a
t
e
k
i
n
a
s
e
1
9
4
3
7
b
7
6
c
1
1
5
c
1
3
6
1
7
2
c
1
9
4
c
2
a
c
t
a
A
c
t
i
n
i
d
i
n
2
1
8
1
2
7
b
1
9
6
b
2
1
8
3
p
t
n
T
r
y
p
s
i
n
2
2
3
1
0
6
b
1
4
2
1
6
3
c
1
7
8
2
2
3
3
c
n
a
C
o
n
c
a
n
a
v
a
l
i
n
A
2
3
7
8
5
c
1
4
2
1
7
5
1
9
3
2
3
7
1
t
i
m
T
r
i
o
s
e
-
p
h
o
s
p
h
a
t
e
i
s
o
m
e
r
a
s
e
,
c
h
a
i
n
A
2
4
7
2
2
6
1
8
8
1
2
4
1
6
3
c
2
0
2
2
4
7
c
3
h
l
a
H
u
m
a
n
c
l
a
s
s
I
h
i
s
t
o
c
o
m
p
a
t
i
b
i
l
i
t
y
a
n
t
i
g
e
n
,
c
h
a
i
n
A
2
7
0
3
7
6
1
8
2
1
8
7
b
2
1
7
2
4
7
2
7
0
2
p
r
k
P
r
o
t
e
i
n
a
s
e
K
2
7
9
3
1
1
0
0
c
1
3
3
1
6
3
1
9
9
2
2
6
2
5
9
c
2
7
9
5
a
b
p
L
-
A
r
a
b
i
n
o
s
e
b
i
n
d
i
n
g
p
r
o
t
e
i
n
3
0
6
5
2
c
1
0
6
c
1
3
0
c
1
7
2
c
1
9
9
2
2
9
c
3
0
6
c
5
c
p
a
C
a
r
b
o
x
y
p
e
p
t
i
d
a
s
e
A
3
0
7
2
2
c
5
5
c
1
1
2
c
1
6
6
c
1
9
3
2
2
6
2
7
1
3
0
7
8
a
t
c
A
s
p
a
r
t
a
t
e
c
a
r
b
a
m
o
y
l
t
r
a
n
s
f
e
r
a
s
e
,
c
h
a
i
n
A
3
1
0
7
0
1
1
5
1
5
7
1
8
4
2
1
1
2
5
0
2
7
7
3
1
0
3
t
l
n
T
h
e
r
m
o
l
y
s
i
n
3
1
6
5
5
8
8
1
2
7
2
3
5
2
6
5
2
9
2
3
1
6
1
p
f
k
P
h
o
s
p
h
o
f
r
u
c
t
o
k
i
n
a
s
e
,
c
h
a
i
n
A
3
2
0
2
2
5
2
9
1
1
2
1
c
1
3
9
1
6
6
c
2
0
2
2
3
5
2
5
6
3
1
9
1
c
m
s
C
h
y
m
o
s
i
n
B
3
2
3
4
0
c
8
8
c
1
6
9
b
2
1
4
2
4
7
c
2
6
8
3
0
7
3
2
3
2
l
i
v
L
e
u
/
I
l
e
/
V
a
l
b
i
n
d
i
n
g
p
r
o
t
e
i
n
3
4
4
4
9
c
8
2
c
1
1
5
1
6
9
c
1
9
9
2
5
9
c
3
4
4
c
8
a
d
h
A
p
o
-
l
i
v
e
n
a
l
c
o
h
o
l
d
e
h
y
d
r
o
g
e
n
a
s
e
3
7
4
2
8
8
2
1
0
6
1
4
2
1
6
9
1
9
3
2
1
4
2
4
1
c
2
6
5
3
5
2
3
7
4
c
R
i
g
h
t
t
e
r
m
i
n
u
s
b
o
u
n
d
a
r
y
o
f
e
a
c
h
f
o
l
d
o
n
i
s
i
n
d
i
c
a
t
e
d
a
s
a
n
a
m
i
n
o
a
c
i
d
n
u
m
b
e
r
o
f
t
h
e
C
t
e
r
m
i
n
u
s
.
S
u
b
s
e
q
u
e
n
t
f
o
l
d
o
n
s
t
a
r
t
s
a
t
t
h
e
n
e
x
t
a
m
i
n
o
a
c
i
d
a
n
d
p
r
o
c
e
e
d
s
t
i
l
l
t
h
e
t
e
r
m
i
n
u
s
.
a
P
r
o
t
e
i
n
s
w
h
o
s
e
X
-
r
a
y
s
t
r
u
c
t
u
r
e
s
l
a
c
k
t
h
e
r
e
s
i
d
u
e
s
a
t
t
h
e
b
e
g
i
n
n
i
n
g
o
r
a
t
t
h
e
e
n
d
o
f
t
h
e
p
o
l
y
p
e
p
t
i
d
e
c
h
a
i
n
.
E
f
f
e
c
t
i
v
e
l
e
n
g
t
h
o
f
t
h
e
p
a
r
t
o
f
t
h
e
p
r
o
t
e
i
n
w
i
t
h
t
h
e
d
e
t
e
r
m
i
n
e
d
s
t
r
u
c
t
u
r
e
i
s
g
i
v
e
n
.
F
o
l
d
o
n
j
u
n
c
t
i
o
n
s
w
e
r
e
c
a
l
c
u
l
a
t
e
d
w
i
t
h
r
e
s
p
e
c
t
t
o
t
h
e
b
e
g
i
n
n
i
n
g
a
n
d
e
n
d
o
f
X
-
r
a
y
d
e
t
e
r
m
i
n
e
d
p
r
o
t
e
i
n
s
t
r
u
c
t
u
r
e
s
.
b
B
o
u
n
d
a
r
i
e
s
o
f
f
o
l
d
o
n
s
f
r
o
m
t
h
e
r
s
t
g
r
o
u
p
w
h
i
c
h
r
e
c
o
g
n
i
z
e
t
h
e
i
r
o
w
n
s
e
q
u
e
n
c
e
a
n
d
s
t
r
u
c
t
u
r
e
.
c
B
o
u
n
d
a
r
i
e
s
o
f
f
o
l
d
o
n
s
w
h
i
c
h
r
e
c
o
g
n
i
z
e
o
n
l
y
t
h
e
i
r
s
t
r
u
c
t
u
r
e
.
We found that all foldons can be clustered into
three groups with respect to average Q-scores and
foldability (Figure 4). The rst group contains
foldons exhibiting some characteristics of the
whole protein such as ability to fold rapidly and to
recognize its own sequence and structure. These
least frustrated foldons contain largely long-range
interactions and describe turns and super-second-
ary structures. This group is characterized by the
high values and by Q-scores being larger than
0.9. Boundaries of foldons from the rst group are
indicated by an asterisk in Table 1. As can be seen
from Figure 5, the energy gap between the native
and rst excited state for most foldons from this
group is rather large, indicating their low sequence
and structure degeneracy. Figure 5 shows the
energy distribution of the alternative structures for
the third lysozyme foldon. The sequence of this fol-
don ts its own structure rather well since there is
a big gap between the native energy and the rst
excited state. However, we found that the sequence
of the fragment 52 to 99 from the a-lactalbumin ts
the structure of this foldon even better than the
structure of lysozyme. These two proteins are very
similar in structure and the corresponding Q-score
is very close to 1. Foldons from the second group
recognize either their own sequence or structure
and some foldons from this group pick up the
sequence/structure of rather similar foldons as a
best choice. A third group comprises highly fru-
strated foldons with low values, which recog-
nize neither their sequence nor their structure
(Figure 4). It is interesting to note that the Q-score
can serve as a measure of the delity by which
foldons recognize themselves and is correlated to
their relative foldability . In other words sequences
which can recognize themselves in a threading al-
gorithm have rather a smooth energy landscape
with a deep funnel leading to the ground state.
Discussion
The key notion examined here is that proteins
can be decomposed into autonomously folding
units which can evolve independently and serve as
building blocks in construction of protein struc-
tures. In order to estimate the number of structu-
rally different foldons in the protein data base, we
performed sequence-sequence and sequence-struc-
ture alignments between foldons, and obtained
about 2600 foldons in the representative data set.
The size of the representative data set is rather sen-
sitive to the cut-off value for structural similarity.
Values of the Q-score needed for assigning struc-
tural identify cannot be precisely determined, and
Q = 0.42 may be considered as a lower limit for
the cut-off value, indicating that the size of foldon
universe would probably exceed 2600 foldons. On
the other hand, according to Chothia (1994), only a
small fraction of existing proteins is represented in
the Protein Data Bank. Assuming there is no strong
bias in the data banks and the number of proteins
with the duplicated foldons is not very large, one
can obtain approximately 1000 different protein
families. Multiplying this number by the average
number of different foldons per protein (4.4 2.5)
we obtain 4400 different foldon classes. These esti-
mates suggest that our present foldon data set,
comprising 183 types of structurally different fol-
dons, represents only a small fraction of the under-
lying foldon universe. In this sense many families
of foldons are not yet represented in our data set.
Classications of a-turn-a motifs also show that
even for fragments more than 25 residues long,
analysis yields very weakly populated families
(Wintjens et al., 1996).
We have shown that the backbones of the major
parts of several proteins can be reproduced with
an r.m.s. error of about 5 A
2
k = 1
g
c
(A
i
, A
j
,)u(r
c
k
r
ij
), where u is a step function with
cut-off distance r
c
k
(r
c
1
= 5 A
r
c
2
= 12 A
). Contact dis-
tances are estimated between C
b
atoms, while for
amino acid Gly coordinates of C
a
atoms have been
used. The parameters g are optimized in such a
way as to maximize the ratio E/dE.
We used two different energy functions which
differed from each other by the type of optimiz-
ation. The original energy function was optimized
to maximize the stability gap in units of the stan-
dard deviation of the molten globule distribution.
Misfolded conformations were generated by
threading the target sequence through the confor-
mations of structurally different proteins
(Goldstein et al., 1992b). Self-consistently optimized
energy functions took into account the partial
order of misfolded states (Koretke et al., 1996). In
this case the stability gap represents the energy
difference between the folded conformation and
thermally occupied minima in the ensemble of
the molten globule conformations. Partially folded
structures were generated by alignment of
trial sequences against known structures. The
Hamiltonian with self-consistently optimized gs
also included a term based on statistics of back-
bone hydrogen bonds.
Alignment procedures
To nd the best match between sequences, we
applied the standard sequence homology technique
based on the Smith-Waterman alignment algorithm
with a Dayhoff scoring matrix (Waterman et al.,
1976). Sequence-structure alignments were based
on the empirical Hamiltonian and implemented in
the mean eld fashion (Goldstein et al., 1993). We
do not allow for gaps in the sequences and struc-
tures of foldons and assume that insertions and
deletions occur between independently folding
units. In order to identify the sub-sequence compa-
tible with a given structure, we searched for the
lowest energy alignment using the Needleman-
Wunsch algorithm (Needleman & Wunsch, 1970)
with an iterative matrix method of calculation. The
elements of the scoring matrix H
ii
/ represent the
energy contribution of amino acid A
i
of target
sequence A embedded into location i
/
of template
structure B, the so-called ``frozen approximation''
(Godzik et al., 1992).
Using the scoring matrix H:
H
ii
/ = g(A
i
Y C
B
i
/ )
k
/
`i
/
2
j =1
g(A
i
Y B
k
/ )u(r
c
j
r
i
/
k
/ ) (1)
we obtain an initial alignment which maps the resi-
due k of sequence A to the residue k
/
of sequence B
(P(k) = k
/
). Then, the up-dated score representing
the energy contribution of residue A
i
in a new
environment can be written as:
H
ii
/ = g(A
i
Y C
B
i
/ )
k
/
`i
/
2
j =1
g(A
i
Y A
k
)u(r
c
j
r
i
/
k
/ ) (2)
The procedure of alignment and calculation of sub-
sequent scoring matrices was repeated and con-
verged within ve iterations in most cases.
Criteria of structural similarity
The Q-score, which measures contact pattern
similarity, was used to determine the degree of
similarity between the structures of target and tem-
plate foldons (Goldstein et al., 1992a). The Q-score
is calculated using a Gaussian function of the inter-
residue distance centered at zero with standard
deviation of [j i[
0.15
A
i ,=j
exp (r
A
ij
r
B
i
/
j
/ )
2
a2[ j i [
0X3
(3)
Here, r
ij
r
i
/
j
/ are the distances between C
b
atoms of
residues i and j of protein A and aligned residues i
/
j
/
of protein B, respectively; N
A
is the length of the
target sequence.
Determination of the foldon junctions
Foldon boundaries were determined according
to the following criterion (Panchenko et al., 1996):
the polypeptide chain was cleaved after some resi-
due j and the average -value of the N-terminal
(from the rst residue to residue j) and the C-term-
inal (from residue j to last residue) segments was
computed using
j
= (
N,j
j,c
)/2. The cleavage
point was then moved along the chain, and the
position of the rst local maximum of located
the boundary of the rst foldon. The cleavage pro-
cedure is repeated, with each subsequent foldon
beginning where the previous one was assumed to
end. We believe that the data-based energy func-
tions used here are close enough to actual free en-
ergies such that the calculated foldons are good
rst approximations to the physico-chemical fold-
ing units.
Acknowledgements
A.P. thanks A. Akmal for the careful reading of the
manuscript. Our work was supported by National Insti-
tutes of Health grant PHS R01 GM44557.
The Foldon Universe 103
References
Abkevich, V. I., Gutin, A. M. & Shakhnovich, E. I.
(1994). Specic nucleus as the transition state for
protein folding: an evidence from the lattice model.
Biochemistry, 33, 1002610036.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer,
E. F., Jr, Brice, M. D., Rodgers, J. R., Kennard, O.,
Shimanouchi, T. & Tasumi, M. (1977). The Protein
Data Bank: a computer-based archival le for
macromolecular structures. J. Mol. Biol. 112, 535
542.
Bowie, J. U. & Eisenberg, D. (1994). An evolutionary
approach to folding small a-helical proteins that
uses sequence information and an empirical quiding
tness function. Proc. Natl Acad. Sci. USA, 91, 4436
4440.
Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). A method
to identify protein sequences that fold into a known
three-dimensional structure. Science, 253, 164170.
Bryant, S. H. & Lawrence, C. E. (1993). An empirical
energy function for threading protein sequence
through the folding motif. Proteins: Struct. Funct.
Genet. 16, 92112.
Bryngelson, J. D. & Wolynes, P. G. (1987). Spin glasses
and the statistical mechanics of protein folding.
Proc. Natl Acad. Sci. USA, 84, 75247528.
Bryngelson, J. D. & Wolynes, P. G. (1989). Intermediates
and barrier crossing in a random energy model
(with applications to protein folding). J. Phys. Chem.
93 (19), 69026915.
Bryngelson, J. D., Onuchic, J. N., Socci, N. D. &
Wolynes, P. G. (1995). Funnels, pathways and the
energy landscape of protein folding: a synthesis.
Proteins: Struct. Funct. Genet. 21, 167195.
Chothia, C. (1994). Protein families in the metazoan
genome. Development, S, 2733.
DeBolt, S. E. & Skolnick, J. (1996). Evaluation of atomic
level mean force potentials via inverse folding and
inverse renement of protein structures: atomic bur-
ial position and pair-wise non-bonded interactions.
Protein Eng. 9, 637655.
Fischer, D., Tsai, C.-J., Nussinov, R. & Wolfson, H.
(1995). A 3-D sequence-independent representation
of the protein data bank. Protein Eng. 8, 981997.
Gay, G. P., Ruiz-Sanz, J., Neira, J. L., Itzhaki, L. S. &
Fersht, A. R. (1995). Folding of a nascent polypep-
tide chain in vitro: cooperative formation of struc-
ture in a protein module. Proc. Natl Acad. Sci. USA,
92, 36833686.
Go, N. (1983). Theoretical studies of protein folding.
Annu. Rev. Biophys. Bioeng. 12, 183210.
Godzik, A., Skolnick, J. & Kolinski, A. (1992). Tology
ngerprint approach to the inverse protein folding
problem. J. Mol. Biol. 277, 227238.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1992a). Optimal protein-folding codes from
spin-glass theory. Proc. Natl Acad. Sci. USA, 89,
49184922.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1992b). Protein tertiary structure recognition
using optimized Hamiltonians with local
interactions. Proc. Natl Acad. Sci. USA, 89, 9029
9033.
Goldstein, R. A., Luthey-Schulten, Z. A. & Wolynes,
P. G. (1993). Protein tertiary structure recognition
using optimized associative memory Hamiltonians.
In Proc. 26th Annual Hawaii International Conference
on System Sciences (Mudge, T. N., Milutinovic, V. &
Hunter, L., eds), vol. 1, pp. 699707, IEEE Compu-
ter Society Press.
Griko, Y. V., Rogov, V. V. & Privalov, P. L. (1992).
Domains in l cro repressor: a calori-metric study.
Biochemistry, 31, 1270112705.
Hendlich, M., Lackner, P., Weitchkus, S., Floeckner, H.,
Froschauer, R., Gottsbacher, K., Casari, G. & Sippl,
N. J. (1990). Identication of native protein folds
amongst a large number of incorrect models: the
calculation of low energy conformations from
potentials of mean force. J. Mol Biol. 216, 167180.
Hinds, D. A. & Levitt, M. (1996). From structure to
sequence and back again. J. Mol. Biol. 258, 201209.
Hirst, J. D. & Brooks, C. L. (1995). Molecular dynamics
simulations of isolated helices of myoglobin. Bio-
chemistry, 34, 76147621.
Holm, L & Sander, C. (1994). Parser for protein folding
units. Proteins: Struct. Funct. Genet, 19, 256268.
Huang, E. S., Subbiah, S. & Levitt, M. (1995). Recogniz-
ing native folds by the arrangement of hydrophobic
and polar residues. J. Mol. Biol. 252, 709720.
Ikura, T., Go, N., Kohda, D., Inagaki, F., Yanagawa, H.,
Kawabata, M., Kawabata, S., Iwanage, S., Noguti,
T. & Go, M. (1993). Secondary structural features of
modules m2 and m3 of barnase in solution by nmr
experiment and distance geometry calculation. Pro-
teins: Struct. Funct. Genet. 16, 341356.
Islam, S. A., Luo, J. & Sternberg, M. J. E. (1995). Identi-
cation and analysis of domains in proteins. Protein
Eng. 8, 513525.
Jennings, P. A. & Wright, P. E. (1993). Formation of a
molten globule intermediate early in the kinetic
folding pathway of apomyoglobin. Science, 262,
892896.
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A
new approach to protein fold recognition. Nature,
358, 8689.
Kabsch, W. & Sander, C. (1983). Dictionary of protein
secondary structure: pattern recognition of hydro-
gen-bonded and geometrical features. Biopolymers,
22, 25772637.
Kippen, A. D., Sancho, J. & Fersht, A. R. (1994). Folding
of barnase in parts. Biochemistry, 33, 37783786.
Klimov, D. K. & Thirumalai, D. (1996). Criterion that
determines the foldability of proteins. Phys. Rev.
Letters, 76, 40704073.
Koretke, K. K., Luthey-Schulten, Z. & Wolynes, P. G.
(1996). Self-consistently optimized statistical mech-
anical energy functions for sequence structure
alignment. Protein Sci. 5, 10431059.
Leopold, P. E., Montal, M. & Onuchic, J. N. (1992). Pro-
tein folding funnelsa kinetic approach to the
sequence structure relationship. Proc. Natl Acad. Sci.
USA, 89 (18), 87218725.
Levitt, M. (1992). Accurate modeling of protein confor-
mation by automatic segment matching. J. Mol. Biol.
226, 507533.
Murphy, K. P., Bhakuni, V., Xie, D. & Freire, E. (1992).
Molecular basis of co-operativity in protein folding:
(III) structural identication of cooperative folding
units and folding intermediates. J. Mol. Biol. 227,
293306.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C.
(1995). SCOP: a structural classication of proteins
database for the investigation of sequences and
structures. J. Mol. Biol. 247, 536540.
Needleman, S. B. & Wunsch, C. D. (1970). A general
method applicable to the search for similarities in
104 The Foldon Universe
the amino acid sequence of two proteins. J. Mol.
Biol. 48, 443453.
Onuchic, J. N., Wolynes, P. G., Luthey-Schulten, Z. &
Socci, N. D. (1995). Toward an outline of the topo-
graphy of a realistic protein-folding funnel. Proc.
Natl Acad. Sci. USA, 92, 36263630.
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994).
Protein super-families and domain super-folds.
Nature, 372, 631634.
Panchenko, A. R., Luthey-Schulten, Z. & Wolynes, P. G.
(1996). Foldons, protein structural modules, and
exons. Proc. Natl Acad. Sci. USA, 93, 20082013.
Richards, F. M. (1977). Areas, volumes, packing and pro-
tein structure. Annu. Rev. Biophys. Bioeng. 6, 151
176.
Rooman, M. J., Kocher, J.-P. A. & Wodak, S. J. (1991).
Prediction of protein backbone conformation based
on seven structure assignments: inuence of local
interactions. J. Mol. Biol. 221, 961979.
Rooman, M. J., Kocher, J.-P. A. & Wodak, S. J. (1992).
Extracting information on folding from the amino
acid sequence: accurate predictions for protein
regions with preferred conformation in the absence
of tertiary interactions. Biochemistry, 31, 10226
10238.
Sali, A. & Blundell, T. L. (1993). Comparative protein
modelling by satisfaction of spatial restraints. J. Mol.
Biol. 234, 779815.
Segawa, S.-I. & Richards, R. M. (1988). Identications of
regions of potential exibility in protein structures:
folding units and correlations with intron positions.
Biopolymers, 27, 2340.
Simon, I., Glasser, L. & Scheraga, H. A. (1991). Calcu-
lation of protein conformation as an assembly of
stable overlapping segments: application to bovine
pancreatic trypsin inhibitor. Proc. Natl Acad. Sci.
USA, 88, 36613665.
Socci, N. D. & Onuchich, J. N. (1994). Folding kinetics of
protein-like hetero-polymers. J. Chem. Phys. 101 (2),
15191528.
Sowdhamini, R., Runo, S. D. & Blundell, T. L. (1996).
A database of globular protein structural domains:
clustering of representative family members into
similar folds. Folding Design, 1, 209220.
Srinivasan, R. & Rose, G. D. (1995). LINUS: a hierarchic
procedure to predict the fold of a protein. Proteins:
Struct. Funct. Genet. 22, 8199.
Unger, R. & Sussman, J. L. (1993). The importance of
short structural motifs in protein structure analysis.
J. Comp. Aid. Mol. Des. 7, 457472.
Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976).
Some biological sequence metrics. Advan. Maths. 20,
367397.
Wetlaufer, D. B. (1981). Folding of protein fragments.
Advan. Protein Chem. 34, 6192.
Wintjens, R. T., Rooman, M. J. & Wodak, S. J. (1996).
Automatic classication and analysis of a a-turn
motifs in proteins. J. Mol. Biol. 255, 235253.
Edited by F. E. Cohen
(Received 18 February 1997; received in revised form 3 June 1997; accepted 3 June 1997)
The Foldon Universe 105