Protein Docking & Folding Methods (40ch

NOTES
Protein Docking
Definition:
The process of predicting the structure of complexes formed by protein-protein and protein-ligand
interaction is called docking
Introductory Details
IO
Given two biological molecules determine:
• Whether the two molecules “interact”
- is there an energetically favorable orientation of the
two molecules such that one may modify the other’s
function
- do the two molecules fit together in any energetically
favorable way
• If so, what is the orientation that maximizes the
“interaction” while minimizing the total “energy” of the
complex
Importance
DEBAAP
 Design of novel pharmaceuticals, drugs, agricultural and biological products.
 Essential Roles played by proteins
 Bio-Molecular interactions are the core of all Biological Processes
 Altering behavior of Protein
 Act as Switches
 Protein Engineering
Difficulty
HT
 Both molecules are flexible and may alter each other’s structure as they interact:
 Hundreds to thousands of degrees of freedom
 Total possible conformations are astronomical
Types of Docking
RF
 Rigid Docking
 Both the protein and ligand are considered as rigid molecules
 Six Degrees of Freedom
 Software: GRAMM
 Flexible Docking
 Either the protein or the ligand or both are considered partially or fully flexible
 Search space vast
NEED FOR FLEXIBLE DOCKING

MSLT
 It has been observed that when two molecules dock, the movement of atoms may involve
 main chains
 side chains
 large changes in the interface upon complex formation
 The atom movements are also observed in the exposed non-interface residues caused
by flexibility and disorder.
Global Optimization of Potential Energy Function
Need and Details

 Final conformation of the docked complex is the one with the minimum energy
 The multidimensional energy surface has multiple minima problem.
 Need for new specialized global search strategies
Examples
SD
 S:SG
 D:A
 Statistical / Stochastic
 Simulated Annealing
 Genetic Algorithms
 Deterministic
 Alpha Branch Bound
Docking Methods
Types of Flexible docking models

RPH
 Rigid Protein and Flexible Ligand
 Partially Flexible Protein and Flexible Ligand
 Hinge Bending in Protein and Ligand

Rigid Protein – Flexible Ligand
 Method 1:
 Details:
 Finds optimal binding site in substrate by exploring various binding sites and
ligand conformations using
 Simulated Annealing,
 Genetic Algorithms
 and Local Search.
 Software:
 AutoDock
 Drawbacks:
 Only considers ligand flexibility and hence cannot predict the conformational
changes in the receptor Protein
 The selection of flexible torsional angles in the ligand has to be done by the user
 Method 2: DSD – D:UPU S:F - D:AG
 Details:
 Use incremental construction algorithm
 Place an anchor fragment of ligand in the binding site
 Uses greedy algorithm to add fragments and complete the ligand structure
 Software
 FlexX
 Drawbacks:
 Anchor Fragment selection is difficult
 Greedy algorithm propagates errors resulting from initial bad choices

Partial Flexible Protein - Flexible Ligand
 Details:
 Induced Fit Model: BF
 Both Protein and Ligand are flexible and undergo a conformational change upon
binding to form a minimum energy perfect-fit
 Full Protein flexibility is computationally intractable
 Method 1: DSD – D:RPC S:I – D:AI
 Details:
 Internal Coordinates Mechanism RPC
 Partial protein flexibility in the side chains of the active site
 Randomly changes the position of ligand followed by energy

minimization
 Comparative assessment with autodock and flexX showed that ICM

provided highest accuracy
 Software:
 ICM(Internal Coordinates Mechanism)
 Drawbacks AI
 Allows partial protein flexibility in the side chains of the active site only
 Ignores any changes in the backbone atoms or atoms in other regions

Protein Folding
Definition
• Protein folding is the physical process by which a one dimensional chain of amino acids folds
into its characteristic and functional three-dimensional structure.
• 1D to 3D
Motivation for Protein 3D structure Prediction
• Main Motivation:
• The structure of the protein is directly related to the protein’s functionality.
• The reasons for research of 3D structure are: MAID
• Medicine:
– Understanding biological functions. Binding and unbinding of proteins

constitute much of the cellular activity of living organisms.
• Agriculture:
– Genetic engineering of better and richer crops.
• Industry:
– Synthesis of enzymes.
• Drug Design:
– Finding ”targets” for docking drugs.
What determines Fold?
• in general, the amino-acid sequence of a protein determines the 3D shape of a protein
• but some exceptions AS
– all proteins can be denatured

– some molecules have multiple conformations
Difficulty: Is Problem Hard?
• Yes. HD
– Huge Search Space:
– Difficult to decide criterion for correct fold
Solvable?
• Yes, Nature Zindabad
• Thus
1. Nature must not sample all conformations
2. Nature knows the correct criteria
Folding Method Criterion
• The method of choice for folding a given protein depends on
– existence of a similar protein whose structure is already known, and
– on the extent of such similarity
Methods of Protein Prediction
1. Comparative Modeling or Homology Modeling
2. Ab initio prediction (from scratch)
3. Fold Recognition or threading (fitting a sequence to existing models)

Homology or Comparative Modeling
Introduction
• Comparative modeling allows us to build a 3-D model for a protein of known aa
sequence but un-known structure, using another protein of known sequence and
structure as a template
Steps: IADIBBRE
Identify Homologous/Parent(s) Sequences: SD
• Search the target sequence against the sequences in the protein data bank
using
• FASTA
• or BLAST.
• Distant homologs could be search by using PSI-Blast.
Align the target sequence with Parent(s) 3IA
• If sequence identity between target and parent is >70% the alignment is

generally trivial
• If identity is below 40% then alignment becomes very difficult
• Alignment is done using Dynamic Programming as discussed earlier.
• If more than one parents are identified then it is better to Multiple align the
parents structurally first then calculate a MSA and then align the target
sequence to that Multiple Alignment.
Determine SCRs and SVRs
• Multiple Parents
• If multiple parents are present then those regions that are same in all
parents are termed as SCRs or core regions and other variable regions
are termed as SVRs
• Single Parent
• Whereas if one parent is present then initially we assume that all

regions are SCRs except where indels occur. If even one indel occurs in a
loop region then whole of the loop is considered as SVR.
Inherit the SCRs from the parent(s)
• The SCR is copied from the parent (s) to use in the model.
Build the SVRs
• SVRs are almost always loop regions
• When these vary in length from the loops present in the parent
structure(s), they will be built with lower accuracy.
• Even where loop regions are conserved in length they can adopt quite
different conformations.
Build the sidechains
• Rotamer Library: (Developing Bioinformatics Computer Skills by Cynthia Gibas,

Ref: pg261)
• These contain information about allowed rotations of the remote amino acid
side chain atoms
Refine Model
• Usually through (EM) Energy Minimization
• Standard mathematical minimization algorithms
• EM is only able to move atoms such that a local minimum on the energy surface
is found.
Evaluate Errors in the model
• RMSD (Root Mean Square Deviation) is used to say how similar is a structure to
another.
• R= √ ∑(di)2/N
• di: distance between each pair of atom
Exact String Matching
Applications:
it has applications in many fields, as:
• Text searching
• Molecular biology
• Data compression
• and so on…
Motivation
Recognizing DNA Contamination
• Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the
combined sources of possible contamination), find all substrings of S2 that occur in S1 and are
longer than some given length.
• These substrings are candidates of unwanted pieces of S2 that have contaminated the desired
DNA string.
Z-Algorithm
Given string S = {aabadaabcaaba}
Step 1: number alphabets

a a b a d a a b c a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13
Step 2: Calculate Z’s

Z2=1{a..a} , Z3= 0, Z4=1{a..a}, Z5=0, Z6=3{aab..aab}, Z7=1{a..a}, Z8=0, Z9=0, Z10=4, Z11=1, Z12=0, Z13=1
Step 3: Draw Z boxes
| | |---------| |--------------|
a a b a d a a b c a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13
Step 4: obtain li and ri values:
i 1 2 3 4 5 6 7 8 9 10 11 12 13
Li - 2 2 4 4 6 6 6 6 10 10 10 10
Ri - 2 2 4 4 8 8 8 8 13 13 13 13
How to find small pattern P in T
• Let $ be a character found neither in P nor in T.
• Build the string S = P$T.
• String S has length m+n+1, where m ≤ n.
• Run the Z algorithm on S.
• The resulting values Zi hold the property that Zi ≤ m. This is due to the presence of character $ in
S.
CASES
• Case 1: k > r : (k is outside a z-box)
• Case 2: k ≤ r: ( k is inside a z-box)
– Case 2a: Zk’ < |B|
• Z-box at k is the same as k’
• Hence, Zk is set to Zk’
• Values of r and l remain unchanged
– Case 2b: Zk’ >= |B|
• The algorithm will start searching for a first mismatch starting from position r+1
• Let q be the position of that mismatch
• Zk is set to (q-1) – k+1 = q-k
• R is set to q-1 and l is set to k

Hidden Markov Model
Definitions
• A Markov chain is a random process with the property that the next state depends only on the
current state.
• A hidden Markov model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.
• Each state has a probability distribution over the possible output tokens. Therefore the
sequence of tokens generated by an HMM gives some information about the sequence of
states.
• Note that the adjective 'hidden' refers to the state sequence through which the model passes,
not to the parameters of the model; even if the model parameters are known exactly, the model
is still 'hidden'.
Points:
• n a regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters.
• In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is
visible.
Calculate value of viterbi

V (k)= e(of output) * max(V(prev row&col)* Transition from k to l)
Note: jahan maximum value aa rahi hai waha say direction lau gaye…
Viterbi Algorithm
Regular Expression to HMM
State HMM
Profile HMM
• A profile HMM can be obtained from a multiple alignment and can be used for searching a
database for other members of the family in the alignment very much like standard profiles
• Parts
• The bottom line of states are called the main states, because they model the columns of
the alignment.
• In these states the probability distribution is just the frequency of the amino acids or
nucleotides as in the previous model of the DNA motif.
• The second row of diamond shaped states are called insert states and are used to model
highly variable regions in the alignment.
• They function exactly like the top state in the previous example.
• The top line of circular states are called delete states.
• These are a different type of state, called a silent or null state.
• They do not match any residues,
• they make it possible to jump over one or more columns in the alignment, to model the
situation when just a few of the sequences have a ‘–’ in the multiple alignment at a
position.
Genetic Algorithm
Gene: A specific sequence of nucleotide bases, that carry information required for constructing proteins.
Proteins: Provide structural components of cells, tissues and enzymes for essential biochemical
reactions.
Overview
 A class of probabilistic optimized search algorithms based on the mechanics of natural selection
and natural genetics.
 Inspired by the biological evolution process. Dinosaurs are dead, cockroaches still surviving.
 Based on the survival of the fittest among string structures with a structured yet randomized
exchange to form search algorithm.
What is Genetic Algorithm

A genetic algorithm maintains a population of candidate solutions for the problem at hand, and makes it
evolve by iteratively applying a set of stochastic operators.
GA vs. Normal Optimization

 GAs work with a coding of the parameter set, not the parameters themselves.
 GAs search from a population of points, not a single point.
 GAs use payoff (objective function) information, not derivatives or other auxiliary knowledge.
 GAs use probabilistic transition rules, not deterministic rules.
Genetic Algorithms Nature
Optimization problem Environment
Feasible solutions Individuals living in that environment
Solutions quality (fitness function) Individual’s degree of adaptation to its surrounding

environment
A set of feasible solutions A population of organisms (species)

Stochastic operators Selection, recombination and mutation in nature’s
evolutionary process
Iteratively applying a set of stochastic Evolution of populations to suit their environment

operators on a set of feasible solutions
Limitations and weakness
 The idea behind GA is extremely appealing, but has the following limitations:
 It is quite unnatural to model most applications in terms of genetic operators like mutation and
crossover on bit strings.
 The pseudo biology adds another level of complexity between you and your problem.
 Their weakness is that the process of selection alone is too systematic and predictable, not like
creativity as we know it.
 Binary representations are limited in their operations and for certain problems alternative
operators and representations must be used.
 The cross over and mutations make no use of real problem structure, so large fractions of
transitions lead to inferior solutions, and convergence is slow.
 GAs take a very long time on non-trivial problems as they generally require more objective
function evaluations as compared to classical optimization techniques. This being a major
practical limitation.
 Analogy with evolution is appropriate, but it took millions of years to achieve significant
improvement. Can we afford to wait that long?
Iterative search
1. Select Current solution
2. Create new solution
3. Check whether solution met criterion desired
4. If not repeat (1) And (2) again
5. Else stop
Neighborhood Methods
 Same as iterative search

 Steps (Descend method):
o Feasible solution i
o N(i) is a set of all neighbors near to the solution i
o J solution is found in I, where f(j)<f(i)
o If f(j)>f(i)search stops
Simulated Annealing
Intro to Annealing
 Steps:
o Heat to desired temperature
o Holding at that temperature
o Cooling to room temperature
 Sequences of time and temperature

o Annealing schedules
o Cooling schedules
 Two way are these schedules Critical:

o Difference in Outside(Temp) and In(Temp)
 Causes temperature gradients and internal stress
 Hence crack
o Actual annealing time should be long enough for transform to take place
Intro to Simulated Annealing
 Same as neighborhood method

 Problem with global optima descend method:
 Get stuck in local minima due to f(j)>f(i)
 Generating Distribution
o Generates possible valleys or states to be explored
 Accepting distribution
o Difference between function value of the present generated valley to be explored and
the last saved lowest valley
 For certain NP-hard problems where local extrema is mostly reached and global better extrema
is left
o Simulated annealing outperforms them
o It does by straightforward iterative improvement
o Tradeoff: longer running times
Simulated Annealing
 Boltzmann Probability Factor:

o p = exp ( -df / T)
 where df is the increase in f and
 T is a control parameter.
 Requirements for SA :
o A representation of possible solutions
o A generator of random changes in solutions
o A means of evaluating the problem functions
o Annealing schedule
 Method :
1. Input and assess Initial solution
2. Estimate initial temperature
 A suitable To is one that results in an average increase of acceptance probability
po of about 0.8.
 The value of To will clearly depend on f and, hence, problem-specific.
 To estimated by conducting an initial search in which all increases are accepted
and calculating the average increase in f observed df+.
 To = - df+ / ln(po)
3. Generate new solution
4. Assess initial temperature
5. Check whether to Accept new solution
6. If yes, then update stores
7. If no, skip (6)
8. Adjusted temperature
9. Check whether to Terminate search
10. If (9) yes STOP
11. Else again start from step (3)
 Acceptance of search steps (Metropolitan Criterion):

o Assume the performance change in the search direction
o Always accept a descending steps
o Accept a ascending step only if it pass a random test:
 exp   T   random 0,1
 Cooling Schedule
o T, annealing temperature, is the parameter that control the frequency of acceptance of
ascending steps
o We gradually reduce temperature T(k)
o At each temperature search is allowed for a certain number of steps
o The choice of parameters {T(k), and L(k)} are called cooling schedule
o EX:
 Set L = n, the number of variables in the problem.
 Set T(0) such that exp(-D/T(0)) » 1.
 Set T(k+1) = a·T(k), where a is a constant smaller but close to 1.
 Algorithm
5) Repeat 1) – 4) until stopping criterion is m
 Strengths of SA
o Simulated annealing can deal with highly nonlinear models and noisy data and many
constraints.
o It is a robust and general technique.
o Its main advantages over other local search methods are its flexibility and its ability to
approach global optimality.
o The algorithm is quite versatile since it does not rely on any restrictive properties of the
model.
Heuristic Algorithm
Intro
 Dynamic Programming
o Loose attraction when
 Database size 109
o It takes too much time
 Alternatives:
o Hardware
 Very fast
 Cost expensive
o Distribute computing
 Slow than hardware version but still fast
 Cost expensive
o Heuristics
 Much faster than dynamic Programming
Heuristic Method
 Definition:
o A heuristic method is an algorithm that gives only approximate solution to a given
problem.
 Important Points:
o Sometimes we are not able to formally prove that this solution actually solves the
problem,
o heuristic methods are commonly used because they are much faster than exact
algorithms.
o In addition, this is a software based strategy, which is therefore relatively cheap and
available to any researcher.
 Commonly used heuristics are based on the following observations:

o Even linear time complexity will be problematic when database size is huge (over 109).
o Preprocessing of the database is desirable, since numerous queries are run on an
infrequently updated database.
o Substitutions are much more likely than indels.
o We expect homologous sequences to contain a lot of segments with matches or
substitutions, but without indels and gaps.
o These segments can be used as starting points for further searching.
 Heuristic methods:
o FASTA
o BLAST
o BLAST2
FASTA
Intro
• The FASTA algorithm is a heuristic method for string comparison.
• It compares a query string against a single text string.
• When searching the whole database for matches to a given query, we compare the query using
the FASTA algorithm to every string in the database.
• Good local alignment is likely to have exact matching subsequences.
• The algorithm uses this property and focuses on segments in which there will be an absolute
identity between the two compared strings.
• We can use the alignment Dot-Plot matrix for finding these identical regions.
Method (Steps):
1. Finding Hot Spots (Regions with exact match (ktup))
2. Finding 10 Best Diagonal Runs (No indels)

3. Evaluate Diagonal Runs (using substitution matrices, find Init1)
4. Combine Good Diagonal Runs (Init N (Max. Weight Path in the Graph))
5. Finding alternative local alignment (Dynamic Programming in a Band (opt score))
6. Ranking
Small Definitions:
 Hot Spots:
o The first step of the algorithm is to determine all exact matches of length k (wordsize)
between the two sequences, called hot spots
 Ktup:
o (short for k respective tuples) - an integer parameter, which specifies the length of the
matching substrings
 Diagonal Runs:
o A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same
diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot
spots are allowed).
Comments
• Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for
finding similar but not identical sequences since exact matches of this length are required
BLAST
intro
• BLAST (Basic Local Alignment Search Tool)

• The BLAST algorithm was developed in 1990 (FASTA in 1985)
• The motivation for the development of BLAST was the need to increase the speed of FASTA by
finding fewer and better hot spots.
• The idea was to integrate the substitution matrix in the first stage of finding the hot spots.
Method (Steps):
1. k-letter word list of the query sequence
2. List the possible matching words
3. Scan Database (For Seeding)
4. Extend exact matches to High Scoring Segment Pairs
5. List all HSP above cutoff score
6. Evaluate Statistical Significance of HSP
7. Combine HSP
8. Show gapped local alignment
Small Definitions:
 segment pair i:
o Given two strings S1 and S2, a segment pair is a pair of equal length substrings of S1 and
S2, aligned without gaps.
 A locally maximal segment is a segment whose alignment score (without gaps) cannot be
improved by extending it or shortening it.
 A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all
segment pairs in S1, S2.
 When comparing all the sequences in the database against the query, BLAST attempts to find all
the database sequences that when paired with the query contain a MSP above some cutoff
score S. we call such pairs HSP.
Types of BLAST
 Two hit
o extension step typically accounts for 90% of BLAST’s execution time
o key idea: do extension only when there are two hits on the same diagonal within
distance A of each other
o to maintain sensitivity, lower T parameter
 more single hits found
o but only small fraction have associated 2nd hit
 Gapped
o trigger gapped alignment if two-hit extension has a sufficiently high score
o run DP process both forward & backward from seed
o prune cells when local alignment score falls a certain distance below best score yet
 PSI –BLAST
o use results from BLAST query to construct a profile matrix
o search database with profile instead of query sequence
o iterate
 Profile creation
o The program initially operates on a single query sequence by performing a gapped
BLAST search
o Then, the program takes significant local alignments (hits) found, constructs a multiple
alignment and abstracts a position-specific scoring matrix (PSSM) from this alignment.
o Steps:
 Take significant BLAST hits
 Make an alignment
 Construct profile
The first step of the algorithm is to determine all exact matches of length k (wordsize)

Protein Docking & Folding Methods (40ch

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Protein Docking & Folding Methods (40ch

Enviado por

Direitos autorais:

Formatos disponíveis

NOTES

Given two biological molecules determine:

• Whether the two molecules “interact”

- is there an energetically favorable orientation of the

two molecules such that one may modify the other’s

- do the two molecules fit together in any energetically

• If so, what is the orientation that maximizes the

“interaction” while minimizing the total “energy” of the

 Design of novel pharmaceuticals, drugs, agricultural and biological products.

 Essential Roles played by proteins

 Bio-Molecular interactions are the core of all Biological Processes

 Altering behavior of Protein

 Hundreds to thousands of degrees of freedom

 Total possible conformations are astronomical

 Both the protein and ligand are considered as rigid molecules

 Six Degrees of Freedom

 Search space vast

NEED FOR FLEXIBLE DOCKING

 large changes in the interface upon complex formation

Need and Details

 The multidimensional energy surface has multiple minima problem.

 Need for new specialized global search strategies

 Alpha Branch Bound

Types of Flexible docking models

 Rigid Protein and Flexible Ligand

 Partially Flexible Protein and Flexible Ligand

 Hinge Bending in Protein and Ligand

 and Local Search.

 Method 2: DSD – D:UPU S:F - D:AG

 Use incremental construction algorithm

 Place an anchor fragment of ligand in the binding site

 Anchor Fragment selection is difficult

 Greedy algorithm propagates errors resulting from initial bad choices

 Induced Fit Model: BF

 Full Protein flexibility is computationally intractable

 Method 1: DSD – D:RPC S:I – D:AI

 Internal Coordinates Mechanism RPC

 Partial protein flexibility in the side chains of the active site

 Randomly changes the position of ligand followed by energy

 Comparative assessment with autodock and flexX showed that ICM

 ICM(Internal Coordinates Mechanism)

 Ignores any changes in the backbone atoms or atoms in other regions

Motivation for Protein 3D structure Prediction

• The structure of the protein is directly related to the protein’s functionality.

• The reasons for research of 3D structure are: MAID

– Understanding biological functions. Binding and unbinding of proteins

– Genetic engineering of better and richer crops.

– Finding ”targets” for docking drugs.

What determines Fold?

• in general, the amino-acid sequence of a protein determines the 3D shape of a protein

• but some exceptions AS

– all proteins can be denatured

Difficulty: Is Problem Hard?

– Huge Search Space:

– Difficult to decide criterion for correct fold

1. Nature must not sample all conformations

2. Nature knows the correct criteria

Folding Method Criterion

• The method of choice for folding a given protein depends on

– existence of a similar protein whose structure is already known, and

– on the extent of such similarity

Methods of Protein Prediction