Escolar Documentos
Profissional Documentos
Cultura Documentos
Protein Docking
Definition:
The process of predicting the structure of complexes formed by protein-protein and protein-ligand
interaction is called docking
Introductory Details
IO
function
favorable way
complex
Importance
DEBAAP
Act as Switches
Protein Engineering
Difficulty
HT
Both molecules are flexible and may alter each other’s structure as they interact:
Types of Docking
RF
Rigid Docking
Software: GRAMM
Flexible Docking
Either the protein or the ligand or both are considered partially or fully flexible
It has been observed that when two molecules dock, the movement of atoms may involve
main chains
side chains
The atom movements are also observed in the exposed non-interface residues caused
by flexibility and disorder.
Global Optimization of Potential Energy Function
Examples
SD
S:SG
D:A
Statistical / Stochastic
Simulated Annealing
Genetic Algorithms
Deterministic
Docking Methods
Details:
Finds optimal binding site in substrate by exploring various binding sites and
ligand conformations using
Simulated Annealing,
Genetic Algorithms
Software:
AutoDock
Drawbacks:
Only considers ligand flexibility and hence cannot predict the conformational
changes in the receptor Protein
The selection of flexible torsional angles in the ligand has to be done by the user
Details:
Uses greedy algorithm to add fragments and complete the ligand structure
Software
FlexX
Drawbacks:
Both Protein and Ligand are flexible and undergo a conformational change upon
binding to form a minimum energy perfect-fit
Details:
Software:
Drawbacks AI
Allows partial protein flexibility in the side chains of the active site only
Definition
• Protein folding is the physical process by which a one dimensional chain of amino acids folds
into its characteristic and functional three-dimensional structure.
• 1D to 3D
• Main Motivation:
• Medicine:
• Agriculture:
• Industry:
– Synthesis of enzymes.
• Drug Design:
• Yes. HD
Solvable?
• Yes, Nature Zindabad
• Thus
Introduction
• Comparative modeling allows us to build a 3-D model for a protein of known aa
sequence but un-known structure, using another protein of known sequence and
structure as a template
Steps: IADIBBRE
• Search the target sequence against the sequences in the protein data bank
using
• FASTA
• or BLAST.
• If more than one parents are identified then it is better to Multiple align the
parents structurally first then calculate a MSA and then align the target
sequence to that Multiple Alignment.
Determine SCRs and SVRs
• Multiple Parents
• If multiple parents are present then those regions that are same in all
parents are termed as SCRs or core regions and other variable regions
are termed as SVRs
• Single Parent
• The SCR is copied from the parent (s) to use in the model.
• When these vary in length from the loops present in the parent
structure(s), they will be built with lower accuracy.
• Even where loop regions are conserved in length they can adopt quite
different conformations.
Refine Model
• EM is only able to move atoms such that a local minimum on the energy surface
is found.
• RMSD (Root Mean Square Deviation) is used to say how similar is a structure to
another.
• R= √ ∑(di)2/N
Applications:
• Text searching
• Molecular biology
• Data compression
• and so on…
Motivation
• Given a string S1( the newly isolated and sequenced string of DNA) and a known string S2 ( the
combined sources of possible contamination), find all substrings of S2 that occur in S1 and are
longer than some given length.
• These substrings are candidates of unwanted pieces of S2 that have contaminated the desired
DNA string.
Z-Algorithm
1 2 3 4 5 6 7 8 9 10 11 12 13
| | |---------| |--------------|
a a b a d a a b c a a b a
1 2 3 4 5 6 7 8 9 10 11 12 13
i 1 2 3 4 5 6 7 8 9 10 11 12 13
Li - 2 2 4 4 6 6 6 6 10 10 10 10
Ri - 2 2 4 4 8 8 8 8 13 13 13 13
How to find small pattern P in T
• The resulting values Zi hold the property that Zi ≤ m. This is due to the presence of character $ in
S.
CASES
• Case 1: k > r : (k is outside a z-box)
• The algorithm will start searching for a first mismatch starting from position r+1
Definitions
• A Markov chain is a random process with the property that the next state depends only on the
current state.
• A hidden Markov model (HMM) is a statistical Markov model in which the system being
modeled is assumed to be a Markov process with unobserved (hidden) states.
• Each state has a probability distribution over the possible output tokens. Therefore the
sequence of tokens generated by an HMM gives some information about the sequence of
states.
• Note that the adjective 'hidden' refers to the state sequence through which the model passes,
not to the parameters of the model; even if the model parameters are known exactly, the model
is still 'hidden'.
Points:
• n a regular Markov model, the state is directly visible to the observer, and therefore the state
transition probabilities are the only parameters.
• In a hidden Markov model, the state is not directly visible, but output, dependent on the state, is
visible.
Note: jahan maximum value aa rahi hai waha say direction lau gaye…
Viterbi Algorithm
State HMM
Profile HMM
• A profile HMM can be obtained from a multiple alignment and can be used for searching a
database for other members of the family in the alignment very much like standard profiles
• Parts
• The bottom line of states are called the main states, because they model the columns of
the alignment.
• In these states the probability distribution is just the frequency of the amino acids or
nucleotides as in the previous model of the DNA motif.
• The second row of diamond shaped states are called insert states and are used to model
highly variable regions in the alignment.
• They function exactly like the top state in the previous example.
• they make it possible to jump over one or more columns in the alignment, to model the
situation when just a few of the sequences have a ‘–’ in the multiple alignment at a
position.
Genetic Algorithm
Gene: A specific sequence of nucleotide bases, that carry information required for constructing proteins.
Proteins: Provide structural components of cells, tissues and enzymes for essential biochemical
reactions.
Overview
A class of probabilistic optimized search algorithms based on the mechanics of natural selection
and natural genetics.
Inspired by the biological evolution process. Dinosaurs are dead, cockroaches still surviving.
Based on the survival of the fittest among string structures with a structured yet randomized
exchange to form search algorithm.
The idea behind GA is extremely appealing, but has the following limitations:
It is quite unnatural to model most applications in terms of genetic operators like mutation and
crossover on bit strings.
The pseudo biology adds another level of complexity between you and your problem.
Their weakness is that the process of selection alone is too systematic and predictable, not like
creativity as we know it.
Binary representations are limited in their operations and for certain problems alternative
operators and representations must be used.
The cross over and mutations make no use of real problem structure, so large fractions of
transitions lead to inferior solutions, and convergence is slow.
GAs take a very long time on non-trivial problems as they generally require more objective
function evaluations as compared to classical optimization techniques. This being a major
practical limitation.
Analogy with evolution is appropriate, but it took millions of years to achieve significant
improvement. Can we afford to wait that long?
Iterative search
1. Select Current solution
2. Create new solution
3. Check whether solution met criterion desired
4. If not repeat (1) And (2) again
5. Else stop
Neighborhood Methods
Simulated Annealing
Intro to Annealing
Steps:
o Heat to desired temperature
o Holding at that temperature
o Cooling to room temperature
Generating Distribution
o Generates possible valleys or states to be explored
Accepting distribution
o Difference between function value of the present generated valley to be explored and
the last saved lowest valley
For certain NP-hard problems where local extrema is mostly reached and global better extrema
is left
o Simulated annealing outperforms them
o It does by straightforward iterative improvement
o Tradeoff: longer running times
Simulated Annealing
Requirements for SA :
o A representation of possible solutions
o A generator of random changes in solutions
o A means of evaluating the problem functions
o Annealing schedule
Method :
1. Input and assess Initial solution
2. Estimate initial temperature
A suitable To is one that results in an average increase of acceptance probability
po of about 0.8.
The value of To will clearly depend on f and, hence, problem-specific.
To estimated by conducting an initial search in which all increases are accepted
and calculating the average increase in f observed df+.
To = - df+ / ln(po)
3. Generate new solution
4. Assess initial temperature
5. Check whether to Accept new solution
6. If yes, then update stores
7. If no, skip (6)
8. Adjusted temperature
9. Check whether to Terminate search
10. If (9) yes STOP
11. Else again start from step (3)
Cooling Schedule
o T, annealing temperature, is the parameter that control the frequency of acceptance of
ascending steps
o We gradually reduce temperature T(k)
o At each temperature search is allowed for a certain number of steps
o The choice of parameters {T(k), and L(k)} are called cooling schedule
o EX:
Set L = n, the number of variables in the problem.
Set T(0) such that exp(-D/T(0)) » 1.
Set T(k+1) = a·T(k), where a is a constant smaller but close to 1.
Algorithm
Strengths of SA
o Simulated annealing can deal with highly nonlinear models and noisy data and many
constraints.
o It is a robust and general technique.
o Its main advantages over other local search methods are its flexibility and its ability to
approach global optimality.
o The algorithm is quite versatile since it does not rely on any restrictive properties of the
model.
Heuristic Algorithm
Intro
Dynamic Programming
o Loose attraction when
Database size 109
o It takes too much time
Alternatives:
o Hardware
Very fast
Cost expensive
o Distribute computing
Slow than hardware version but still fast
Cost expensive
o Heuristics
Much faster than dynamic Programming
Heuristic Method
Definition:
o A heuristic method is an algorithm that gives only approximate solution to a given
problem.
Important Points:
o Sometimes we are not able to formally prove that this solution actually solves the
problem,
o heuristic methods are commonly used because they are much faster than exact
algorithms.
o In addition, this is a software based strategy, which is therefore relatively cheap and
available to any researcher.
Heuristic methods:
o FASTA
o BLAST
o BLAST2
FASTA
Intro
• When searching the whole database for matches to a given query, we compare the query using
the FASTA algorithm to every string in the database.
• The algorithm uses this property and focuses on segments in which there will be an absolute
identity between the two compared strings.
• We can use the alignment Dot-Plot matrix for finding these identical regions.
Method (Steps):
4. Combine Good Diagonal Runs (Init N (Max. Weight Path in the Graph))
6. Ranking
Small Definitions:
Hot Spots:
o The first step of the algorithm is to determine all exact matches of length k (wordsize)
between the two sequences, called hot spots
Ktup:
o (short for k respective tuples) - an integer parameter, which specifies the length of the
matching substrings
Diagonal Runs:
o A diagonal run is a set of hot-spots that lie in a consecutive sequence on the same
diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot
spots are allowed).
Comments
• Larger ktuple increases speed since fewer “hits” are found but it also decreases sensitivity for
finding similar but not identical sequences since exact matches of this length are required
BLAST
intro
• The motivation for the development of BLAST was the need to increase the speed of FASTA by
finding fewer and better hot spots.
• The idea was to integrate the substitution matrix in the first stage of finding the hot spots.
Method (Steps):
7. Combine HSP
Small Definitions:
segment pair i:
o Given two strings S1 and S2, a segment pair is a pair of equal length substrings of S1 and
S2, aligned without gaps.
A locally maximal segment is a segment whose alignment score (without gaps) cannot be
improved by extending it or shortening it.
A maximum segment pair (MSP) in S1 and S2 is a segment pair with the maximum score over all
segment pairs in S1, S2.
When comparing all the sequences in the database against the query, BLAST attempts to find all
the database sequences that when paired with the query contain a MSP above some cutoff
score S. we call such pairs HSP.
Types of BLAST
Two hit
o extension step typically accounts for 90% of BLAST’s execution time
o key idea: do extension only when there are two hits on the same diagonal within
distance A of each other
o to maintain sensitivity, lower T parameter
more single hits found
o but only small fraction have associated 2nd hit
Gapped
o trigger gapped alignment if two-hit extension has a sufficiently high score
o run DP process both forward & backward from seed
o prune cells when local alignment score falls a certain distance below best score yet
PSI –BLAST
o use results from BLAST query to construct a profile matrix
o search database with profile instead of query sequence
o iterate
Profile creation
o The program initially operates on a single query sequence by performing a gapped
BLAST search
o Then, the program takes significant local alignments (hits) found, constructs a multiple
alignment and abstracts a position-specific scoring matrix (PSSM) from this alignment.
o Steps:
Take significant BLAST hits
Make an alignment
Construct profile
The first step of the algorithm is to determine all exact matches of length k (wordsize)