Exploring Database and Analyzing Protein Sequence

EXPLORING DATABASE AND
ANALYZING PROTEIN SEQUENCE

COURSE: GEB-207
DEPARTMENT OF GENETIC ENGINEERING AND

BIOTECHNOLOGY, UNIVERSITY OF DHAKA.
AUBHISHEK ZAMAN
ROLL:O8
4/26/2009
C
ontents
Chapter 1: About Bioinformatics 7-20
1.1 General 08
1.2 Resources 10
1.2.1 Gateways 10
1.2.2 Database 12
1.2.3 Software or Tools 14
1.3 Application and Importance 15
1.4 Project Aim 18
Chapter2: Working with Protein Sequences 21-65
2.1 General 22
2.2Fetching protein sequence from Database 22
2.2.1 Database 22
2.2.2 Method 23
2.2.3 Result 24
2.3 Analyzing Protein Sequences 24
2.3.1 Understanding the general, physical, chemical properties of a
Protein sequence. 24
2.3.1.1 Software/ Tools 25
2.3.1.2 Method 25
2.3.1.3Result 26
2.3.2 Searching Database for similar sequences 27
2.3.2.1Software/ Tools 27
2.3.2.2 Methods 28
2.3.2.3 Result 30
2.3.3 Sequence Alignment Study 41
2.3.3.1 Pair wise alignment 41
2.3.3.1.1 Software/ Programs 42
2.3.3.1.2 Methods 43
2.3.3.1.3 Results 44
2.3.3.2 Multiple Sequence Alignment 47
2.3.3.2.1 Software/ Programs 49
2.3.3.2.2 Methods 49
2.3.3.2.3 Results 51
2.3.4 Phylogenetic tree construction 54
2.3.4.2 Methods 56
2.3.4.3 Result 58
2.3.5 Secondary Structure Prediction 60
2.3.5.2 Methods 61
2.3.5.3 Result 63
Chapter 3: Discussion 66-70

3.1 General 67
3.2 Exploring Database 67
3.3 Analyzing Protein Sequences 68
3.4 Conclusion 70
Page | 2
List of abbreviation
ABBREVIATION ELABORATION
BLAST Basic Local Alignment Search Tool
DDBJ D NA Data Bank of Japan
EBI European Bionformatics Institute
EMB European Molecular Biolog Laboratory

Expasy Expert Protein Analysis System
H Hierachical Neural Network
CBI National Centre for Biotechnological Information
CI National Cancer Institute
IH National Institute of Health
LM United States National Library of Medicine
PDB Protein DataBank
PSI-BLA ST Protein Specific Iterated Blast
PSI RED Protein Secondary Information Prediction
SIB Swiss Institute of Bioinformatics
URL Universal Resource Locator
Page | 3
List of figures
Figure no. Name Of Table Page No.

1.1 Bioinformatics; an interdeciplinary subject 8
1.2 Submission and updates between three databases 10
1.3 Use of informatics in drug designing. 16
1.3 The Catalytic mechanism of Chymotrypsin. 19
1.4 The overview of the project 20
2.1 The flow of data from primary data sources into component databases of 22
universal protein resourse.
2.2 FASTA format result of p00766 24

2.3 Graphical presentation of BLASTp results 31
2.4 Graphical presentation of PSI-BLAST search result 36
2.5 Graphical representation of pair-wise alignment 44
2.6 Algorithm of a software performing multiple sequence alignment 48
2.7 Multiple Sequence Alignment(MSA) 53
2.8 Multiple Sequence Alignment(MSA) Jalview results 54
2.9 Newick presentation 58
2.10 Phylogenic Tree (cladogram) from Homologous sequence of p00766 58
2.11 Phylogenic Tree (Phylogram) from Homologous sequence of p00766 59
2.12 Phylogenetic tree by JalView 59
2.13 The graphical presentation of HNN 60
2.14 Secondary structure by HNN 63
2.15 Secondary structure by PSI-Pred. 65
Page | 4
List of Tables
Table no. Name Of Table Page No.
1.1 Tools at EBI 11
1.2 Available tools at Bioinformatics Group - University College London 12
1.3 Primary Sequence Databases 13

1.4 Meta-bases 14
1.5 Software used in the project 15
2.1 Pair-wise alignment results for retreived sequences to identify similarities 45
3.1 Different uses of BLAST programs. 69
Page | 5
List of Web Addresses:
• http://www. ncbi.nlm.nih.gov
• http://www.ebi.ac.uk
• http://bioinf.cs.ucl.ac.uk/psipred/
• http://www.expasy.org
• http://www.pdb.org/
Reference Source:
• Bioinformatics: a practical guide to the analysis of genes and proteins
B.F. Ouelette and A.D. Baxevanis
• Discovering Genomics, Proteomics and Bioinformatics
A.M. Campbell and L.J. Heyer
• Post Genome informatics
Minoru Kanesha
• Bioinformatics-Sequence And Genome Analysis
D.W. Mount
• Bioinformatics for Dummies
G,M. Claverie and C. Notredame
• www.wikipedia.org
Page | 6
Chapter 1:
About Bioinformatics
Page | 7
B
IOINFORMATICS is an interdisciplinary subject. It may be termed as a blend of biological
and computational sciences. Bioinformatics involves storing, retreiving and manipulation of
biological data using computational texhniques.
Computer
Biology Science
BIOINFORMATICS
Mathematics
Statistics
Figure1.1: Bioinformatics; an interdeciplinary subject
1.1 General
B
iological data are flooding in at an unprecedented rate. For example as of August 2000, the
GenBank repository of nucleic acid sequences contained 8,214,000 entries and the SWISS-
PROT database of protein sequences contained 88,166. On average, the amount of
information stored in these databases is doubling every 15 months.
Bioinformatics is conceptualising biology in terms of molecules (in the sense of physical chemistry)
and applying informatics techniques (derived from disciplines such as applied maths, computer
science and statistics) to understand and organise the information associated with these molecules, on
a large scale. In short, bioinformatics is a management information system for molecular biology and
has many practical applications.
Bio stands for life, informatics comes from the word information. So, Bioinformatics refers to the
science that deals with the information that comes from living system. However, bioinformatics more
properly refers to the creation and advancement of algorithms, computational and statistical
Page | 8
techniques, and theory to solve formal and practical problems arising from the management and
analysis of biological data.
The National Center for Biotechnology Information (NCBI) defines bioinformatics as:
"Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. There are three important sub-disciplines within
bioinformatics: the development of new algorithms and statistics with which to assess relationships
among members of large data sets; the analysis and interpretation of various types of data including
nucleotide and amino acid sequences, protein domains, and protein structures; and the development
and implementation of tools that enable efficient access and management of different types of
information."
The terms bioinformatics and computational biology are often used interchangeably. However
bioinformatics more properly refers to the creation and advancement of algorithms, computational and
statistical techniques, and theory to solve formal and practical problems posed by or inspired from the
management and analysis of biological data. Important sub-disciplines within bioinformatics and
computational biology include:
• the development and implementation of tools that enable efficient access to, and use
and management of, various types of information
• the development of new algorithms (mathematical formulas) and statistics with which
to assess relationships among members of large data sets, such as methods to locate a gene
within a sequence, predict protein structure and/or function, and cluster protein sequences into
families of related sequences
Storing, retreiving and manipulating biological data in a meaningful way to interpret the biological
system is the prime objective of Bioinformatics. To do so in the initial phase the data produced by the
thousands of research teams all over the world are collected and organized in databases specialized for
particular subjects. GDB (Gene Data Bank), SWISS-PROT, GenBank, PDB (Protein Data Bank) etc
are some well known examples. As informations kept growing in size and complexities need of
specialized tools with diverse algorithmic approach started growing too. It resulted in application of
specialized softwares such as BLAST, CLUSTALW, BIOEDIT, SRATCH, Swiss PDB Viewer etc
for better data manipulation and sorting out.
Page | 9
1.2. The Resources
R esources of Bioinformatics are consisted of The Gateways, Databases and softwares.
ENTREZ
NCBI
• submission
• updates
• submission
• updates
GenBANK
EBI
EMBL
DDBJ
CIB
SRS
getentry
• submission
• updates
Figure1.2: Data flow for new submission and updates between three databases
1.2.1 Gateway
A
gateway in Information Technology (IT) is thought to be an open door through which a user
collects a specialized information. A gateway can be reached at a specific Universal
Resource Locator (URL).
There are several gateways for software and databases that offer access to many of the sites in
bioinformatics. The gateways and databases are listed below:
ational Centre for Biotechnology Information (CBI)

Web site: http://www.ncbi.nlm.nih.gov
The National Center for Biotechnology Information (CBI) is part of the

United States ational Library of Medicine (LM), a branch of the National
Institutes of Health. The NCBI houses genome sequencing data in GenBank
and an index of biomedical research articles in PubMed Central and PubMed,
Page | 10
as well as other information relevant to biotechnology. In addition to GenBank. NCBI
provides Online Mendelian Inheritance in Man, the Molecular Modeling Database (3D
protein structures), the Unique Human Gene Sequence Collection, a Gene Map of the Human
genome, a Taxonomy Browser, and coordinates with the National Cancer Institute to provide
the Cancer Genome Anatomy Project. All these databases are linked through a unique search
and retrieval system, called Entrez., that also include cross-referenced information integrate
these resources
European Bioinformatics Institute (EBI)

Web Site: http://www.ebi.ac.uk
The European Bioinformatics Institute (EBI) is a non-profit academic

organisation that forms part of the European Molecular Biology
Laboratory (EMBL). The EBI is a centre for research and services in
bioinformatics. The Institute manages databases of biological data
including nucleic acid, protein sequences and macromolecular structures. The mission of the
EBI is to provide freely available data and bioinformatics services to all facets of the
scientific community in ways that promote scientific progress and to contribute to the
advancement in molecular biology and genome research through basic investigator-driven
research in bioinformatics
Table 1.1 Tools at EBI
Tool Description
Align Pairwise global and local alignment tool (EMBOSS).
ClustalW Multiple sequence alignments.
CpG Plot/CpGreport CpG Island finder and plotting tool (EMBOSS).
GeneMark Gene prediction service.
Genetic Code Viewer Review of genetic code differences.
Wise2 Compares a protein sequence or a protein profile HMM to a DNA
sequence.
Mutation Checker Sequence validation.
Pepstats/Pepwindow/Pepinfo EMBOSS programs for basic protein sequence analysis
(EMBOSS).
Promoter wise Compares two DNA sequences allowing for inversions and
translocations, ideal for promoters.
Reverse Translator Reverse complement checker.
SAPS Statistics on protein sequences.
Transeq DNA sequence translation tool (EMBOSS).
ExPASy Molecular Biology Server-Expert Protein Analysis System, Swiss Institute of

Bioinformatics-
Web Site: http://www.expasy.org
The ExPASy (Expert Protein Analysis System) is a proteomics server of the Swiss
Institute of Bioinformatics (SIB) which analyzes protein sequences and structures
and two-dimensional gel electrophoresis (2-D Page electrophoresis). The server
functions in collaboration with the European Institute of Bioinformatics. ExPASy
also produces the protein sequence knowledgebase, UniProtKB/Swiss-prot, and its
Page | 11
computer annotated supplement, UniProtKB/Trembl.
Bioinformatics Group - University College London

Web Site: http://www.bioinf.cs.ucl.ac.uk
The Bioinformatics Group was originally founded as the Joint Research

Council funded Bioinformatics Unit within the Department of Computer
Science at University College London. The group's main aim is to
develop and apply state-of-the-art mathematical and computer science
techniques to problems now arising in the life sciences, particularly those now appearing in the post-
genomic era.
Available tools and software are:
Table 1.2: Available tools at Bioinformatics Group - University College London
Protein Structure Prediction Threading (THREADER)

Ab initio folding simulations
Secondary structure prediction (PSIPRED)
Protein disorder prediction (DISOPRED)
Protein domain prediction (DomPred)
Protein Sequence Analysis Amino acid substitution matrices

Hidden Markov Models (collaboration with N. Goldman,
Cambridge, & J. Thorne, NCSU)
Genome Analysis Genomic Threading Database (GTD)

Genomic fold recognition (GenTHREADER)
Genome annotation using software agents
Protein Structure Classification Comparison of structure classifications (CATH/SCOP/FSSP)

CATH (collaboration with J. Thornton & C. Orengo, UCL
Biochemistry)
Transmembrane Protein MEMSAT

Modelling Folding In Lipid Membranes (FILM)
Biological Applications of Data- Information extraction for biological research (BioRat)

mining and Machine Learning
Techniques
1.2.2 Databases
A
database in internet is actually consisted of a Database management system (DBMS) which
has two interface- one is for user to use and input and another one is for management in the
host computer. A database is compilation of entities in correspondence to its marked out
attributes.
Page | 12
Database (or data base) is a collection of data in an organised way so that its contents can easily be
accessed, managed, and modified by a computer. It is also called data bank. The most prevalent type
of database is the relational database which organizes the data in tables; multiple relations can be
mathematically defined between the rows and columns of each table to yield the desired information.
An object-oriented database stores data in the form of objects which are organized in hierachical
classes that may inherit properties from classes higher in the tree structure.
A biological database is a large, organized body of persistent data, usually associated with
computerized software designed to update, query, and retrieve components of the data stored within
the system. A simple database might be a single file containing many records, each of which includes
the same set of information. Most biological databases consist of long strings of nucleotides (guanine,
adenine, thymine, cytosine and uracil) and/or amino acids (threonine, serine, glycine, etc.). Each
sequence of nucleotides or amino acids represents a particular gene or protein (or section thereof),
respectively. Sequences are represented in shorthand, using single letter designations.
There are two main functions of biological databases:
1. Make biological data available to scientists. As much as possible of a particular type of

information should be available in one single place (book, site, database). Published data may
be difficult to find or access, and collecting it from the literature is very time-consuming. And
not all data is actually published explicitly in an article (genome sequences!).
2. To make biological data available in computer-readable form. Since analysis of biological
data almost always involves computers, having the data in computer-readable form (rather
than printed on paper) is a necessary first step.
Databases for bioinformatics are-
Primary and added-value databases

Sequence Vs organism databases
‘Federated’ databases: global computer networks … WWW
Primary or archived databases contain information and annotation of DNA and protein sequences,
DNA and protein structures and DNA and protein expression profiles.
Secondary or derived databases are so called because they contain the results of analysis on the
primary resources including information on sequence patterns or motifs, variants and mutations and
evolutionary relationships. Information from the literature is contained in bibliographic databases,
such as Medline.
The following table represent widely used databases for analyzing DNA and protein sequences as
well as databases and types of researches can be performed for DNA, protein structure and protein
function.
Table1.3: Primary Sequence Databases
Databases Software tools Web Site

NCBI (National Centre for Biotechnology http://www.ncbi.nlm.nih.gov/
Nucleic information) - GenBank
Acid EBI (European Bioinformatics Institute) – http://www.ebi.ac.uk/

EMBL
Page | 13
Databases DISC – DNA Information and Stock Center, http://www.dna.affrc.go.jp/
Japan
NCBI – GenPept http://www.ncbi.nlm.nih.gov/
Protein ExPasy – SwissProt and TrEMBL http://www.expasy.ch/
EBI (European Bioinformatics Institute) – http://www.ebi.ac.uk/
SwissProt, TrEMBL, PIR
Databases DISC – DNA Information and Stock Center, http://www.dna.affrc.go.jp/
Japan
Meta-databases: A meta-database can be considered a database of databases, rather than any one
integration project or technology. They collect data from different sources and usually make them
available in new and more convenient form, or with an emphasis on a particular disease or
organism.
Table 1.4: Meta-bases
Name Web Site

Entrez (National Center for Biotechnology http://www.ncbi.nlm.nih.gov
Information)
euGenes (Indiana University) http://eugenes.org
GeneCards (Weizmann Inst.) http://www.genecards.org
SOURCE (Stanford University) http://genome-www4.stanford.edu/cgi-
bin/SMD/source/sourceSearch
mGen containing four of the world biggest http://www.cyber-
databases GenBank, Refseq, EMBL and indian.com/bioperl/index.html
DDBJ - easy and simple program friendly
gene extraction
Bioinformatic Harvester (Karlsruhe http://harvester.fzk.de

Institute of Technology) - Integrating 26
major protein/gene resources.
MetaBase(KOBIC) - A user contributed http://BioDatabase.Org

database of biological databases.
1.2.3. Software/Tools
S
oftware tools are computer programs for sequence analysis, database construction
and management, evolutionary relations, structural analysis, pathways. The
software tools are integrated into databases.
Page | 14
The Bioinformatics Toolbox offers computational molecular biologists and other
research scientists an open and extensible environment in which to explore ideas,
prototype new algorithms, and build applications in drug research, genetic engineering,
and other genomics and proteomics projects. These tools range from a collection of
standalone tools with a common data format under a single, slick standalone or web-
based interface, to integrative and extensible bioinformatics workflow development
environments.
The important software programs in Bioinformatics that have been used in our project
are given in the following table:
Table 1.5: Software used in the project

ame of the Software Application and purpose Source
ProtParam Predict physicochemical http://www.expasy.org/
properties from sequence
BLAST finds regions of local http://www.ncbi.nlm.nih.gov/
similarity between
sequences
ClustalW Multiple sequence alignment http://www.ebi.ac.uk/
tool
PSIPRED Secondary structure http://bioinf.cs.ucl.ac.uk/psipred/
prediction tool
Hierarchical Neural Secondary structure http://www.expasy.org/
Network prediction tool
1.3 Application of Bioinformatics
B
ioinformatics is being used in following fields:
Gene expression study
Many expression studies have so far focused on devising methods to cluster genes by
similarities in expression profiles. This is in order to determine the proteins that are
expressed together under different cellular conditions. Briefly, the most common
methods are hierarchical clustering, self-organising maps, and K-means clustering. Hierarchical
methods originally derived from algorithms to construct phylogenetic trees, and group genes in a
bottom-up fashion; genes with the most similar expression profiles are clustered first, and those
with more diverse profiles are included iteratively. In contrast, the self-organising map and K-
means methods employ a top-down approach in which the user pre-defines the number of clusters
for the dataset. The clusters are initially assigned randomly, and the genes are regrouped iteratively
until they are optimally clustered.
Drug development
One of the earliest medical applications of bioinformatics has been in aiding rational drug
design. Figure 1.3 outlines the commonly cited approach, taking the MLH1 gene product as
an example drug target. MLH1 is a human gene encoding a mismatch repair protein (mmr)
situated on the short arm of chromosome 3. Through linkage analysis and its similarity to
Page | 15
mmr genes in mice, the gene has been implicated in nonpolyposis colorectal cancer (126).
Given the nucleotide sequence, the probable amino acid sequence of the encoded protein can
be determined using translation software. Sequence search techniques can then be used to
find homologues in model organisms, and based on sequence similarity, it is possible to
model the structure of the human protein on experimentally characterised structures. Finally,
docking algorithms could design molecules that could bind the model structure, leading the way for
biochemical assays to test their biological activity on the actual protein.At present all drugs on the
market target only about 500 proteins. With an improved understanding of disease mechanisms and
using computational tools to identify and validate new drug targets, more specific medicines that act
on the cause, not merely the symptoms, of the disease can be developed. These highly specific drugs
promise to have fewer side effects than many of today's medicines.
Figure 1.3: Use of informatics in drug designing.
Pharmacogenomics
Clinical medicine will become more personalized with the development of the field of
pharmacogenomics. This is the study of how an individual's genetic inheritence affects the body's
response to drugs. At present, some drugs fail to make it to the market because a small percentage of
the clinical patient population show adverse affects to a drug due to sequence variants in their DNA.
Today, doctors have to use trial and error to find the best drug to treat a particular patient as those
with the same clinical symptoms can show a wide range of responses to the same treatment. In the
future, doctors will be able to analyse a patient's genetic profile and prescribe the best available drug
therapy and dosage from the beginning.
Gene therapy
Gene therapy is the approach used to treat, cure or even prevent disease by changing the expression of
a person’s defective genes. Currently, this field is in its infantile stage with clinical trials for many
different types of cancer and other diseases ongoing.
Page | 16
Detection of Antibiotic-resistant pathogens
Scientists have been examining the genome of Enterococcus faecalis, a leading cause of bacterial
infection among hospital patients. They have discovered a region made up of a number of antibiotic-
resistant genes that may transform the bacterium from a harmless gut bacterium to a menacing
invader. The discovery of this region could provide useful marker for detecting pathogenic strains and
help to control the spread of infection inwards.
Agriculture
Bioinformatics tools can be used to sequence the genomes of plants and animals and elucidate the
functions of different genes. This specific genetic knowledge could then be used to produce nutrient
rich, drought, disease and insect resistant plants and improve the quality of livestock making them
healthier, more disease resistant and more productive.
Insect resistance
Genes from Bacillus thuringiensis that can control a number of serious pests have been successfully
transferred to cotton, maize and potatoes. This new ability of the plants to resist insect attack means
that the amount of insecticides being used can be reduced and hence the nutritional quality of the
crops is increased.
Improved nutritional quality
Scientists have recently succeeded in transferring genes into rice to increase levels of Vitamin A, iron
and other micronutrients. Scientists have also inserted a gene from yeast into tomato, the result is a
plant whose fruit stays longer on the vine and has an extended shelf life.
Biotechnology
The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have potential for
practical applications in industry and government-funded environmental remediation. These
microorganisms thrive in water temperatures above the boiling point and therefore may provide the
DOE, the Department of Defence, and private companies with heat-stable enzymes suitable for use in
industrial processes.
Microbial genome applications
Microorganisms are ubiquitous, that is they are found everywhere. They have been found surviving
and thriving in extremes of heat, cold, radiation, salt, acidity and pressure. They are present in the
environment, our bodies, the air, food and water. Traditionally, a variety of microbial properties have
been applied in the baking, brewing and food industries. The arrival of the complete genome
sequences and their potential to provide a greater insight into the microbial world and its capacities
could have broad and far reaching implications for environment, health, energy and industrial
applications.
Waste management
Deinococcus radiodurans is known as the world's toughest bacteria and it is the most radiation
resistant organism known. Scientists are interested in this organism because of its potential usefulness
in cleaning up waste sites that contain radiation and toxic chemicals. Microbial Genome Program
(MGP) scientists are determining the DNA sequence of C. crescentus one of the organisms
responsible for sewerage treatment.
Page | 17
Maintenance of climatic balance
Increasing levels of carbon dioxide emission, mainly through the expanding use of fossil fuels for
energy, are thought to contribute to global climate change. Recently, the DOE (Department of Energy,
USA) launched a program to decrease atmospheric carbon dioxide levels. One method of doing so is
to study the genomes of microbes that use carbon dioxide as their sole carbon source.
Evolutionary Studies
The sequencing of genomes from all three domains of life, eukaryota, bacteria and archaea means that
evolutionary studies can be performed in a quest to determine the tree of life and the last universal
common ancestor.
Forensic studies
Bioinformatics has created a great opportunity to ease the forensic experiment. It has been guaranteed
the highest possible accuracy to detect the right culprit in forensic investigations.
Forensic analysis of microbes
Scientists used their genomic tools to help distinguish between the strains of Bacillus anthryacis that
was used in the summer of 2001 terrorist attack in Florida with that of closely related anthrax strains.
Bioweapon creation
Scientists have recently built the virus Poliomyelitis by entirely artificial means using genomic data
available on the internet and materials from a mail order chemical supply.
1.4 Project Aim
T he aim of our project was to get introduced with the field of Bioinformatics. More specifically
the target was to-
Be familiar with biological databases and available tools to analyze the information
in such databases.
Finding the sequence of the protein and study the physicochemical properties.
Aligning similar proteins and generating phylogenetic trees to examine

evolutionary relationships.
Clustering protein sequences into families of related sequences and the

development of protein models.
Developing methods to predict the structure and/or function and resive the
secondery structure.
A well known protein Chymotrypsin (PDB Id- P00766) was studied as the in the project.
Chymotrypsin is a proteolytic enzyme. This enzyme catalyzes the hydrolysis of peptide bonds of
Page | 18
proteins in the small intestine. It is selective for peptide bonds with aromatic or large aromatic
hydrophobic side chains (Tyr, Trp, Phe) on the carboxyl side of this bond. Chymotrypsin also
catalyzes the hydrolysis of ester bonds. It is termed as serine Protease because it has a reactive serine
residue in its active site. Three amino acid residues have been found to play the key role in catalysis:
Ser195, His57 and Asp102. Together these residues are termed as “Catalytic Triad”. Although far
apart in the primary structure the protein folding brings these residues close and in correct orientation
in tertiary structure. Chymotrypsin was the first discovered Serine protease. Its crystal structure was
first resolved by David Blow in 1967. this discovery provided
provided a key understanding of the catalytic
mechanism of a great variety of enzymes.
The mechanism of chymotrypsin action is illustrated in the following page.
Figure1.4: the Mechanism of Action of Chymotrypsin
Using bioinformatics tools we have performed a number jobs concerned with Chymotrypsin, such as
Retrieving the sequence of the protein.

Determining the physio-chemi
chemical properties from the sequence.
Performing BLAST search for finding similar sequences.
Pair wise and multiple sequence alignment of Chymotrypsin with various other protein
sequences.
Page | 19
Construction of a Phylogenetic tree and to determine the evolutionary relationship based on
the protein that was chosen for multiple sequence alignment.
The overview of the project is shown in the following flow chart:
Sequence database Manual input

browsing
Protein
Sequence file
Protein sequence Searching databases

Analysis for similar sequences
Primary structure:
Physico-chemical Sequence
properties Comparison
Secondary Pair wise alignment Multiple Sequence

Structure alignment
Prediction
Identity Similarity
Phylogenetic Tree construction
Figure 1.5: The overview of the project
Page | 20
Chapter2:
Working with protein sequence
Page | 21
2.1 General
W
ith the availability of hundreds of complete genome sequences from both prokaryotes and
eukaryotes efforts are now focused o the identification and functional analysis of the
proteins encoded by these gnomes. this urgency has resulted in a big burst of fresh
informations linked to proteomics. there came the need of a protein sequence databases.
Uniprot NREF 50
Uniprot NREF 90
Uniprot NREF 100
Proteome set Uniprot knowledgebase: IPI

swissprot+TrEMBL
Uniprot archive
Sub/pept DDBJ/E
VEG PDB Patent WGS EnsE REF Fly Wor
ide data MBL/G Base
A data MBL SEQ mBas
enbank
e
Figure 2.1: The flow of data from primary data sources into component databases of universal
protein resourse.
2.2 Fetching protein sequence from Database
2.2.1 DATABASE
we searched the protein database incorporated with NCBI gateway.It is the NIH protein sequence
database, an annotated collection of all publicly available Protein sequences. The complete release
notes for the current version of protein database are available on the NCBI ftp site. A new release is
made every two months .
Page | 22
Methods
1. Search for the desired
sequence was started with the
NCBI home page
(http://www.ncbi.nlm.nih.gov)
2. “Protein” was chosen in the
“Search” box and was
searched for Chymotrypsin
sequence .
3. P00766 was selected from the
list and clicked.
4. The information available on
the page was read carefully.
5. “FASTA” was selected from

Display.
6. The amino acid sequence in
FASTA format was saved.
Genpept format
CBI home page

Search For Chymotrypsin
(http://www.ncbi.nlm.nih.g
ov) ‘Protein’
Sequence FASTA selected P00766 is selected

saved from Diaplay
Page | 23
2.2.2 Results
The sequence was retreived and saved in microsoft word format for further use.
>gi|117615|sp|P00766|
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
FDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTG
WGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
GIVSWGSSTCSTSTPGVYARVTALVNWVQQTLAAN
Figure 2.2: FASTA format result of p00766
2.3 Analyzing Protein sequence
2.3.1: Understanding the general physiochemical

properties of a protein sequence.
P
roteins are condensation polymers of amino acid residues. however a liner organisation of
residues itself do not express much about protein structure as well as protein function. it is the
3D or tertiery native structure (quarternary in case of a multisubunit protein) which depicts a
protein best.
Though primary structure analysis is not a good methode for functional and structural analysis of the
protein, it can provide with some valuable informations regarding poteins behaviour in a solution, its
molecular weight, extinction coefficient etc. Thus general physiochemical properties can be a good
indicator to understand protein activities in broader scale.
Page | 24
2.3.1.1 Software
ProtParam (web link: www.expasy.org)
P rotParam (References / Documentation) is a tool which allows the computation of

various physical and chemical parameters for a given protein stored in Swiss-Prot or
TrEMBL or for a user entered sequence. The computed parameters include the
molecular weight, theoretical pI, amino acid composition, atomic composition, extinction
coefficient, estimated half-life, instability index, aliphatic index and grand average of
hydropathicity (GRAVY) following parameters are revealed by protparam-
Molecular weight
Number of residues
Average residue weight
Charge
Iso-electric point
For each physico-chemical class of amino acid: number, molar percent
Probability of protein expression in E. coli inclusion bodies
Extinction coefficient at 1 mg/ml (A280)
Molar extinction coefficient (A280)
2.3.1.2 Method
1. The address of the “European

Bioinformatics Institute” http://
www.expasy.org.
was written in the address bar of

the Internet Explorer.
2. Then the “Toolbox” option was

clicked.
3. The “Sequence Analysis”

option was chosen.
4. Then from the list, the

“Protparam” option was clicked.
Page | 25
5. Then in the box for the sequence, the sequence of “P00766 (swiss-prot accession
no)” was pasted.
6. Then, the “Run” command Button was clicked.
7. Then the obtained results were saved on a Microsoft Word document.
www.expasy.org. Expasy home page

Toolbox
Sequence of Protparam was

selected Sequence Analysis
P00766 was pasted
Compute Result Save

parameters
2.3.1.3 Results
ProtParam
User-provided sequence:
10 20 30 40 50 60
CGVPAIQPVL SGLSRIVNGE EAVPGSWPWQ VSLQDKTGFH FCGGSLINEN WVVTAAHCGV
70 80 90 100 110 120

TTSDVVVAGE FDQGSSSEKI QKLKIAKVFK NSKYNSLTIN NDITLLKLST AASFSQTVSA
130 140 150 160 170 180

VCLPSASDDF AAGTTCVTTG WGLTRYTNAN TPDRLQQASL PLLSNTNCKK YWGTKIKDAM
190 200 210 220 230 240

ICAGASGVSS CMGDSGGPLV CKKNGAWTLV GIVSWGSSTC STSTPGVYAR VTALVNWVQQ
TLAAN
References and documentation are available.

Please note the modified algorithm for extinction coefficient.
Number of amino acids: 245
Molecular weight: 25666.1
Theoretical pI: 8.52
Amino acid composition:
Ala (A) 22 9.0%
Arg (R) 4 1.6%
Asn (N) 14 5.7%
Asp (D) 9 3.7%
Page | 26
Cys (C) 10 4.1%
Gln (Q) 10 4.1%
Glu (E) 5 2.0%
Gly (G) 23 9.4%
His (H) 2 0.8%
Ile (I) 10 4.1%
Leu (L) 19 7.8%
Lys (K) 14 5.7%
Met (M) 2 0.8%
Phe (F) 6 2.4%
Pro (P) 9 3.7%
Ser (S) 28 11.4%
Thr (T) 23 9.4%
Trp (W) 8 3.3%
Tyr (Y) 4 1.6%
Val (V) 23 9.4%
Pyl (O) 0 0.0%
Sec (U) 0 0.0%
(B) 0 0.0%
(Z) 0 0.0%
(X) 0 0.0%
Total number of negatively charged residues (Asp + Glu): 14

Total number of positively charged residues (Arg + Lys): 18
Atomic composition:
Carbon C 1127
Hydrogen H 1783
Nitrogen N 307
Oxygen O 353
Sulfur S 12
Formula: C1127H1783N307O353S12
Total number of atoms: 3582
Extinction coefficients:
Extinction coefficients are in units of M-1 cm-1, at 280 nm measured in water.

Ext. coefficient 50585
Abs 0.1% (=1 g/l) 1.971, assuming ALL Cys residues appear as half cystines
Ext. coefficient 49960
Abs 0.1% (=1 g/l) 1.947, assuming NO Cys residues appear as half cystines
Estimated half-life:
The N-terminal of the sequence considered is C (Cys).
The estimated half-life is: 1.2 hours (mammalian reticulocytes, in vitro).
>20 hours (yeast, in vivo).
>10 hours (Escherichia coli, in vivo).
Instability index:
The instability index (II) is computed to be 15.27
This classifies the protein as stable.
Aliphatic index: 82.37
Grand average of hydropathicity (GRAVY): 0.051
2.3.2 Searching database for similar sequences
2.3.2.1 Software tools

Page | 27
S
tandard Protein-Protein BLAST (BLASTp)
BLASTp is the NCBI-BLAST program for comparing a protein query sequence to a

protein database. The original BLAST program was developed at NCBI. It takes protein
sequences in FASTA format, GenBank Accession number or GI numbers and compares them
against the NCBI Protein databases. BLASTp is used to both identifying a query amino acid
sequence and for finding similar sequences in protein databases. Like other BLAST
programs, blastp is designed to find local regions for similarity. However, when sequence
similarity spans the whole sequence, blastp will report a global alignment, which is the
preferred result for protein identification purposes. It can be used from NCBI website.
P
osition Specific Iterated BLAST (Psi-BLAST)
PSI-BLAST uses an iterative search in which sequences found in round of searching

are used to build score model for the next round searching. Highly conserved positions 12
receive high scores and weakly conserved positions receive scores near zero. The profile is
used to perform a second (etc.) BLAST search and the results of each “iteration” used to
refine the profile. This iterative searching strategy results in increased sensitivity. It can be
used from NCBI website.
2.3.2.2 Method
BLASTp
1. The Home page of the NCBI was reached

using the address http://www.ncbi.nih.gov/
2. The BLAST option was selected.
3. From the Protein portion, the ”Protein-protein

BLAST (blastp)” option was selected.
4. Then, in the “Search” box, the sequence of

1GCT was given as input.
5. The “nr” option was selected form the

“Choose database” option.
6. “BLOSUM 62” was selected in the “Matrix”

box.
7. Then, the “BLAST” command button was

clicked.
8. Then the obtained result was saved in a Microsoft Word Document.
Page | 28
Standard protein-
CBI home BLAST protein BLAST
page
BLOSUM62
BLAST run selected from Sequence of 1GCT
MATRIX option pasted in Search
window
Format Result saved

clicked
Psi-BLAST
1. Starting with NCBI,

“BLAST” search was selected
and the options on that page
were examined.
2. PSI- BLAST was chosen

from “Protein BLAST”.
3. Protein (1GCT) sequence,

saved previously in FASTA
format, was pasted on the
“Search” window.
4. From the MATRIX options,

“BLOSUM62” was selected.
5. “PSI-BLAST” was chosen.
6. Then “BLAST” was run.
7. The result was then saved.
Page | 29
CBI home page BLAST PSI-BLAST
“Format for PSI- “BLOSUM62” selected Sequence pasted

on “Search”
BLAST chosen from MATRIX option window
“BLAST” run Result saved
2.3.2.3 Results
BLASTp results
Page | 30
Figure 2.3: Graphical presentation of BLASTp results
Page | 31
More such results.......................................................................
Alignments Select All Get selected sequences Distance tree of results
> ref|XP_608091.3| PREDICTED: chymotrypsinogen B1 [Bos taurus]

Length=300
GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus]

(10 or fewer PubMed links)
Score = 496 bits (1278), Expect = 5e-139, Method: Compositional matrix adjust.
Identities = 245/245 (100%), Positives = 245/245 (100%), Gaps = 0/245 (0%)
Query 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60
CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
Sbjct 56 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 115
Query 61 TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA 120

TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
Sbjct 116 TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA 175
Page | 32
Query 121 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 180
VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
Sbjct 176 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 235
Query 181 ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ 240

ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ
Sbjct 236 ICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVSWGSSTCSTSTPGVYARVTALVNWVQQ 295
Query 241 TLAAN 245

TLAAN
Sbjct 296 TLAAN 300
> sp|P00766.1|CTRA_BOVIN RecName: Full=Chymotrypsinogen A; Contains: RecName:

Full=Chymotrypsin
A chain A; Contains: RecName: Full=Chymotrypsin A
chain B; Contains: RecName: Full=Chymotrypsin A chain C
pdb|2CGA|A Chain A, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis
And Refinement Of A New Crystal Form At 1.8 Angstroms
Resolution
pdb|2CGA|B Chain B, Bovine Chymotrypsinogen A. X-Ray Crystal Structure Analysis
And Refinement Of A New Crystal Form At 1.8 Angstroms
Resolution
30 more sequence titles
pdb|1ACB|E Chain E, Crystal And Molecular Structure Of The Bovine Alpha-

Chymotrypsin-Eglin C Complex At 2.0 Angstroms Resolution
pdb|1CGI|E Chain E, Three-Dimensional Structure Of The Complexes Between
Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human
Pancreatic Secretory Trypsin Inhibitor (Kazal-Type)
pdb|1CGJ|E Chain E, Three-Dimensional Structure Of The Complexes Between
Bovine ChymotrypsinogenA And Two Recombinant Variants Of Human
Pancreatic Secretory Trypsin Inhibitor (Kazal-Type)
pdb|1EX3|A Chain A, Crystal Structure Of Bovine Chymotrypsinogen A (Tetragonal)
pdb|1GL1|A Chain A, Structure Of The Complex Between Bovine Alpha-Chymotrypsin
And Pmp-C, An Inhibitor From The Insect Locusta Migratoria
pdb|1GL1|B Chain B, Structure Of The Complex Between Bovine Alpha-Chymotrypsin
pdb|1GL1|C Chain C, Structure Of The Complex Between Bovine Alpha-Chymotrypsin
pdb|1GL0|E Chain E, Structure Of The Complex Between Bovine Alpha-Chymotrypsin
And Pmp-D2v, An Inhibitor From The Insect Locusta Migratoria
pdb|1K2I|1 Chain 1, Crystal Structure Of Gamma-Chymotrypsin In Complex With
7- Hydroxycoumarin
pdb|1P2M|A Chain A, Structural Consequences Of Accommodation Of Four Non-
Cognate Amino-Acid Residues In The S1 Pocket Of Bovine Trypsin
And Chymotrypsin
pdb|1P2M|C Chain C, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1P2N|A Chain A, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1P2N|C Chain C, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1P2O|A Chain A, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
Page | 33
pdb|1P2O|C Chain C, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1P2Q|A Chain A, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1P2Q|C Chain C, Structural Consequences Of Accommodation Of Four Non-
And Chymotrypsin
pdb|1OXG|A Chain A, Crystal Structure Of A Complex Formed Between Organic
Solvent Treated Bovine Alpha-Chymotrypsin And Its Autocatalytically
Produced Highly Potent 14-Residue Peptide At 2.2 Resolution
pdb|1T7C|A Chain A, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine
Chymotrypsin Complex
pdb|1T7C|C Chain C, Crystal Structure Of The P1 Glu Bpti Mutant- Bovine
pdb|1T8L|A Chain A, Crystal Structure Of The P1 Met Bpti Mutant- Bovine
pdb|1T8L|C Chain C, Crystal Structure Of The P1 Met Bpti Mutant- Bovine
pdb|1T8M|A Chain A, Crystal Structure Of The P1 His Bpti Mutant- Bovine
pdb|1T8M|C Chain C, Crystal Structure Of The P1 His Bpti Mutant- Bovine
pdb|1T8N|A Chain A, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine
pdb|1T8N|C Chain C, Crystal Structure Of The P1 Thr Bpti Mutant- Bovine
pdb|1T8O|A Chain A, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine
pdb|1T8O|C Chain C, Crystal Structure Of The P1 Trp Bpti Mutant- Bovine
pdb|1CHG|A Chain A, Chymotrypsinogen,2.5 Angstroms Crystal Structure,
Comparison
With Alpha-Chymotrypsin,And Implications For Zymogen
Activation
pdb|1GCD|A Chain A, Refined Crystal Structure Of "aged" And "non-Aged"
Organophosphoryl Conjugates Of Gamma-Chymotrypsin
Length=245
GENE ID: 529639 CTRB2 | chymotrypsinogen B2 [Bos taurus]

(10 or fewer PubMed links)
Sbjct 1 CGVPAIQPVLSGLSRIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60


Sbjct 121 VCLPSASDDFAAGTTCVTTGWGLTRYTNANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 180

Page | 34
Query 241 TLAAN 245
TLAAN
Sbjct 241 TLAAN 245
> pdb|1GCT|A Chain A, Is Gamma-Chymotrypsin A Tetrapeptide Acyl-Enzyme Adduct

Of Gamma-Chymotrypsin?
pdb|2GCT|A Chain A, Structure Of Gamma-Chymotrypsin In The Range Ph 2.0
To Ph 10.5 Suggests That Gamma-Chymotrypsin Is A Covalent Acyl-
Enzyme Adduct At Low Ph
pdb|1GHB|E Chain E, A Second Active Site In Chymotrypsin? The X-Ray Crystal
Structure Of N-Acetyl-D-Tryptophan Bound To Gamma- Chymotrypsin
pdb|2GMT|A Chain A, Three-Dimensional Structure Of Chymotrypsin Inactivated
With (2s) N-Acetyl-L-Alanyl-L-Phenylalanyl-Chloroethyl Ketone:
Implications For The Mechanism Of Inactivation Of Serine
Proteases By Chloroketones
pdb|3GCH|A Chain A, Chemistry Of Caged Enzymes. Binding Of Photoreversible
Cinnamates To Chymotrypsin
Length=245
CGVPAIQPVLSGL IVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV
Sbjct 1 CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGV 60


VCLPSASDDFAAGTTCVTTGWGLTRY ANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM
Sbjct 121 VCLPSASDDFAAGTTCVTTGWGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAM 180

Query 241 TLAAN 245

TLAAN
Sbjct 241 TLAAN 245
More such results
2.3.2.3.2 PSI BLAST RESULT (After 3 Iterations)
Page | 35
Figure 2.4: Graphical presentation of PSI-BLAST search result
Page | 36
Similar More Results.......................................................................
Alignments Select All Get selected sequences Distance tree of results

>emb|CAG00821.1| unnamed protein product [Tetraodon nigroviridis]
Length=263
Score = 437 bits (1125), Expect = 3e-121, Method: Composition-based stats.
CGVP I PV++G SRIVNGEEAVP SWPWQVSLQ+ TGFHFCGGSLINENWVVTAAHC V
Sbjct 19 CGVPGIPPVITGYSRIVNGEEAVPHSWPWQVSLQEYTGFHFCGGSLINENWVVTAAHCNV 78
TS V+ GE D+ S++E IQ +++ +VFK+ YNS TINNDITL+KL++ A + VS
Sbjct 79 RTSHRVILGEHDRSSNNENIQVMQVGQVFKHPNYNSYTINNDITLIKLASPAQLNIRVSP 138
Page | 37
VC+ SD F G CVT+GWGLTRY +TP RLQQ +LPLL+N C+K+WG+KI D M
Sbjct 139 VCVAETSDVFPGGMKCVTSGWGLTRYNAPDTPPRLQQVALPLLTNEECRKHWGSKITDLM 198
+CAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS CS S+PGVYARVT L W+ Q
Sbjct 199 VCAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGFCSVSSPGVYARVTMLRAWMDQ 258
Query 241 TLAAN 245
+AAN
Sbjct 259 IIAAN 263
>gb|AAT45254.1| chymotrypsinogen 2-like protein [Sparus aurata]

gb|ABE68638.1| chymotrypsinogen II precursor [Sparus aurata]
Length=264
CG PAI PV++G SRIVNGEEAVP SWPWQVSLQD TGFHFCGGSLINENWVVTAAHC V
Sbjct 20 CGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENWVVTAAHCNV 79
TS V+ GE D+ S++E IQ +K+ KVFK+ YN TINNDI L+KL++ A VS
Sbjct 80 RTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSP 139
VC+ +D+F G CVT+GWGLTRY +TP LQQASLPLL+N C++YWG+KI + M
Sbjct 140 VCVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLM 199
ICAGASG SSCMGDSGGPLVC+K GAWTLVGIVSWGS TC+ + PGVYARVT L W+ Q
Sbjct 200 ICAGASGASSCMGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQ 259
Query 241 TLAAN 245
+AAN
Sbjct 260 IIAAN 264
>ref|XP_536782.2| PREDICTED: similar to chymotrypsinogen B1 isoform 1 [Canis

familiaris]
Length=264
GENE ID: 479650 CTRB1 | chymotrypsinogen B1 [Canis lupus familiaris]
Page | 38
CGVPAI+PVL+GLSRIVNGE+AVPGSWPWQVSLQD TGFHFCGGSLI+E+WVVTAAHCGV
Sbjct 20 CGVPAIEPVLNGLSRIVNGEDAVPGSWPWQVSLQDSTGFHFCGGSLISEDWVVTAAHCGV 79
TS +VVAGEFDQ SS E IQ LKIA+VFKN K+N T+ NDITLLKL+T A FS+TVS
Sbjct 80 RTSHLVVAGEFDQSSSEENIQVLKIAEVFKNPKFNMFTVRNDITLLKLATPARFSETVSP 139
VCLP A+D+F G CVTTGWG T+Y TPD+LQQA+LPLLSN CKK+WG+KI D M
Sbjct 140 VCLPQATDEFPPGLMCVTTGWGRTKYNANKTPDKLQQAALPLLSNAECKKFWGSKITDVM 199
ICAGASGVSSCMGDSGGPLVC+K+GAWTLVGIVSWGS TCSTS P VY+RVT L+ WVQ+
Sbjct 200 ICAGASGVSSCMGDSGGPLVCQKDGAWTLVGIVSWGSGTCSTSVPAVYSRVTELIPWVQE 259
Query 241 TLAAN 245
LAAN
Sbjct 260 ILAAN 264
Similar search results....................................................
Sequences Retrieved from PSI BLAST results in FASTA format:

>gi|117615|sp|P00766|
>pdb|4CHA|
CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
>pdb|1GCT|CHYMOTRYPSIN*A
CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD
QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG
LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS
WGSSTCSTSTPGVYARVTALVNWVQQTLAAN
>|CTRB2 protein[Human]
MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV
VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC
LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM
Page | 39
GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN
>|CTRL protein[Human]
LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI
SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT
RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA
GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN
>|Ela3 protein[Mouse]
PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL
GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA
PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA
DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN
>gi|chymotrypsinogen 2-like protein [Sparus aurata]
GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW
VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV
CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC
MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN
>gi|Zebrafish [Danio rerio]
WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA
HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA
EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD
SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN
>pdb|1S0Q|TRYPSINOGEN
IVGGYTCGANTVPYQVSLNSGYHFCGGSLINSQWVVSAAHCYKSGIQVRLGEDNINVVEGNEQFISASKS
IVHPSYNSNTLNNDIMLIKLKSAASLNSRVASISLPTSCASAGTQCLISGWGNTKSSGTSYPDVLKCLKA
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
NYVSWIKQTIASN
>pdb|3PTN|TRYPSIN
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
NYVSWIKQTIASN
>gi|PRSS2 protein [Bos taurus]
MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ
VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL
ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL
QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS
>gi|tryptase-III [Human]
LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP
DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET
FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD
SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|beta 1 tryptase [Gorilla gorilla]
MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL
TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV
TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML
CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta]
MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ
VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL
ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL
QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS
>pdb|5PTP|HYDROLASE
PILSDSSCKSAYPGQITSNMFCAGYLEGGKDSCQGDXGGPVVCSGKLQGIVSWGSGCAQKNKPGVYTKVC
Page | 40
NYVSWIKQTIASN
>pdb|3EST|ELASTASE
VVGGTEAQRNSWPSQISLQYRSGSSWAHTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNNGTEQ
YVGVQKIVVHPYWNTDDVAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQ
LAQTLQQAYLPTVDYAICSSSSYWGSTVKNSMVCAGGDGVRSGCQGDSGGPLHCLVNGQYAVHGVTSFVS
RLGCNVTRKPTVFTRVSAYISWINNVIASN
>pdb|1ZHR|FactorXI
IVGGTASVRGEWPWQVTLHTTSPTQRHLCGGSIIGNQWILTAAHCFYGVESPKILRVYSGILNQAEIAED
TSFFGVQEIIIHDQYKMAESGYDIALLKLETTVNYADSQRPISLPSKGDRNVIYTDCWVTGWGYRKLRDK
IQNTLQKAKIPLVTNEECQKRYRGHKITHKMICAGYREGGKDACKGDSGGPLSCKHNEVWHLVGITSWGE
GCAQRERPGVYTNVVEYVDWILEKTQAV
>pdb|1DDJ|PLASMINOGEN
SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY
KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG
WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI
LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN
>|Mast cell protease6
MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT
AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS
LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC
AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS
>gi|899286|Hepsin
TSGFFCVDEGRLPHTQRLLEVISVCDCPRGRFLAAICQDCGRRKLPVDRIVGGRDTSLGRWPWQVSLRYD
GAHLCGGSLLSGDWVLTAAHCFPERNRVLSRWRVFAGAVAQASPHGLQLGVQAVVYHGGYLPFRDPNSEE
NSNDIALVHLSSPLPLTEYIQPVCLPAAGQALVDGKICTVTGWGNTQYYGQQAGVLQEARVPIISNDVCN
GADFYGNQIKPKMFCAGYPEGGIDACQGDSGGPFVCEDSISRTPRWRLCGIVSWGTGCALAQKPGVYTKV
SDFREWIFQAIKTHSEASGMVTQL
>pdb|1SPJ|KALLIKREIN
IVGGWECEQHSQPWQAALYHFSTFQCGGILVHRQWVLTAAHCISDNYQLWLGRHNLFDDENTAQFVHVSE
SFPHPGFNMSLLENHTRQADEDYSHDLMLLRLTEPADTITDAVKVVELPTEEPEVGSTCLASGWGSIEPE
NFSFPDDLQCVDLKILPNDECKKAHVQKVTDFMLCVGHLEGGKDTCVGDSGGPLMCDGVLQGVTSWGYVP
CGTPNKPSVAVRVLSYVKWIEDTIAENS
>pdb|1HCG|FACTOR X
IVGGQECKDGECPWQALLINEENEGFCGGTILSEFYILTAAHCLYQAKRFKVRVGDRNTEQEEGGEAVHE
VEVVIKHNRFTKETYDFDIAVLRLKTPITFRMNVAPACLPERDWAESTLMTQKTGIVSGFGRTHEKGRQS
TRLKMLEVPYVDRNSCKLSSSFIITQNMFCAGYDTKQEDACQGDSGGPHVTRFKDTYFVTGIVSWGEGCA
RKGKYGIYTKVTAFLKWIDRSMKTRGLPKAK
>pdb|1HYL|COLLAGENASE
IINGYEAYTGLFPYQAGLDITLQDQRRVWCGGSLIDNKWILTAAHCVHDAVSVVVYLGSAVQYEGEAVVN
SERIISHSMFNPDTYLNDVALIKIPHVEYTDNIQPIRLPSGEELNNKFENIWATVSGWGQSNTDTVILQY
TYNLVIDNDRCAQEYPPGIIVESTICGDTSDGKSPCFGDSGGPFVLSDKNLLIGVVSFVSGAGCESGKPV
GFSRVTSYMDWIQQNTGIKF
>gi|Cold-Adaption Enzymes [Salmon]
IVGGYECKAYSQAHQVSLNSGYHFCGGSLVNENWVVSAAHCYKSRVEVRLGEHNIKVTEGSEQFISSSRV
IRHPNYSSYNIDNDIMLIKLSKPATLNTYVQPVALPTSCAPAGTMCTVSGWGNTMSSTADSDKLQCLNIP
ILSYSDCNDSYPGMITNAMFCAGYLEGGKDSCQGDSGGPVVCNGELQGVVSWGYGCAEPGNPGVYAKVCI
FSDWLTSTMASY
2.3.3 Sequence Alignment Study
2.3.3.1 Pair-wise alignment
S
equence alignment is the procedure of comparing two (pair-wise alignment)
(multiple sequence alignment) sequences by searching for a series of individual
character patterns that are in the same order in the sequences. Two sequences are
Page | 41
by writing them across a page in two rows. Identical or similar characters are placed in the
same column, and nonidentical characters can either be placed in the same column as a mis-
match or opposite a gap in the other sequence. In an optimal alignment, nonidentical char-
acters and gaps are placed to bring as many identical or similar characters as possible into
vertical register. Sequences that can be readily aligned in this manner are said to be similar.
There are two types of sequence alignment, global and local. In global alignment, an
attempt is made to align the entire sequence, using as many characters as possible, up to
both ends of each sequence. Sequences that are quite similar and approximately the same
length are suitable candidates for global alignment. In local alignment, stretches of
sequence with the highest density of matches are aligned, thus generating one or more
islands of matches or subalignments in the aligned sequences. Local alignments are more
suitable for aligning sequences that are similar along some of their lengths but dissimilar
in others, sequences that differ in length, or sequences that share a conserved region or
domain.
Pairwise alignment is the process by which a pair of sequences are compared to one another by
sequence alignment technique either global or local. It can also bedotplot
LGPSSKQTGKGS-SRIWDN
Global alignment
LN-ITKSAGKGAIMRLGDA
–------TGKG--------
Local alignment
-------AGKG--------
Distinction between global and local alignments of two sequences.
2.3.3.1.1 Software/Program
BLAST2 sequence
T
his tool produces the alignment of two given sequences using BLAST engine for
local alignment. While the standard BLAST program is widely used to search for
homologous sequences in nucleotide and protein databases, one often needs to
compare only two sequences that are already known to be homologous, coming from
related species or, e.g. different isolates of the same virus. In such cases searching the
entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes
the BLAST algorithm for pair wise DNA-DNA or protein-protein sequence comparison.
The results of BLAST2 Sequences give information about the similarities and identities
of other proteins regarding of the query protein. It also gives a graphical representation
of the alignment.
Page | 42
2.3.3.1.2 Methods
1. Starting with NCBI, “BLAST” search was selected
2. “Align two sequences (bl2seq)” was chosen from Special

databases.
3. “blastp” was chosen from Program and along with it,

“BLOSUM62” was automatically selected in the Matrix
options.
4. The query sequence was pasted from the saved fie in the
first window
5. The subject
sequence was
pasted from file
in the 2nd
window.
6. “Align” was
clicked.
7. The results
were saved.
CBI home page BLAST Align two sequences
(bl2seq)
Align Query & Subject seq. “blastp” chosen
pasted in separate window from Program
Result saved
Page | 43
2.3.3.1.3 Result
Pair-wise alignment results were found seperately for sequences. Among those one particular
result of p00766 and cold adaptation enzyme is given below--
Figure 2.5: Graphical representation of pair-wise alignment
Page | 44
Table 2.1: Pair-wise alignment results for retreived sequences to identify similarities
Attemps Sequence 01 Sequence 02 score Expect identities positives Gaps
Sq(p00766) Sq=Cold
01. S=human adaptation 164bits 97/231 137/231 12/
L=245 enzyme (414 ) -38
1℮-38 (41%) (59 %) (5 %)
S=(salmon)
L=231
Sq(p00766) Sq=pdb[4CHA] 494bits -138
02 S=human S= (1271 ) 5℮-138 241/245 241/245 0/245
L=245 L=245 (98 %) (98 %) (0 %)
Sq(p00766) Sq=1GT(chymotr 485bits -135

03 S=human ypsin) ( 1249) 2℮- 241/245 241/245 4/245
L=245 S= 135 ( 98%) (98 %) ( 1%)
L=245
Sq(p00766) Sq=CTRB 2 417bits

-115
04 S=human protein ( 1072) 6℮- 199/245 215/245 0/245
L=245 S=human 115 (81 %) (87 %) (0 %)
L=263
Sq(p00766) Sq=CTRL protein 294bits 132/246 174/246 1/246

-78
05 S=human S= (752 ) 7℮-78 (53%) ( 70%) ( 0%)
L=245 L=269
Sq(p00766) Sq=Ela 3 protein 197bits 111/253 153/253 16/253

-49
06 S=mouse (502 ) 7℮-49 ( 43%) (60 %) ( 6%)
S=human L=255
L=245
Sq(p00766) Sq=chymotrysin 353bits 165/245 192/245 0/245

07 S=human like protein ( 907) -96
8℮-96 ( 67%) ( 78%) ( 0%)
L=245 S=Sparus aureta
L=
Sq(p00766) Sq= 347bits -94 166/247 197/247 2/247

08 S=human S=zebra fish ( 890) 7℮-94 ( 67%) ( 79%) ( 0%)
L=245 L=261
Sq(p00766) Sq=trypsinogen 175bits 98/232 140/232 11/232

09 S=human S= ( 444) -42
4℮-42 ( 42%) (60%) ( 4%)
L=245 L=223
Page | 45
Sq(p00766) Sq=3PTN(trypsin 175bits 98/232 140/232 11/232
-42
10 S=human ) ( 444) 4℮-42 ( 42%) ( 60%) ( 4%)
L=245 S=
L=223
Sq(p00766) Sq=PRSS2 179bits -43 104/233 139/233 4/233

11 S=human S=Bos taurus ( 454) 3℮-43 ( 44%) ( 59%) ( 2%)
L=245 L=247
Sq(p00766) Sq=tryptase 166bits 92/237 126/237 14/237

-39
12 S=human S=human ( 428) 2℮-39 ( 38%) ( 53%) ( 5%)
L=245 L=267
Sq(p00766) Sq=beta tryptase 164bits 91/237 127/237 14/237

-39
13 S=human S=Gorilla ( 416) 7℮-39 ( 38%) ( 53%) ( 5%)
L=245 L=275
Sq(p00766) Sq=try p14 171bits -41 102/233 134/233 11/233

14 S=human S=Macaca mulata ( 434) 6℮-41 ( 43%) ( 57%) ( 4%)
L=245 L=247
Sq(p00766) Sq=Hydratase 173bits 97/232 139/232 11/232

15 S=human S= ( 439) -41
1℮-41 ( 41%) ( 59%) ( 4%)
L=245 L=223
Sq(p00766) Sq=Elastase 162bits 95/241 137/241 12/241

16 S= (409 ) -38
4℮-38 ( 39%) ( 56%) ( 4%)
S=human L=240
L=245
Sq(p00766) Sq=Factor XI 169bits 93/238 125/238 10/238

17 S=human S=human ( 428) -40
3℮-40 ( 39%) ( 52%) ( 4%)
L=245 L=238
Sq(p00766) Sq=plasminogen 171bits 95/253 137/253 17/253

18 S=human S=human ( 432) -40
1℮-40 ( 37%) ( 54%) ( 6%)
L=245 L=247
Sq(p00766) Sq=protease 162bits -38 90/239 127/239 14/239

19 S=human S=(Mast cell) ( 411) 3℮-38 ( 37%) (53 %) ( 5%)
L=245 L=274
Page | 46
Sq(p00766) Sq=Hepsin 159bits -37 95/249 130/249 23/243
20 S=human S=human ( 403) 2℮-37 ( 38%) ( 52%) ( 9%)
L=245 L=304
Sq(p00766) Sq=Kallekrenin 132bits -29 83/245 123/245 23/245

21 S=human S=human ( 331) 5℮-29 (33 %) ( 50%) ( 9%)
L=245 L=238
Sq(p00766) Sq=Factor X 120bits -25 72/237 117/237 15/237

22 S=human S=human (302) 1℮-25 ( 30%) ( 49%) ( 6%)
L=245 L=241
Sq(p00766) Sq=collagenase 98 bits 74/235 117/235 21/235

-19
23 S=human S=human (244) 6℮-19 (31 %) (49 %) (8 %)
L=245 L=230
here, S= source, L= length, sq= sequence
2.3.3.2 Multiple Sequence Alignment
O
ne of the major contribution of molecular biology to evolutionary analysis is the discovery
that the DNA sequences of different organisms are often related. Similar genes are conserved
across widely divergent species, often performing a similar or even identical function, and at
other times, mutating or rearranging to perform an altered function through the forces of natural
selection. Thus, many genes are represented in highly conserved forms in organisms. Through
simultaneous alignment of the sequences of these genes, sequence patterns that have been subject to
alteration may be analyzed.
Because the potential for learning about the structure and function of molecules by
multiple sequence alignment (msa) is so great, computational methods have received a
great deal of attention. In msa, sequences are aligned optimally by bringing the greatest
number of similar characters into register in the same column of the alignment, just as
described in Chapter 3 for the alignment of two sequences. Computationally, msa presents
several difficult challenges. First, finding an optimal alignment of more than two sequences
that includes matches, mismatches, and gaps, and that takes into account the degree of
variation in all of the sequences at the same time poses a very difficult challenge. The
dynamic programming algorithm used for optimal alignment of pairs of sequences can be
extended to three sequences, but for more than three sequences, only a small number of
relatively short sequences may be analyzed. Thus, approximate methods are used, including (1) a
progressive global alignment of the sequences starting with an alignment of the
most alike sequences and then building an alignment by adding more sequences, (2) iterative methods
that make an initial alignment of groups of sequences and then revise the
alignment to achieve a more reasonable result, (3) alignments based on locally conserved
patterns found in the same order in the sequences, and (4) use of statistical methods and
probabilistic models of the sequences. A second computational challenge is identifying a
reasonable method of obtaining a cumulative score for the substitutions in the column of
an msa. Finally, the placement and scoring of gaps in the various sequences of an msa presents an
additional challenge.
The msa of a set of sequences may also be viewed as an evolutionary history of the sequences. If the
Page | 47
sequences in the msa align very well, they are likely to be recently derived from a common ancestor
sequence. Conversely, a group of poorly aligned sequences share a more complex and distant
evolutionary relationship. The task of aligning a set of sequences, some more closely and others less
closely related, is identical to that of discovering the evolutionary relationships among the sequences.
As with aligning a pair of sequences, the difficulty in aligning a group of sequences varies
considerably with sequence similarity. On the one hand, if the amount of sequence variation is minimal,
it is quite straightforward to align the sequences, even without the assistance of a computer program.
On the other hand, if the amount of sequence variation is
great, it may be very difficult to find an optimal alignment of the sequences because so
many combinations of substitutions, insertions, and deletions, each predicting a different
alignment, are possible.
Figure 2.6: Algorithm of a software performing multiple sequence alignment
Page | 48
2.3.3.2.1 Software/ Tools
CLASTALW
C
LUSTALW is a general purpose multiple sequence alignment program for DNA or proteins.
It produces biologically meaningful multiple sequence alignments of divergent sequences. It
is a fully automated sequence alignment tool for DNA and protein sequences. It returns the
best match over a total length of input sequences, be it a protein or a nucleic acid. This program
follows the following steps:
Perform pair wise alignments of all of the sequences.
Use the alignment scores to produce a phylogenic tree using neighbor-joining methods.
Align the sequences sequentially, guided by the phylogenetic relationships indicated by
the tree.
CLUSTALW improves the sensitivity of progressive multiple sequence alignment through sequence
weighting, position specific gap penalties and weight matrix choice. Evolutionary relationships can
also be seen via viewing Cladograms or Phylograms. The sequence alignment is performed in global
alignment manner.
JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used as a
visualization tool for the PFAM CORBA server and client at the EBI but is available as a
general purpose alignment editor. It is used widely in a variety of web pages (e.g. the EBI
clustalw server and the PFAM protein domain database) but is available as a general purpose
alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic relationships are
patterns of shared common history between biological replicators.
2.3.3.2.2 Method
1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics

Institute was selected.
2. “Toolbox” was clicked and then “Sequence Analysis” was chosen from the drop down
menu.
3. From the tools available, “CLUSTALW” was selected for multiple sequence
alignment.
4. Full was chosen from the alignment option.
5. “Blosum” was chosen from Matrix.
Page | 49
6. “Input” was selected from the output order.
7. The sequences similar to our query sequence protein were pasted

in FASTA format in the given window from file.
8. The program was run.
9. “Show Colour” was clicked.
10. The result was saved.
EBI Home page European Toolbox

Bioinformatics
Institute
Parameters: Alignment-Fast
Matrix-blosum; output order- CLUSTALW Sequence Analysis
Input
Run Show Result

Sequences pasted
Colour saved
Page | 50
2.3.2.2.3 Results
C
lustal results are best expressive when the initial gap sequences are omitted. It is
because the multiple sequence alignment here is a global alignment process. So after
omitting sequences that caused too much gaps to match p00766 sequence we had 14
overall meaningful sequences. That are-
>gi|117615|sp|P00766|
>pdb|4CHA|
CGVPAIQPVLSGLXXIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGE
WGLTRYXXANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLV
>pdb|1GCT|CHYMOTRYPSIN*A
CGVPAIQPVLSGLIVNGEEAVPGSWPWQVSLQDKTGFHFCGGSLINENWVVTAAHCGVTTSDVVVAGEFD
QGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSAVCLPSASDDFAAGTTCVTTGWG
LTRYANTPDRLQQASLPLLSNTNCKKYWGTKIKDAMICAGASGVSSCMGDSGGPLVCKKNGAWTLVGIVS
WGSSTCSTSTPGVYARVTALVNWVQQTLAAN
>|CTRB2 protein[Human]
MASLWLLSCFSLVGAAFGCGVPAIHPVLSGLSRIVNGEDAVPGSWPWQVSLQDKTGFHFCGGSLISEDWV
VTAAHCGVRTSDVVVAGEFDQGSDEENIQVLKIAKVFKNPKFSILTVNNDITLLKLATPARFSQTVSAVC
LPSADDDFPAGTLCATTGWGKTKYNANKTPDKLQQAALPLLSNAECKKSWGRRITDVMICAGASGVSSCM
GDSGGPLVCQKDGAWTLVGIVSWGSRTCSTTTPAVYARVTKLIPWVQKILAAN
>|CTRL protein[Human]
LTSATMLLLSLTLSLVLLGSSWGCGIPAIKPALSFSQRIVNGENAVLGSWPWQVSLQDSSGFHFCGGSLI
SQSWVVTAAHCNVSPGRHFVVLGEYDRSSNAEPLQVLSVSRAITHPSWNSTTMNNDVTLLKLASPAQYTT
RISPVCLASSNEALTEGLTCVTTGWGRLSGVGNVTPAHLQQVALPLVTVNQCRQYWGSSITDSMICAGGA
GASSCQGDSGGPLVCQKGNTWVLIGIVSWGTKNCNVRAPAVYTRVSKFSTWINQVIAYN
>|Ela3 protein[Mouse]
PTRPQPSHNPSSRVVNGEEAVPHSWPWQVSLQYEKDGSFHHTCGGSLITPDWVLTAGHCISTSRTYQVVL
GEHERGVEEGQEQVIPINAGDLFVHPKWNSMCVSCGNDIALVKLSRSAQLGDAVQLACLPPAGEILPNGA
PCYISGWGRLSTNGPLPDKLQQALLPVVDYEHCSRWNWWGLSVKTTMVCAGGDIQSGCNGDSGGPLNCPA
DNGTWQVHGVTSFVSSLGCNTLRKPTVFTRVSAFIDWIEETIANN
>gi|chymotrypsinogen 2-like protein [Sparus aurata]
GTRFLWILSCLAFVGAAYGCGTPAISPVITGYSRIVNGEEAVPHSWPWQVSLQDYTGFHFCGGSLINENW
VVTAAHCNVRTSHRVILGEHDRSSNAEAIQVMKVGKVFKHPNYNGYTINNDILLIKLASPAQMGMRVSPV
CVAETADNFPGGMRCVTSGWGLTRYNAPDTPALLQQASLPLLTNEQCRQYWGSKISNLMICAGASGASSC
MGDSGGPLVCEKAGAWTLVGIVSWGSGTCTPTMPGVYARVTELRAWMDQIIAAN
>gi|Zebrafish [Danio rerio]
WLLSCVAFFSAAYGCGVPAIPPVVSGYARIVNGEEAVPHSWPWQVSLQDFTGFHFCGGSLINEFWVVTAA
HCSVRTSHRVILGEHNKGKSNTQEDIQTMKVSKVFTHPQYNSNTIENDIALVKLTAPASLNAHVSPVCLA
EASDNFASGMTCVTSGWGVTRYNALFTPDELQQVALPLLSNEDCKNHWGSNIRDTMICAGAAGASSCMGD
SGGPLVCQKDNIWTLVGIVSWGSSRCDPTMPGVYGRVTELRDWVDQILASN
>gi|PRSS2 protein [Bos taurus]
MHSLLILAFVGAAVAFPSDDDDKIVGGYTCAENSVPYQVSLNAGYHFCRGSLINDQWVVSAAHCYQYHIQ
VRLGEYNIDVLEGGEQFIDASKIIRHPKYSSWTLDNDILLIKLSTPAVINARVSTLALPSACASAGTECL
ISGWGNTLSSGVNYPDLLQCLEAPLLSHADCEASYPGEITNNMICAGFLEGGKDSCQGDSGGPVACNGQL
QGIVSWGYGCAQKGKPGVYTKVCNYVDWIQETIAANS
>gi|tryptase-III [Human]
LPVLASRAYAAPAPGQALQRVGIVGGQEAPRSKWPWQVSLRVRDRYWMHFCGGSLIHPQWVLTAAHCVGP
Page | 51
DVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVKVSSHVHTVTLPPASET
FPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDDVRIVRDDMLCAGNTRRD
SCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|beta 1 tryptase [Gorilla gorilla]
MLNLLLLALPVLASPAYAAPAPGQALQRAGIVGGQEAPRSKWPWQVSLRVRGQYWMHFCGGSLIHPQWVL
TAAHCVGPDVKDLAALRVQLREQHLYYQDQLLPVSRIIVHPQFYTAQIGADIALLELEEPVNVSSHVHTV
TLPPASETFPPGMPCWVTGWGDVDNDERLPPPFPLKQVKVPIMENHICDAKYHLGAYTGDNVRIVRDDML
CAGNTRRDSCQGDSGGPLVCKVNGTWLQAGVVSWGEGCAQPNRPGIYTRVTYYLDWIHHYVPKKP
>gi|58257847|gb|AAW69366.1| try14 [Macaca mulatta]
MNPLLIFAFVGATVAAPFDDDDKIVGGYTCEENSLPYQVSLNSGSHFCGGSLINKQWVVSAAHCYKPRIQ
VRLGEHNIKVLEGNEQFIHAAKIIRHPKYNNETLDNDIMLVKLSTPAIINARVSTISLPSALAAAGTECL
ISGWGNTLSFGADYPDELQCLDAPVLTQAKCEASYPGKITSNMFCVGFLEGGKDSCQRDSGGPVVCNGQL
QGVVSWGYGCARKNRPGVYTKVYNYVDWIRDTIAANS
>pdb|1DDJ|PLASMINOGEN
SFDCGKPQVEPKKCPGRVVGGCVAHPHSWPWQVSLRTRFGMHFCGGTLISPEWVLTAAHCLEKSPRPSSY
KVILGAHQEVNLEPHVQEIEVSRLFLEPTRKDIALLKLSSPAVITDKVIPACLPSPNYVVADRTECFITG
WGETQGTFGAGLLKEAQLPVIENKVCNRYEFLNGRVQSTELCAGHLAGGTDSCQGDAGGPLVCFEKDKYI
LQGVTSWGLGCARPNKPGVYVRVSRFVTWIEGVMRNN
>|Mast cell protease6
MLKLLLLLALSPLASLVHAAPCPVKQRVGIVGGREASESKWPWQVSLRFKFSFWMHFCGGSLIHPQWVLT
AAHCVGLHIKSPELFRVQLREQYLYYADQLLTVNRTVVHPHYYTVEDGADIALLELENPVNVSTHIHPTS
LPPASETFPSGTSCWVTGWGDIDSDEPLLPPYPLKQVKVPIVENSLCDRKYHTGLYTGDDVPIVQDGMLC
AGNTRSDSCQGDSGGPLVCKVKGTWLQAGVVSWGEGCAEANRPGIYTRVTYYLDWIHRYVPQRS
Page | 52
CLUSTAL W results
Figure 2.7: Multiple Sequence Alignment(MSA)
Page | 53
JAL view result
Figure 2.8: Multiple Sequence Alignment(MSA) Jalview results
Similar results were found in case of keeping the parameter output order “aligned” in case of
“input”
2.3.4 Phylogenetic tree Construction
2.3.4.1 Software/Tools
CLUSTALW
Page | 54
M
ultiple sequence comparisons help highlight weak sequence similarity and shed
light on structure, function, or origin. The most widely used programs for global
multiple sequence alignment are from the Clustal series of programs.
CLUSTALW and CLUSTALX are progressive alignment programs that follow the
following steps:
Perform pair wise alignments of all of the sequences.

Use the alignment scores to produce a phylogenic tree using neighbor-joining
methods.
Align the sequences sequentially, guided by the phylogenetic relationships
indicated by the tree.
ClustalW is use to align DNA or protein sequences in order to elucidate their relatedness as
well as their evolutionary origin.
CLUSTALW improves the sensitivity of progressive multiple sequence alignment through

sequence weighting, position specific gap penalties and weight matrix choice. The initial pair
wise alignments are calculated using an enhanced dynamic programming algorithm, and the
genetic distances used to create the phylogenetic tree are calculated by dividing the total
number of mismatched positions by the total number of matched positions. The resulting
evolutionary relationships can be viewed either as cladograms or phylograms, with the option
to display branch lengths (or “tree graph distances).
Web link: http://www.ebi.ac.uk
JalView
J
alview is a multiple alignment editor written entirely in java. It was initially to be used
as a visualization tool for the Pfam CORBA server and client at the EBI but is available
as a general purpose alignment editor. It is used widely in a variety of web pages (e.g.
the EBI clustalw server and the PFAM protein domain database) but is available as a general
purpose alignment editor. Jalview is also a phylogenetic tree drawing program. Phylogenetic
relationships are patterns of shared common history between biological replicators.
Web link: http://www.ebi.ac.uk/jalview
Page | 55
2.3.4.2 Methods
Using CLASTALw
1. Starting with the EBI home page http://www.ebi.ac.uk European Bioinformatics Institute
was selected.
2. “Toolbox” was clicked and then “Sequence Analysis” was chosen.
3. From the drop down menu “CLUSTALW” was selected.
4. The following parameters were selected from Output and Phylogenetic tree:
• TREE TYPE: nj
• CORRECTDISTANCE: on
• IGNORE GAPS: on
5. The multiple sequence alignment result previously obtained from CLUSTALW was
pasted.
6. The program was then run for phylogenetic tree construction.
7. To view the phylogenetic tree, “Show as Phylogram Tree” was clicked.
8. The resulting phylogenetic tree was saved.
EBI home page European Toolbox

(http://www.ebi.ac.uk) Bioinformatics Institute
Tree Type-nj;Correct CLUSTALW Sequence Analysis

distance-on;
ignore gaps-on
Sequences of MSA Run Show as Result saved

pasted Phylogram Tree
Page | 56
Using JalView
1. Starting with the EBI home page (http://www.ebi.ac.uk), European Bioinformatics
Institute was selected.
2. “Toolbox” was clicked and then “Sequence Analysis” was chosen.
3. From the tools available, “CLUSTALW” was selected for multiple sequence
alignment.
4. The parameters chosen were: “Full” for Alignment, “Blosum” for Matrix and “Input”
for Output Order.
5. The sequences were pasted in the given window.
6. The program was run.
7. “JalView” was clicked from Results of search.
8. “Neighbour joining tree using JalView” was chosen from Calculate.
9. The phylogram tree was saved.
EBI Home Page European Sequence

Bioinformatics Toolbox Analysis
(http://www.ebi.ac. Institute
uk)
Parameters:
Alignment-“Fast”,
Run Sequences Matrix- “Blosum”, CLUSTALW
pasted OutputOrder-Input
“Input”.
“JalView” Calculate eibhbour joining Phylogram

from Results tree using PID Tree saved
Page | 57
2.3.4.3 Results
Newick file for Phylogenic Tree construction
Figure 2.9: Newick presentation
Cladogram
Fig 2.10: Phylogenic Tree (cladogram) from Homologous sequence of p00766
Phylogram
Page | 58
Fig 2.11: Phylogenic Tree (Phylogram) from Homologous sequence of p00766
Phylogenetic tree using JAL view
Figure 2.12: Phylogenetic tree by JalView
Page | 59
2.3.5 Secondary Structure Prediction
2.3.5.1 Software/ Tools
P roteins’ secondary structure depend on their primary sequences. Several software can
be used to determint secondary structure. Some of them are listed below:
2.3.5.1.1 PSI-Pred
PSIPRED is a software tool provided by University College London (UCL).Its widely used software
to predict secondary structure from sequence. The PSIPRED protein structure prediction server allows
one’s to submit a protein sequence, perform a prediction of one’s choice and receive the results of the
prediction via e-mail. PSIPRED is a simple and reliable secondary structure prediction method,
incorporating two feed-forward neural networks which perform an analysis on output obtained from
PSI-BLAST (Position Specific Iterated - BLAST).It is a highly accurate method for protein secondary
structure prediction.
2.3.5.1.2 Neural Network
etwork (NN) is a special type of problem solving algorithm based on the parallel
architecture of complex animal neuronal organization. Hidden Markov Model is the basis of
developing this algorithm. Neural Network simulates human learning process by mimicking
networking organization of neuron and synapses. A single neuron, in the computational scheme, is a
node in a directed graph, with one or more entering connections designated as input, and a single
leaving connection called the output. To form a network, several neurons are assembled and the
outputs of some connected to the inputs of others. Some nodes contain connections that provide input
to the entire network; some deliver output information from the network to the outside world; and
others, that do not interact directly with the outside, are called “hidden” layers.
Fig 2.13: The graphical presentation of HNN
Applying this to the interpretation of genotypic information, neural networks are trained using a large
database of input (genotype and treatment) data and output (drug response) data. The model is then
tested on a testing set of input and output data to see how accurate it is.
2.3.5.1.2 Hierarchal Neural Network (HNN)

Hierarchical neural networks consist of multiple neural networks concreted in a form of an acyclic
graph. Tree-structured neural architectures are a special type of hierarchical neural network. The
networks within the graph can be single neurons or complex neural architectures such as multilayer
Page | 60
perceptions or radial basis function networks. Decision trees, hierarchical self-organizing maps,
hierarchies of experts, hierarchical or tree-based classifiers are typical applications for hierarchical
neural networks.
2.3.5.2 Methods
Using PSIPRED:
1. Starting with the Bioinformatics Unit page (http://bioinf.cs.ucl.ac.uk ), “Secondary
structure prediction (PSIPRED)” was selected from “Protein Structure Prediction.” 24
2. The sequence of 1GCT was pasted in FASTA format.
3. Email address was entered.
4. Predict was clicked.
5. The results were obtained in email and then saved.
Bioinformatics Unit Secondary structure

(http://bioinf.cs.ucl.ac. Server accessed
prediction (PSIPRED)
uk)
Predict Sequence of IGCT Sequence pasted in

submitted FASTA format
Results of PSIPRED Results saved

received in email
Page | 61
Using HNN
1. By entering the link www.expasy.org, the ExPASy

Proteomics Server was accessed.
2. From Tools and software packages, “Secondary and

tertiary structure prediction” was selected.
3. HNN was chosen from Secondary structure

prediction.
4. 4CHA sequence was pasted in the window.
5. The sequence was submitted to run the program.
6. The result was saved.
Secondary and HNN

www.expasy.org tertiary structure
prediction
Submit
Result saved sequence pasted
Page | 62
2.3.5.3 Results
Hierarchical Neural Network result

10 20 30 40 50 60 70
| | | | | | |
ccccchchhhhchheeeccccccccccceeeecccccceeeccccccccheeeehhhcccccceeeeeec
cccccchhhhhhhhhhhhhhcccccceeeccceeeeeecccccccceeeeeecccccccccccceeeeec
ccceeecccccccchhhcccccccccccchcchhhhhhhhhhhccccccccccccccceeeecccceeee
eeeecccccccccccchhhhhhhhhhhhhhhhccc
Sequence length : 245
HNN :
Alpha helix (Hh) : 56 is 22.86%
310 helix (Gg) : 0 is 0.00%
Pi helix (Ii) : 0 is 0.00%
Beta bridge (Bb) : 0 is 0.00%
Extended strand (Ee) : 55 is 22.45%
Beta turn (Tt) : 0 is 0.00%
Bend region (Ss) : 0 is 0.00%
Random coil (Cc) : 134 is 54.69%
Ambigous states (?) : 0 is 0.00%
Other states : 0 is 0.00%
Figure 2.14: Secondary structure by HNN
Page | 63
PSIPRED Result
On Tue, 20 Jan 2009 09:03:36 GMT, "Apache" <psipred@cs.ucl.ac.uk> said:
PSIPRED PREDICTION RESULTS
Key
Conf: Confidence (0=low, 9=high)
Pred: Predicted secondary structure (H=helix, E=strand, C=coil)
AA: Target sequence
PSIPRED HFORMAT (PSIPRED V2.6 by David Jones)
Conf: 998765777799997599999999987868999938997897079940998998841313
Pred:
CCCCCCCCCCCCCCCEECCEECCCCCCCCEEEEEECCCCEEEEEEEECCCEEEECHHHCC
AA:
10 20 30 40 50 60
Conf: 787579999626553799748997889998999888888785199997888757697801
Pred: CCCCEEEEEEEECCCCCCCCEEEEEEEEEECCCCCCCCCCCCEEEEEECCCCCCCCCEEC
AA: TTSDVVVAGEFDQGSSSEKIQKLKIAKVFKNSKYNSLTINNDITLLKLSTAASFSQTVSA
70 80 90 100 110 120
Conf: 687998776999898999828854678999988635999787299999987088799887
Pred:
CCCCCCCCCCCCCCEEEEEECCCCCCCCCCCCCEEEEEEEEECCHHHHHHHCCCCCCCCE
AA:
130 140 150 160 170 180
Conf: 974899833677888994777119989999999860688879988499887997999999
Pred:
EECCCCCCCCCCCCCCCEEEECCCCEEEEEEEEEEECCCCCCCCCEEEEEHHHHHHHHHH
AA:
190 200 210 220 230 240
Conf: 98559
Pred: HHHCC
AA: TLAAN
Calculate PostScript, PDF and JPEG graphical output for this result
using: http://bioinf3.cs.ucl.ac.uk/cgi-bin/psipred/gra/nph-view2.cgi?id=0eef479d8c802aad.psi
Page | 64
Figure 2.15: Secondary structure by PSI-Pred.
Page | 65
Chapter 3:
Discussion
Page | 66
3.1 General
T
he genomic era is characterised by a enormous expansion in the
amount of biological information available in the field of DNA
molecular biology. The greatest challenge of the molecular
biology community is to make sense of the data and exploring meaningful
means to exploit those data in practical genomics and proteomics. The
result was obvious- using computer to store, retreive and manipulate the
data to produce meaningful informations. RNA
From central dogma of life we know, informations pass from genome to
proteome through transcription and translation. Transcription is the
process of encoading DNA to mRNA and Translation is mRNA to
PROTEIN
Protein.
Bioinformatics tools are mainly being developed targeting the three

central biological processes:
1. Determination of protein sequence from DNA sequence

2. Determination of protein structure from its primary sequence
3. Determination of protein function from its 3D structure
A database management system (DBMS) is a collection of informations in seperate entities

with corresponding attributes linked to it. The software fits data storing, retreiving and
manipulation exceedingly well in comparison to manual data storage and management.
Our project aim was to get familiar with the basic bioinformatics tool. The protein specimen
we took was chymotrypsin. It is one of the most well studied sample in enzyme study. The
enzyme is responsible for breaking polypeptides into a smaller fragment. By definition the
enzyme is a Protease group of protein. We extracted homologous protein sequences from the
database and did Pairwise sequence alignment and multiple sequence alignment to
understand the evolutionary relatiobship among the sequences.
A evolutionary tree was built on the basis of sequence homology; identity and similarity
present in the protein sequences.
3.2 Exploring Database
E
NTREZ is a combined database and search engine composed of-
1. Pubmed: biomedical literature database

2. Pubmed central: free full text journal articles
3. Journals: detailed informations about the journals
4. Mesh: detailed informations about NLM’s controlled vocabulary
5. Nucleotide sequence database (GenBank)
Page | 67
6. Protein sequence database
7. Genome: whole genome sequence database
8. Structure: 3D macromolecular structure
9. Taxonomy: organisms in GenBank
10. SNP: single nucleotide polymorphism
11. GENE: gene-centered informations
12. Books: online books
13. OMIM: online mendelian inheritence in man
14. Site search: NCBI web and FTP sites
15. UniGene: gene oriented clusters of transcript sequences
16. CDD: conserved protein domain database
17. 3D domain: domain from ENTREZ structure
18. Uni STS: markers and mapping data
19. PopSet: population study datasets
20. GEO: expression and molecular abundance profiles
21. GEO datasets: experimental sets of GEO data.
We used the Protein database of NCBI gateway to retreive the chymotrypsin sequence.
However we went through other features of the database and explored various informations
about the protein sequence.
Genbank informations are kept in Flatfile format which actually is composed of 3 sets of
informations-
1. Header
2. Features
3. Sequence
Header part is composed of following informations-
• Locus
• Description
• Accession no.
• GI no.
• Version
• Source (organism)
• Organism (in detail)
• Reference (title, journal, author)
Header part contains following informations-
Page | 68
• source
• Gene
• RNA
lastly the sequence part contains the protein sequence in FLAT file format. However for input
in BLAST we needed FASTA format sequence which starts with a “>” sign followed by a
short description an “Enter” and the sequence without any “Space” in “Courier’’ font.
Another important learning of the project was to get aquainted with the softwares of the
database. We learned use of different BLAST software of the NCBI gateway. The uses are
enlisted below-
Length Database Purpose BLAST program

Protein Identify the query sequence or find Standard Protein
15 protein sequences similar to query BLAST (blastp)
residues Find members of a protein family or PSI-BLAST
or build a custom position-specific score
longer matrix
Find proteins similar to the query PSI-BLAST
around a given pattern
Conserved Find conserved domains in the query CD-search
Domains (RPS-BLAST)
Conserved Find conserved domains in the query Conserved Domain
Domains and identify other proteins with similar Architecture Retrieval
domain architectures Tool (CDART)
Nucleotide Find similar proteins in a translated Translated BLAST
nucleotide databases (tblastn)
5-15 Protein Search for peptide motifs Search for short,
residues nearly exact matches
Table 3.1: Different uses of BLAST programs.
Apart from this we also used some useful tools such as CLUSTAL W for multiple sequence
alignment of EBI gateway.
3.3 Analyzing Protein Sequence
T
he protein sequence was of 245 residues.
contained high amount of serine residues 28 in number so is placed in the serine

protease kind of protein.
We learned how to search database for homologous sequences using BLASt tools
like BLASTp and PSI-BLAst.
Page | 69
We learned about general physiochemical properties of the protein by using Protparam
software. Only the sequence was pasted not the FASTA format. The software gave us an
approximate pI value by counting Total number of negatively charged residues (Asp + Glu:
14)Total number of positively charged residues (Arg + Lys: 18) and molecular weight by
multiplying 110 with the total residue number. Other informations like extinction coefficient
could also have been predicted.
We performed Pair-wise sequence alignment withhomologous sequences derived by PSI-

BLAST. By doing so we identified conserved regions in those protein sequences. The query
sequence showed chunk of conserved regions with the sequence of Cold adaptation enzyme
of Salmon fish very precisely. It helped us identifying the location of enzyme’s binding or
active sited as those sequences over the evolutionary period remains more or less conserved.
We learned that the catalytic triode responsible for enzymatic activity was composed of
Serine 195, Histidine 57 and Aspertate 102. An in depth idea of the mechanism of the
enzymes reaction revealed initial Covalent modification at the Serine residue.
By performing Multiple sequence Alignment we understood the evolutionary relationship

even more specifically and was able to generate the relationship as Phylogenetic Tree.
The percentage and position of alpha helix, beta sheets were predicted by using different
tools for secondary structure prediction like PSIPRED, Hierarchical Neural Network (HNN) .
It gave us an idea of the secondary structure of the protein which included more Beta-pleated
structure (around 45%) and less Alpha helix structures (around 14%).
3.4 Conclusion
As the field of molecular biology is advancing, thousands of new proteins are being
discovered. So sequencing of unknown proteins and determination of their structure remain a
crude necessity for the researchers. By studying the structure of a known protein, this
elementary project has provided us to work with unknown proteins and assuming their
functions during advanced research activities.
Page | 70

Exploring Database and Analyzing Protein Sequence

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Exploring Database and Analyzing Protein Sequence

Enviado por

Direitos autorais:

Formatos disponíveis

EXPLORING DATABASE AND

ANALYZING PROTEIN SEQUENCE

DEPARTMENT OF GENETIC ENGINEERING AND

Chapter 3: Discussion 66-70

EMB European Molecular Biolog Laboratory

Figure no. Name Of Table Page No.

2.2 FASTA format result of p00766 24

1.3 Primary Sequence Databases 13

Figure1.1: Bioinformatics; an interdeciplinary subject

R esources of Bioinformatics are consisted of The Gateways, Databases and softwares.

ational Centre for Biotechnology Information (CBI)

The National Center for Biotechnology Information (CBI) is part of the

European Bioinformatics Institute (EBI)

The European Bioinformatics Institute (EBI) is a non-profit academic

Table 1.1 Tools at EBI

ExPASy Molecular Biology Server-Expert Protein Analysis System, Swiss Institute of

Bioinformatics Group - University College London

The Bioinformatics Group was originally founded as the Joint Research

Available tools and software are:

Table 1.2: Available tools at Bioinformatics Group - University College London

Protein Structure Prediction Threading (THREADER)

Protein Sequence Analysis Amino acid substitution matrices

Genome Analysis Genomic Threading Database (GTD)

Protein Structure Classification Comparison of structure classifications (CATH/SCOP/FSSP)

Transmembrane Protein MEMSAT

Biological Applications of Data- Information extraction for biological research (BioRat)

There are two main functions of biological databases:

1. Make biological data available to scientists. As much as possible of a particular type of

Databases for bioinformatics are-

Primary and added-value databases

Table1.3: Primary Sequence Databases

Databases Software tools Web Site

Acid EBI (European Bioinformatics Institute) – http://www.ebi.ac.uk/

Table 1.4: Meta-bases

Name Web Site

Bioinformatic Harvester (Karlsruhe http://harvester.fzk.de

MetaBase(KOBIC) - A user contributed http://BioDatabase.Org

Table 1.5: Software used in the project

1.3 Application of Bioinformatics

Gene expression study

Figure 1.3: Use of informatics in drug designing.

Improved nutritional quality

Microbial genome applications

Forensic analysis of microbes

1.4 Project Aim

Aligning similar proteins and generating phylogenetic trees to examine

Clustering protein sequences into families of related sequences and the

Figure1.4: the Mechanism of Action of Chymotrypsin

Retrieving the sequence of the protein.

The overview of the project is shown in the following flow chart:

Sequence database Manual input

Protein sequence Searching databases

Secondary Pair wise alignment Multiple Sequence

Phylogenetic Tree construction

Figure 1.5: The overview of the project

Working with protein sequence

Uniprot NREF 100

Proteome set Uniprot knowledgebase: IPI

2.2 Fetching protein sequence from Database

5. “FASTA” was selected from

CBI home page

Sequence FASTA selected P00766 is selected

Figure 2.2: FASTA format result of p00766

2.3 Analyzing Protein sequence

ational Centre for Biotechnology Information (CBI)

The National Center for Biotechnology Information (CBI) is part of the

CBI home page