Você está na página 1de 98

Introduction to Programming: Perl for Biologists

Timothy M. Kunau
Center for Biomedical Research Informatics
Academic Health Center
University of Minnesota
kunau@umn.edu
Bioinformatics Summer Institute 2007

Outline
Art and Programming
Getting Started
Biology and Computer Science
Bioinformatics Data
Perl basics:
Strings and Variables
Math and Logic
Looping, operators, and functions

Art and Programming


Moving from Data to Story
Systems = Beauty

Science is what we understand well enough


to explain to a computer. Art is all the rest.
Donald Knuth

Edit -- Run -- Revise (and Save)


As a programmer, most of your time will be spent
planning, testing, and revising your program.
Running is often incidental on todays hardware.
Carefully written programs can be productive tools for
years.
Programming is a method of communication: your
code must be readable by both the computer and
your users.

Errors and Debugging


Rarely involves actual insects.
If the task is well understood,
errors are mostly typographical.
these error messages can be
extraordinarily helpful.
If the task is not well understood
or the data is irregular, it may
produce a logical error and
require more thought.
Beware: a valid program can still
produce the wrong result.

Programming
Is an exercise in problem solving:
iterative
gradual
often a solitary activity
Social activity
You are now part of a community of tool builders.
A program does not often stand alone, but interacts with other programs
that make up its environment. Each building on the others.
Systematic and beautiful

Programming
Is an economically valuable skill.
Commercial and proprietary
systems are built to protect their
economic value.
Open Source projects are
different.
Open Source software projects
publish their source code so that
is can be shared and improved
by the community of users.

http://www.opensource.org/

Open Source Programs


Firefox
LINUX
MySQL
Apache web server
Languages:
Perl
Ruby
Python

Programming Strategies
Break down into two major approaches:
1.

Find a program written by someone else.

2.

Write one yourself.

The reality is usually somewhere in between.

Programming Strategies
Open Source programming communities are often large
and prolific.
If you cannot find a program that does exactly what you
need -- you can likely find one that does most of what
you need.
A little tweaking is often significantly quicker than rolling
your own.
A day in the library can save you six months in the lab.
-- ancient adage

Programming Strategies
It is important to become aware of the communities
that use and support the tools you use.
Some copyrights may apply but use is generally
free.
CPAN

What has been will be again, what


has been done will be done again;
there is nothing new under the sun.
(Ecclesiastes 1:9 NIV)

The Process
1. Identify the inputs, data, and specifications from the user.
2. Design the solution as a series of steps toward the desired

result.
3. Decide on the output(s). Does the result print to the screen

or to a file? How will this output be used? Does format


matter?
4. Refine the design with increasing detail. (pseudocode)
5. Do appropriate code modules exist? (CPAN)
6. Write the program.

Pseudocode
An informal program in which
there are no details and formal
syntax is not followed.

get the name of DNA file from user

A quick and informal way to


collect your ideas about solving
the problem at hand.

for each element

read in DNA from DNA file

if element is DNA, then add one


to the count
print count

What is Perl?
Scripting language by Larry Wall, cica 1985
Born of AWK
Practical Extraction and Reporting Language
Pathologically Eclectic Rubbish Lister
Disturbingly flexible in form, format, and usage.
Swiss Army chain-saw

Why Perl?!
An easy language to use, though sometimes hard to
learn. Some choices were made to make things easier
for the programmer at the expense of the student.
Fast cross platform text processing.
Good pattern matching. (regex)
Many extensions for Life Sciences data types. (BioPerl)
Many biologists already know Perl.
Powerful

#!/usr/local/bin/perl -w
use SOAP::Lite;
print STDERR "Welcome to the SOAP demonstration\n";
my $res;
$servername = "inquiry.ccgb.umn.edu";
my $server = SOAP::Lite
-> uri("http://$servername/Backbeat")
-> proxy("http://$servername/cgi-bin/bipod/BIFX.pl");
$res = $server->

(SOAP::Data->name(USER)->value("kunau"),
SOAP::Data->name(PASSWORD)->value(
my $ticket;
if ($res->result()) { $ticket = $res->result(); }
print STDERR "Got ticket $ticket\n";

));

Login
Get a ticket

my $id = "nt:ABY13260";
= $id;
=~ s/:
;
$res = $server->
(SOAP::Data->name(TICKET)->value($ticket),
SOAP::Data->name("BLOCKING")->value(1),
SOAP::Data->name("sequence")->value("$id"),
SOAP::Data->name(
)->value("fasta"),
SOAP::Data->name("outseq")->value(
));
($res);
print STDERR "fetched file for $id\n";
$res = $server->
(SOAP::Data->name(TICKET)->value($ticket),
SOAP::Data->name("BLOCKING")->value(0),
SOAP::Data->name("blastall")->value("blastn"),
SOAP::Data->name("query")->value(
),
SOAP::Data->name(
)->value("yeast.nt"),
SOAP::Data->name(
)->value("yeast.nt"),
SOAP::Data->name(
)->value(
. ".blastx"));
($res);
my $jid = 0;
if ($res->result()) { $jid = $res->result(); }
print "Submitted BLAST for
. Got job id $jid\n";
# Client side block
my $result = "";
while ($result ne "FINISHED") {
print "Checking status for job $jid\n";
$res = $server->
(
SOAP::Data->name("TICKET")->value($ticket),
$jid));
($res);
if ($res->result()) { $result = $res->result(); }
print "Got status $result\n";
if ($result ne "FINISHED") { sleep 3; }
}

Configure a service
Submit request
Check status (rinse, repeat)

$res = $server->
(SOAP::Data->name(TICKET)->value($ticket),
SOAP::Data->name(FILENAME)->value("blastall.txt"));
($res);
if ($res->result()) {
$result = $res->result();
print "Got status $result\n";
if ($result ne "FINISHED") { sleep 3; }
}

$res = $server->
(SOAP::Data->name(TICKET)->value($ticket),
SOAP::Data->name(FILENAME)->value("blastall.txt"));
($res);
if ($res->result()) { print $res->result(); }
###################### SUBROUTINES #####################
sub
{
my $res = shift;
if (my $fault = $res->fault()) {
my %fault = %$fault;
while (my ($key, $val) = each (%fault)) {
print "$key $val\n";
}
}
}

Print result

Beginning Perl for


Bioinformatics
Hardcover: 400 pages
Publisher: O'Reilly Media, Inc.; 1
edition (October 15, 2001)
Language: English
ISBN: 0596000804
Product Dimensions: 9.2 x 7.1 x
0.9 inches
Shipping Weight: 1.3 pounds.
Average Customer Review: 4.5/5
based on 25 reviews.

Mastering Perl for


Bioinformatics
Hardcover: 377 pages
Publisher: O'Reilly Media, Inc.; 1
edition (June, 2003)
Language: English
ISBN: 0596003072
Product Dimensions: 9.4 x 6.8 x
0.9 inches
Shipping Weight: 1.4 pounds.
Average Customer Review: 4.5/5
based on 8 reviews.

Safari Books on-line

http://proquest.safaribooksonline.com/home

Safari: Perl

Safari: bioinformatics

Getting Started
The programming rite of passage.
Tidbits
print string;
newline: \n
tab: \t
# comments
All about context

A simple program

#!/usr/bin/perl -w
#
# a program to do the obvious
#
print Hello, world!\n;

A simple result

% ./hello-world.pl
Hello, world!

How does it work?


#!/usr/bin/perl -w
#

Every Perl program


begins with this line.

# a program to do the obvious

Comments

#
print Hello, world!\n;

The print function


sends the quoted
text to the default
output device, the
screen.

Theme and variation

#!/usr/bin/perl -w
#
# assign a value to $message
my $message = Hello, world!\n;
# print the $message
print $message;

Store the
value Hello,
world! in a
container
called a
variable.

Theme and variation

#!/usr/bin/perl -w
#
# assign a value to $message
my $message = qq{Hello, world!\n};
# print the $message
print $message;

Dont let a
change in
form throw
you.

TMTOWTDI
Theres More Than One
Way To Do It
This can be frustrating
for new users.
Well try to focus on what
were doing. Dont worry
about all the possible
ways to do it yet.

LAB: Lets try it!


Login to your workstation
launch a terminal window
mkdir bsi2007
cd bsi2007
launch a text editor: pico, vi, emacs
create and save your Hello, world! program
Run it

LAB: Lets try it!

% mkdir bsi2007
% cd bsi2007
% pico hello-world.pl
% chmod +x hello-world.pl
% ./hello-world.pl

LAB: Lets try it!

#!/usr/bin/perl -w
#
# a program to do the obvious
#
print Hello, world!\n;

LAB: Lets try a little variation.

#!/usr/bin/perl -w
#
# assign a value to $message
my $message = Hello, world!\n;
# print the $message
print $message;

LAB: break it.


What happens when?:
1.

You remove a semicolon?

2.

You remove a dollar sign?

3.

You change the shebang?

Can you change the shebang


to something else that works?

4.

lather --> rinse --> repeat

The goal of testing is to cause your code


to fail. The goal of testing is not to cause
your code to succeed.
D. Conway

LAB: A simple program

#!/usr/bin/perl -w
#
# a program to do the obvious
#
print Hello, world!\n;

LAB: A simple result

% ./hello-world.pl
Hello, world!

Biology and Computer Science


The Life Sciences and
many of the Computer
Sciences grew up
together.
Databases
Languages
Networks
the World Wide Web

It is better to use ones


head for a few minutes,
than to use a computing
machine for a few days.
Francis Crick

A brief history
1950s: Double helix structure of DNA
1960s: Manual alignment using edit distances
1970s: Optimal global alignment (Needleman & Wunsch)
Substitution matrixes (Dayhoff)
1980s: Optimal local alignment (Smith & Waterman)
1990s: Heuristic local alignment search
FASTA: (Pearson et al.),
BLAST: (Altschul et al.)

Disconnects
Social differences
Managing expectations
Developing a common
vocabulary
Conways Law

Conways Law states:


Organizations which design
systems are constrained to
produce designs which are
copies of the communication
structures of their
organizations.
In other words:
Any piece of software reflects
the organizational structure
that produced it.

Social differences
Tool building versus the great discovery:
Computer scientists create new rules to engineer a solution.
(Inventing laws)
Life scientists look for the exception that breaks the rules.
(Discover laws)

Social differences
Biologists

Computer
Scientists

Sharing results

sit on it until ready to


publish

Share but do not


guarantee correctness

Reporting results

Peer reviewed papers

Talks at conferences
Publish Source Code

Whos who
(on publications)

Lab leader always last

Lab leader second, least


involved last

Managing Expectations
What can we expect
from each other?
Life Sciences are
presenting the grand
challenges of our
time...
What does Computer
Science have to offer
Life Sciences
research?

Developing a common vocabulary


Words in common but
with different meanings:
Array, chip, clone,
cluster, database,
domain, insert, library,
node, partitioning,
root, sequence,
transformation, tree,
vector, virus

Isnt it odd?

Biology is the only science in which


multiplication means the same thing as
division.

Developing a common vocabulary


The importance of interpreters.
Constrained and negotiated vocabularies, Ontologies:
gene expression and Gene Expression and gene
regulation
putative kinase and possibly a kinase and it may be
something, but it isnt a kinase
Metadata without guidelines will lead to entropy.
Folksonomy: in-formalisms, tagging?
You are becoming interpreters.

Developing a common vocabulary


BioBench-Bob: The information is in the file, whats
the problem?
Compu-Carla: This file is a mess! How about some
consistency and structure?

What we have here is a failure to communicate.


Compu-Carla: The information is all in the database,
why are you complaining?
BioBench-Bob: How do I read it?

Conways Law"

Organizations which design systems are


constrained to produce designs which are
copies of the communication structures
of their organizations.

Bioinformatics Data
Quantity has a quality
all its own
Russian military axiom

GBREL.TXT Genetic Sequence Data Bank


April 15 2007
NCBI-GenBank Flat File Release 159.0
Distribution Release Notes
71,802,595 loci, 75,742,041,056 bases, from 71,802,595 reported sequences

Bioinformatics Data
Often unstructured or semi-structured.
Data appears as text strings:
Protein sequences: FASTA flat-files,
et alia.
Annotation: often free-text
Feudal states (Lincoln Stein)

FASTA
>ContigId:Contig1 AssemblyProcessId:MtSC AssemblyProcessVersion:6
GCTTTAATCTTGTAGGTTTGATGAAAGAATAAGTTCGTTTGCTGAGAAGA
AGTTTACAAGAGATGGTATAGAAGTTCAAACTGGATGCCGCGTTATGAGT
GTTGATGACAAGGAAATTACAGTGAAGGTGAAATCAACGGGAGAGGTTTG
CTCGGTTCCCCATGGATTGATTATCTGGTCTACTGGCATTTCTACTCTTC
CAGTTATAAGAGATTTTATGGAAGAAATTGGTCAGACTAAAAGGCATGTA
CTGGCAACCGATGAATGGTTGAGAGTGAAGGAATGTGAAGATGTGTTTGC
CATTGGTGATTGTTCATCAATAAATCAACGTAAAATCATGGATGATATCT
TGGACATATTTAAGGCTGCAGACAAAAATAACTCCGGTACCTTAACTGTG
TAAGAATGCGAAGAAGTGATGGATGAATGTATCTTAAGATATCCTGCAGT
GGAATGC

Medicago Truncatula consensus sequence

LOCUS
DEFINITION

GenBank

SCU49845
5028 bp
DNA
PLN
21-JUN-1999
Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION
U49845
VERSION
U49845.1 GI:1293613
KEYWORDS
.
SOURCE
Saccharomyces cerevisiae (baker's yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Ascomycota; Saccharomycotina; Saccharomycetes;
Saccharomycetales; Saccharomycetaceae; Saccharomyces.
REFERENCE
1 (bases 1 to 5028)
AUTHORS
Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W.
TITLE
Cloning and sequence of REV7, a gene whose function is required for
DNA damage-induced mutagenesis in Saccharomyces cerevisiae
JOURNAL
Yeast 10 (11), 1503-1509 (1994)
PUBMED
7871890
REFERENCE
2 (bases 1 to 5028)
AUTHORS
Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE
Selection of axial growth sites in yeast requires Axl2p, a novel
plasma membrane glycoprotein
JOURNAL
Genes Dev. 10 (7), 777-793 (1996)
PUBMED
8846915
REFERENCE
3 (bases 1 to 5028)
AUTHORS
Roemer,T.
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New
Haven, CT, USA
FEATURES
Location/Qualifiers
source
1..5028
/organism="Saccharomyces cerevisiae"
/db_xref="taxon:4932"
/chromosome="IX"
/map="9"
CDS
<1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
gene
687..3158
/gene="AXL2"
CDS
687..3158
/gene="AXL2"

Approximately 71,802,595 loci,


75,742,041,056 bases, from 71,802,595
reported sequences in traditional
GenBank divisions as of April 2007.

/note="plasma membrane glycoprotein"


/codon_start=1
/function="required for axial budding pattern of S.
cerevisiae"
/product="Axl2p"
/protein_id="AAA98666.1"
/db_xref="GI:1293615"
/translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVNESF
TFQISNDTYKSSVDKTAQITYNCFDLPSWLSFDSSSRTFSGEPSSDLLSDANTTLYFN
VILEGTDSADSTSLNNTYQFVVTNRPSISLSSDFNLLALLKNYGYTNGKNALKLDPNE
VFNVTFDRSMFTNEESIVSYYGRSQLYNAPLPNWLFFDSGELKFTGTAPVINSAIAPE
TSYSFVIIATDIEGFSAVEVEFELVIGAHQLTTSIQNSLIINVTDTGNVSYDLPLNYV
YLDDDPISSDKLGSINLLDAPDWVALDNATISGSVPDELLGKNSNPANFSVSIYDTYG
DVIYFNFEVVSTTDLFAISSLPNINATRGEWFSYYFLPSQFTDYVNTNVSLEFTNSSQ
DHDWVKFQSSNLTLAGEVPKNFDKLSLGLKANQGSQSQELYFNIIGMDSKITHSNHSA
NATSTRSSHHSTSTSSYTSSTYTAKISSTSAAATSSAPAALPAANKTSSHNKKAVAIA
CGVAIPLGVILVALICFLIFWRRRRENPDDENLPHAISGPDLNNPANKPNQENATPLN
NPFDDDASSYDDTSIARRLAALNTLKLDNHSATESDISSVDEKRDSLSGMNTYNDQFQ
SQSKEELLAKPPVQPPESPFFDPQNRSSSVYMDSEPAVNKSWRYTGNLSPVSDIVRDS
YGSQKTVDTEKLFDLEAPEKEKRTSRDVTMSSLDPWNSNISPSPVRKSVTPSPYNVTK
HRNRHLQNIQDSQSGKNGITPTTMSTSSSDDFVPVKDGENFCWVHSMEPDRRPSKKRL
VDFSNKSNVNVGQVKDIHGRIPEML"
complement(3300..4037)
/gene="REV7"
complement(3300..4037)
/gene="REV7"
/codon_start=1
/product="Rev7p"
/protein_id="AAA98667.1"
/db_xref="GI:1293616"
/translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ
FVPINRHPALIDYIEELILDVLSKLTHVYRFSICIINKKNDLCIEKYVLDFSELQHVD
KDDQIITETEVFDEFRSSLNSLIMHLEKLPKVNDDTITFEAVINAIELELGHKLDRNR
RVDSLEEKAEIERDSNWVKCQEDENLPDNNGFQPPKIKLTSLVGSDVGPLIIHQFSEK
LISGDDKILNGVYSQYEEGESIFGSLF"

GenBank

gene
CDS

ORIGIN
1
61
121
181
241
301
361
421

gatcctccat
ccgacatgag
ctgcatctga
gaaccgccaa
ccacactgtc
agacgcgaaa
attttggcaa
aatacccatc

atacaacggt
acagttaggt
agccgctgaa
tagacaacat
attattataa
aaaaaagaac
cttatgtttc
gtaggtatgg

atctccacct
atcgtcgaga
gttctactaa
atgtaacata
ttagaaacag
aacgcgtcat
ctcttcgagc
ttaaagatag

caggtttaga
gttacaagct
gggtggataa
tttaggatat
aacgcaaaaa
agaacttttg
agtactcgag
catctccaca

tctcaacaac
aaaacgagca
catcatccgt
acctcgaaaa
ttatccacta
gcaattcgcg
ccctgtctca
acctcaaagc

ggaaccattg
gtagtcagct
gcaagaccaa
taataaaccg
tataattcaa
tcacaaataa
agaatgtaat
tccttgccga

GenBank

481
541
601
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
1441
1501
1561
1621
1681
1741
1801
1861
1921
1981
2041
2101
2161
2221
2281
2341
2401
2461
2521
2581
2641
2701
2761
2821
2881
2941
3001
3061

gagtcgccct
tttactctca
acaattactt
cgtatatcaa
ctactatatc
aacaataccc
cctataaatc
gctggctttc
tatctgatgc
acagcacgtc
tatcgtcaga
acgctctgaa
ctaacgaaga
ccaattggct
actcggcgat
gattttctgc
ctattcaaaa
ctctaaacta
acttattgga
cagatgaatt
cttatggtga
ttagttctct
cttctcagtt
aagaccatga
agaatttcga
tatattttaa
caacgtccac
acactgcaaa
cagcagccaa
ctatcccatt
gaagggaaaa
atcctgcaaa
atgcttcctc
aattggataa
ctctatcagg
tagcaaaacc
cttctgtgta
tgtcaccagt
aaaaactttt
tgtcttcact
caccatcacc
ctcaaagcgg
ttgttccggt
gaccaagtaa

cctttgtcga
catcctgtag
aatagaaaaa
gaagcattca
actactccat
cccagtggca
gtctgtagac
gtttgactct
gaacaccacg
tttgaacaat
tttcaatcta
actagatcct
atccattgtg
gttcttcgat
tgctccagaa
cgttgaggta
tagtttgata
tgtttatctc
tgctccagac
actcggtaag
tgtgatttat
tcccaatatt
tacagactac
ctgggtgaaa
caagctttca
catcattggc
aagaagttct
aatttcttct
taaaacttca
aggcgttatc
tccagacgat
taaaccaaat
gtacgatgat
ccactctgcc
tatgaataca
cccagtacag
tatggatagt
ctctgatatt
cgatttagaa
ggacccttgg
atataacgta
taaaaacgga
taaagatggt
gaaaaggtta

gtaattttca
tgattgacac
ttatatcttc
cttaccatga
ctagtagtgg
agagtcaatg
aagacagctc
agttctagaa
ttgtatttca
acataccaat
ttggcgttgt
aatgaagtct
tcgtattacg
tctggcgagt
acaagctaca
gaattcgaat
atcaacgtta
gatgacgatc
tgggtggcat
aactccaatc
ttcaacttcg
aacgctacaa
gtgaatacaa
ttccaatcat
ttaggtttga
atggattcaa
caccactcca
acctccgctg
tctcacaata
ctagtagctc
gaaaacttac
caagaaaacg
acttcaatag
actgaatctg
tacaatgatc
cctccagaga
gaaccagcag
gtcagagaca
gcaccagaga
aacagcaata
acgaagcatc
atcactccca
gaaaattttt
gtagattttt

cttttcatat
tgcaacagcc
ctcgaaacga
cacagcttca
ccacgcccta
aatcgtttac
aaataacata
cgttctcagg
atgtaatact
ttgttgttac
taaaaaacta
tcaacgtgac
gacgttctca
tgaagtttac
gttttgtcat
tagtcatcgg
ctgacacagg
ctatttcttc
tagataatgc
ctgccaattt
aagttgtctc
ggggtgaatg
acgtttcatt
ctaatttaac
aagcgaacca
agataactca
cctcaacaag
ctgctacttc
aaaaagcagt
tcatttgctt
cgcatgctat
ctacaccttt
caagaagatt
atatttccag
agttccaatc
gcccgttctt
taaataaatc
gttacggatc
aggaaaaacg
ttagcccttc
gtaaccgcca
caacaatgtc
gctgggtcca
caaataagag

gagaacttat
accatcacta
tttcctgctt
gatttcatta
tgaggcatat
atttcaaatt
caattgcttc
tgaaccttct
cgagggtacg
aaaccgtcca
tggttatact
ttttgaccgt
gttgtataat
tgggacggca
catcgctaca
ggctcaccag
taacgtttca
tgataaattg
taccatttcc
ttctgtgtcc
cacaacggat
gttctcctac
agagtttact
attagctgga
aggttcacaa
ctcaaaccac
ttcttacaca
ttctgctcca
agcaattgcg
cctaatattc
tagtggacct
gaacaacccc
ggctgctttg
cgtggatgaa
ccaaagtaaa
tgacccacag
ctggcgatat
acaaaaaact
tacgtcaagg
tcccgtaaga
cttacaaaat
aacttcatct
tagcatggaa
taatgtcaat

tttcttattc
gaagaacaga
ccaacatcta
ttgctgacag
cctatcggaa
tccaatgata
gacttaccga
tctgacttac
gactctgccg
tccatctcgc
aacggcaaaa
tcaatgttca
gcgccgttac
ccggtgataa
gacattgaag
ttaactacct
tatgacttac
ggttctataa
gggtctgtcc
atttatgata
ttgtttgcca
tattttttgc
aattcaagcc
gaagtgccca
tctcaagagc
agtgcgaatg
tcttctactt
gcagcgctgc
tgcggtgttg
tggagacgca
gatttgaata
tttgatgatg
aacactttga
aagagagatt
gaagaattat
aataggtctt
actggcaacc
gttgatacag
gatgtcacta
aaatcagtaa
attcaagact
tctgacgatt
ccagacagaa
gttggtcaag

3121
3181
3241
3301
3361
3421
3481
3541
3601
3661
3721
3781
3841
3901
3961
4021
4081
4141
4201
4261
4321
4381
4441
4501
4561
4621
4681
4741
4801
4861
4921
4981

GenBank

//

ttaaggacat
taattttatt
agtttttata
taaaacaaag
attttgtcgt
tcagaaccga
aaattttcat
tccaaactat
ttaataactg
ataatcaaac
tgatcgtctt
aaatcgttct
agaacatcca
acgaactgcg
acatttctat
tctacccatc
tcagtcgtcg
gtttatatta
atattaagaa
ctgtttatgt
tttggtaaag
cttagttcat
ccatctgtca
agcgcgtttg
tccaatgaat
tcttcgcact
atttgctcag
tcactgtctt
gatctcaagt
ttctccactt
ttttcagtgt
tgccatgact

tcacggacgc
ttcctgtttt
cttagagaca
atccaaaaat
caccgctgat
ctaaagaagt
cttcttgaca
cgaccctcct
cttcaaatgt
tatttaagga
tatccacatg
ttttattaat
gtataagttc
gcaagttgaa
aaaataaaat
tattcataaa
caaaaacgta
gttaaacagg
agtggaaatt
ttctacgtac
gtgaaagcat
cttttttcca
gcaacatcag
tcgtttgtat
tagcaatttc
tcttttccca
agttcaaatc
ctagctgttg
tattggagtc
cactgtcgag
tagattgctc
cagattctaa

atcccagaaa
attttttatt
tttaatttta
gctctcgccc
taatttttca
gagttttatt
tttaacccag
gtttctgtcc
tattgtgtca
agatcggaat
ttgtaattca
aatgcagatg
ttctatatag
tgactggtaa
caaattaatg
gctgacgcaa
taccttcttt
gtctagtctt
aaattagtag
ttttgattta
aatgtaaaag
aaaagcaccc
ttgtgtgagc
cttccgtaat
gtccaattct
ttcatctctt
ggcctctttc
ttctagatcc
ttcagccaat
ttgctcgttt
taattctttg
ttttaagcta

tgctgtgatt
agtggtttac
attccattct
tcttcatatt
ctaaactgat
ttaggaggtt
tttgaatccc
aacttatgtc
tcgttgactt
tcgtcgaaca
ctaaaatcta
gaaaatctgt
tcaattaaag
gtagtgtagt
tagcatttta
cgattactat
ttccgacctt
agtgtgaaag
tgtagacgta
tagcaagggg
ctagaataaa
aatgataata
aataataaaa
tttagtctta
ttttgagctt
tcttcttcca
agtttatcca
tggtttttct
tgctttgtat
ttagcggaca
agctgttctc
ttcaatttct

atacgcaacg
agatacccta
tcaaatttca
gagaatacac
gaataatcaa
gaaaaccatt
tttcaatttc
ctagttccaa
taggtaattt
cttcagtttc
aaacgtattt
aaacgtgcgt
caggatgcct
cgaatgactg
agtataccct
tttttttttc
ttttttagct
ctagtggttt
tatgcatatg
aaaagaaata
atggacgaaa
actaaaatga
tcatcacctc
tcaatgggaa
cttcatattt
aagcaacgat
ttgcttcctt
tggtgtagtt
cagacaattg
aagatttaat
tcagctcctc
ctttgatc

atattttgct
tattttattt
tttttgcact
tccattcaaa
aggccccacg
attgtctggt
tgctttttcc
ttcgatcgca
ctccaaatgc
cgtaatgatc
ttcaatgcat
taatttagaa
attaatggga
aggtgggtat
cagccacttc
ttcttggatc
ttctggaaaa
cgattgactg
tatttctcgc
catactattt
taaagagagg
aaaggatttg
cgttgccttt
tcataaattt
gctttggaat
ccttctaccc
cagtttggct
ctcattatta
actctctaac
ctcgttttct
atatttttct

SWISS-Prot
Release 53.0 of 29-May-07 of
UniProtKB/Swiss-Prot contains
269,293 sequence entries,
comprising 98,902,758 amino acids
abstracted from 156,204
references.

ID
AC
DT
DT
DT
DE
GN
OS
OC
OC
OC
OC
OX
RN
RP
RC
RX
RA
RA
RA
RT
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT

DMI1_MEDTR
STANDARD;
PRT;
882 AA.
Q6RHR6;
29-MAR-2005, integrated into UniProtKB/Swiss-Prot.
05-JUL-2004, sequence version 1.
04-APR-2006, entry version 13.
Putative ion channel DMI-1 (Does not make infections protein 1).
Name=DMI1;
Medicago truncatula (Barrel medic).
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons;
rosids; eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae;
Medicago.
NCBI_TaxID=3880;
[1]
NUCLEOTIDE SEQUENCE [MRNA], INDUCTION, AND TISSUE SPECIFICITY.
TISSUE=Root;
PubMed=14963334; DOI=10.1126/science.1092986;
Ane J.-M., Kiss G.B., Riely B.K., Penmetsa R.V., Oldroyd G.E.,
Ayax C., Levy J., Debelle F., Baek J.-M., Kalo P., Rosenberg C.,
Roe B.A., Long S.R., Denarie J., Cook D.R.;
"Medicago truncatula DMI1 required for bacterial and fungal symbioses
in legumes.";
Science 303:1364-1367(2004).
-!- FUNCTION: Required for early signal transduction events leading to
endosymbioses. Acts early in a signal transduction chain leading
from the perception of Nod factor to the activation of calcium
spiking. Also involved in mycorrhizal symbiosis.
-!- SUBCELLULAR LOCATION: Plastid; chloroplast; chloroplast membrane;
multi-pass membrane protein (Potential).
-!- TISSUE SPECIFICITY: Mainly expressed in roots. Also detected in
pods, flowers, leaves, and stems.
-!- INDUCTION: Not induced after bacterial or Nod factor treatment.
-!- SIMILARITY: Belongs to the castor/pollux family.
----------------------------------------------------------------------Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
Distributed under the Creative Commons Attribution-NoDerivs License
----------------------------------------------------------------------EMBL; AY497771; AAS49490.1; -; mRNA.
Chloroplast; Coiled coil; Ion transport; Ionic channel; Membrane;
Plastid; Transmembrane; Transport.
CHAIN
1
882
Putative ion channel DMI-1.
/FTId=PRO_0000165855.
TRANSMEM
129
149
Potential.
TRANSMEM
192
212
Potential.
TRANSMEM
255
275
Potential.
TRANSMEM
307
327
Potential.
COILED
378
403
Potential.
COMPBIAS
78
96
Pro-rich.
COMPBIAS
114
117
Poly-Ser.

Tower of Babel, Pieter Brueghel the Elder, 1563.

XML

<?xml version="1.0" encoding="UTF-8"?>


<uniprot xmlns="http://uniprot.org/uniprot"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
<entry dataset="Swiss-Prot" created="2005-03-29" modified="2006-04-04" version="13">
<accession>Q6RHR6</accession>
<name>DMI1_MEDTR</name>
<protein>
<name>Putative ion channel DMI-1</name>
<name>Does not make infections protein 1</name>
</protein>
<gene>
<name type="primary">DMI1</name>
</gene>
<organism key="1">
<name type="scientific">Medicago truncatula</name>
<name type="common">Barrel medic</name>
<dbReference type="NCBI Taxonomy" id="3880" key="2"/>
<lineage>
<taxon>Eukaryota</taxon>
<taxon>Viridiplantae</taxon>
<taxon>Streptophyta</taxon>
<taxon>Embryophyta</taxon>
<taxon>Tracheophyta</taxon>
<taxon>Spermatophyta</taxon>
<taxon>Magnoliophyta</taxon>
<taxon>eudicotyledons</taxon>
<taxon>core eudicotyledons</taxon>
<taxon>rosids</taxon>
<taxon>eurosids I</taxon>
<taxon>Fabales</taxon>
<taxon>Fabaceae</taxon>
<taxon>Papilionoideae</taxon>
<taxon>Trifolieae</taxon>
<taxon>Medicago</taxon>
</lineage>
</organism>
<reference key="3">
<citation type="journal article" date="2004" name="Science" volume="303" first="1364" last="1367">
<title>Medicago truncatula DMI1 required for bacterial and fungal symbioses in legumes.</title>
<authorList>
<person name="Ane J.-M."/>
<person name="Kiss G.B."/>
<person name="Riely B.K."/>
<person name="Penmetsa R.V."/>
<person name="Oldroyd G.E."/>
<person name="Ayax C."/>
<person name="Levy J."/>

Two principle problems in bioinformatics


distribution: data is created
and controlled by autonomous
groups all over the world.
biology is hard and messy:
large collections of data, many
numbers of data types and
tools; few of which talk to each
other.

Perl is often the glue that binds these systems together.

Perl basics: Strings


Primitives:
Strings
Numerics

TGACATGCTAGCTAGCTAGCTAT
1356
#@$!$!%@&&!@

Data types
Scalar: a variable quantity
that cannot be resolved
into components.
List: a collection of items,
often stored in an array.
Hash: a dish of cooked
meat cut into small pieces
and re-cooked, usually
with potatoes.

Data types
Scalar: a variable quantity
that cannot be resolved
into components.
List: a collection of items,
often stored in an array.
Hash: like an array, but
instead of indexing values
by number, values are
accessed by name. Think
of them as name-value
pairs.

Data types
Scalar: my $var = a;
my $num = 10;
List: my @fruit_list = (apple,orange,banana);
Hash:
my %ip2hostname = (
160.94.109.65
=> leaf.cbri.umn.edu,
160.94.109.55

=> blastoma.cbri.umn.edu,

160.94.109.211 => kierkegaard.cbri.umn.edu


);

Math
Standard arithmetic: +, -, *, /
modulus operator: %
4 % 2 = 0 and 5 % 2 = 1
Operate in place: $num += 3;
Increment and decrement variable: $i++, $a- power: 2**5
Square-root: sqrt(9)

Some Math Code


# Pythagorean theorem
my $a = 3; my $b = 4;
my $c = sqrt($a**2 + $b**2);
# whats left over from the division
my $x = 22; my $y = 6;
my $div = int ( $x / $y );
my $mod = $x % $y;
print output: , $div, , $mod, \n;
output: 3 4

Logic and Equality


if / unless / elsif / else
if( TEST ) { DO SOMETHING }
elsif( TEST ) { SOMETHING ELSE }
else { DO SOMETHING ELSE IN CASE }
Equality: == (numbers) and eq (strings)
Numeric Less/Greater than: <,

<=, >, >=

String (lexical) comparisons: lt,

le, gt, ge

Testing equality
my $str1 = mumbo;
my $str2 = jumbo;
if( $str1 eq $str2 ) {
print strings are equal\n;
}
if( $str1 lt $str2 ) {
print less; }

} else {
print more\n;
}

Testing equality
my $str1 = mumbo;
my $str2 = jumbo;
if( $str1 eq $str2 ) {
print strings are equal\n;
}
if( $str1 lt $str2 ) {
print less; }

} else {
print more\n;
}

Testing equality
my $num1 = 10;
my $num2 = 100;
if( $num1 == $num2 ) {
print nums are equal\n;
}
if( $num1 < $num2 ) {
print less; }

} else {
print more\n;
}

Boolean Logic
AND: && and
OR: || or
NOT: !

not

if( $a > 10 && $a <= 20) {


do something interesting here;
}

Loops

while( TEST ) { }
until( ! TEST ) { }
for( $i = 0 ; $i < 10; $i++ ) {}
foreach $item ( @list ) { }
for $item ( @list ) { }

Using logic

for( $i = 0;
if( $i == 0
print $i
} elsif( $i
print $i
} else {
print $i
}
}

$i < 20; $i++ ) {


) {
is 0\n;
/ 2 == 0) {
is even\n;
is odd\n;

Using logic: subtile

for( $i = 0;
if( $i == 0
print $i
} elsif( $i
print $i
} else {
print $i
}
}

$i < 20; $i++ ) {


) {
is 0\n;
% 2 == 0) {
is even\n;
is odd\n;

Using logic: looping

for( $i = 0;
if( $i == 0
print $i
} elsif( $i
print $i
} else {
print $i
}
}

$i < 20; $i++ ) {


) {
is 0\n;
% 2 == 0) {
is even\n;
is odd\n;

Using logic: comparing

for( $i = 0;
if( $i == 0
print $i
} elsif( $i
print $i
} else {
print $i
}
}

$i < 20; $i++ ) {


) {
is 0\n;
% 2 == 0) {
is even\n;
is odd\n;

What is truth?
True
if( zero ) {}
if( 23 || -1 || ! 0) {}
$x = 0 or none; if( $x )
False
if( 0 || undef || || 0 ) { }

Special variables
This is why many
people dislike Perl.
Too many little silly
things to remember.
One of the trade-offs
that make it harder to
learn and ultimately
easier to use.
perldoc perlvar
for more detailed
information.

Some special variables


$! : error messages here
$, : separator when doing print @array;
$/ : record delimiter (\n usually)
$a,$b : used in sorting
$_ : implicit variable
perldoc perlvar for more info

The Implicit variable?


Implicit variable is $_
It is the last thing were were thinking about.
Examples:
for ( @list ) { print $_ };
while(<IN>) { print $_};

Some operators imbedded functions


tr///: transliteration from one group of
characters to another.
lc, lcfirst
uc, ucfirst
chomp: removes the line endings from all
elements of a list; returning the (total)
number of chars removed.
chop: chops off the last character on all
elements of a list; returns the last chopped
char.

LAB: Math
% pico pi.pl
#!/usr/bin/perl -w
#
# assign values
my $num1 = 22;
my $num2 = 7;
my $result = $num1 / $num2;
# print the result
print $result;

% chmod +x pi.pl
% ./pi.pl

LAB: Math
% pico pi.pl
#!/usr/bin/perl -w
#
# assign values
my $num1 = 22;
my $num2 = 7;

% chmod +x pi.pl
% ./pi.pl

my $result = int($num1 / $num2);


# print the result
print $result;

LAB: Math
% pico pi.pl
#!/usr/bin/perl -w
#
# assign values
my $num1 = 22;
my $num2 = 7;
my $result = $num1 % $num2;
# print the result
print $result;

% chmod +x pi.pl
% ./pi.pl

LAB: break it.


What happens when?:
1.You

change the operation?

2.You

change the values?

3.You

put the numbers in quotes?

4.Add

another number and multiply


the result?

% pico loops-and-logic.pl

LAB: Loops and logic

% chmod +x loops-and-logic.pl
% ./loops-and-logic.pl

#!/usr/bin/perl -w

for( $i = 0;
if( $i == 0
print $i
} elsif( $i
print $i
} else {
print $i
}
}

$i < 20; $i++ ) {


) {
is 0\n;
% 2 == 0) {
is even\n;
is odd\n;

LAB: Loops and logic


% ./loops-and-logic.pl
0 is 0
1 is odd
2 is even
3 is odd
4 is even
5 is odd
6 is even
7 is odd
8 is even
9 is odd
10 is even
11 is odd
12 is even
13 is odd
14 is even
15 is odd
16 is even
17 is odd
18 is even
19 is odd

LAB: Loops and logic

% pico foreach.pl
% chmod +x foreach.pl
% ./foreach.pl

#!/usr/bin/perl -w

foreach $item ( contig, seq, phrap ) {


if( $item eq phrap ) {
print Is there a phred file for this $item file?\n;
} elsif( $item eq seq) {
print Is $item in FASTA format?\n;
} else {
print $item is an unknown type.\n;
}
}

LAB: Loops and logic

% pico foreach.pl
% chmod +x foreach.pl
% ./foreach.pl

#!/usr/bin/perl -w

my @items = ( contig, seq, phrap );


foreach $item ( @items ) {
if( $item eq phrap ) {
print Is there a phred file for this $item file?\n;
} elsif( $item eq seq) {
print Is $item in FASTA format?\n;
} else {
print $item is an unknown type.\n;
}
}

LAB: break it.


What happens when?:
1.

You change the test?

2.

You change the values?

3.

Test with booleans?

LAB: operators and


functions

% pico transcribe.pl
% chmod +x transcribe.pl
% ./transcribe.pl

#!/usr/bin/perl -w
# Transcribing DNA into RNA
# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
my $RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to RNA:\n\n";
print "$RNA\n";

LAB: operators and


functions

% pico transcribe.pl
% chmod +x transcribe.pl
% ./transcribe.pl

#!/usr/bin/perl -w
# Transcribing DNA into RNA
# The DNA
my $DNA = 'ACGGGAGGACGGGAAAATTACTACGGCATTAGC';
# Print the DNA onto the screen
print "Here is the starting DNA:\n\n";
print "$DNA\n\n";
# Transcribe the DNA to RNA by substituting all T's with U's.
my $RNA = $DNA;
$RNA =~ s/T/U/g;
# Print the RNA onto the screen
print "Here is the result of transcribing the DNA to RNA:\n\n";
print "$RNA\n";

LAB: break it.


What happens when?:
1.You

change the case?

Change the case with different methods?


(tr///, \L, \U, lc(), uc() )

2.

3.You

reverse the sequence?

If you remember nothing else


biology is hard and messy.

The key problems are social.


Together we are smarter than any
one of us.
Technology is easy by
comparison.

Parting Thoughts: an assignment.


1. Calculate the reverse complement of a DNA strand using the tr/// operation.
2. Read about file handling. (Safari on-line documentation is available.)
3. Read about Regular Expressions (regex). (Safari)
4. Find CPAN.ORG and locate a module that would be useful to you as a biologist.
5. Read about that module and email me (kunau@umn.edu) the following details:
1. Name of the module.
2. The name of the person who wrote it.
3. What it does.
4. How it would be useful to you?

Questions?

Thank You.

Você também pode gostar