HZHsu 2006

國立暨南國際大學資訊工程學系
碩士論文
以 DNA 序列特性為基礎的加密方法
Encryption Methods Based upon DNA Sequences
指導教授：李家同教授
研究生：許宏誌
中華民國九十五年六月
Acknowledgements
Acknowledgements
I first thank my advisor, Professor R. C. T. Lee during my graduate school years for
his education and training. He made me a humanistic, merciful, and friendly person.
In research, he also made me be a logical, accurate, and scrupulous student. The most
significant improvement is my English comprehension and writing.
Secondly, I thank Professor C. H. Huang, Professor Jenn-Sheng Lin, Professor
Yue-Li Wang and Professor Chuan-Yi Tang for their suggestions on my thesis. I also
appreciate their directions on my research.
Thirdly, I am grateful to C. C. Lin, F. L. Lin, S. C. Chen, C. Y. Yang, J. G. Chen,
Y. M. Pan, S. J. Pan, W. H. Wen, W. L. Wang, T. H. Ku, Z. H. Chen, C. S. Ou and G.
R. Zhuang for their directions and discussions between them and me on researches. I
am also grateful to L. C. Juan, H. J. Chen, W. H. Wen, K. H. Liao, Y. L. Chen, T. H.
Ku, Z. H. Chen, C. S. Ou, G. R. Zhuang and Z. B. Huang for their humanistic
knowledge and making me know more about the world.
Finally, I sincerely appreciate Linda. She always backs me during these two years.
Hsu, Hong-Zhi
Computer Science and Information Engineering
National Chi-Nan University, Puli, Nantou

June, 2006
i
論文摘要
論文名稱：以 DNA 序列為基礎的加密方法
校院系：國立暨南國際大學資訊工程學系頁數：四十五
畢業時間：九十五年六月學位別：碩士
研究生：許宏誌指導教授：李家同教授
論文摘要
在這篇論文裡，我們利用 DNA 序列的一些特性提出三個加密的方法。我們會
指出 DNA 中的一些有趣的特性，而這些特性可以讓我們利用來隱藏資料。我們在
這篇論文提出的三個方法是：插入法、互補字串法還有替換法。每一個方法我們都
先秘密地選一個 DNA 參考序列 S 然後接著把訊息 M 加入 S 中得到 S’。我們把 S’跟
其他的 DNA 序列或是像 DNA 的序列一起送給接收者。接收者可以成功地確認我們
把訊息 M 藏在哪一個然後可以忽略其他的 DNA 序列。他也可以把訊息 M 擷取出來。
第一個方法最主要的想法是我們把 M 跟 S 打斷成許多片段然後把 M 的每
一個片段都插入 S 兩兩片段中。第二個方法是把 M 打斷成許多片段然後分別地把每
一個片段插入 DNA 序列中的一對互補字串前面。我們也從 S 中擷取一段片段插入
有隱藏 M 片段的互補字串後方。第三個方法是利用改變序列中的字母來隱藏 M 中
的資訊。
關鍵字：加密、DNA、互補字串、互補規則
ii
Abstract
Title of Thesis: Encryption Methods Based Upon DNA Sequences
Name of Institute: Department of Computer Science and Information

Pages: 45
Engineering, National Ch-Nan University.
Graduation Time: 06/2006 Degree Conferred: Master
Student Name: Hong-Zhi Hsu Advisor Name: Richard Chia-Tung Lee
Abstract
In this thesis, we propose three encryption methods based upon some properties of DNA
sequences. We will point out that the DNA sequences possess some interesting
properties which we can utilize to hide data. The three methods proposed in this thesis
are: the insertion method, the complementary pair method and the substitution method.
For each method, we secretly select a reference DNA sequence S and incorporate the
secret message M into it such that we obtain S’. We send this S’, together with many
other DNA, or DNA-like sequences to the receiver. The receiver is able to identify the
particular sequence with M hidden in it and ignore all of the other sequences. He will
also be able to extract M.
The main idea of Method 1 is to break S and M, and we insert each segment of M
between every pair segments of S. Method 2 is to break M into several segments and
hide each one before a complementary pair substring in a DNA sequence, respectively.
We also extract a segment of S and insert it after each complementary pair substring with
a segment of M hidden before it. Method 3 is to hide the secret message M by
substituting alphabets of S.
Keywords: Encryption, DNA, Complementary Pair Substring, Complementary Rule.
iii
Contents
Contents
List of Figures ..................................................................................................................... vi
Chapter 1 Introduction and Related Works ................................................................1-1
1.1 A Simple Version of DNA Based Encryption Method by Using Biology Properties ....... 1-1
1.2 The Complicated Version ..................................................................................................... 1-6
1.3 Thesis Organization.............................................................................................................. 1-9
Chapter 2 Method 1: The Insertion Method ...............................................................2-1
2.1 Algorithms of Insertion Method.......................................................................................... 2-3
2.2 An Example of Insertion Method ........................................................................................ 2-6
2.3 Robustness of Insertion Method.......................................................................................... 2-8
2.4 Payload of Insertion Method ............................................................................................. 2-13
Chapter 3 Method 2: The Complementary Pair Approach .......................................3-1
3.1 Algorithms of Complementary Pair Approach .................................................................. 3-1
3.2 Robustness of Complementary Pair Approach .................................................................. 3-5
3.3 Payload of Complementary Pair Approach........................................................................ 3-8
3.4 The Complementary Palindrome ........................................................................................ 3-9
Chapter 4 Method 3: The Substitution Approach.......................................................4-1
4.1 Algorithms of Substitution Approach ................................................................................. 4-1
4.2 An Example of Substitution Approach ............................................................................... 4-4
4.3 Robustness of Substitution Approach ................................................................................. 4-6
iv
Contents
4.4 Payload of Substitution Approach....................................................................................... 4-8
Chapter 5 Conclusion and Future Works ....................................................................5-1
5.1 Conclusions ........................................................................................................................... 5-1
5.2 Future Works ........................................................................................................................ 5-1
Bibliography......................................................................................................................6-1
v
List of Figures
List of Figures
Figure 1-1 An Experiment of Fluorescent on DNA. ........................................................1-3
Figure 2-1 The Relations between m(s) and r’s(k’s). ..................................................... 2-11
Figure 3-1 An Illustration of the Complementary Pair Approach....................................3-7
vi
Chapter 1 Introduction and Related Works
In recent years, much research work has been done on DNA based encryption
schemes [C2003, CRB99, LRBR2000 and SEP2002]. Most of them use biological
properties of DNA sequences. We present one of them in this chapter.
In this chapter, we present two encryption methods by using biology properties.
Both of them are based on physical DNA sequences and use some chemical schemes
which are used to decode [C2003]. We will introduce a simple version and a
complicated version.
1.1 A Simple Version of DNA Based Encryption Method by Using

Biology Properties
In this chapter, we will introduce how to use biology properties to hide data. In a
DNA based encryption scheme, we will store binary data as a DNA sequence and later
hide the data in some way. There is a key, called primer, which is used to find out the
binary data and we will introduce how to use primers to store our data in DNA sequences.
The method is that a sender sends a DNA sequence and selected primers to the
receiver and then the receiver uses the primers and the DNA sequence to transform it into
a sequence consisting of the binary data. We may use a public DNA sequence as our
reference sequence and the receiver also knows this sequence. Thus, we will send
selected primers to the receiver. Without the primers and the reference sequence, no one
can correctly decode the binary data.
It is a trick that how to store binary data in DNA sequence. As we mentioned
1-1
before, we will use selected primers to do that. Let us define a primer first. A primer
is a short complementary substring of a DNA sequence. Suppose we have a DNA
sequence:
ATGCTTAGTTCCATCGGAGACTAATGGCCTA
and two primers:
ATCAA and GATTAC
which are the complementary substrings of TAGTT and CTAATG, respectively. There
is also a rule which should be defined, the complementary rule. A complementary
relation is as follows: for each alphabet x of a DNA sequence, c(x) denotes the
complementary alphabet of x. A complementary rule is the complementary relations for
each alphabet of a DNA sequence. In our case, we define the complementary rule as
follows: A-T, T-A, C-G and G-C. That means, c(A) is T for instance.
In biology, there is a chemical mechanism to make primers and a DNA sequence
combine and then we will use a chemical substance, called fluorescent, to indicate where
the positions of the primers are. The fluorescent will make the positions of
combinations of primers and substrings of DNA sequence bright. A bright section
corresponds to the binary data ‘1’ and a dark one corresponds to ‘0’. There is a sample
for this kind of chemical experiment.
1-2
Figure 1-1 An Experiment of Fluorescent on DNA.
The gree sections represent the binary data ‘1’s in the sequences and the dark
sections are the binary data ‘0’s in the sequences.
In our case, we will have the following combinations:
ATGCTTAGTTCCATCGGAGACTAATGGCCTA
ATCAA GATTAC
0 1 0 1 0
Thus we will have a binary sequence “01010” finally.
Consider another example. Suppose we have the following DNA sequence and
primers. DNA sequence:
ATGCTTAGTTCCATCGGAGACTAAT
Primers:
1-3
ATCAA, GGTAG and GATTA
The sequence will be coded as 01101 as the following:
ATGCTTAGTT CCATCGGAGACTAAT
ATCAA GGTAG GATTA
0 1 1 0 1
The sender sends the DNA sequence and the selected primers to the receiver. The
receiver now uses the primers and the DNA sequence to transform it into a sequence
consisting of bright and dark parts by using the chemical mechanism as we mentioned
before. This will produce a binary stream.
In a real DNA sequence, for instance, the following one is a segment of DNA
sequence of Litmus, its real length is with 2856 nucleotides long.
ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC
ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG
ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC
ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG
CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
It means that for a real DNA sequence, it can be used to store amount of binary data
by using short primers. Assume we use primers consisting of five nucleotides, thus
⎡ 2856 ⎤
there will be at most ⎢ 5 ⎥ = 572 bits if we store data in DNA sequence of Litmus.
⎢ ⎥
It should be noticed that if we use too small size primer, it is hard to store our data in
DNA sequence. Suppose we have to store a binary sequence “01101” and we have a
1-4
DNA sequence as the following:
ATCGGCTAATCGGCTAATCGGCTAATCGGCTA
Assume we chose small size primers as the following:
TA, CG, and CG
It may cause the following combinations:
ATCGGCTAATCGGCTAATCGGCTAATCGGCTA
TA CG CG
1 0 1 0 1 0
Thus we will have the binary code “101010” and it is not our required result.
It is a key point to make sure that the selected primers could correctly generate the
required binary data. The following example is that we use a segment of the DNA
sequence of Litmus to store a binary data 010111010110. We use primers
CTTAAGCGCGA, AAGCGCGACT, CAGTGTTAAG, CGCGACT,
GGCGCTTAAGGACGT, TCTATTAACATAAATT and CACGGATCGAGCTA
The result is as the following:
ATCGAATTCGCGCTGAGTCACAATTCGCGCTGA GTCACAATTC GCGCTGA
CTTAAGCGCGA AAGCGCGACT CAGTGTTAAG CGCGACT
0 1 0 1 1 1
1-5
GTCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTG
GGCGCTTAAGGACGT
0 1 0
CAGAGATAATTGTATTTAA GTGCCTAGCTCGATACAATAAACGCCATTTGAC
TCTATTAACATAAATT CACGGATCGAGCTA
1 1 0
CATTCACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGG
AATTCCTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
To prevent an attacker to guess the selected primers, it is better to choose longer
primers than shorter ones. It is really hard to guess what a long primer is.
In order to decode the message, the receiver should know (1) the complementary
rule, (2) the reference DNA sequence and (3) the selected primers. For an attacker, it is
really hard to guess the reference DNA sequence and the selected primers. There are
about 55 million publicly DNA sequences in the world.
1.2 The Complicated Version
In the above method, a sender sends a reference DNA sequence and the selected
primers to the receiver and the receiver decode and find out the binary message hidden in
the DNA sequence by using the selected primers. We could send only a section of a
primer and information which is used to recover the correct primer to the receiver to
make the decoding more complicated. This is more complicated because an attacker
would be hard to get the correct primer even though he got the short section.
1-6
For instance, suppose we have a reference DNA sequence:
ATGCTTAGTTCCATCGGAGACTAAT
We select the following primer:
ATCAA, GGTAG and GATTA
In the simple version, we will send the sequence and the selected primers to the
receiver. This time, we will send a partial section for each primer to the receiver. We
will introduce how the receiver recovers the correct primers and decode. After getting
the sections of the selected primers, the receiver could easily recover all the correctly
primers and get the message.
The key point is that how the receiver could recover the correct primers according to
sections of selected primers. Before we introduce the trick, we shall introduce a term,
called template first. A template is a complementary string of a primer. For instance, for
a primer TACCCTGATTAC, its template is ATGGGACTAATG. A template is used to
recover the original primer. There is also a chemical mechanism, called PCR
(Polymerase Chain Reaction). We will recover the original primer by using the PCR
strategy on the template and a partial primer. We cut a certain suffix of the primer; in
this case, it may be GATTAC and put it together with the template. Thus by using PCR,
a kind of chemical scheme, we will get the correct primer.
The template of the primer: ATGGGACTAATG
A partial of the primer: GATTAC
The primer: TACCCTGATTAC
1-7
By using the PCR strategy, we would not need to send the correct primers to the
receiver, instead we send the prefixes or the suffixes of the selected primers and the
template of each primer. Thus, the receiver will use a partial primer and its
corresponding template to recover the correct primer. After he get all the selected
primers, he could decode and get the secret message.
The following example is a complicated version of encryption by using a segment of
the DNA sequence of Litmus. Suppose we still hide the binary data 010111010110 and
we still use primers
CTTAAGCGCGA, AAGCGCGACT, CAGTGTTAAG, CGCGACT,
GGCGCTTAAGGACGT, TCTATTAACATAAATT and CACGGATCGAGCTA
The template for each primer is
GAATTCGCGCT, TTCGCGCTGA, GTCACAATTC, GCGCTGA,
CCGCGAATTCCTGCA, AGATAATTGTATTTAA and GTGCCTAGCTCGAT
We cut the suffix for each primer and get the following partial primers:
CTTAAGC, AAGCGCG, CAGTGTTA, CGCGAC, GGCGCTTAAG,
TCTATTAACAT and CACGGATC
We send the templates and the partial primers to the receiver. For each
corresponding template and the partial primer, the receiver uses the PCR strategy to
recover the selected primer.
1-8
For instance:
Template: GAATTCGCGCT
Partial primer: CTTAAGC
Primer: CTTAAGCGCGA
Thus the receiver will get all selected primers and thus decode the sequence as we
mentioned before.
It is a key point to make sure the corresponding template and partial primer are one
to one mapping. If we have the following partial primers:
CAGGT and CAG
There must be only one template which its prefix is GTC. Thus, the receiver
would not know which partial primer he should recover is.
For this complicated version encryption, in order to decode the message, a receiver
should know (1) the complementary rule, (2) the partial primers and (3) the templates of
selected primers.
1.3 Thesis Organization
The thesis is organized as follows. In chapter 2, we propose the first method,
Insertion Method. The main idea of Method 1 is to divide the secret messages and
reference DNA sequence, then we assemble the segment one by one from secret
messages and the reference sequence. The robustness and the payload of Method 1 will
also be proposed in this chapter. We propose Method 2, the Complementary Pair
1-9
Approach, in chapter 3. We insert the secret messages before a pair of complementary
substrings. The robustness and the payload of Method 2 are also proposed in this
chapter. Method 3, the Substitution Approach, is proposed in chapter 4. We change
each alphabet by our conditions to hide the secret messages. The robustness and the
payload of Method 3 are proposed in chapter 4. Finally, conclusions and future works
are also proposed in chapter 5.
1-10
Chapter 2 Method 1: The Insertion Method
In this chapter, we shall point out that the DNA sequences possess some interesting
properties which we can utilize to hide data. We present three methods in the following
three chapters, the insertion method, the complementary pair method and the substitution
method. For each method, we secretly select a reference DNA sequence S and
incorporate the secret message M into it such that we obtain S’. We send this S’,
together with many other DNA, or DNA-like sequences to the receiver. The receiver is
able to identify the particular sequence with M hidden in it and ignore all of the other
sequences. He will also be able to extract M.
In this chapter, we will present the first method which does not make use of the
biological properties. Instead, we use the following properties of DNA sequences.
A DNA sequence is a sequence consisting of four alphabets: A, C, G and T. Each
alphabet is related to a nucleotide. It is usually quite long. For instance, the following
are two DNA sequences.
The first one is a segment of DNA sequence of Litmus, its real length is with 2856
nucleotides long:
ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC
ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG
ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC
ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG
CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC.
2-1
The second one is a segment of DNA sequence of Balsaminaceae, its real length is
with 2283 nucleotides long:
TTTTTATTATTTTTTTTCATTTTTTTCTCAGTTTTTAGCACATATCATTACATTTTA
TTTTTTCATTACTTCTATCATTCTATCTATAAAATCGATTATTTTTATCACTTATTTT
TCTAATTTCCAATATTTCATCTAATGATTATATTACATTAAAGAAATCGGTTAAAA
GCGACTAAAAATCAATCTGGAACAAGGCTTAGTTTATTTAATATATTATTTTATG
TAATTTCTATTGAAAAATTAGTTAAAAGGCAAGTATTTGAGAT.
We can now easily see one special property of DNA sequences: There is almost no
difference between a real DNA sequence and a faked one. This is a property which we
shall exploit in our research.
There is another fact which is quite useful to us: There are a large number of DNA
sequences publicly available in various web-sites. A rough estimation would put the
number of DNA sequences publicly available to be around 55 million.
By using the above facts, we designed three DNA based encryption methods. All
of these methods would secretly select a reference sequence S from publicly available
DNA sequences. Only the sender and the receiver are aware of this reference sequence.
The sender would transform this selected DNA sequence S into a new sequence S’ by
incorporating the DNA sequence S with the secret message M. This transformed
sequence S’ is sent by a sender to the receiver together with many other DNA sequences.
The receiver would then examine all of the received sequences, identify S’ and recover
the secret message M.
We shall introduce three methods in the following chapters. For all of these
2-2
methods, we assume that there are two schemes used by the sender and the receiver
which are kept secret. The first one is a binary coding scheme which transforms
alphabets A, C, G and T into binary codes and vice versa. For instance, the following
may be a binary coding used: ((A 00) (C 01) (G 10) (T 11)). It should be noted that
more digits may be used. The second scheme is a complementary pair rule. That is,
we shall assign each alphabet x a complement, denoted as C(x). The following may
be such a rule: ((A C) (C G) (G T) (T A)).
We also assume that the secret message M is a binary sequence.
2.1 Algorithms of Insertion Method
To simplify the discussion, we start with the most basic version and give a simple
example. The more complicated version of our method will be presented after this basic
one is given. Suppose the secret message M is 01001100. Let S be
ACGGTTCCAATGC. Our coding steps are as follows:
Step 1. We first code S into a binary sequence by using the binary coding scheme.
Thus the sequence S will now become 00011010111101010000111001.
Step 2. Divide S into segments whereby each segment contains k bits. Suppose k is 3.
Then we have the following segments: 000, 110, 101, 111, 010, 100, 001, 110,
01.
Step 3. Insert bits from M, once at a time, into the beginning of segments of S. The
result is as follows: 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110, 01. We
should ignore those segments without any secret message inserted. Thus, we
will have the following segments: 0000, 1110, 0101, 0111, 1010, 1100, 0001,
0110. Concatenating the above segments, we have the following binary
2-3
sequence: 00001110010101111010110000010110.
Step 4. We use the binary code scheme to produce the following faked DNA sequence:
S’=AATGCCCTGGTAACCG. As the reader can see, this sequence is quite
different from S.
Step 5. We send the above sequence S’ to the receiver, amid many other irrelevant
sequences.
The above process is the encryption process. It is easy to see that the decryption
process is just to reverse the encryption process. For every received sequence T, the
receiver extracts a sequence out of it. If the extracted sequence is not a prefix of the
reference sequence S, ignore T. If it is, the receiver knows that he has also successfully
extracted the secret message M as a by-product.
The above is the basic version of our approach. In a more complicated version, we
divide S into segments by using a random number generator. That is, k is not fixed any
more. Instead, it is determined by a random number generator which is known only to
the sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into
segments with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message
M. We may therefore also use the same random number generator to divide M into
segments. The parameter used will be denoted as r.
The following is the exact algorithm for Method 1.
2-4
Algorithm 2.1 Encryption Algorithm of Method 1 (Insertion Method)
Input: A reference DNA sequence S, a secret binary message M and a binary

coding scheme to code A, C, G and T into binary digits.
Output: An encrypted DNA sequence S’.
Step 1. Code S into a binary sequence S1 by using the binary coding scheme.
Step 2. Generate k’s by using a random number generator to divide S1 into

segments and generate r’s to divide the secret message M into
segments. Each ki and ri is larger than 1 or equal to 1. Denote S1
by s1s2…sn and M by m1m2…mp.
Step 3. Insert each mi of M before si of S1 to produce a new binary sequence.

Delete sp+1sp+2…sn. Denote the resulting binary sequence by S2.
Step 4. Transform sequence S2 back to a faked DNA sequence S3 by using the

same binary coding scheme used in Step 1.
Step 5. Return S3.
S3 is sent to the receiver together with many other DNA sequences. The receiver
uses the following algorithm to decrypt.
Algorithm 2.2 Decryption Algorithm for Method 1
Input: A set of DNA sequences, one of which has the secret message M
hidden in it by using Algorithm 2.1, a reference DNA sequence S and
a binary coding scheme.
Output: The secret message M.
Step 1. Generate numbers k’s and r’s denoted as k1k2…kn and r1r2…rp by
using the same random number generator with the same seed of the
encoding scheme.
Step 2. For a DNA sequence S’ of the set, code S’ into a binary sequence by
2-5
using the binary coding used by the sender and use r1+k1,r2+k2,… to
divide the binary sequence into binary segments.
Step 3. For each segment of the first p segments of S’, extract the first ri bits,
called mi.
Step 4. For each segment of the first p segments of S’, extract the last ki bits,
called si.
Step 5. Concatenate all mi’s to be M and all sj’s, to be S1.
Step 6. Transform S1 to be a DNA sequence by using the same rule. If S1 is

not a prefix of S, go to Step 2.
Step 7. Return M.
For an intruder to find out the secret message, he must be equipped as follows. (1)
He must know precisely the reference DNA sequence S. Since there are roughly 55
millions DNA sequences available publicly, it is extremely hard to guess one. (2) He
has to know the random number generator and the two seeds used. (3) He has to know
the binary coding scheme.
2.2 An Example of Insertion Method
In this section, we will show a complete example about the complicate version of
Method 1.
Our reference DNA sequence is a segment of DNA sequence of Litmus as
mentioned in previous section.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT
CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA
GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC
2-6
CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT
GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Our secret message M=0110100101001100101001010101001010100110101010.
The binary coding rule is as follows:
A:00 C:01 G:10 T:11
Suppose we have two random number sequences by using a random number generator.
K=3, 4, 7, 6, 1, 5, 3, 7, 6, 4.
R=7, 9, 11, 3, 5, 8, 4, 12, 7, 10.
We transform S to be the following binary sequence:
0011011000001111011001100111100010110100010000111101100110011110001011010
0010000111101100110011110001011010001000011111011100001110100100101100110
0000111101011110010010010101011000001111010110010011111001001000100011000
0111110110011111100001011100101110010011101100011000100001100000001100101
0011111110000101001111010001010001001111101011101110010001011101010000100
1110110011001000101101100010110110111011000101000001111010111100100101000
1100110111101000110101000110000010011111010101001110101110000110110100010
Divide the secret message M by using the sequence K, thus we will get the following
divided segments:
011, 0100, 1010011, 001010, 0, 10101, 010, 0101010, 011010, 1010.
2-7
Divide the reference sequence S by using the sequence R, thus we will get the
following divided segments:
0011011, 000001111, 01100110011, 110, 00101, 10100010, 0001, 111011001100,
1111000, 1011010001.
Insert each segment of M before each of S, thus we will get the following segments:
0110011011, 0100000001111, 101001101100110011, 001010110, 000101,
1010110100010, 0100001, 0101010111011001100, 0110101111000, 10101011010001.
We combine the above segments and transform it to be a DNA sequence by using
the binary coding rule.
S’=CGCGTCAAACTTCATCGCGCGCCCGACCGGTCACAGACCCCCTCGCGATCC
TGAGGGTCAC
The Decryption steps will be done by using the sequences K and R to divide the
segments of M and a prefix of S. Thus we could recover the secret message.
We will first transform S’ to be a binary sequence by using the binary coding rule.
The sequence K=3, 4, 7, 6, 1, 5, 3, 7, 6, 4 and R=7, 9, 11, 3, 5, 8, 4, 12, 7, 10. We will
extract the segments by using the above two sequences. In our case, k1 is 3 and r1 is 7.
We extract the first 10 bits of S’ and the first 3 bits is the first segment of M. By
continuously extracting the segments of M, we will get M finally.
2.3 Robustness of Insertion Method
In this section, we will show the strength of the Method 1. By our definition, the
2-8
strength of an encryption method is the possibility when an attacker guesses without any
information. We will point out how much solutions are if the attacker does not know
any information and then we will calculate the possibility.
Instead, k is determined by a random number generator which is known only to the
sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into segments
with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message M. We
may therefore also use the same random number generator to divide M into segments.
The parameter used will be denoted as r.
For an intruder to find out the secret message, he must be equipped as follows. (1)
He must know precisely the reference DNA sequence S. Since there are roughly 55
millions DNA sequences available publicly, it is extremely hard to guess one. (2) He
has to know the random number generator and the two seeds used. (3) He has to know
the binary coding scheme.
For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is (2a).
55000000
For (2): Suppose we handle a sequence S’ after the encryption stage. The size of S’
is n’. S’ is composed by the secret message M and the certain size prefix of reference
sequence S. We define the size of M is m and the size of the certain prefix S is s. It is
hard to know m and s for an attacker.
It can be imagined that an attacker should guess the size m and s first. We know
⎛ 2 + n'−2 − 1⎞ ⎛ n'−1 ⎞
m+s=n’, m, s , n ' ≥ 1 and there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = n'−1 possibilities here. For
⎝ n '−2 ⎠ ⎝ n'−2 ⎠
2-9
⎛ 2 + 10 − 2 − 1⎞ ⎛9⎞
instance, assume n’=10, there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 9 possibilities as the
⎝ 10 − 2 ⎠ ⎝8⎠
following:
m=1, s=9
m=2, s=8
m=3, s=7
m=4, s=6
m=5, s=5
m=6, s=4
m=7, s=3
m=8, s=2
m=9, s=1
1
The probability of an attacker’s successful guessing m and s is (2b). It is not
n'−1
enough to decode for the attacker.
There are still two values should be known by the attacker, namely, m and s. The
problem is that the attacker does not know the numbers of r’s and k’s which are used to
break the secret message and the reference sequence S. Since the summations of r’s and
k’s generate m and s respectively. The following figure indicates the relations between
m(s) and r’s(k’s).
2-10
A prefix of S
with size s
k1 k2 k3 k4
The secret
message
with size m
r1 r2 r3 r4
Figure 2-1 The Relations between m(s) and r’s(k’s).
Notice that it is hard for an attacker to know how much k’s and r’s we use. Thus he
should try two k’s(r’s), three k’s(r’s), four k’s(r’s) and so on.
For s, there may be the following cases:
k1=s, k1 ≥ 1
k1+k2=s, k1 , k 2 ≥ 1
k1+k2+k3=s, k1 , k 2 , k 3 ≥ 1
k1+k2+k3+k4=s, k1 , k 2 , k 3 , k 4 ≥ 1
.
.
.
k1+k2+k3+…+ks=s, k1 , k 2 , k 3 , L , k s ≥ 1
⎛ 1 + s − 1 − 1⎞ ⎛ s − 1⎞
The number of solution of the first formula is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 . The number
⎝ s − 1 ⎠ ⎝ s − 1⎠
⎛ 2 + s − 2 − 1⎞ ⎛ s −1 ⎞
of solutions of the second one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = s − 1 . The number of solutions of
⎝ s − 2 ⎠ ⎝ s − 2⎠
2-11
⎛ 3 + s − 3 − 1⎞ ⎛ s −1⎞
the third one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the fourth one is
⎝ s −3 ⎠ ⎝ s − 3⎠
⎛ 4 + s − 4 − 1⎞ ⎛ s −1 ⎞ ⎛ s + s − s − 1⎞ ⎛ s − 1⎞
C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the last one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 .
⎝ s−4 ⎠ ⎝ s − 4⎠ ⎝ s−s ⎠ ⎝ 0 ⎠
The above numbers could be summarized as the following formula:
s −1
⎛ s − 1⎞ ⎛ s −1 ⎞ ⎛ s −1⎞ ⎛ s − 1⎞ ⎛ s −1 ⎞
C ⎜⎜ ⎟⎟ + C ⎜⎜
⎝ s − 1⎠
⎟⎟ + C ⎜⎜
⎝ s − 2⎠
⎟⎟ + L + C ⎜⎜
⎝ s − 3⎠ ⎝ 0 ⎠
⎟⎟ = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠
i =0
It is not much sense about the above formula. Let us introduce the Binomial
Theorem. The Binomial Theorem:
n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
, n, i ≥ 0
Assume x and y are equal to 1, we will have the following result:
n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
= 2n
Consider the following equation:
s −1
⎛ s − 1⎞ s −1 ⎛ s − 1 ⎞ s −2 ⎛ s − 1 ⎞ s −3 2 ⎛ s − 1⎞ s −1 ⎛ s −1 ⎞
( x + y ) s −1 = C ⎜⎜ ⎟⎟ x + C ⎜⎜
⎝ s − 1⎠ ⎝ s − 2⎠
⎟⎟ x y + C ⎜⎜
⎝ s − 3⎠
⎟⎟ x y + L + C ⎜⎜
⎝ 0 ⎠
⎟⎟ y = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠x
i =0
s −1−i i
y
Assume x and y are equal to 1, we will have the following result:
s −1
⎛ s −1 ⎞
( x + y ) s −1 = ∑ C⎜⎜⎝ s − 1 − i ⎟⎟⎠x
i =0
s −1−i i
y = 2 s −1
1
Thus the probability of an attacker’s successful guessing in this stage is (2c).
2 s −1
Similarly, the probability of an attacker’s successful guessing for m in this stage is

1
m −1
(2d).
2
2-12
Let us consider the following case: Suppose an attacker makes a guessing of the
number of k, thus he will also make a guessing of the number of r because by our
algorithm, the number of k should be equal to the number of r. Thus the possibility in
this step should be one of (2c) and (2d).
For (3): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of
1
an attacker’s successful guessing in this stage is (2e).
24
Finally, the probability of an attacker’s successful guessing of Method 1 is the
multiplication of (2a), (2b), (2c), (2d) and (2e). The probability is as follows:
Lemma 2-1
Suppose n’ is the length of S’, s and m are the length of the prefix of S in the
encoding scheme and the secret message, respectively. The probability of an
attacker’s successful guessing of Insertion Method is
1 1 1 1 1
× ×( × )× .
55000000 n'−1 2 s −1 2 m −1 24
We will conclude that Method 1 is safe.
2.4 Payload of Insertion Method
Physically, our encryption method is a data hiding scheme. The payload should be
considered in a data hiding scheme. Let us define the payload first. The payload of a
data hiding scheme is the extra spaces or bits after the data is hidden in the medium.
For instance, suppose the length of the binary sequence of our reference sequence S is n.
After the data is hidden, the length of S is n+s, thus the payload is s. It is important to
consider the payload of a data hiding scheme because if the payload is very large, it may
take too much time to transmit the encrypted data. Thus, an attacker may have enough
2-13
time to decode the encrypted data.
Let us consider the payload of Method 1. The payload of method 1 is obvious.
Suppose the length of secret message M is m. The length of binary sequence of S is n.
In our approach, we may omit a suffix of S after M is hidden, suppose the length of the
suffix is s.
Lemma 2-2:
Suppose n is the length of S, s is the length of the prefix of S in the encoding

scheme and m is the length of the secret message. The pay load of Insertion
Method is n-s+m-n=m-s.
Let us consider an example.
If m is 46 and n is 564, the length of S’ is 122 thus the payload is 122-564=-442.
2-14
Chapter 3 Method 2: The Complementary Pair Approach
Chapter 3 Method 2: The Complementary Pair
Approach
In this chapter, we shall illustrate Method 2, the complementary pair approach. In
RNA sequence, we often have the so-called base pairs [BG95, LNC2000 and WHRS87].
We cannot elaborate the detailed meaning of the base pairs of RNA in this paper. For us,
we may just define our own complementary pairs. That is, for each alphabet, we assign
a unique counterpart for it. For instance, we may have the following base pair rule:
(A T) (C A) (G C) (T G).
Then the complementary sequence of AATGC will be TTGCA.
In the sequence: “ATCTGAATGCTTGTCTATTGCATCAAT”, complementary
substrings occur, as indicated by the bold characters. To find the longest
complementary substrings, we may use dynamic programming approach [A2000]. Let
us assume that we have the secret message M which has even number of bits. Note that
it is always reasonable to assume so because we may always use even number of bits to
code.
3.1 Algorithms of Complementary Pair Approach
To give an example, assume that M=0110. Again, as in Method 1, we assume that
we have a reference sequence S. Let us assume it to be S=ACGGTCGTTCCCTAGTTG.
Our Method 2 will work as follows:
Step 1. Artificially generate a sequence with out any complementary substrings and
consisting of A, C, G and T only. Assume that the sequence is
3-1
L=ACGGTCTCATCAATGCTTCAGT.
Step 2. Divide M into segments such that each segment contains even number of bits.
Thus, in our case, we have 01 and 10. We code 01 and 10 according to some
coding rule. By using the coding rule given in the previous section, 01 and 10
will be coded as C and G respectively.
Step 3. Generate two complementary strings with length k and insert them into L.
Assume k=5 and we have the following two complementary strings:
(AAGCT TTCAG) and (ACCTG TAAGC). The sequence L now becomes
L’= ACG AAGCT GTCT TTCAG CAT ACCTG CAAT TAAGC
GCTTCAGT.
Step 4. Insert the first(second) alphabet of the secret message one alphabet before the
first(second) complementary string. The string becomes: L’= ACCG
AAGCT GTCT TTCAG CAGT ACCTG CAAT TAAGC GCTTCAGT.
Step 5. Use a random number generator to select two positive integers j1 and j2.
Assume j1=2 and j2=4. Insert substrings S[j1,j1+4]=CGGTC and
S[j2,j2+4]=GTCGT one alphabet after the first and second complementary
substrings. L’ becomes:
L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG
TCGTCTTCAGT.
Step 6. Return L’’.
The sender sends L” together with many other sequences, some with complementary
pair substrings, to the receiver.
The receiver would process every sequence received by finding all complementary
3-2
substrings of it. If they are of the correct lengths, then it checks whether the prescribed
substrings of S are hidden correctly. If they are, the receiver may now extract the secret
message.
The above description of Method 2 is a simplified version. The more complicated
version may use complementary pair substrings with varying lengths. As we indicated
above, the sender sends L’’ together with many other sequences which may contain
complementary pair substrings. Those complementary pair substrings will have lengths
different from the specified ones and thus will be ignored.
In the following, we shall present the algorithms for Method 2. The algorithms are
based upon the complementary pair concept.
Algorithm 3.1 Encryption Algorithm for Method 2
(Complementary Pair Approach)
Input: A reference DNA sequence S, a secret binary message M, a binary

coding scheme to code A, C, G and T into binary digits and a
complementary pair rule.
Output: An encrypted DNA sequence S’
Step 1. Artificially generate a DNA-like sequence L=l1l2…ls with length s and
without any complementary substrings. Use a random number
generator to generate k1 , k 2 , L , k s . Generate a set
A={a1a1’,a2a2’, …, anan’}, n<<s and each ai with length k i , of
complementary strings.
Step 2. Divide M into segments such that each segment may be coded by the
binary coding scheme. Code these segments to be “A”, “G”, “C” and
3-3
“T” by using the same coding rule. Suppose M=m1m2…mp.
Step 3. For each ag and ag’, insert them after lg+3, and lg+k+3, 1 ≤ g ≤ p .
Step 4. For each pair of complementary strings ag and ag’ in L, insert mg one
alphabet before ag. Thus, L will be a new sequence, called L’.
Step 5. Randomly select positive integers j1, j2,…,jp, 0 ≤ j1, j2 ... j p ≤ n − 4 . For
each ag’ in L’, insert S[jg,jg+4] one alphabet after ag’. Thus, L will be
a new sequence, called L”.
Step 6. Return L”.
Algorithm 3.2 Decryption Algorithm of Method 2
(Complementary Pair Approach)
Input: A set of DNA sequences, one of which has the secret message M hidden
in it by using Algorithm 3.1, a reference DNA sequence S, a binary
coding scheme and a complementary pair rule used in Algorithm 3.1
Output: Secret message M.
Step 1. For the next DNA sequence L” in the set, use the dynamic
programming strategy to find out all longest complementary
substrings. If the substrings are not of the correct lengths or no such
strings are found, go back to Step 1.
Step 2. Select positive integers j1,j2,…,jp by using the same random number
generator used in Algorithm 3.1. For each pair of complementary
substrings ag and ag’, check whether the substring with length 5
starting from one alphabet after ag’ is the same with S[jg,jg+4] or not.
3-4
If they are not the same, ignore L” and go to Step 1.
Step 3. For each pair of complementary substrings ag and ag’, extract the
alphabet one alphabet before ag, called mg.
Step 4. Concatenate all mg‘s to be M.
Step 5. Return M.
An intruder should have the following information. First, he should know the
reference DNA sequence. Second, he should have the complementary rule. Third, he
should know the positive integers j’s to make an authentication to the reference DNA
sequence. Forth, he should know the binary coding rule. Fifth, he must know the
correct lengths of the complementary pair substrings.
3.2 Robustness of Complementary Pair Approach
Let us review the example and the encryption steps of Method 2 in section 4.1.
The main idea is that we insert a part of the secret message before a pair of
complementary substring and a section of the reference sequence S after a pair of
complementary substring. Suppose S= ACGGTCGTTCCCTAGTTG and the secret
message M=0110.
Step 1. Artificially generate a sequence without any complementary substrings and
consisting of A, C, G and T only. Assume that the sequence is
L=ACGGTCTCATCAATGCTTCAGT.
Step 2. Divide M into segments such that each segment contains even number of bits.
Thus, in our case, we have 01 and 10. We code 01 and 10 according to some
coding rule. By using the coding rule given in the previous section, 01 and 10
3-5
will be coded as C and G respectively.
Step 3. Generate two complementary strings with length k and insert them into L.
Assume k=5 and we have the following two complementary strings:
(AAGCT TTCAG) and (ACCTG TAAGC). The sequence L now becomes
L’= ACG AAGCT GTCT TTCAG CAT ACCTG CAAT TAAGC
GCTTCAGT.
Step 4. Insert the first(second) alphabet of the secret message one alphabet before the
first(second) complementary string. The string becomes: L’= ACCG
AAGCT GTCT TTCAG CAGT ACCTG CAAT TAAGC GCTTCAGT.
Step 5. Use a random number generator to select two positive integers j1 and j2.
Assume j1=2 and j2=4. Insert substrings S[j1,j1+4]=CGGTC and
S[j2,j2+4]=GTCGT one alphabet after the first and second complementary
substrings. L’ becomes:
L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG
TCGTCTTCAGT.
Step 6. Return L’’.
The sender sends L” together with many other sequences, some with complementary
pair substrings, to the receiver.
The receiver would process every sequence received by finding all complementary
substrings of it. If they are of the correct lengths, then it checks whether the prescribed
substrings of S are hidden correctly. If they are, the receiver may now extract the secret
message.
3-6
In order to decode the message, an attacker should know the following information:
(1) He should know the reference DNA sequence S.
(2) He should have the complementary rule.
(3) He should know the positive integers j’s.
(4) He should know the binary coding rule.
(5) He must know the correct lengths of the complementary pair substrings.
1
the probability of an attack’s successful guessing is (3a).
55000000
For (2): There are 4 possibilities of the complementary alphabet to each alphabet of
DNA sequences. The total possibilities of a complementary rules are 4 × 3 × 2 × 1 = 24 .
1
The probability is (3b).
24
For (3): The positive integers js mean the starting positions of the substrings of S.
The receiver uses these substrings to do an authentication. Thus, if the length of S is n,
there will be n-5 possibilities for each j. Suppose the length of the secret message M is
m. We divide each character and insert before a pair of complementary substrings.
Also, we insert a substring of S after a pair of complementary substring. Thus, we will
need m substrings of S. The following figure illustrates our approach.
Figure 3-1 An Illustration of the Complementary Pair Approach.
3-7
Thus, the probability of an attacker’s successful guessing all substrings of S is

1
(3c).
( n − 5) m
For (4): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of
1
an attacker’s successful guessing in this stage is (3d).
24
For (5): It is hard for an attacker to know the length of the complementary substrings.
In our case, we fix the length of each complementary substring and set it be k. An
attacker may use an efficient algorithm to find out all complementary substrings. But he
will not know the exactly length. Suppose the longest length of the complementary
substring is found out as x. The attacker may guess the length of the required
1
complementary substrings from 1 to x. Thus the probability is (3e).
x
Finally, the probability of an attacker’s successful guessing of method 2 is the
multiplication of (3a), (3b), (3c), (3d) and (3e). The probability is as follows:
Lemma 3-1
Suppose n is the length of S, m is the number of the complementary pair

substrings used in the encoding scheme and x is the number of lengths of the
complementary pair substrings in S’. The probability of an attacker’s
successful guessing of The Complementary Pair Approach is
1 1 1 1 1
× × × × .
55000000 256 (n − 5) m 24 x
We will conclude that method 2 is safe.
3.3 Payload of Complementary Pair Approach
We now introduce the payload of Method 2. The payload of Method 2 consists of
3-8
three parts: the secret message, the complementary pair substrings and the substrings of S.
Suppose that the length of secret message M is m. The length of complementary pair
substrings is |a1|+|a2|+…+|am|, where ai is a complementary substring generated by us.
Suppose | a1|+| a2|+…+| am | is equal to A. In our approach, we will insert a substring for
each complementary substring. In this case, we have m complementary pair substrings
and we will always insert a substring of S which its length is 5 after each complementary
pair substring. Therefore, the total length of m substrings of S is 5m.
Lemma 3-2
Suppose m is the length of secret message and A is the total length of the
complementary pair substrings used in the encoding scheme. The payload of
The Complementary Pair Approach is m+A+5m=6m+A.
Take the example introduced in section 4.1. The length of M is 2. The total
length of the complementary pair substrings is 20. The length of substrings of S inserted
by us is 10. Thus the payload is 32.
3.4 The Complementary Palindrome
Instead of complementary pair substrings, we may also have palindromes [MW2002,
ABG92, ABG95, BG95, G76 and M75]. A palindrome is a string of the form αα '
where α ' is the reverse of α . For instance: AACGTTGCAA is a palindrome where
α =AACGT. There are several methods proposed to find the longest palindrome in a
string [ABG92, ABG95, BG95, G76 and M75].
Our approach using palindrome is quite similar to that using complementary pairs.
Let α = α 1α 2 ...α h be a substring. Let β = α h −1α h − 2 ...α 1 . Let α h ' = C (α h ) . Let
α ' = α h 'α h −1 '...α 1 ' . Then αα ' is a complementary palindrome. For example, assume
3-9
the complementary rule is ((AT) (CA) (GC) (TG)). Thus ACCTGAAT is a
complementary palindrome. Method 2 may use complementary palindromes because
methods to find palindromes can be easily extended to find the complementary
palindromes.
3-10
Chapter 4 Method 3: The Substitution Approach
In this chapter, we will introduce a method by using the substitution approach to
hide data in the reference DNA sequence. Before we introduce the algorithms, we shall
recall the complementary rule. That is, for each alphabet, we assign a unique
counterpart for it. For instance, we may have the following base pair rule:
(A T) (C A) (G C) (T G).
We shall define a legal complementary rule. We first define the complement of an
alphabet x is C(x). The double complement of x is C(C(x)). The triple complement of
x is C(C(C(x))) A legal complementary rule is a complementary rule such that
x ≠ C ( x) ≠ C (C ( x)) ≠ C (C (C ( x))) . For instance, the following complementary rule is a legal
one.
(A T) (C A) (G C) (T G).
4.1 Algorithms of Substitution Approach
For this method, we also will use a reference sequence S. Let us assume that
S=ACGGAATTGCTTCAG and the secret message M=m1m2…mp is 0111010. The
length of S is 15 and is larger than the length of M, p, which is 7 in this case.
Our main idea is as follows:
Step 1. Suppose the length of the reference sequence S is 15. Select p numbers
randomly from 1 to 15, p is equal to 7 in this case. Assume that they are 2, 3, 5,
10, 12, 13 and 15. Let A=A1, A2, …, Ap={2,3,5,10,12,13,15}.
4-1
Step 2. Transform S into S’ by the following rule:
For all i from 1 to 15,
if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , set Si to C ( S i ) ,
if i is equal to some Aj and mj is 0, do not change Si, and
if i is not equal to any Aj, set Si to C (C ( S i )) .
Thus S’=GCCATGCCAACTAGG.
Step 3. Send S’ to the receiver with many other sequences.
The receiver would not need to generate the set A. After receiving a set of
sequences, he will check all positions to a sequence S’ in the set. There are only three
possible cases: (1) S’i is the same with Si (The secret bit mj equal to 0). (2) S’i is
C ( S i ) (The secret bit mj equal to 1). (3) S’i is C (C ( S i )) . If there exists one j such that
S’i and Si are not of the above three cases, it means that the sequence should be ignored.
In the following, we present the algorithms for Method 3.
Algorithm 4-1 Encryption Algorithm for Method 3
(Substitution Approach)
Input: A DNA sequence S, the secret message M=m1m2…mp and a

complementary rule.
Output: An encrypted DNA sequence S”.
Step 1. Use a random number generator to generate a set of available integer
4-2
sequences, called set A. The number of A is p, the length of M.
Step 2. Initialize i and j to 1.
Step 3. For each element Si of S, do the following operations:
if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , change Si to C ( S i ) ,
if i equal to some Aj and mj is 0, do not change Si,
if i is not equal to any Aj, set Si to the double-complement of itself.
Step 4. Return S’.
Algorithm 4-2 Decryption Algorithm for Method 3 (Substitution Approach)
Input: A set of DNA sequences, the reference sequence S and the

complementary rule.
Output: Secret message M
Step 1. Initialize i and j equal to 1.
Step 2. For the next sequence S’ of the sequence set, do the following
operations:
For each Si:
if there exists a j such that such that S j ' ≠ S j ' , S j ' ≠ C ( S j ' ) and
S j ' ≠ C (C ( S j )) . ignore S’; otherwise,
4-3
if S’i is the same with Si, mj=0,
else if S’i is C ( S i ) , mj=1,
Step 3. Concatenate all mj,’s to be M and return M.
For an intruder to find out the secret message, he should have the following
information: (1) He should know the reference DNA sequence. (2) He should know the
complementary rule.
4.2 An Example of Substitution Approach
In this section, we will run a fully example to illustrate Method 3. Our reference
DNA sequence is a segment of DNA sequence of Litmus as mentioned in the previous
section.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT
CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA
GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC
CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT
GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Our secret message M=0110100101001100101001010101001010100110101010.
Suppose the complementary rule is: (A T) (C A) (G C) (T G).
Suppose A={4, 7, 8, 11, 13, 15, 16, 20, 23, 27, 36, 39, 42, 43, 44, 49, 63, 67, 68, 69,
70, 72, 74, 76, 79, 83, 86, 88, 89, 93, 94, 96, 97, 100, 101, 107, 117, 121, 126, 139, 156,
171, 178, 190, 201, 212} which is generated by random number generator.
4-4
We first point out the position corresponding to A in S. The underlined position is
the contents in A.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAG
TCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAG
AGATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTC
ACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTC
CTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Recall our encryption algorithm, transforms S into S’ by the following rule: For all
i from 1 to 15,
if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , set Si to C ( S i ) ,
if i is not equal to any Aj, set Si to C (C ( S i )) .
The complementary rule is: (A T) (C A) (G C) (T G).
M=0110100101001100101001010101001010100110101010.
Thus we will get S’ as follows:
S’=GCTGGGGGTACAACGAACTTTGACCTCTATCAGACCGTAGCGAGTATCGGA
CTGTGGCCACATTCTACCCAAAAGGCTCCATTATCTAGGGCTATGAGCTCTGA
GAACGGCCACGCTCGGCCATTGGATCTAGCGTGGTGGGTATTGCCCAGTTGGC
TGTTGTGCCAACATATGTTCATGGATCTATATATTACGTTACTGTAGAAGGCCT
4-5
CCATGAAGCGCTCAAGCTTGTAGGATCCTTTGCAACAGTACTGTT.
The receiver will get the secret message M by comparing S’ and S. The comparing
rules are described in Algorithm 4.2.
4.3 Robustness of Substitution Approach
Let us review the example and the steps of method 3. For this method, we also will
use a reference sequence S. Let us assume that S=ACGGAATTGCTTCAG and the
secret message M=m1m2…mp is 0111010. The length of S is 15 and is larger than the
length of M, p, which is 7 in this case. For illustration, assume that the complementary
rule is the same as given in the above sections. That is, the rule is as follows: ((A T) (C
A) (G C) (T G)).
Step 1. Suppose the length of the reference sequence S is 15. Select p distinct numbers
randomly from 1 to 15, p is equal to 7 in this case. Assume that they are sorted
as 2, 3, 5, 10, 12, 13 and 15. Let A=A1, A2, …, Ap={2,3,5,10,12,13,15}.
Step 2. Transform S into S’ by the following rule:
For all i from 1 to 15,
if i is equal to some Aj and mj is 1, 1≤ j ≤ p , set Si to C ( Si )

,
C (C ( Si ))
if i is not equal to any Aj, set Si to .
Thus S’=GCCATGCCAACTAGG.
Step 3. Send S’ to the receiver with many other sequences.
4-6
For an intruder to find out the secret message, he should have the following
information.
(1) He should know the reference DNA sequence.
(2) He should know the complementary rule.
1
the probability of an attack’s successful guessing is 55000000 (4a).
For (2): We should consider how many legal complementary rules are. We define a
legal complementary rule as the following: for each alphabet x of a DNA sequence,
both C(x), C(C(x)) and C(C(C(x))) are not equal to x. Note that C(x) is the
complementary of x. There will be 6 legal complementary rules as follows:
(A T) (T C) (C G) (G A),
(A T) (T G) (G C) (C A),
(A C) (C T) (T G) (G A),
(A C) (C G) (G T) (T A),
(A G) (G T) (T C) (C A), and
(A G) (G C) (C T) (T A)
1
The probability is 6 (4b).
Finally, the probability of an attacker’s successful guessing of method 2 is the
4-7
multiplication of (4a) and (4b).
The probability is as follows:
1 1
×
5500000 6
Let us consider another case. Suppose that it is impossible to know the reference
sequence S for an attacker. Thus, according to our algorithm, the attacker should
consider 3 possibilities for each alphabet between S’ and S. For each alphabet x between
S’ and S, there may be three cases: (1) S’ is x and S is x, too. (2) S’ is x and S is C(x). (3) S’
is x and S is C(C(x)). If the length of S’ is n, there will be 3n possibilities. The

1
probability of the attacker’s successful guessing will be .
3n
Lemma 4-1:
Suppose n is the length of S. The probability of The Substitution Approach is

1 1 1
× or .
5500000 6 3n
4.4 Payload of Substitution Approach
In this section, we will introduce the payload of Method 3. The payload of Method
3 is 0 because we will only change the alphabet of the reference sequence S but not add
any extra words in S.
4-8
Chapter 5 Conclusion and Future Works
Chapter 5 Conclusion and Future Works
5.1 Conclusions
In the thesis, we have presented three encryption methods based on the DNA
sequence. In the previous related works, they are almost based upon the biology
properties. There is almost no difference between a real DNA sequence and a faked one.
There are a large number of DNA sequences publicly available in various web-sites. A
rough estimation would put the number of DNA sequences publicly available to be
around 55. Our encryption methods are based upon the above two properties.
For an attacker, it is hard for him to detect whether there are secret messages in a
DNA sequence or not. Even though he know that there are some secret messages in our
DNA sequence, it is hard for him to know the reference sequence and correctly decode
the secret.
5.2 Future Works
It is worth to make the decoding stage more complicated. It is also worth to find
out other properties of DNA sequences which can be used to hide data. Also, to
decrease the payloads of Method 1 and Method 2 are also our future works.
5-1
Bibliography
Bibliography
[A2000] Dynamic Programming Algorithms for RNA Secondary Structure Prediction with
Pseudoknots, Akutsu, T., Dirsrete Applied Mathematics, Vol. 104, 2000, pp. 45-62.
[ABG92] Optimal Parallel Algorithms for Periods, Palindromes and squares, Apostolico,
A. , Breslauer, D. , Galil, Z. , Proceedings of the International Colloquium on Automata,
Languages, and Programming , 1992 , pp. 296-307.
[ABG95] Parallel Detection of All palindromes in a String, Apostolico, A. , Breslauer,
D. , Galil, Z. , Theoretical Computer Science , Vol. 141 , 1995 , pp. 163-173.
[ABLRRW94] Molecular Biology of the Cell, Alberts, B., Bray, D., Lewis, J., Raff, M.,
Roberts, K. and Watson, J. D., New York & London: Garland Publishing, 1994.
[BG95] Finding All Periods and Initial Palindromes of a String, Breslauer, D. , Galil, Z. ,
Algorithmica , Vol. 14 , 1995 , pp. 355-366.
[C2003] A DNA-Based, Biomolecular Cryptography Design, Chen, J. , Proceedings of
the 2003 International Symposium on Circuits and Systems, Vol. 3, 2003.
[CRB99] Hiding Messages in DNA Microdots, Clelland, C. T., Risca, V. and Bancroft,
C., Nature, Vol. 399, 1999, pp.533-534.
[G76] Real-Time Algorithms for String-Matching and Palindrome Recognition, Gail,
Z. , STOC , 1976 , pp. 161-173.
[LNC2000] Principles of Biochemistry, Lehninger, A. L., Nelson, D. L. and Cox, M. M.,
New York, Worth, 2000.
6-1
Bibliography
[LRBR2000] Cryptography with DNA Binary Strands, Leier, A., Richter, C., Banzhaf,
W. and Rauhe, H., BioSystems, Vol. 57, 2000, pp.13-22.
[M75] Manacher, G.., A new linear-time on-line Algorithm for finding the smallest
initial palindrome of the string, J. ACM, Vol. 22, 1975, pp. 346-351.
[MW2002] Jewels of Stringology, Crochemore, Maxime and Rytter, Wojciech, World
Scientific, 2002.
[SFP2002] Hiding Data in DNA, Shimanovsky, B., Feng, J. and Potkonjak, M., Revised
Papers from the 5th International Workshop on Information Hiding, Lecture Notes in
Computer Science, Vol. 2578, 2002, pp.373-386.
[WHRS87] Molecular Biology of the Gene, Watson, J., Hopkins, N., Roberts, J. and
Steitz, J., Benjamin Cummings, 4th Edition, Menlo Park, CA, 1987.
6-2

HZHsu 2006

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

HZHsu 2006

Enviado por

Direitos autorais:

Formatos disponíveis

國立暨南國際大學資訊工程學系

significant improvement is my English comprehension and writing.

Secondly, I thank Professor C. H. Huang, Professor Jenn-Sheng Lin, Professor

appreciate their directions on my research.

Thirdly, I am grateful to C. C. Lin, F. L. Lin, S. C. Chen, C. Y. Yang, J. G. Chen,

Y. M. Pan, S. J. Pan, W. H. Wen, W. L. Wang, T. H. Ku, Z. H. Chen, C. S. Ou and G.

am also grateful to L. C. Juan, H. J. Chen, W. H. Wen, K. H. Liao, Y. L. Chen, T. H.

Ku, Z. H. Chen, C. S. Ou, G. R. Zhuang and Z. B. Huang for their humanistic

knowledge and making me know more about the world.

National Chi-Nan University, Puli, Nantou

論文名稱：以 DNA 序列為基礎的加密方法

在這篇論文裡，我們利用 DNA 序列的一些特性提出三個加密的方法。我們會

先秘密地選一個 DNA 參考序列 S 然後接著把訊息 M 加入 S 中得到 S’。我們把 S’跟

其他的 DNA 序列或是像 DNA 的序列一起送給接收者。接收者可以成功地確認我們

把訊息 M 藏在哪一個然後可以忽略其他的 DNA 序列。他也可以把訊息 M 擷取出來。

一個片段都插入 S 兩兩片段中。第二個方法是把 M 打斷成許多片段然後分別地把每

一個片段插入 DNA 序列中的一對互補字串前面。我們也從 S 中擷取一段片段插入

Title of Thesis: Encryption Methods Based Upon DNA Sequences

Name of Institute: Department of Computer Science and Information

Graduation Time: 06/2006 Degree Conferred: Master

Student Name: Hong-Zhi Hsu Advisor Name: Richard Chia-Tung Lee

also be able to extract M.

a segment of M hidden before it. Method 3 is to hide the secret message M by

Keywords: Encryption, DNA, Complementary Pair Substring, Complementary Rule.

List of Figures ..................................................................................................................... vi

Chapter 1 Introduction and Related Works ................................................................1-1

1.2 The Complicated Version ..................................................................................................... 1-6

1.3 Thesis Organization.............................................................................................................. 1-9

Chapter 2 Method 1: The Insertion Method ...............................................................2-1

2.1 Algorithms of Insertion Method.......................................................................................... 2-3

2.2 An Example of Insertion Method ........................................................................................ 2-6

2.3 Robustness of Insertion Method.......................................................................................... 2-8

2.4 Payload of Insertion Method ............................................................................................. 2-13

Chapter 3 Method 2: The Complementary Pair Approach .......................................3-1

3.1 Algorithms of Complementary Pair Approach .................................................................. 3-1

3.2 Robustness of Complementary Pair Approach .................................................................. 3-5

3.3 Payload of Complementary Pair Approach........................................................................ 3-8

3.4 The Complementary Palindrome ........................................................................................ 3-9

Chapter 4 Method 3: The Substitution Approach.......................................................4-1

4.1 Algorithms of Substitution Approach ................................................................................. 4-1

4.2 An Example of Substitution Approach ............................................................................... 4-4

4.3 Robustness of Substitution Approach ................................................................................. 4-6

4.4 Payload of Substitution Approach....................................................................................... 4-8

Chapter 5 Conclusion and Future Works ....................................................................5-1

5.1 Conclusions ........................................................................................................................... 5-1

5.2 Future Works ........................................................................................................................ 5-1

Figure 1-1 An Experiment of Fluorescent on DNA. ........................................................1-3

Figure 3-1 An Illustration of the Complementary Pair Approach....................................3-7

Chapter 1 Introduction and Related Works

properties of DNA sequences. We present one of them in this chapter.

In this chapter, we present two encryption methods by using biology properties.

1.1 A Simple Version of DNA Based Encryption Method by Using

can correctly decode the binary data.

It is a trick that how to store binary data in DNA sequence. As we mentioned

is a short complementary substring of a DNA sequence. Suppose we have a DNA

and two primers:

ATCAA and GATTAC

is also a rule which should be defined, the complementary rule. A complementary

complementary alphabet of x. A complementary rule is the complementary relations for

In biology, there is a chemical mechanism to make primers and a DNA sequence

combinations of primers and substrings of DNA sequence bright. A bright section

for this kind of chemical experiment.