Escolar Documentos
Profissional Documentos
Cultura Documentos
碩士論文
以 DNA 序列特性為基礎的加密方法
Encryption Methods Based upon DNA Sequences
指導教授:李家同 教授
研究生:許宏誌
中 華 民 國 九 十 五 年 六 月
Acknowledgements
Acknowledgements
I first thank my advisor, Professor R. C. T. Lee during my graduate school years for
his education and training. He made me a humanistic, merciful, and friendly person.
In research, he also made me be a logical, accurate, and scrupulous student. The most
Yue-Li Wang and Professor Chuan-Yi Tang for their suggestions on my thesis. I also
R. Zhuang for their directions and discussions between them and me on researches. I
Finally, I sincerely appreciate Linda. She always backs me during these two years.
Hsu, Hong-Zhi
Computer Science and Information Engineering
i
論文摘要
校院系:國立暨南國際大學資訊工程學系 頁數:四十五
畢業時間:九十五 年 六 月 學位別:碩士
研究生:許宏誌 指導教授:李家同 教授
論文摘要
指出 DNA 中的一些有趣的特性,而這些特性可以讓我們利用來隱藏資料。我們在
這篇論文提出的三個方法是:插入法、互補字串法還有替換法。每一個方法我們都
第一個方法最主要的想法是我們把 M 跟 S 打斷成許多片段然後把 M 的每
有隱藏 M 片段的互補字串後方。第三個方法是利用改變序列中的字母來隱藏 M 中
的資訊。
關鍵字:加密、DNA、互補字串、互補規則
ii
Abstract
Abstract
In this thesis, we propose three encryption methods based upon some properties of DNA
sequences. We will point out that the DNA sequences possess some interesting
properties which we can utilize to hide data. The three methods proposed in this thesis
are: the insertion method, the complementary pair method and the substitution method.
For each method, we secretly select a reference DNA sequence S and incorporate the
secret message M into it such that we obtain S’. We send this S’, together with many
other DNA, or DNA-like sequences to the receiver. The receiver is able to identify the
particular sequence with M hidden in it and ignore all of the other sequences. He will
The main idea of Method 1 is to break S and M, and we insert each segment of M
between every pair segments of S. Method 2 is to break M into several segments and
hide each one before a complementary pair substring in a DNA sequence, respectively.
We also extract a segment of S and insert it after each complementary pair substring with
substituting alphabets of S.
iii
Contents
Contents
1.1 A Simple Version of DNA Based Encryption Method by Using Biology Properties ....... 1-1
iv
Contents
Bibliography......................................................................................................................6-1
v
List of Figures
List of Figures
Figure 2-1 The Relations between m(s) and r’s(k’s). ..................................................... 2-11
vi
Chapter 1 Introduction and Related Works
In recent years, much research work has been done on DNA based encryption
schemes [C2003, CRB99, LRBR2000 and SEP2002]. Most of them use biological
Both of them are based on physical DNA sequences and use some chemical schemes
which are used to decode [C2003]. We will introduce a simple version and a
complicated version.
In this chapter, we will introduce how to use biology properties to hide data. In a
DNA based encryption scheme, we will store binary data as a DNA sequence and later
hide the data in some way. There is a key, called primer, which is used to find out the
binary data and we will introduce how to use primers to store our data in DNA sequences.
The method is that a sender sends a DNA sequence and selected primers to the
receiver and then the receiver uses the primers and the DNA sequence to transform it into
a sequence consisting of the binary data. We may use a public DNA sequence as our
reference sequence and the receiver also knows this sequence. Thus, we will send
selected primers to the receiver. Without the primers and the reference sequence, no one
1-1
Chapter 1 Introduction and Related Works
before, we will use selected primers to do that. Let us define a primer first. A primer
sequence:
ATGCTTAGTTCCATCGGAGACTAATGGCCTA
which are the complementary substrings of TAGTT and CTAATG, respectively. There
relation is as follows: for each alphabet x of a DNA sequence, c(x) denotes the
each alphabet of a DNA sequence. In our case, we define the complementary rule as
follows: A-T, T-A, C-G and G-C. That means, c(A) is T for instance.
combine and then we will use a chemical substance, called fluorescent, to indicate where
the positions of the primers are. The fluorescent will make the positions of
corresponds to the binary data ‘1’ and a dark one corresponds to ‘0’. There is a sample
1-2
Chapter 1 Introduction and Related Works
The gree sections represent the binary data ‘1’s in the sequences and the dark
ATGCTTAGTTCCATCGGAGACTAATGGCCTA
ATCAA GATTAC
0 1 0 1 0
Consider another example. Suppose we have the following DNA sequence and
ATGCTTAGTTCCATCGGAGACTAAT
Primers:
1-3
Chapter 1 Introduction and Related Works
ATGCTTAGTT CCATCGGAGACTAAT
0 1 1 0 1
The sender sends the DNA sequence and the selected primers to the receiver. The
receiver now uses the primers and the DNA sequence to transform it into a sequence
consisting of bright and dark parts by using the chemical mechanism as we mentioned
In a real DNA sequence, for instance, the following one is a segment of DNA
ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC
ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG
ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC
ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG
CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
It means that for a real DNA sequence, it can be used to store amount of binary data
by using short primers. Assume we use primers consisting of five nucleotides, thus
⎡ 2856 ⎤
there will be at most ⎢ 5 ⎥ = 572 bits if we store data in DNA sequence of Litmus.
⎢ ⎥
It should be noticed that if we use too small size primer, it is hard to store our data in
DNA sequence. Suppose we have to store a binary sequence “01101” and we have a
1-4
Chapter 1 Introduction and Related Works
ATCGGCTAATCGGCTAATCGGCTAATCGGCTA
ATCGGCTAATCGGCTAATCGGCTAATCGGCTA
TA CG CG
1 0 1 0 1 0
Thus we will have the binary code “101010” and it is not our required result.
It is a key point to make sure that the selected primers could correctly generate the
required binary data. The following example is that we use a segment of the DNA
0 1 0 1 1 1
1-5
Chapter 1 Introduction and Related Works
GTCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTG
GGCGCTTAAGGACGT
0 1 0
CAGAGATAATTGTATTTAA GTGCCTAGCTCGATACAATAAACGCCATTTGAC
TCTATTAACATAAATT CACGGATCGAGCTA
1 1 0
CATTCACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGG
AATTCCTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
primers than shorter ones. It is really hard to guess what a long primer is.
In order to decode the message, the receiver should know (1) the complementary
rule, (2) the reference DNA sequence and (3) the selected primers. For an attacker, it is
really hard to guess the reference DNA sequence and the selected primers. There are
In the above method, a sender sends a reference DNA sequence and the selected
primers to the receiver and the receiver decode and find out the binary message hidden in
the DNA sequence by using the selected primers. We could send only a section of a
primer and information which is used to recover the correct primer to the receiver to
make the decoding more complicated. This is more complicated because an attacker
would be hard to get the correct primer even though he got the short section.
1-6
Chapter 1 Introduction and Related Works
ATGCTTAGTTCCATCGGAGACTAAT
In the simple version, we will send the sequence and the selected primers to the
receiver. This time, we will send a partial section for each primer to the receiver. We
will introduce how the receiver recovers the correct primers and decode. After getting
the sections of the selected primers, the receiver could easily recover all the correctly
The key point is that how the receiver could recover the correct primers according to
sections of selected primers. Before we introduce the trick, we shall introduce a term,
called template first. A template is a complementary string of a primer. For instance, for
recover the original primer. There is also a chemical mechanism, called PCR
(Polymerase Chain Reaction). We will recover the original primer by using the PCR
strategy on the template and a partial primer. We cut a certain suffix of the primer; in
this case, it may be GATTAC and put it together with the template. Thus by using PCR,
1-7
Chapter 1 Introduction and Related Works
By using the PCR strategy, we would not need to send the correct primers to the
receiver, instead we send the prefixes or the suffixes of the selected primers and the
template of each primer. Thus, the receiver will use a partial primer and its
corresponding template to recover the correct primer. After he get all the selected
the DNA sequence of Litmus. Suppose we still hide the binary data 010111010110 and
We cut the suffix for each primer and get the following partial primers:
We send the templates and the partial primers to the receiver. For each
corresponding template and the partial primer, the receiver uses the PCR strategy to
1-8
Chapter 1 Introduction and Related Works
For instance:
Template: GAATTCGCGCT
Primer: CTTAAGCGCGA
Thus the receiver will get all selected primers and thus decode the sequence as we
mentioned before.
It is a key point to make sure the corresponding template and partial primer are one
There must be only one template which its prefix is GTC. Thus, the receiver
For this complicated version encryption, in order to decode the message, a receiver
should know (1) the complementary rule, (2) the partial primers and (3) the templates of
selected primers.
Insertion Method. The main idea of Method 1 is to divide the secret messages and
reference DNA sequence, then we assemble the segment one by one from secret
messages and the reference sequence. The robustness and the payload of Method 1 will
1-9
Chapter 1 Introduction and Related Works
substrings. The robustness and the payload of Method 2 are also proposed in this
each alphabet by our conditions to hide the secret messages. The robustness and the
payload of Method 3 are proposed in chapter 4. Finally, conclusions and future works
1-10
Chapter 2 Method 1: The Insertion Method
In this chapter, we shall point out that the DNA sequences possess some interesting
properties which we can utilize to hide data. We present three methods in the following
three chapters, the insertion method, the complementary pair method and the substitution
method. For each method, we secretly select a reference DNA sequence S and
incorporate the secret message M into it such that we obtain S’. We send this S’,
together with many other DNA, or DNA-like sequences to the receiver. The receiver is
able to identify the particular sequence with M hidden in it and ignore all of the other
In this chapter, we will present the first method which does not make use of the
alphabet is related to a nucleotide. It is usually quite long. For instance, the following
The first one is a segment of DNA sequence of Litmus, its real length is with 2856
nucleotides long:
ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC
ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG
ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC
ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG
CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC.
2-1
Chapter 2 Method 1: The Insertion Method
The second one is a segment of DNA sequence of Balsaminaceae, its real length is
TTTTTATTATTTTTTTTCATTTTTTTCTCAGTTTTTAGCACATATCATTACATTTTA
TTTTTTCATTACTTCTATCATTCTATCTATAAAATCGATTATTTTTATCACTTATTTT
TCTAATTTCCAATATTTCATCTAATGATTATATTACATTAAAGAAATCGGTTAAAA
GCGACTAAAAATCAATCTGGAACAAGGCTTAGTTTATTTAATATATTATTTTATG
TAATTTCTATTGAAAAATTAGTTAAAAGGCAAGTATTTGAGAT.
We can now easily see one special property of DNA sequences: There is almost no
difference between a real DNA sequence and a faked one. This is a property which we
There is another fact which is quite useful to us: There are a large number of DNA
sequences publicly available in various web-sites. A rough estimation would put the
By using the above facts, we designed three DNA based encryption methods. All
of these methods would secretly select a reference sequence S from publicly available
DNA sequences. Only the sender and the receiver are aware of this reference sequence.
The sender would transform this selected DNA sequence S into a new sequence S’ by
incorporating the DNA sequence S with the secret message M. This transformed
sequence S’ is sent by a sender to the receiver together with many other DNA sequences.
The receiver would then examine all of the received sequences, identify S’ and recover
We shall introduce three methods in the following chapters. For all of these
2-2
Chapter 2 Method 1: The Insertion Method
methods, we assume that there are two schemes used by the sender and the receiver
which are kept secret. The first one is a binary coding scheme which transforms
alphabets A, C, G and T into binary codes and vice versa. For instance, the following
may be a binary coding used: ((A 00) (C 01) (G 10) (T 11)). It should be noted that
more digits may be used. The second scheme is a complementary pair rule. That is,
we shall assign each alphabet x a complement, denoted as C(x). The following may
To simplify the discussion, we start with the most basic version and give a simple
example. The more complicated version of our method will be presented after this basic
Step 1. We first code S into a binary sequence by using the binary coding scheme.
Step 2. Divide S into segments whereby each segment contains k bits. Suppose k is 3.
Then we have the following segments: 000, 110, 101, 111, 010, 100, 001, 110,
01.
Step 3. Insert bits from M, once at a time, into the beginning of segments of S. The
result is as follows: 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110, 01. We
should ignore those segments without any secret message inserted. Thus, we
will have the following segments: 0000, 1110, 0101, 0111, 1010, 1100, 0001,
2-3
Chapter 2 Method 1: The Insertion Method
sequence: 00001110010101111010110000010110.
Step 4. We use the binary code scheme to produce the following faked DNA sequence:
different from S.
Step 5. We send the above sequence S’ to the receiver, amid many other irrelevant
sequences.
The above process is the encryption process. It is easy to see that the decryption
process is just to reverse the encryption process. For every received sequence T, the
receiver extracts a sequence out of it. If the extracted sequence is not a prefix of the
reference sequence S, ignore T. If it is, the receiver knows that he has also successfully
The above is the basic version of our approach. In a more complicated version, we
divide S into segments by using a random number generator. That is, k is not fixed any
the sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into
segments with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message
M. We may therefore also use the same random number generator to divide M into
2-4
Chapter 2 Method 1: The Insertion Method
Step 1. Code S into a binary sequence S1 by using the binary coding scheme.
S3 is sent to the receiver together with many other DNA sequences. The receiver
Input: A set of DNA sequences, one of which has the secret message M
hidden in it by using Algorithm 2.1, a reference DNA sequence S and
a binary coding scheme.
Step 1. Generate numbers k’s and r’s denoted as k1k2…kn and r1r2…rp by
using the same random number generator with the same seed of the
encoding scheme.
Step 2. For a DNA sequence S’ of the set, code S’ into a binary sequence by
2-5
Chapter 2 Method 1: The Insertion Method
using the binary coding used by the sender and use r1+k1,r2+k2,… to
divide the binary sequence into binary segments.
Step 3. For each segment of the first p segments of S’, extract the first ri bits,
called mi.
Step 4. For each segment of the first p segments of S’, extract the last ki bits,
called si.
Step 7. Return M.
For an intruder to find out the secret message, he must be equipped as follows. (1)
He must know precisely the reference DNA sequence S. Since there are roughly 55
millions DNA sequences available publicly, it is extremely hard to guess one. (2) He
has to know the random number generator and the two seeds used. (3) He has to know
In this section, we will show a complete example about the complicate version of
Method 1.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT
CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA
GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC
2-6
Chapter 2 Method 1: The Insertion Method
CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT
GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Suppose we have two random number sequences by using a random number generator.
K=3, 4, 7, 6, 1, 5, 3, 7, 6, 4.
0011011000001111011001100111100010110100010000111101100110011110001011010
0010000111101100110011110001011010001000011111011100001110100100101100110
0000111101011110010010010101011000001111010110010011111001001000100011000
0111110110011111100001011100101110010011101100011000100001100000001100101
0011111110000101001111010001010001001111101011101110010001011101010000100
1110110011001000101101100010110110111011000101000001111010111100100101000
1100110111101000110101000110000010011111010101001110101110000110110100010
Divide the secret message M by using the sequence K, thus we will get the following
divided segments:
2-7
Chapter 2 Method 1: The Insertion Method
Divide the reference sequence S by using the sequence R, thus we will get the
1111000, 1011010001.
Insert each segment of M before each of S, thus we will get the following segments:
S’=CGCGTCAAACTTCATCGCGCGCCCGACCGGTCACAGACCCCCTCGCGATCC
TGAGGGTCAC
The Decryption steps will be done by using the sequences K and R to divide the
We will first transform S’ to be a binary sequence by using the binary coding rule.
extract the segments by using the above two sequences. In our case, k1 is 3 and r1 is 7.
We extract the first 10 bits of S’ and the first 3 bits is the first segment of M. By
In this section, we will show the strength of the Method 1. By our definition, the
2-8
Chapter 2 Method 1: The Insertion Method
strength of an encryption method is the possibility when an attacker guesses without any
information. We will point out how much solutions are if the attacker does not know
sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into segments
with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message M. We
may therefore also use the same random number generator to divide M into segments.
For an intruder to find out the secret message, he must be equipped as follows. (1)
He must know precisely the reference DNA sequence S. Since there are roughly 55
millions DNA sequences available publicly, it is extremely hard to guess one. (2) He
has to know the random number generator and the two seeds used. (3) He has to know
For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is (2a).
55000000
For (2): Suppose we handle a sequence S’ after the encryption stage. The size of S’
is n’. S’ is composed by the secret message M and the certain size prefix of reference
sequence S. We define the size of M is m and the size of the certain prefix S is s. It is
It can be imagined that an attacker should guess the size m and s first. We know
⎛ 2 + n'−2 − 1⎞ ⎛ n'−1 ⎞
m+s=n’, m, s , n ' ≥ 1 and there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = n'−1 possibilities here. For
⎝ n '−2 ⎠ ⎝ n'−2 ⎠
2-9
Chapter 2 Method 1: The Insertion Method
⎛ 2 + 10 − 2 − 1⎞ ⎛9⎞
instance, assume n’=10, there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 9 possibilities as the
⎝ 10 − 2 ⎠ ⎝8⎠
following:
m=1, s=9
m=2, s=8
m=3, s=7
m=4, s=6
m=5, s=5
m=6, s=4
m=7, s=3
m=8, s=2
m=9, s=1
1
The probability of an attacker’s successful guessing m and s is (2b). It is not
n'−1
There are still two values should be known by the attacker, namely, m and s. The
problem is that the attacker does not know the numbers of r’s and k’s which are used to
break the secret message and the reference sequence S. Since the summations of r’s and
k’s generate m and s respectively. The following figure indicates the relations between
2-10
Chapter 2 Method 1: The Insertion Method
A prefix of S
with size s
k1 k2 k3 k4
The secret
message
with size m
r1 r2 r3 r4
Notice that it is hard for an attacker to know how much k’s and r’s we use. Thus he
should try two k’s(r’s), three k’s(r’s), four k’s(r’s) and so on.
k1=s, k1 ≥ 1
k1+k2=s, k1 , k 2 ≥ 1
k1+k2+k3=s, k1 , k 2 , k 3 ≥ 1
k1+k2+k3+k4=s, k1 , k 2 , k 3 , k 4 ≥ 1
.
.
.
k1+k2+k3+…+ks=s, k1 , k 2 , k 3 , L , k s ≥ 1
⎛ 1 + s − 1 − 1⎞ ⎛ s − 1⎞
The number of solution of the first formula is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 . The number
⎝ s − 1 ⎠ ⎝ s − 1⎠
⎛ 2 + s − 2 − 1⎞ ⎛ s −1 ⎞
of solutions of the second one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = s − 1 . The number of solutions of
⎝ s − 2 ⎠ ⎝ s − 2⎠
2-11
Chapter 2 Method 1: The Insertion Method
⎛ 3 + s − 3 − 1⎞ ⎛ s −1⎞
the third one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the fourth one is
⎝ s −3 ⎠ ⎝ s − 3⎠
⎛ 4 + s − 4 − 1⎞ ⎛ s −1 ⎞ ⎛ s + s − s − 1⎞ ⎛ s − 1⎞
C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the last one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 .
⎝ s−4 ⎠ ⎝ s − 4⎠ ⎝ s−s ⎠ ⎝ 0 ⎠
s −1
⎛ s − 1⎞ ⎛ s −1 ⎞ ⎛ s −1⎞ ⎛ s − 1⎞ ⎛ s −1 ⎞
C ⎜⎜ ⎟⎟ + C ⎜⎜
⎝ s − 1⎠
⎟⎟ + C ⎜⎜
⎝ s − 2⎠
⎟⎟ + L + C ⎜⎜
⎝ s − 3⎠ ⎝ 0 ⎠
⎟⎟ = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠
i =0
It is not much sense about the above formula. Let us introduce the Binomial
n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
, n, i ≥ 0
n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
= 2n
s −1
⎛ s − 1⎞ s −1 ⎛ s − 1 ⎞ s −2 ⎛ s − 1 ⎞ s −3 2 ⎛ s − 1⎞ s −1 ⎛ s −1 ⎞
( x + y ) s −1 = C ⎜⎜ ⎟⎟ x + C ⎜⎜
⎝ s − 1⎠ ⎝ s − 2⎠
⎟⎟ x y + C ⎜⎜
⎝ s − 3⎠
⎟⎟ x y + L + C ⎜⎜
⎝ 0 ⎠
⎟⎟ y = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠x
i =0
s −1−i i
y
s −1
⎛ s −1 ⎞
( x + y ) s −1 = ∑ C⎜⎜⎝ s − 1 − i ⎟⎟⎠x
i =0
s −1−i i
y = 2 s −1
1
Thus the probability of an attacker’s successful guessing in this stage is (2c).
2 s −1
2-12
Chapter 2 Method 1: The Insertion Method
Let us consider the following case: Suppose an attacker makes a guessing of the
number of k, thus he will also make a guessing of the number of r because by our
algorithm, the number of k should be equal to the number of r. Thus the possibility in
For (3): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of
1
an attacker’s successful guessing in this stage is (2e).
24
multiplication of (2a), (2b), (2c), (2d) and (2e). The probability is as follows:
Lemma 2-1
Suppose n’ is the length of S’, s and m are the length of the prefix of S in the
encoding scheme and the secret message, respectively. The probability of an
attacker’s successful guessing of Insertion Method is
1 1 1 1 1
× ×( × )× .
55000000 n'−1 2 s −1 2 m −1 24
Physically, our encryption method is a data hiding scheme. The payload should be
considered in a data hiding scheme. Let us define the payload first. The payload of a
data hiding scheme is the extra spaces or bits after the data is hidden in the medium.
For instance, suppose the length of the binary sequence of our reference sequence S is n.
After the data is hidden, the length of S is n+s, thus the payload is s. It is important to
consider the payload of a data hiding scheme because if the payload is very large, it may
take too much time to transmit the encrypted data. Thus, an attacker may have enough
2-13
Chapter 2 Method 1: The Insertion Method
In our approach, we may omit a suffix of S after M is hidden, suppose the length of the
suffix is s.
Lemma 2-2:
2-14
Chapter 3 Method 2: The Complementary Pair Approach
Approach
RNA sequence, we often have the so-called base pairs [BG95, LNC2000 and WHRS87].
We cannot elaborate the detailed meaning of the base pairs of RNA in this paper. For us,
we may just define our own complementary pairs. That is, for each alphabet, we assign
a unique counterpart for it. For instance, we may have the following base pair rule:
(A T) (C A) (G C) (T G).
us assume that we have the secret message M which has even number of bits. Note that
it is always reasonable to assume so because we may always use even number of bits to
code.
Step 1. Artificially generate a sequence with out any complementary substrings and
3-1
Chapter 3 Method 2: The Complementary Pair Approach
L=ACGGTCTCATCAATGCTTCAGT.
Step 2. Divide M into segments such that each segment contains even number of bits.
Thus, in our case, we have 01 and 10. We code 01 and 10 according to some
coding rule. By using the coding rule given in the previous section, 01 and 10
Step 3. Generate two complementary strings with length k and insert them into L.
GCTTCAGT.
Step 4. Insert the first(second) alphabet of the secret message one alphabet before the
Step 5. Use a random number generator to select two positive integers j1 and j2.
substrings. L’ becomes:
L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG
TCGTCTTCAGT.
The sender sends L” together with many other sequences, some with complementary
The receiver would process every sequence received by finding all complementary
3-2
Chapter 3 Method 2: The Complementary Pair Approach
substrings of it. If they are of the correct lengths, then it checks whether the prescribed
substrings of S are hidden correctly. If they are, the receiver may now extract the secret
message.
version may use complementary pair substrings with varying lengths. As we indicated
above, the sender sends L’’ together with many other sequences which may contain
complementary pair substrings. Those complementary pair substrings will have lengths
In the following, we shall present the algorithms for Method 2. The algorithms are
complementary strings.
Step 2. Divide M into segments such that each segment may be coded by the
binary coding scheme. Code these segments to be “A”, “G”, “C” and
3-3
Chapter 3 Method 2: The Complementary Pair Approach
Step 3. For each ag and ag’, insert them after lg+3, and lg+k+3, 1 ≤ g ≤ p .
Step 4. For each pair of complementary strings ag and ag’ in L, insert mg one
Step 5. Randomly select positive integers j1, j2,…,jp, 0 ≤ j1, j2 ... j p ≤ n − 4 . For
each ag’ in L’, insert S[jg,jg+4] one alphabet after ag’. Thus, L will be
Input: A set of DNA sequences, one of which has the secret message M hidden
in it by using Algorithm 3.1, a reference DNA sequence S, a binary
coding scheme and a complementary pair rule used in Algorithm 3.1
Step 1. For the next DNA sequence L” in the set, use the dynamic
Step 2. Select positive integers j1,j2,…,jp by using the same random number
starting from one alphabet after ag’ is the same with S[jg,jg+4] or not.
3-4
Chapter 3 Method 2: The Complementary Pair Approach
Step 3. For each pair of complementary substrings ag and ag’, extract the
Step 5. Return M.
An intruder should have the following information. First, he should know the
reference DNA sequence. Second, he should have the complementary rule. Third, he
should know the positive integers j’s to make an authentication to the reference DNA
sequence. Forth, he should know the binary coding rule. Fifth, he must know the
Let us review the example and the encryption steps of Method 2 in section 4.1.
The main idea is that we insert a part of the secret message before a pair of
message M=0110.
L=ACGGTCTCATCAATGCTTCAGT.
Step 2. Divide M into segments such that each segment contains even number of bits.
Thus, in our case, we have 01 and 10. We code 01 and 10 according to some
coding rule. By using the coding rule given in the previous section, 01 and 10
3-5
Chapter 3 Method 2: The Complementary Pair Approach
Step 3. Generate two complementary strings with length k and insert them into L.
GCTTCAGT.
Step 4. Insert the first(second) alphabet of the secret message one alphabet before the
Step 5. Use a random number generator to select two positive integers j1 and j2.
substrings. L’ becomes:
L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG
TCGTCTTCAGT.
The sender sends L” together with many other sequences, some with complementary
The receiver would process every sequence received by finding all complementary
substrings of it. If they are of the correct lengths, then it checks whether the prescribed
substrings of S are hidden correctly. If they are, the receiver may now extract the secret
message.
3-6
Chapter 3 Method 2: The Complementary Pair Approach
In order to decode the message, an attacker should know the following information:
(5) He must know the correct lengths of the complementary pair substrings.
For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is (3a).
55000000
For (2): There are 4 possibilities of the complementary alphabet to each alphabet of
1
The probability is (3b).
24
For (3): The positive integers js mean the starting positions of the substrings of S.
there will be n-5 possibilities for each j. Suppose the length of the secret message M is
3-7
Chapter 3 Method 2: The Complementary Pair Approach
For (4): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of
1
an attacker’s successful guessing in this stage is (3d).
24
For (5): It is hard for an attacker to know the length of the complementary substrings.
In our case, we fix the length of each complementary substring and set it be k. An
attacker may use an efficient algorithm to find out all complementary substrings. But he
will not know the exactly length. Suppose the longest length of the complementary
substring is found out as x. The attacker may guess the length of the required
1
complementary substrings from 1 to x. Thus the probability is (3e).
x
multiplication of (3a), (3b), (3c), (3d) and (3e). The probability is as follows:
Lemma 3-1
3-8
Chapter 3 Method 2: The Complementary Pair Approach
three parts: the secret message, the complementary pair substrings and the substrings of S.
Suppose that the length of secret message M is m. The length of complementary pair
Suppose | a1|+| a2|+…+| am | is equal to A. In our approach, we will insert a substring for
and we will always insert a substring of S which its length is 5 after each complementary
Lemma 3-2
Suppose m is the length of secret message and A is the total length of the
complementary pair substrings used in the encoding scheme. The payload of
The Complementary Pair Approach is m+A+5m=6m+A.
Take the example introduced in section 4.1. The length of M is 2. The total
length of the complementary pair substrings is 20. The length of substrings of S inserted
ABG92, ABG95, BG95, G76 and M75]. A palindrome is a string of the form αα '
α =AACGT. There are several methods proposed to find the longest palindrome in a
Our approach using palindrome is quite similar to that using complementary pairs.
α ' = α h 'α h −1 '...α 1 ' . Then αα ' is a complementary palindrome. For example, assume
3-9
Chapter 3 Method 2: The Complementary Pair Approach
palindromes.
3-10
Chapter 4 Method 3: The Substitution Approach
hide data in the reference DNA sequence. Before we introduce the algorithms, we shall
recall the complementary rule. That is, for each alphabet, we assign a unique
counterpart for it. For instance, we may have the following base pair rule:
(A T) (C A) (G C) (T G).
one.
(A T) (C A) (G C) (T G).
For this method, we also will use a reference sequence S. Let us assume that
Step 1. Suppose the length of the reference sequence S is 15. Select p numbers
randomly from 1 to 15, p is equal to 7 in this case. Assume that they are 2, 3, 5,
4-1
Chapter 4 Method 3: The Substitution Approach
Thus S’=GCCATGCCAACTAGG.
The receiver would not need to generate the set A. After receiving a set of
sequences, he will check all positions to a sequence S’ in the set. There are only three
possible cases: (1) S’i is the same with Si (The secret bit mj equal to 0). (2) S’i is
C ( S i ) (The secret bit mj equal to 1). (3) S’i is C (C ( S i )) . If there exists one j such that
S’i and Si are not of the above three cases, it means that the sequence should be ignored.
(Substitution Approach)
4-2
Chapter 4 Method 3: The Substitution Approach
Step 2. For the next sequence S’ of the sequence set, do the following
operations:
if there exists a j such that such that S j ' ≠ S j ' , S j ' ≠ C ( S j ' ) and
4-3
Chapter 4 Method 3: The Substitution Approach
For an intruder to find out the secret message, he should have the following
information: (1) He should know the reference DNA sequence. (2) He should know the
complementary rule.
In this section, we will run a fully example to illustrate Method 3. Our reference
section.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT
CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA
GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC
CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT
GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Suppose A={4, 7, 8, 11, 13, 15, 16, 20, 23, 27, 36, 39, 42, 43, 44, 49, 63, 67, 68, 69,
70, 72, 74, 76, 79, 83, 86, 88, 89, 93, 94, 96, 97, 100, 101, 107, 117, 121, 126, 139, 156,
171, 178, 190, 201, 212} which is generated by random number generator.
4-4
Chapter 4 Method 3: The Substitution Approach
the contents in A.
S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAG
TCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAG
AGATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTC
ACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTC
CTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC
Recall our encryption algorithm, transforms S into S’ by the following rule: For all
i from 1 to 15,
M=0110100101001100101001010101001010100110101010.
S’=GCTGGGGGTACAACGAACTTTGACCTCTATCAGACCGTAGCGAGTATCGGA
CTGTGGCCACATTCTACCCAAAAGGCTCCATTATCTAGGGCTATGAGCTCTGA
GAACGGCCACGCTCGGCCATTGGATCTAGCGTGGTGGGTATTGCCCAGTTGGC
TGTTGTGCCAACATATGTTCATGGATCTATATATTACGTTACTGTAGAAGGCCT
4-5
Chapter 4 Method 3: The Substitution Approach
CCATGAAGCGCTCAAGCTTGTAGGATCCTTTGCAACAGTACTGTT.
The receiver will get the secret message M by comparing S’ and S. The comparing
Let us review the example and the steps of method 3. For this method, we also will
secret message M=m1m2…mp is 0111010. The length of S is 15 and is larger than the
length of M, p, which is 7 in this case. For illustration, assume that the complementary
rule is the same as given in the above sections. That is, the rule is as follows: ((A T) (C
A) (G C) (T G)).
Step 1. Suppose the length of the reference sequence S is 15. Select p distinct numbers
randomly from 1 to 15, p is equal to 7 in this case. Assume that they are sorted
C (C ( Si ))
if i is not equal to any Aj, set Si to .
Thus S’=GCCATGCCAACTAGG.
4-6
Chapter 4 Method 3: The Substitution Approach
For an intruder to find out the secret message, he should have the following
information.
For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is 55000000 (4a).
For (2): We should consider how many legal complementary rules are. We define a
legal complementary rule as the following: for each alphabet x of a DNA sequence,
both C(x), C(C(x)) and C(C(C(x))) are not equal to x. Note that C(x) is the
(A T) (T C) (C G) (G A),
(A T) (T G) (G C) (C A),
(A C) (C T) (T G) (G A),
(A C) (C G) (G T) (T A),
(A G) (G T) (T C) (C A), and
(A G) (G C) (C T) (T A)
1
The probability is 6 (4b).
4-7
Chapter 4 Method 3: The Substitution Approach
1 1
×
5500000 6
Let us consider another case. Suppose that it is impossible to know the reference
sequence S for an attacker. Thus, according to our algorithm, the attacker should
consider 3 possibilities for each alphabet between S’ and S. For each alphabet x between
S’ and S, there may be three cases: (1) S’ is x and S is x, too. (2) S’ is x and S is C(x). (3) S’
Lemma 4-1:
In this section, we will introduce the payload of Method 3. The payload of Method
3 is 0 because we will only change the alphabet of the reference sequence S but not add
4-8
Chapter 5 Conclusion and Future Works
5.1 Conclusions
In the thesis, we have presented three encryption methods based on the DNA
sequence. In the previous related works, they are almost based upon the biology
properties. There is almost no difference between a real DNA sequence and a faked one.
There are a large number of DNA sequences publicly available in various web-sites. A
rough estimation would put the number of DNA sequences publicly available to be
around 55. Our encryption methods are based upon the above two properties.
For an attacker, it is hard for him to detect whether there are secret messages in a
DNA sequence or not. Even though he know that there are some secret messages in our
DNA sequence, it is hard for him to know the reference sequence and correctly decode
the secret.
It is worth to make the decoding stage more complicated. It is also worth to find
out other properties of DNA sequences which can be used to hide data. Also, to
decrease the payloads of Method 1 and Method 2 are also our future works.
5-1
Bibliography
Bibliography
[A2000] Dynamic Programming Algorithms for RNA Secondary Structure Prediction with
Pseudoknots, Akutsu, T., Dirsrete Applied Mathematics, Vol. 104, 2000, pp. 45-62.
[ABG92] Optimal Parallel Algorithms for Periods, Palindromes and squares, Apostolico,
[ABLRRW94] Molecular Biology of the Cell, Alberts, B., Bray, D., Lewis, J., Raff, M.,
Roberts, K. and Watson, J. D., New York & London: Garland Publishing, 1994.
[BG95] Finding All Periods and Initial Palindromes of a String, Breslauer, D. , Galil, Z. ,
[CRB99] Hiding Messages in DNA Microdots, Clelland, C. T., Risca, V. and Bancroft,
6-1
Bibliography
[LRBR2000] Cryptography with DNA Binary Strands, Leier, A., Richter, C., Banzhaf,
[M75] Manacher, G.., A new linear-time on-line Algorithm for finding the smallest
initial palindrome of the string, J. ACM, Vol. 22, 1975, pp. 346-351.
Scientific, 2002.
[SFP2002] Hiding Data in DNA, Shimanovsky, B., Feng, J. and Potkonjak, M., Revised
Papers from the 5th International Workshop on Information Hiding, Lecture Notes in
[WHRS87] Molecular Biology of the Gene, Watson, J., Hopkins, N., Roberts, J. and
Steitz, J., Benjamin Cummings, 4th Edition, Menlo Park, CA, 1987.
6-2