Você está na página 1de 55

國立暨南國際大學資訊工程學系

碩士論文

以 DNA 序列特性為基礎的加密方法
Encryption Methods Based upon DNA Sequences

指導教授:李家同 教授
研究生:許宏誌

中 華 民 國 九 十 五 年 六 月
Acknowledgements

Acknowledgements

I first thank my advisor, Professor R. C. T. Lee during my graduate school years for

his education and training. He made me a humanistic, merciful, and friendly person.

In research, he also made me be a logical, accurate, and scrupulous student. The most

significant improvement is my English comprehension and writing.

Secondly, I thank Professor C. H. Huang, Professor Jenn-Sheng Lin, Professor

Yue-Li Wang and Professor Chuan-Yi Tang for their suggestions on my thesis. I also

appreciate their directions on my research.

Thirdly, I am grateful to C. C. Lin, F. L. Lin, S. C. Chen, C. Y. Yang, J. G. Chen,

Y. M. Pan, S. J. Pan, W. H. Wen, W. L. Wang, T. H. Ku, Z. H. Chen, C. S. Ou and G.

R. Zhuang for their directions and discussions between them and me on researches. I

am also grateful to L. C. Juan, H. J. Chen, W. H. Wen, K. H. Liao, Y. L. Chen, T. H.

Ku, Z. H. Chen, C. S. Ou, G. R. Zhuang and Z. B. Huang for their humanistic

knowledge and making me know more about the world.

Finally, I sincerely appreciate Linda. She always backs me during these two years.

Hsu, Hong-Zhi
Computer Science and Information Engineering

National Chi-Nan University, Puli, Nantou


June, 2006

i
論文摘要

論文名稱:以 DNA 序列為基礎的加密方法

校院系:國立暨南國際大學資訊工程學系 頁數:四十五

畢業時間:九十五 年 六 月 學位別:碩士

研究生:許宏誌 指導教授:李家同 教授

論文摘要

在這篇論文裡,我們利用 DNA 序列的一些特性提出三個加密的方法。我們會

指出 DNA 中的一些有趣的特性,而這些特性可以讓我們利用來隱藏資料。我們在

這篇論文提出的三個方法是:插入法、互補字串法還有替換法。每一個方法我們都

先秘密地選一個 DNA 參考序列 S 然後接著把訊息 M 加入 S 中得到 S’。我們把 S’跟

其他的 DNA 序列或是像 DNA 的序列一起送給接收者。接收者可以成功地確認我們

把訊息 M 藏在哪一個然後可以忽略其他的 DNA 序列。他也可以把訊息 M 擷取出來。

第一個方法最主要的想法是我們把 M 跟 S 打斷成許多片段然後把 M 的每

一個片段都插入 S 兩兩片段中。第二個方法是把 M 打斷成許多片段然後分別地把每

一個片段插入 DNA 序列中的一對互補字串前面。我們也從 S 中擷取一段片段插入

有隱藏 M 片段的互補字串後方。第三個方法是利用改變序列中的字母來隱藏 M 中

的資訊。

關鍵字:加密、DNA、互補字串、互補規則

ii
Abstract

Title of Thesis: Encryption Methods Based Upon DNA Sequences

Name of Institute: Department of Computer Science and Information


Pages: 45
Engineering, National Ch-Nan University.

Graduation Time: 06/2006 Degree Conferred: Master

Student Name: Hong-Zhi Hsu Advisor Name: Richard Chia-Tung Lee

Abstract

In this thesis, we propose three encryption methods based upon some properties of DNA

sequences. We will point out that the DNA sequences possess some interesting

properties which we can utilize to hide data. The three methods proposed in this thesis

are: the insertion method, the complementary pair method and the substitution method.

For each method, we secretly select a reference DNA sequence S and incorporate the

secret message M into it such that we obtain S’. We send this S’, together with many

other DNA, or DNA-like sequences to the receiver. The receiver is able to identify the

particular sequence with M hidden in it and ignore all of the other sequences. He will

also be able to extract M.

The main idea of Method 1 is to break S and M, and we insert each segment of M

between every pair segments of S. Method 2 is to break M into several segments and

hide each one before a complementary pair substring in a DNA sequence, respectively.

We also extract a segment of S and insert it after each complementary pair substring with

a segment of M hidden before it. Method 3 is to hide the secret message M by

substituting alphabets of S.

Keywords: Encryption, DNA, Complementary Pair Substring, Complementary Rule.

iii
Contents

Contents

List of Figures ..................................................................................................................... vi

Chapter 1 Introduction and Related Works ................................................................1-1

1.1 A Simple Version of DNA Based Encryption Method by Using Biology Properties ....... 1-1

1.2 The Complicated Version ..................................................................................................... 1-6

1.3 Thesis Organization.............................................................................................................. 1-9

Chapter 2 Method 1: The Insertion Method ...............................................................2-1

2.1 Algorithms of Insertion Method.......................................................................................... 2-3

2.2 An Example of Insertion Method ........................................................................................ 2-6

2.3 Robustness of Insertion Method.......................................................................................... 2-8

2.4 Payload of Insertion Method ............................................................................................. 2-13

Chapter 3 Method 2: The Complementary Pair Approach .......................................3-1

3.1 Algorithms of Complementary Pair Approach .................................................................. 3-1

3.2 Robustness of Complementary Pair Approach .................................................................. 3-5

3.3 Payload of Complementary Pair Approach........................................................................ 3-8

3.4 The Complementary Palindrome ........................................................................................ 3-9

Chapter 4 Method 3: The Substitution Approach.......................................................4-1

4.1 Algorithms of Substitution Approach ................................................................................. 4-1

4.2 An Example of Substitution Approach ............................................................................... 4-4

4.3 Robustness of Substitution Approach ................................................................................. 4-6

iv
Contents

4.4 Payload of Substitution Approach....................................................................................... 4-8

Chapter 5 Conclusion and Future Works ....................................................................5-1

5.1 Conclusions ........................................................................................................................... 5-1

5.2 Future Works ........................................................................................................................ 5-1

Bibliography......................................................................................................................6-1

v
List of Figures

List of Figures

Figure 1-1 An Experiment of Fluorescent on DNA. ........................................................1-3

Figure 2-1 The Relations between m(s) and r’s(k’s). ..................................................... 2-11

Figure 3-1 An Illustration of the Complementary Pair Approach....................................3-7

vi
Chapter 1 Introduction and Related Works

Chapter 1 Introduction and Related Works

In recent years, much research work has been done on DNA based encryption

schemes [C2003, CRB99, LRBR2000 and SEP2002]. Most of them use biological

properties of DNA sequences. We present one of them in this chapter.

In this chapter, we present two encryption methods by using biology properties.

Both of them are based on physical DNA sequences and use some chemical schemes

which are used to decode [C2003]. We will introduce a simple version and a

complicated version.

1.1 A Simple Version of DNA Based Encryption Method by Using


Biology Properties

In this chapter, we will introduce how to use biology properties to hide data. In a

DNA based encryption scheme, we will store binary data as a DNA sequence and later

hide the data in some way. There is a key, called primer, which is used to find out the

binary data and we will introduce how to use primers to store our data in DNA sequences.

The method is that a sender sends a DNA sequence and selected primers to the

receiver and then the receiver uses the primers and the DNA sequence to transform it into

a sequence consisting of the binary data. We may use a public DNA sequence as our

reference sequence and the receiver also knows this sequence. Thus, we will send

selected primers to the receiver. Without the primers and the reference sequence, no one

can correctly decode the binary data.

It is a trick that how to store binary data in DNA sequence. As we mentioned

1-1
Chapter 1 Introduction and Related Works

before, we will use selected primers to do that. Let us define a primer first. A primer

is a short complementary substring of a DNA sequence. Suppose we have a DNA

sequence:

ATGCTTAGTTCCATCGGAGACTAATGGCCTA

and two primers:

ATCAA and GATTAC

which are the complementary substrings of TAGTT and CTAATG, respectively. There

is also a rule which should be defined, the complementary rule. A complementary

relation is as follows: for each alphabet x of a DNA sequence, c(x) denotes the

complementary alphabet of x. A complementary rule is the complementary relations for

each alphabet of a DNA sequence. In our case, we define the complementary rule as

follows: A-T, T-A, C-G and G-C. That means, c(A) is T for instance.

In biology, there is a chemical mechanism to make primers and a DNA sequence

combine and then we will use a chemical substance, called fluorescent, to indicate where

the positions of the primers are. The fluorescent will make the positions of

combinations of primers and substrings of DNA sequence bright. A bright section

corresponds to the binary data ‘1’ and a dark one corresponds to ‘0’. There is a sample

for this kind of chemical experiment.

1-2
Chapter 1 Introduction and Related Works

Figure 1-1 An Experiment of Fluorescent on DNA.

The gree sections represent the binary data ‘1’s in the sequences and the dark

sections are the binary data ‘0’s in the sequences.

In our case, we will have the following combinations:

ATGCTTAGTTCCATCGGAGACTAATGGCCTA

ATCAA GATTAC

0 1 0 1 0

Thus we will have a binary sequence “01010” finally.

Consider another example. Suppose we have the following DNA sequence and

primers. DNA sequence:

ATGCTTAGTTCCATCGGAGACTAAT

Primers:

1-3
Chapter 1 Introduction and Related Works

ATCAA, GGTAG and GATTA

The sequence will be coded as 01101 as the following:

ATGCTTAGTT CCATCGGAGACTAAT

ATCAA GGTAG GATTA

0 1 1 0 1

The sender sends the DNA sequence and the selected primers to the receiver. The

receiver now uses the primers and the DNA sequence to transform it into a sequence

consisting of bright and dark parts by using the chemical mechanism as we mentioned

before. This will produce a binary stream.

In a real DNA sequence, for instance, the following one is a segment of DNA

sequence of Litmus, its real length is with 2856 nucleotides long.

ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC

ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG

ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC

ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG

CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC

It means that for a real DNA sequence, it can be used to store amount of binary data

by using short primers. Assume we use primers consisting of five nucleotides, thus
⎡ 2856 ⎤
there will be at most ⎢ 5 ⎥ = 572 bits if we store data in DNA sequence of Litmus.
⎢ ⎥

It should be noticed that if we use too small size primer, it is hard to store our data in

DNA sequence. Suppose we have to store a binary sequence “01101” and we have a

1-4
Chapter 1 Introduction and Related Works

DNA sequence as the following:

ATCGGCTAATCGGCTAATCGGCTAATCGGCTA

Assume we chose small size primers as the following:

TA, CG, and CG

It may cause the following combinations:

ATCGGCTAATCGGCTAATCGGCTAATCGGCTA

TA CG CG

1 0 1 0 1 0

Thus we will have the binary code “101010” and it is not our required result.

It is a key point to make sure that the selected primers could correctly generate the

required binary data. The following example is that we use a segment of the DNA

sequence of Litmus to store a binary data 010111010110. We use primers

CTTAAGCGCGA, AAGCGCGACT, CAGTGTTAAG, CGCGACT,

GGCGCTTAAGGACGT, TCTATTAACATAAATT and CACGGATCGAGCTA

The result is as the following:

ATCGAATTCGCGCTGAGTCACAATTCGCGCTGA GTCACAATTC GCGCTGA

CTTAAGCGCGA AAGCGCGACT CAGTGTTAAG CGCGACT

0 1 0 1 1 1

1-5
Chapter 1 Introduction and Related Works

GTCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTG

GGCGCTTAAGGACGT

0 1 0

CAGAGATAATTGTATTTAA GTGCCTAGCTCGATACAATAAACGCCATTTGAC

TCTATTAACATAAATT CACGGATCGAGCTA

1 1 0

CATTCACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGG

AATTCCTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC

To prevent an attacker to guess the selected primers, it is better to choose longer

primers than shorter ones. It is really hard to guess what a long primer is.

In order to decode the message, the receiver should know (1) the complementary

rule, (2) the reference DNA sequence and (3) the selected primers. For an attacker, it is

really hard to guess the reference DNA sequence and the selected primers. There are

about 55 million publicly DNA sequences in the world.

1.2 The Complicated Version

In the above method, a sender sends a reference DNA sequence and the selected

primers to the receiver and the receiver decode and find out the binary message hidden in

the DNA sequence by using the selected primers. We could send only a section of a

primer and information which is used to recover the correct primer to the receiver to

make the decoding more complicated. This is more complicated because an attacker

would be hard to get the correct primer even though he got the short section.

1-6
Chapter 1 Introduction and Related Works

For instance, suppose we have a reference DNA sequence:

ATGCTTAGTTCCATCGGAGACTAAT

We select the following primer:

ATCAA, GGTAG and GATTA

In the simple version, we will send the sequence and the selected primers to the

receiver. This time, we will send a partial section for each primer to the receiver. We

will introduce how the receiver recovers the correct primers and decode. After getting

the sections of the selected primers, the receiver could easily recover all the correctly

primers and get the message.

The key point is that how the receiver could recover the correct primers according to

sections of selected primers. Before we introduce the trick, we shall introduce a term,

called template first. A template is a complementary string of a primer. For instance, for

a primer TACCCTGATTAC, its template is ATGGGACTAATG. A template is used to

recover the original primer. There is also a chemical mechanism, called PCR

(Polymerase Chain Reaction). We will recover the original primer by using the PCR

strategy on the template and a partial primer. We cut a certain suffix of the primer; in

this case, it may be GATTAC and put it together with the template. Thus by using PCR,

a kind of chemical scheme, we will get the correct primer.

The template of the primer: ATGGGACTAATG

A partial of the primer: GATTAC

The primer: TACCCTGATTAC

1-7
Chapter 1 Introduction and Related Works

By using the PCR strategy, we would not need to send the correct primers to the

receiver, instead we send the prefixes or the suffixes of the selected primers and the

template of each primer. Thus, the receiver will use a partial primer and its

corresponding template to recover the correct primer. After he get all the selected

primers, he could decode and get the secret message.

The following example is a complicated version of encryption by using a segment of

the DNA sequence of Litmus. Suppose we still hide the binary data 010111010110 and

we still use primers

CTTAAGCGCGA, AAGCGCGACT, CAGTGTTAAG, CGCGACT,

GGCGCTTAAGGACGT, TCTATTAACATAAATT and CACGGATCGAGCTA

The template for each primer is

GAATTCGCGCT, TTCGCGCTGA, GTCACAATTC, GCGCTGA,

CCGCGAATTCCTGCA, AGATAATTGTATTTAA and GTGCCTAGCTCGAT

We cut the suffix for each primer and get the following partial primers:

CTTAAGC, AAGCGCG, CAGTGTTA, CGCGAC, GGCGCTTAAG,

TCTATTAACAT and CACGGATC

We send the templates and the partial primers to the receiver. For each

corresponding template and the partial primer, the receiver uses the PCR strategy to

recover the selected primer.

1-8
Chapter 1 Introduction and Related Works

For instance:

Template: GAATTCGCGCT

Partial primer: CTTAAGC

Primer: CTTAAGCGCGA

Thus the receiver will get all selected primers and thus decode the sequence as we

mentioned before.

It is a key point to make sure the corresponding template and partial primer are one

to one mapping. If we have the following partial primers:

CAGGT and CAG

There must be only one template which its prefix is GTC. Thus, the receiver

would not know which partial primer he should recover is.

For this complicated version encryption, in order to decode the message, a receiver

should know (1) the complementary rule, (2) the partial primers and (3) the templates of

selected primers.

1.3 Thesis Organization

The thesis is organized as follows. In chapter 2, we propose the first method,

Insertion Method. The main idea of Method 1 is to divide the secret messages and

reference DNA sequence, then we assemble the segment one by one from secret

messages and the reference sequence. The robustness and the payload of Method 1 will

also be proposed in this chapter. We propose Method 2, the Complementary Pair

1-9
Chapter 1 Introduction and Related Works

Approach, in chapter 3. We insert the secret messages before a pair of complementary

substrings. The robustness and the payload of Method 2 are also proposed in this

chapter. Method 3, the Substitution Approach, is proposed in chapter 4. We change

each alphabet by our conditions to hide the secret messages. The robustness and the

payload of Method 3 are proposed in chapter 4. Finally, conclusions and future works

are also proposed in chapter 5.

1-10
Chapter 2 Method 1: The Insertion Method

Chapter 2 Method 1: The Insertion Method

In this chapter, we shall point out that the DNA sequences possess some interesting

properties which we can utilize to hide data. We present three methods in the following

three chapters, the insertion method, the complementary pair method and the substitution

method. For each method, we secretly select a reference DNA sequence S and

incorporate the secret message M into it such that we obtain S’. We send this S’,

together with many other DNA, or DNA-like sequences to the receiver. The receiver is

able to identify the particular sequence with M hidden in it and ignore all of the other

sequences. He will also be able to extract M.

In this chapter, we will present the first method which does not make use of the

biological properties. Instead, we use the following properties of DNA sequences.

A DNA sequence is a sequence consisting of four alphabets: A, C, G and T. Each

alphabet is related to a nucleotide. It is usually quite long. For instance, the following

are two DNA sequences.

The first one is a segment of DNA sequence of Litmus, its real length is with 2856

nucleotides long:

ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGTC

ACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGAG

ATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCACC

ACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCTG

CAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC.

2-1
Chapter 2 Method 1: The Insertion Method

The second one is a segment of DNA sequence of Balsaminaceae, its real length is

with 2283 nucleotides long:

TTTTTATTATTTTTTTTCATTTTTTTCTCAGTTTTTAGCACATATCATTACATTTTA

TTTTTTCATTACTTCTATCATTCTATCTATAAAATCGATTATTTTTATCACTTATTTT

TCTAATTTCCAATATTTCATCTAATGATTATATTACATTAAAGAAATCGGTTAAAA

GCGACTAAAAATCAATCTGGAACAAGGCTTAGTTTATTTAATATATTATTTTATG

TAATTTCTATTGAAAAATTAGTTAAAAGGCAAGTATTTGAGAT.

We can now easily see one special property of DNA sequences: There is almost no

difference between a real DNA sequence and a faked one. This is a property which we

shall exploit in our research.

There is another fact which is quite useful to us: There are a large number of DNA

sequences publicly available in various web-sites. A rough estimation would put the

number of DNA sequences publicly available to be around 55 million.

By using the above facts, we designed three DNA based encryption methods. All

of these methods would secretly select a reference sequence S from publicly available

DNA sequences. Only the sender and the receiver are aware of this reference sequence.

The sender would transform this selected DNA sequence S into a new sequence S’ by

incorporating the DNA sequence S with the secret message M. This transformed

sequence S’ is sent by a sender to the receiver together with many other DNA sequences.

The receiver would then examine all of the received sequences, identify S’ and recover

the secret message M.

We shall introduce three methods in the following chapters. For all of these

2-2
Chapter 2 Method 1: The Insertion Method

methods, we assume that there are two schemes used by the sender and the receiver

which are kept secret. The first one is a binary coding scheme which transforms

alphabets A, C, G and T into binary codes and vice versa. For instance, the following

may be a binary coding used: ((A 00) (C 01) (G 10) (T 11)). It should be noted that

more digits may be used. The second scheme is a complementary pair rule. That is,

we shall assign each alphabet x a complement, denoted as C(x). The following may

be such a rule: ((A C) (C G) (G T) (T A)).

We also assume that the secret message M is a binary sequence.

2.1 Algorithms of Insertion Method

To simplify the discussion, we start with the most basic version and give a simple

example. The more complicated version of our method will be presented after this basic

one is given. Suppose the secret message M is 01001100. Let S be

ACGGTTCCAATGC. Our coding steps are as follows:

Step 1. We first code S into a binary sequence by using the binary coding scheme.

Thus the sequence S will now become 00011010111101010000111001.

Step 2. Divide S into segments whereby each segment contains k bits. Suppose k is 3.

Then we have the following segments: 000, 110, 101, 111, 010, 100, 001, 110,

01.

Step 3. Insert bits from M, once at a time, into the beginning of segments of S. The

result is as follows: 0000, 1110, 0101, 0111, 1010, 1100, 0001, 0110, 01. We

should ignore those segments without any secret message inserted. Thus, we

will have the following segments: 0000, 1110, 0101, 0111, 1010, 1100, 0001,

0110. Concatenating the above segments, we have the following binary

2-3
Chapter 2 Method 1: The Insertion Method

sequence: 00001110010101111010110000010110.

Step 4. We use the binary code scheme to produce the following faked DNA sequence:

S’=AATGCCCTGGTAACCG. As the reader can see, this sequence is quite

different from S.

Step 5. We send the above sequence S’ to the receiver, amid many other irrelevant

sequences.

The above process is the encryption process. It is easy to see that the decryption

process is just to reverse the encryption process. For every received sequence T, the

receiver extracts a sequence out of it. If the extracted sequence is not a prefix of the

reference sequence S, ignore T. If it is, the receiver knows that he has also successfully

extracted the secret message M as a by-product.

The above is the basic version of our approach. In a more complicated version, we

divide S into segments by using a random number generator. That is, k is not fixed any

more. Instead, it is determined by a random number generator which is known only to

the sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into

segments with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message

M. We may therefore also use the same random number generator to divide M into

segments. The parameter used will be denoted as r.

The following is the exact algorithm for Method 1.

2-4
Chapter 2 Method 1: The Insertion Method

Algorithm 2.1 Encryption Algorithm of Method 1 (Insertion Method)

Input: A reference DNA sequence S, a secret binary message M and a binary


coding scheme to code A, C, G and T into binary digits.

Output: An encrypted DNA sequence S’.

Step 1. Code S into a binary sequence S1 by using the binary coding scheme.

Step 2. Generate k’s by using a random number generator to divide S1 into


segments and generate r’s to divide the secret message M into
segments. Each ki and ri is larger than 1 or equal to 1. Denote S1
by s1s2…sn and M by m1m2…mp.

Step 3. Insert each mi of M before si of S1 to produce a new binary sequence.


Delete sp+1sp+2…sn. Denote the resulting binary sequence by S2.

Step 4. Transform sequence S2 back to a faked DNA sequence S3 by using the


same binary coding scheme used in Step 1.

Step 5. Return S3.

S3 is sent to the receiver together with many other DNA sequences. The receiver

uses the following algorithm to decrypt.

Algorithm 2.2 Decryption Algorithm for Method 1

Input: A set of DNA sequences, one of which has the secret message M
hidden in it by using Algorithm 2.1, a reference DNA sequence S and
a binary coding scheme.

Output: The secret message M.

Step 1. Generate numbers k’s and r’s denoted as k1k2…kn and r1r2…rp by
using the same random number generator with the same seed of the
encoding scheme.

Step 2. For a DNA sequence S’ of the set, code S’ into a binary sequence by

2-5
Chapter 2 Method 1: The Insertion Method

using the binary coding used by the sender and use r1+k1,r2+k2,… to
divide the binary sequence into binary segments.

Step 3. For each segment of the first p segments of S’, extract the first ri bits,
called mi.

Step 4. For each segment of the first p segments of S’, extract the last ki bits,
called si.

Step 5. Concatenate all mi’s to be M and all sj’s, to be S1.

Step 6. Transform S1 to be a DNA sequence by using the same rule. If S1 is


not a prefix of S, go to Step 2.

Step 7. Return M.

For an intruder to find out the secret message, he must be equipped as follows. (1)

He must know precisely the reference DNA sequence S. Since there are roughly 55

millions DNA sequences available publicly, it is extremely hard to guess one. (2) He

has to know the random number generator and the two seeds used. (3) He has to know

the binary coding scheme.

2.2 An Example of Insertion Method

In this section, we will show a complete example about the complicate version of

Method 1.

Our reference DNA sequence is a segment of DNA sequence of Litmus as

mentioned in previous section.

S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT

CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA

GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC

2-6
Chapter 2 Method 1: The Insertion Method

CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT

GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC

Our secret message M=0110100101001100101001010101001010100110101010.

The binary coding rule is as follows:

A:00 C:01 G:10 T:11

Suppose we have two random number sequences by using a random number generator.

K=3, 4, 7, 6, 1, 5, 3, 7, 6, 4.

R=7, 9, 11, 3, 5, 8, 4, 12, 7, 10.

We transform S to be the following binary sequence:

0011011000001111011001100111100010110100010000111101100110011110001011010

0010000111101100110011110001011010001000011111011100001110100100101100110

0000111101011110010010010101011000001111010110010011111001001000100011000

0111110110011111100001011100101110010011101100011000100001100000001100101

0011111110000101001111010001010001001111101011101110010001011101010000100

1110110011001000101101100010110110111011000101000001111010111100100101000

1100110111101000110101000110000010011111010101001110101110000110110100010

Divide the secret message M by using the sequence K, thus we will get the following

divided segments:

011, 0100, 1010011, 001010, 0, 10101, 010, 0101010, 011010, 1010.

2-7
Chapter 2 Method 1: The Insertion Method

Divide the reference sequence S by using the sequence R, thus we will get the

following divided segments:

0011011, 000001111, 01100110011, 110, 00101, 10100010, 0001, 111011001100,

1111000, 1011010001.

Insert each segment of M before each of S, thus we will get the following segments:

0110011011, 0100000001111, 101001101100110011, 001010110, 000101,

1010110100010, 0100001, 0101010111011001100, 0110101111000, 10101011010001.

We combine the above segments and transform it to be a DNA sequence by using

the binary coding rule.

S’=CGCGTCAAACTTCATCGCGCGCCCGACCGGTCACAGACCCCCTCGCGATCC

TGAGGGTCAC

The Decryption steps will be done by using the sequences K and R to divide the

segments of M and a prefix of S. Thus we could recover the secret message.

We will first transform S’ to be a binary sequence by using the binary coding rule.

The sequence K=3, 4, 7, 6, 1, 5, 3, 7, 6, 4 and R=7, 9, 11, 3, 5, 8, 4, 12, 7, 10. We will

extract the segments by using the above two sequences. In our case, k1 is 3 and r1 is 7.

We extract the first 10 bits of S’ and the first 3 bits is the first segment of M. By

continuously extracting the segments of M, we will get M finally.

2.3 Robustness of Insertion Method

In this section, we will show the strength of the Method 1. By our definition, the

2-8
Chapter 2 Method 1: The Insertion Method

strength of an encryption method is the possibility when an attacker guesses without any

information. We will point out how much solutions are if the attacker does not know

any information and then we will calculate the possibility.

Instead, k is determined by a random number generator which is known only to the

sender and the receiver. Suppose the k’s are 6, 3, 2, 4. Then S is divided into segments

with lengths 6, 3, 2 and 4 respectively. Note that there is also a secret message M. We

may therefore also use the same random number generator to divide M into segments.

The parameter used will be denoted as r.

For an intruder to find out the secret message, he must be equipped as follows. (1)

He must know precisely the reference DNA sequence S. Since there are roughly 55

millions DNA sequences available publicly, it is extremely hard to guess one. (2) He

has to know the random number generator and the two seeds used. (3) He has to know

the binary coding scheme.

For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is (2a).
55000000

For (2): Suppose we handle a sequence S’ after the encryption stage. The size of S’

is n’. S’ is composed by the secret message M and the certain size prefix of reference

sequence S. We define the size of M is m and the size of the certain prefix S is s. It is

hard to know m and s for an attacker.

It can be imagined that an attacker should guess the size m and s first. We know
⎛ 2 + n'−2 − 1⎞ ⎛ n'−1 ⎞
m+s=n’, m, s , n ' ≥ 1 and there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = n'−1 possibilities here. For
⎝ n '−2 ⎠ ⎝ n'−2 ⎠

2-9
Chapter 2 Method 1: The Insertion Method

⎛ 2 + 10 − 2 − 1⎞ ⎛9⎞
instance, assume n’=10, there will be C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 9 possibilities as the
⎝ 10 − 2 ⎠ ⎝8⎠

following:

m=1, s=9

m=2, s=8

m=3, s=7

m=4, s=6

m=5, s=5

m=6, s=4

m=7, s=3

m=8, s=2

m=9, s=1

1
The probability of an attacker’s successful guessing m and s is (2b). It is not
n'−1

enough to decode for the attacker.

There are still two values should be known by the attacker, namely, m and s. The

problem is that the attacker does not know the numbers of r’s and k’s which are used to

break the secret message and the reference sequence S. Since the summations of r’s and

k’s generate m and s respectively. The following figure indicates the relations between

m(s) and r’s(k’s).

2-10
Chapter 2 Method 1: The Insertion Method

A prefix of S
with size s

k1 k2 k3 k4

The secret
message
with size m
r1 r2 r3 r4

Figure 2-1 The Relations between m(s) and r’s(k’s).

Notice that it is hard for an attacker to know how much k’s and r’s we use. Thus he

should try two k’s(r’s), three k’s(r’s), four k’s(r’s) and so on.

For s, there may be the following cases:

k1=s, k1 ≥ 1

k1+k2=s, k1 , k 2 ≥ 1

k1+k2+k3=s, k1 , k 2 , k 3 ≥ 1

k1+k2+k3+k4=s, k1 , k 2 , k 3 , k 4 ≥ 1

.
.
.

k1+k2+k3+…+ks=s, k1 , k 2 , k 3 , L , k s ≥ 1

⎛ 1 + s − 1 − 1⎞ ⎛ s − 1⎞
The number of solution of the first formula is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 . The number
⎝ s − 1 ⎠ ⎝ s − 1⎠

⎛ 2 + s − 2 − 1⎞ ⎛ s −1 ⎞
of solutions of the second one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = s − 1 . The number of solutions of
⎝ s − 2 ⎠ ⎝ s − 2⎠

2-11
Chapter 2 Method 1: The Insertion Method

⎛ 3 + s − 3 − 1⎞ ⎛ s −1⎞
the third one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the fourth one is
⎝ s −3 ⎠ ⎝ s − 3⎠

⎛ 4 + s − 4 − 1⎞ ⎛ s −1 ⎞ ⎛ s + s − s − 1⎞ ⎛ s − 1⎞
C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ . The number of solutions of the last one is C ⎜⎜ ⎟⎟ = C ⎜⎜ ⎟⎟ = 1 .
⎝ s−4 ⎠ ⎝ s − 4⎠ ⎝ s−s ⎠ ⎝ 0 ⎠

The above numbers could be summarized as the following formula:

s −1
⎛ s − 1⎞ ⎛ s −1 ⎞ ⎛ s −1⎞ ⎛ s − 1⎞ ⎛ s −1 ⎞
C ⎜⎜ ⎟⎟ + C ⎜⎜
⎝ s − 1⎠
⎟⎟ + C ⎜⎜
⎝ s − 2⎠
⎟⎟ + L + C ⎜⎜
⎝ s − 3⎠ ⎝ 0 ⎠
⎟⎟ = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠
i =0

It is not much sense about the above formula. Let us introduce the Binomial

Theorem. The Binomial Theorem:

n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
, n, i ≥ 0

Assume x and y are equal to 1, we will have the following result:

n
⎛ n⎞
( x + y) n = ∑ C⎜⎜⎝ i ⎟⎟⎠ x y
i =0
i n −i
= 2n

Consider the following equation:

s −1
⎛ s − 1⎞ s −1 ⎛ s − 1 ⎞ s −2 ⎛ s − 1 ⎞ s −3 2 ⎛ s − 1⎞ s −1 ⎛ s −1 ⎞
( x + y ) s −1 = C ⎜⎜ ⎟⎟ x + C ⎜⎜
⎝ s − 1⎠ ⎝ s − 2⎠
⎟⎟ x y + C ⎜⎜
⎝ s − 3⎠
⎟⎟ x y + L + C ⎜⎜
⎝ 0 ⎠
⎟⎟ y = ∑ C⎜⎜⎝ s −1 − i ⎟⎟⎠x
i =0
s −1−i i
y

Assume x and y are equal to 1, we will have the following result:

s −1
⎛ s −1 ⎞
( x + y ) s −1 = ∑ C⎜⎜⎝ s − 1 − i ⎟⎟⎠x
i =0
s −1−i i
y = 2 s −1

1
Thus the probability of an attacker’s successful guessing in this stage is (2c).
2 s −1

Similarly, the probability of an attacker’s successful guessing for m in this stage is


1
m −1
(2d).
2

2-12
Chapter 2 Method 1: The Insertion Method

Let us consider the following case: Suppose an attacker makes a guessing of the

number of k, thus he will also make a guessing of the number of r because by our

algorithm, the number of k should be equal to the number of r. Thus the possibility in

this step should be one of (2c) and (2d).

For (3): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of

1
an attacker’s successful guessing in this stage is (2e).
24

Finally, the probability of an attacker’s successful guessing of Method 1 is the

multiplication of (2a), (2b), (2c), (2d) and (2e). The probability is as follows:

Lemma 2-1

Suppose n’ is the length of S’, s and m are the length of the prefix of S in the
encoding scheme and the secret message, respectively. The probability of an
attacker’s successful guessing of Insertion Method is
1 1 1 1 1
× ×( × )× .
55000000 n'−1 2 s −1 2 m −1 24

We will conclude that Method 1 is safe.

2.4 Payload of Insertion Method

Physically, our encryption method is a data hiding scheme. The payload should be

considered in a data hiding scheme. Let us define the payload first. The payload of a

data hiding scheme is the extra spaces or bits after the data is hidden in the medium.

For instance, suppose the length of the binary sequence of our reference sequence S is n.

After the data is hidden, the length of S is n+s, thus the payload is s. It is important to

consider the payload of a data hiding scheme because if the payload is very large, it may

take too much time to transmit the encrypted data. Thus, an attacker may have enough

2-13
Chapter 2 Method 1: The Insertion Method

time to decode the encrypted data.

Let us consider the payload of Method 1. The payload of method 1 is obvious.

Suppose the length of secret message M is m. The length of binary sequence of S is n.

In our approach, we may omit a suffix of S after M is hidden, suppose the length of the

suffix is s.

Lemma 2-2:

Suppose n is the length of S, s is the length of the prefix of S in the encoding


scheme and m is the length of the secret message. The pay load of Insertion
Method is n-s+m-n=m-s.

Let us consider an example.

If m is 46 and n is 564, the length of S’ is 122 thus the payload is 122-564=-442.

2-14
Chapter 3 Method 2: The Complementary Pair Approach

Chapter 3 Method 2: The Complementary Pair

Approach

In this chapter, we shall illustrate Method 2, the complementary pair approach. In

RNA sequence, we often have the so-called base pairs [BG95, LNC2000 and WHRS87].

We cannot elaborate the detailed meaning of the base pairs of RNA in this paper. For us,

we may just define our own complementary pairs. That is, for each alphabet, we assign

a unique counterpart for it. For instance, we may have the following base pair rule:

(A T) (C A) (G C) (T G).

Then the complementary sequence of AATGC will be TTGCA.

In the sequence: “ATCTGAATGCTTGTCTATTGCATCAAT”, complementary

substrings occur, as indicated by the bold characters. To find the longest

complementary substrings, we may use dynamic programming approach [A2000]. Let

us assume that we have the secret message M which has even number of bits. Note that

it is always reasonable to assume so because we may always use even number of bits to

code.

3.1 Algorithms of Complementary Pair Approach

To give an example, assume that M=0110. Again, as in Method 1, we assume that

we have a reference sequence S. Let us assume it to be S=ACGGTCGTTCCCTAGTTG.

Our Method 2 will work as follows:

Step 1. Artificially generate a sequence with out any complementary substrings and

consisting of A, C, G and T only. Assume that the sequence is

3-1
Chapter 3 Method 2: The Complementary Pair Approach

L=ACGGTCTCATCAATGCTTCAGT.

Step 2. Divide M into segments such that each segment contains even number of bits.

Thus, in our case, we have 01 and 10. We code 01 and 10 according to some

coding rule. By using the coding rule given in the previous section, 01 and 10

will be coded as C and G respectively.

Step 3. Generate two complementary strings with length k and insert them into L.

Assume k=5 and we have the following two complementary strings:

(AAGCT TTCAG) and (ACCTG TAAGC). The sequence L now becomes

L’= ACG AAGCT GTCT TTCAG CAT ACCTG CAAT TAAGC

GCTTCAGT.

Step 4. Insert the first(second) alphabet of the secret message one alphabet before the

first(second) complementary string. The string becomes: L’= ACCG

AAGCT GTCT TTCAG CAGT ACCTG CAAT TAAGC GCTTCAGT.

Step 5. Use a random number generator to select two positive integers j1 and j2.

Assume j1=2 and j2=4. Insert substrings S[j1,j1+4]=CGGTC and

S[j2,j2+4]=GTCGT one alphabet after the first and second complementary

substrings. L’ becomes:

L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG

TCGTCTTCAGT.

Step 6. Return L’’.

The sender sends L” together with many other sequences, some with complementary

pair substrings, to the receiver.

The receiver would process every sequence received by finding all complementary

3-2
Chapter 3 Method 2: The Complementary Pair Approach

substrings of it. If they are of the correct lengths, then it checks whether the prescribed

substrings of S are hidden correctly. If they are, the receiver may now extract the secret

message.

The above description of Method 2 is a simplified version. The more complicated

version may use complementary pair substrings with varying lengths. As we indicated

above, the sender sends L’’ together with many other sequences which may contain

complementary pair substrings. Those complementary pair substrings will have lengths

different from the specified ones and thus will be ignored.

In the following, we shall present the algorithms for Method 2. The algorithms are

based upon the complementary pair concept.

Algorithm 3.1 Encryption Algorithm for Method 2

(Complementary Pair Approach)

Input: A reference DNA sequence S, a secret binary message M, a binary


coding scheme to code A, C, G and T into binary digits and a
complementary pair rule.

Output: An encrypted DNA sequence S’

Step 1. Artificially generate a DNA-like sequence L=l1l2…ls with length s and

without any complementary substrings. Use a random number

generator to generate k1 , k 2 , L , k s . Generate a set

A={a1a1’,a2a2’, …, anan’}, n<<s and each ai with length k i , of

complementary strings.

Step 2. Divide M into segments such that each segment may be coded by the

binary coding scheme. Code these segments to be “A”, “G”, “C” and

3-3
Chapter 3 Method 2: The Complementary Pair Approach

“T” by using the same coding rule. Suppose M=m1m2…mp.

Step 3. For each ag and ag’, insert them after lg+3, and lg+k+3, 1 ≤ g ≤ p .

Step 4. For each pair of complementary strings ag and ag’ in L, insert mg one

alphabet before ag. Thus, L will be a new sequence, called L’.

Step 5. Randomly select positive integers j1, j2,…,jp, 0 ≤ j1, j2 ... j p ≤ n − 4 . For

each ag’ in L’, insert S[jg,jg+4] one alphabet after ag’. Thus, L will be

a new sequence, called L”.

Step 6. Return L”.

Algorithm 3.2 Decryption Algorithm of Method 2

(Complementary Pair Approach)

Input: A set of DNA sequences, one of which has the secret message M hidden
in it by using Algorithm 3.1, a reference DNA sequence S, a binary
coding scheme and a complementary pair rule used in Algorithm 3.1

Output: Secret message M.

Step 1. For the next DNA sequence L” in the set, use the dynamic

programming strategy to find out all longest complementary

substrings. If the substrings are not of the correct lengths or no such

strings are found, go back to Step 1.

Step 2. Select positive integers j1,j2,…,jp by using the same random number

generator used in Algorithm 3.1. For each pair of complementary

substrings ag and ag’, check whether the substring with length 5

starting from one alphabet after ag’ is the same with S[jg,jg+4] or not.

3-4
Chapter 3 Method 2: The Complementary Pair Approach

If they are not the same, ignore L” and go to Step 1.

Step 3. For each pair of complementary substrings ag and ag’, extract the

alphabet one alphabet before ag, called mg.

Step 4. Concatenate all mg‘s to be M.

Step 5. Return M.

An intruder should have the following information. First, he should know the

reference DNA sequence. Second, he should have the complementary rule. Third, he

should know the positive integers j’s to make an authentication to the reference DNA

sequence. Forth, he should know the binary coding rule. Fifth, he must know the

correct lengths of the complementary pair substrings.

3.2 Robustness of Complementary Pair Approach

Let us review the example and the encryption steps of Method 2 in section 4.1.

The main idea is that we insert a part of the secret message before a pair of

complementary substring and a section of the reference sequence S after a pair of

complementary substring. Suppose S= ACGGTCGTTCCCTAGTTG and the secret

message M=0110.

Step 1. Artificially generate a sequence without any complementary substrings and

consisting of A, C, G and T only. Assume that the sequence is

L=ACGGTCTCATCAATGCTTCAGT.

Step 2. Divide M into segments such that each segment contains even number of bits.

Thus, in our case, we have 01 and 10. We code 01 and 10 according to some

coding rule. By using the coding rule given in the previous section, 01 and 10

3-5
Chapter 3 Method 2: The Complementary Pair Approach

will be coded as C and G respectively.

Step 3. Generate two complementary strings with length k and insert them into L.

Assume k=5 and we have the following two complementary strings:

(AAGCT TTCAG) and (ACCTG TAAGC). The sequence L now becomes

L’= ACG AAGCT GTCT TTCAG CAT ACCTG CAAT TAAGC

GCTTCAGT.

Step 4. Insert the first(second) alphabet of the secret message one alphabet before the

first(second) complementary string. The string becomes: L’= ACCG

AAGCT GTCT TTCAG CAGT ACCTG CAAT TAAGC GCTTCAGT.

Step 5. Use a random number generator to select two positive integers j1 and j2.

Assume j1=2 and j2=4. Insert substrings S[j1,j1+4]=CGGTC and

S[j2,j2+4]=GTCGT one alphabet after the first and second complementary

substrings. L’ becomes:

L”=ACCGAAGCTGTCTTTCAGCCGGTCAGTACCTGCAATTAAGCGG

TCGTCTTCAGT.

Step 6. Return L’’.

The sender sends L” together with many other sequences, some with complementary

pair substrings, to the receiver.

The receiver would process every sequence received by finding all complementary

substrings of it. If they are of the correct lengths, then it checks whether the prescribed

substrings of S are hidden correctly. If they are, the receiver may now extract the secret

message.

3-6
Chapter 3 Method 2: The Complementary Pair Approach

In order to decode the message, an attacker should know the following information:

(1) He should know the reference DNA sequence S.

(2) He should have the complementary rule.

(3) He should know the positive integers j’s.

(4) He should know the binary coding rule.

(5) He must know the correct lengths of the complementary pair substrings.

For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is (3a).
55000000

For (2): There are 4 possibilities of the complementary alphabet to each alphabet of

DNA sequences. The total possibilities of a complementary rules are 4 × 3 × 2 × 1 = 24 .

1
The probability is (3b).
24

For (3): The positive integers js mean the starting positions of the substrings of S.

The receiver uses these substrings to do an authentication. Thus, if the length of S is n,

there will be n-5 possibilities for each j. Suppose the length of the secret message M is

m. We divide each character and insert before a pair of complementary substrings.

Also, we insert a substring of S after a pair of complementary substring. Thus, we will

need m substrings of S. The following figure illustrates our approach.

Figure 3-1 An Illustration of the Complementary Pair Approach.

3-7
Chapter 3 Method 2: The Complementary Pair Approach

Thus, the probability of an attacker’s successful guessing all substrings of S is


1
(3c).
( n − 5) m

For (4): The number of the binary coding rules is 4 × 3 × 2 ×1 = 24 . The probability of

1
an attacker’s successful guessing in this stage is (3d).
24

For (5): It is hard for an attacker to know the length of the complementary substrings.

In our case, we fix the length of each complementary substring and set it be k. An

attacker may use an efficient algorithm to find out all complementary substrings. But he

will not know the exactly length. Suppose the longest length of the complementary

substring is found out as x. The attacker may guess the length of the required
1
complementary substrings from 1 to x. Thus the probability is (3e).
x

Finally, the probability of an attacker’s successful guessing of method 2 is the

multiplication of (3a), (3b), (3c), (3d) and (3e). The probability is as follows:

Lemma 3-1

Suppose n is the length of S, m is the number of the complementary pair


substrings used in the encoding scheme and x is the number of lengths of the
complementary pair substrings in S’. The probability of an attacker’s
successful guessing of The Complementary Pair Approach is
1 1 1 1 1
× × × × .
55000000 256 (n − 5) m 24 x

We will conclude that method 2 is safe.

3.3 Payload of Complementary Pair Approach

We now introduce the payload of Method 2. The payload of Method 2 consists of

3-8
Chapter 3 Method 2: The Complementary Pair Approach

three parts: the secret message, the complementary pair substrings and the substrings of S.

Suppose that the length of secret message M is m. The length of complementary pair

substrings is |a1|+|a2|+…+|am|, where ai is a complementary substring generated by us.

Suppose | a1|+| a2|+…+| am | is equal to A. In our approach, we will insert a substring for

each complementary substring. In this case, we have m complementary pair substrings

and we will always insert a substring of S which its length is 5 after each complementary

pair substring. Therefore, the total length of m substrings of S is 5m.

Lemma 3-2

Suppose m is the length of secret message and A is the total length of the
complementary pair substrings used in the encoding scheme. The payload of
The Complementary Pair Approach is m+A+5m=6m+A.

Take the example introduced in section 4.1. The length of M is 2. The total

length of the complementary pair substrings is 20. The length of substrings of S inserted

by us is 10. Thus the payload is 32.

3.4 The Complementary Palindrome

Instead of complementary pair substrings, we may also have palindromes [MW2002,

ABG92, ABG95, BG95, G76 and M75]. A palindrome is a string of the form αα '

where α ' is the reverse of α . For instance: AACGTTGCAA is a palindrome where

α =AACGT. There are several methods proposed to find the longest palindrome in a

string [ABG92, ABG95, BG95, G76 and M75].

Our approach using palindrome is quite similar to that using complementary pairs.

Let α = α 1α 2 ...α h be a substring. Let β = α h −1α h − 2 ...α 1 . Let α h ' = C (α h ) . Let

α ' = α h 'α h −1 '...α 1 ' . Then αα ' is a complementary palindrome. For example, assume

3-9
Chapter 3 Method 2: The Complementary Pair Approach

the complementary rule is ((AT) (CA) (GC) (TG)). Thus ACCTGAAT is a

complementary palindrome. Method 2 may use complementary palindromes because

methods to find palindromes can be easily extended to find the complementary

palindromes.

3-10
Chapter 4 Method 3: The Substitution Approach

Chapter 4 Method 3: The Substitution Approach

In this chapter, we will introduce a method by using the substitution approach to

hide data in the reference DNA sequence. Before we introduce the algorithms, we shall

recall the complementary rule. That is, for each alphabet, we assign a unique

counterpart for it. For instance, we may have the following base pair rule:

(A T) (C A) (G C) (T G).

We shall define a legal complementary rule. We first define the complement of an

alphabet x is C(x). The double complement of x is C(C(x)). The triple complement of

x is C(C(C(x))) A legal complementary rule is a complementary rule such that

x ≠ C ( x) ≠ C (C ( x)) ≠ C (C (C ( x))) . For instance, the following complementary rule is a legal

one.

(A T) (C A) (G C) (T G).

4.1 Algorithms of Substitution Approach

For this method, we also will use a reference sequence S. Let us assume that

S=ACGGAATTGCTTCAG and the secret message M=m1m2…mp is 0111010. The

length of S is 15 and is larger than the length of M, p, which is 7 in this case.

Our main idea is as follows:

Step 1. Suppose the length of the reference sequence S is 15. Select p numbers

randomly from 1 to 15, p is equal to 7 in this case. Assume that they are 2, 3, 5,

10, 12, 13 and 15. Let A=A1, A2, …, Ap={2,3,5,10,12,13,15}.

4-1
Chapter 4 Method 3: The Substitution Approach

Step 2. Transform S into S’ by the following rule:

For all i from 1 to 15,

if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , set Si to C ( S i ) ,

if i is equal to some Aj and mj is 0, do not change Si, and

if i is not equal to any Aj, set Si to C (C ( S i )) .

Thus S’=GCCATGCCAACTAGG.

Step 3. Send S’ to the receiver with many other sequences.

The receiver would not need to generate the set A. After receiving a set of

sequences, he will check all positions to a sequence S’ in the set. There are only three

possible cases: (1) S’i is the same with Si (The secret bit mj equal to 0). (2) S’i is

C ( S i ) (The secret bit mj equal to 1). (3) S’i is C (C ( S i )) . If there exists one j such that

S’i and Si are not of the above three cases, it means that the sequence should be ignored.

In the following, we present the algorithms for Method 3.

Algorithm 4-1 Encryption Algorithm for Method 3

(Substitution Approach)

Input: A DNA sequence S, the secret message M=m1m2…mp and a


complementary rule.

Output: An encrypted DNA sequence S”.

Step 1. Use a random number generator to generate a set of available integer

4-2
Chapter 4 Method 3: The Substitution Approach

sequences, called set A. The number of A is p, the length of M.

Step 2. Initialize i and j to 1.

Step 3. For each element Si of S, do the following operations:

if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , change Si to C ( S i ) ,

if i equal to some Aj and mj is 0, do not change Si,

if i is not equal to any Aj, set Si to the double-complement of itself.

Step 4. Return S’.

Algorithm 4-2 Decryption Algorithm for Method 3 (Substitution Approach)

Input: A set of DNA sequences, the reference sequence S and the


complementary rule.

Output: Secret message M

Step 1. Initialize i and j equal to 1.

Step 2. For the next sequence S’ of the sequence set, do the following

operations:

For each Si:

if there exists a j such that such that S j ' ≠ S j ' , S j ' ≠ C ( S j ' ) and

S j ' ≠ C (C ( S j )) . ignore S’; otherwise,

4-3
Chapter 4 Method 3: The Substitution Approach

if S’i is the same with Si, mj=0,

else if S’i is C ( S i ) , mj=1,

Step 3. Concatenate all mj,’s to be M and return M.

For an intruder to find out the secret message, he should have the following

information: (1) He should know the reference DNA sequence. (2) He should know the

complementary rule.

4.2 An Example of Substitution Approach

In this section, we will run a fully example to illustrate Method 3. Our reference

DNA sequence is a segment of DNA sequence of Litmus as mentioned in the previous

section.

S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAGT

CACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAGA

GATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTCAC

CACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTCCT

GCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC

Our secret message M=0110100101001100101001010101001010100110101010.

Suppose the complementary rule is: (A T) (C A) (G C) (T G).

Suppose A={4, 7, 8, 11, 13, 15, 16, 20, 23, 27, 36, 39, 42, 43, 44, 49, 63, 67, 68, 69,

70, 72, 74, 76, 79, 83, 86, 88, 89, 93, 94, 96, 97, 100, 101, 107, 117, 121, 126, 139, 156,

171, 178, 190, 201, 212} which is generated by random number generator.

4-4
Chapter 4 Method 3: The Substitution Approach

We first point out the position corresponding to A in S. The underlined position is

the contents in A.

S=ATCGAATTCGCGCTGAGTCACAATTCGCGCTGAGTCACAATTCGCGCTGAG

TCACAATTGTGACTCAGCCGCGAATTCCTGCAGCCCCGAATTCCGCATTGCAG

AGATAATTGTATTTAAGTGCCTAGCTCGATACAATAAACGCCATTTGACCATTC

ACCACATTGGTGTGCACCTCCAAGCTCGCGCACCGTACCGTCTCGAGGAATTC

CTGCAGGATATCTGGATCCACGAAGCTTCCCATGGTGACGTCACC

Recall our encryption algorithm, transforms S into S’ by the following rule: For all

i from 1 to 15,

if i is equal to some Aj and mj is 1, 1 ≤ j ≤ p , set Si to C ( S i ) ,

if i is equal to some Aj and mj is 0, do not change Si, and

if i is not equal to any Aj, set Si to C (C ( S i )) .

The complementary rule is: (A T) (C A) (G C) (T G).

M=0110100101001100101001010101001010100110101010.

Thus we will get S’ as follows:

S’=GCTGGGGGTACAACGAACTTTGACCTCTATCAGACCGTAGCGAGTATCGGA

CTGTGGCCACATTCTACCCAAAAGGCTCCATTATCTAGGGCTATGAGCTCTGA

GAACGGCCACGCTCGGCCATTGGATCTAGCGTGGTGGGTATTGCCCAGTTGGC

TGTTGTGCCAACATATGTTCATGGATCTATATATTACGTTACTGTAGAAGGCCT

4-5
Chapter 4 Method 3: The Substitution Approach

CCATGAAGCGCTCAAGCTTGTAGGATCCTTTGCAACAGTACTGTT.

The receiver will get the secret message M by comparing S’ and S. The comparing

rules are described in Algorithm 4.2.

4.3 Robustness of Substitution Approach

Let us review the example and the steps of method 3. For this method, we also will

use a reference sequence S. Let us assume that S=ACGGAATTGCTTCAG and the

secret message M=m1m2…mp is 0111010. The length of S is 15 and is larger than the

length of M, p, which is 7 in this case. For illustration, assume that the complementary

rule is the same as given in the above sections. That is, the rule is as follows: ((A T) (C

A) (G C) (T G)).

Step 1. Suppose the length of the reference sequence S is 15. Select p distinct numbers

randomly from 1 to 15, p is equal to 7 in this case. Assume that they are sorted

as 2, 3, 5, 10, 12, 13 and 15. Let A=A1, A2, …, Ap={2,3,5,10,12,13,15}.

Step 2. Transform S into S’ by the following rule:

For all i from 1 to 15,

if i is equal to some Aj and mj is 1, 1≤ j ≤ p , set Si to C ( Si )


,

if i is equal to some Aj and mj is 0, do not change Si, and

C (C ( Si ))
if i is not equal to any Aj, set Si to .

Thus S’=GCCATGCCAACTAGG.

Step 3. Send S’ to the receiver with many other sequences.

4-6
Chapter 4 Method 3: The Substitution Approach

For an intruder to find out the secret message, he should have the following

information.

(1) He should know the reference DNA sequence.

(2) He should know the complementary rule.

For (1): There are roughly 55 millions DNA sequences available publicly. Thus,
1
the probability of an attack’s successful guessing is 55000000 (4a).

For (2): We should consider how many legal complementary rules are. We define a

legal complementary rule as the following: for each alphabet x of a DNA sequence,

both C(x), C(C(x)) and C(C(C(x))) are not equal to x. Note that C(x) is the

complementary of x. There will be 6 legal complementary rules as follows:

(A T) (T C) (C G) (G A),

(A T) (T G) (G C) (C A),

(A C) (C T) (T G) (G A),

(A C) (C G) (G T) (T A),

(A G) (G T) (T C) (C A), and

(A G) (G C) (C T) (T A)

1
The probability is 6 (4b).

Finally, the probability of an attacker’s successful guessing of method 2 is the

4-7
Chapter 4 Method 3: The Substitution Approach

multiplication of (4a) and (4b).

The probability is as follows:

1 1
×
5500000 6

Let us consider another case. Suppose that it is impossible to know the reference

sequence S for an attacker. Thus, according to our algorithm, the attacker should

consider 3 possibilities for each alphabet between S’ and S. For each alphabet x between

S’ and S, there may be three cases: (1) S’ is x and S is x, too. (2) S’ is x and S is C(x). (3) S’

is x and S is C(C(x)). If the length of S’ is n, there will be 3n possibilities. The


1
probability of the attacker’s successful guessing will be .
3n

Lemma 4-1:

Suppose n is the length of S. The probability of The Substitution Approach is


1 1 1
× or .
5500000 6 3n

4.4 Payload of Substitution Approach

In this section, we will introduce the payload of Method 3. The payload of Method

3 is 0 because we will only change the alphabet of the reference sequence S but not add

any extra words in S.

4-8
Chapter 5 Conclusion and Future Works

Chapter 5 Conclusion and Future Works

5.1 Conclusions

In the thesis, we have presented three encryption methods based on the DNA

sequence. In the previous related works, they are almost based upon the biology

properties. There is almost no difference between a real DNA sequence and a faked one.

There are a large number of DNA sequences publicly available in various web-sites. A

rough estimation would put the number of DNA sequences publicly available to be

around 55. Our encryption methods are based upon the above two properties.

For an attacker, it is hard for him to detect whether there are secret messages in a

DNA sequence or not. Even though he know that there are some secret messages in our

DNA sequence, it is hard for him to know the reference sequence and correctly decode

the secret.

5.2 Future Works

It is worth to make the decoding stage more complicated. It is also worth to find

out other properties of DNA sequences which can be used to hide data. Also, to

decrease the payloads of Method 1 and Method 2 are also our future works.

5-1
Bibliography

Bibliography

[A2000] Dynamic Programming Algorithms for RNA Secondary Structure Prediction with

Pseudoknots, Akutsu, T., Dirsrete Applied Mathematics, Vol. 104, 2000, pp. 45-62.

[ABG92] Optimal Parallel Algorithms for Periods, Palindromes and squares, Apostolico,

A. , Breslauer, D. , Galil, Z. , Proceedings of the International Colloquium on Automata,

Languages, and Programming , 1992 , pp. 296-307.

[ABG95] Parallel Detection of All palindromes in a String, Apostolico, A. , Breslauer,

D. , Galil, Z. , Theoretical Computer Science , Vol. 141 , 1995 , pp. 163-173.

[ABLRRW94] Molecular Biology of the Cell, Alberts, B., Bray, D., Lewis, J., Raff, M.,

Roberts, K. and Watson, J. D., New York & London: Garland Publishing, 1994.

[BG95] Finding All Periods and Initial Palindromes of a String, Breslauer, D. , Galil, Z. ,

Algorithmica , Vol. 14 , 1995 , pp. 355-366.

[C2003] A DNA-Based, Biomolecular Cryptography Design, Chen, J. , Proceedings of

the 2003 International Symposium on Circuits and Systems, Vol. 3, 2003.

[CRB99] Hiding Messages in DNA Microdots, Clelland, C. T., Risca, V. and Bancroft,

C., Nature, Vol. 399, 1999, pp.533-534.

[G76] Real-Time Algorithms for String-Matching and Palindrome Recognition, Gail,

Z. , STOC , 1976 , pp. 161-173.

[LNC2000] Principles of Biochemistry, Lehninger, A. L., Nelson, D. L. and Cox, M. M.,

New York, Worth, 2000.

6-1
Bibliography

[LRBR2000] Cryptography with DNA Binary Strands, Leier, A., Richter, C., Banzhaf,

W. and Rauhe, H., BioSystems, Vol. 57, 2000, pp.13-22.

[M75] Manacher, G.., A new linear-time on-line Algorithm for finding the smallest

initial palindrome of the string, J. ACM, Vol. 22, 1975, pp. 346-351.

[MW2002] Jewels of Stringology, Crochemore, Maxime and Rytter, Wojciech, World

Scientific, 2002.

[SFP2002] Hiding Data in DNA, Shimanovsky, B., Feng, J. and Potkonjak, M., Revised

Papers from the 5th International Workshop on Information Hiding, Lecture Notes in

Computer Science, Vol. 2578, 2002, pp.373-386.

[WHRS87] Molecular Biology of the Gene, Watson, J., Hopkins, N., Roberts, J. and

Steitz, J., Benjamin Cummings, 4th Edition, Menlo Park, CA, 1987.

6-2

Você também pode gostar