Você está na página 1de 5

ABSTRACT

The fast developments in DNA sequencing techniques have paved the way for tremendous increase in biological databases. Once the whole DNA of an organism is sequenced, the next big task is to predict the protein coding DNA/exons present in that sequence. This idea known as gene finding is one of the major challenges in the analysis of newly sequenced genomes. The most critical thing with biological studies is their accuracy in predicting exact protein coding DNA. Though the state-of-the-art tools report high accuracy, they exhibit performance variations with respect to the length of the sequence being analysed. This work began with an investigation about correspondence between accuracy and length of the DNA sequence being analysed for gene prediction. The preliminary results implied that there is correlation between length and accuracy. Hence in this work, we have developed different models specialized for different ranges of length starting with less than 500 nucleotides to greater than 10000 nucleotides. From a set of features that could discriminate between exons and introns, we have identified those features powerful for each length range. Based on these features we have trained the models and a tool is developed such that when an input sequence is given, it is assigned to the model that is tuned for that particular length range and the prediction is obtained. The proposed work employing Adaboost.M1 in conjunction with random forests as the base classifier shows considerable enhancement of prediction accuracy.

ACKNOWLEDGEMENT

This thesis would not have been possible without the assistance and support of many people. I would sincerely like to thank my supervisor Dr. Achuthsankar S.Nair, HOD Dept. of Computational Biology & Bioinformatics, University of Kerala for offering me this thesis topic and then supporting and guiding me throughout my research. His teaching will definitely have a continuing impact in my future academic and professional career. I would also like to thank my internal guide Ms. Muneera C.R., Associate Professor, Dept. of Electronics & Communication, GEC Thrissur & external guide Ms. Baharak Goli, Research Scholar, Dept. of Computational Biology & Bioinformatics, University of Kerala, for their time and support during the completion of my thesis. I would take this opportunity to thank Dr. Sheeba V.S. , HOD Dept. of Electronics & Communication, GEC Thrissur and the project coordinators Mr. Mohammed Salih K.K., Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur & Mr. Roy Francis, Assistant Professor, Dept. of Electronics & Communication, GEC Thrissur. Last but not the least; I would like to acknowledge as well the invaluable support and encouragement supplied by my family and friends. I greatly appreciate their support.

ii

LIST OF TABLES

Table No. 3.1 3.2 3.3 3.4 3.5

Title FrameD Result GeneMark Result Distribution Mapping Schemes Physicochemical properties of nucleotides

Page No. 16 17 19 27 28

3.6 3.7 3.8 3.9 4.1

Summary of Filters Feature Vector Attribute Selection


Comparison of Various Classifier Methods

29 32 34 36 43

Self Consistency Test Results

4.2

Independent Dataset Test Results

44

5.1

Comparison of Prediction Accuracy

45

iii

LIST OF FIGURES

Figure No. 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6

Title Accuracy of Frame D Accuracy of GeneMark DNA Structure DNA Replication Central Dogma Codons for Amino Acids Eukaryotic DNA Classical Approaches to Gene Finding

Page No. 3 3 6 7 8 9 10 11

3.1

General Schematic Representation of Work

18

3.2

Spectral Content with Nucleotide Position

21

3.3

Feature Extraction using Mapping Techniques

26

3.4 3.5

Tool Developed Working of Length Specific Gene Finding Tool

38 41

iv

LIST OF ABBREVIATIONS AND ACCRONYMS

DNA f SC PSC SR PFDN

Deoxyribo Nucleic Acid Feature Spectral content Paired Spectral Content Spectral Rotation Positional Frequency Distribution of Nucleotides

AMDF

Average Magnitude Difference Function

CV

Cross Validation

Você também pode gostar