Você está na página 1de 5

International Journal of Computer Information Systems, Vol. 3, No.

5, 2011

High Processing Genomic Signal Application with Adaptive region Prediction


S.Jessica Saritha, Dept of Computer Science , JNTUA College ofEngineering Pulivendula, A.P, India sarithajntucep@gmail.com
Abstract: The need of fast and accurate data processing in volumetric data set has recently attained a large attention. For the data information retrieval in online system, the computing algorithm and its approach of implementation plays an important role. Various techniques have come up in recent past to perform a fast processing in volumetric data set. Though the approaches are defined for fast processing, they are not been evaluated in all stream of applications. In applications such as genomic signal processing, the approaches were limited due to very large volume of data set and irregular similarities in the gene sequences. Additionally, the issue of coding and non-coding regions in gene sequence which result in excessive coefficients, result in slower operation. To elevate the limitation to such approach in this paper an approach towards faster processing in genomic signal application is been suggested. Keywords: Genomic Data Processing, Adaptive region filtration, gene Mining. I. INTRODCUTION

Prof. P.Govinda Rajulu Dept of Computer Science, Sri Venkateswara University Tirupathi, A.P, India
pgovindarajulu@yahoo.com representability and processing is need to be modified to improve the speed of operation. Various applications demands for the improvement of computation speed so as to improve system performance, but such systems are limited due to large processing data set. Genomic applications are found with their limitation and are very slow in operation when processed. The data representabilty or processing architecture modification is required for the improvement of such systems. The problem of processing large data set in genomic signal processing applications is observed from the evolution of automated genomic signal processing. Various approaches were observed in past for the representability and system processing modification for faster and accurate classification. [1] Suggested a gene representation modification to the gene sequence which directly relates the identification of the period-3 component to the detection of nucleotide bias in the codon structure, and completely characterizes the DNA Spectrum by a set of numerical sequences. The sequence spectrums are derived for the variation representation of gene sequence so as to improve the speed of computation by the spectral matching approach. This approach is observed to improve the system accuracy by a large factor, but with the improvement in system complexity. In [2] a Harmonic Suppression filter and parametric Minimum Variance Spectrum estimation technique for gene prediction. The author show that both the filtering techniques are able to detect smaller exon regions and adaptive MV filter minimizes the power in introns (non-coding regions) giving more suppression to the intron regions. A new classification criteria improving upon traditional frequency based approaches for identification of coding regions is presented in [1-4]. The Experimental studies carried out indicate superior performance compared with other algorithms that use the 3-periodicity property. In [5] a theoretical justification for the 3-periodicity property observed in protein coding regions within genomic DNA

With the evolution of automated data processing, the need for faster and reliable applications has emerged. The objective of faster computing system for large data set is been focused in recent past. Various approaches were suggested in speeding the system operation either by algorithmic modification or system architectural modification. In this paper work a focus is made on the development of an approach for faster speed computation with data prediction and system improvement for volumetric data processing in Genomic signal processing (GSP) applications. In practical GSP applications the sequence is represented in a continuous Data sequence . The processing speed of such system depends on the data representation and mode of processing. The data

November Issue

Page 60 of 90

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 sequences is presented. The choice of the numerical representation of DNA sequence for biological properties is presented in [7]. To improve the speed of processing, various architectural modifications were observed. In [8] methods used for digital signal processing is presented to solve the problems in the design of Higherple-valued logic circuits and systems. In [9] a molecular digital computer that configures itself is proposed. Towards reducing the enhanced expression of protein an algorithm for reducing duplicate genes was proposed in [18]. In [19,20], an efficient incremental clustering algorithmLeaders-Sub Leaders an extension of Leaders algorithm, suitable for protein sequences of bio informatics is proposed for effective clustering and prototype selection for pattern classification. The method and system have been successful in predicting genomic sequences and text structures. In [22-26] a high throughput gene expression profiling techniques for a huge amount of gene expression data on various organisms has been presented. Such a bulk of biological data provides excellent opportunities to have transcriptional regulation mechanisms using machine learning and data mining approaches. In a simple and robust method for the classification of significantly expressed genes in high throughput microarray measurements of a cells transcriptome is presented. The technique is generated in PCA-based detection and isolation (FDI) systems. PCA-FDI is a data-driven procedure that can be used to isolate gene expression profiles associated with anomalous cell function by projecting target onto a residual subspace orthogonal to a set of PCA coordinates extracted from microarray data. The method is robust to noise and disturbances, and is insensitive to natural variation due to nominal cell functioning. The modified clustering approaches were developed for the objective of speed improvements but no consideration is given on the data representation in conventional clustering approach. These limitations were heavily effective when processing on very large data set such as genomic application. With this objective, in this paper work a focus is made on the improvement of data retrieval efficiency in genomic signal processing application based on the Higher valued logic representation and computation. II. GENOMIC SIGNAL PROCESSING Genomic signal processing (GSP) is the engineering branch that studies the content of the genomic signal and explains the production of mRNA and proteins, which are carried out by the genome. Based on today's technology GSP focuses on obtaining the gene information from analyzing the gene. The processing of genomic signal is nothing but analyzing the gene, and processing it to extract various information which will be useful for the observation. The purpose of the GSP is to combine the theory and various methods to process the genomic information, so that it will be useful. Therefore, GSP includes different techniques regarding expression profiles: detection, prediction, classification, control and statistical modeling. GSP is a essential engineering branch that studies the analysis and process of genes which is based on a synthesis model which involves a meticulous mathematical approaches. As the informations are very important in analysis and predication the information retrieval accuracy is of major importance. For obtaining this objective various problems are observed in current system. Genomics is the inter disciplinary subject which creates revolution in the field of medicine and agriculture. In the next century the processing and sequencing of genomic information of humans and other living things will bring enormous changes in the scientific world. Genomic information is always represented in sequences. DNA and proteins which forms these sequences are represented in characters. DNA consist of four letters A,T,G,C, where as protein contains twenty characters. Already bimolecular analysis is the major research area among scientists all over the world. But since the sequences are in characters it is being difficult to analyze them. If these sequences are converted to numerical values then it will be easy to process them by using DSP techniques. Genes are fundamental components of the human body determining the behavioral and physical attributes of a person. Many core diseases are identified to be genetic and hereditary. Therefore there has been a rise of research in the field of genetic engineering. Genes are constructed of protein structures, RNA and DNA.

Figure1. Deoxyribonucleic Acid (DNA). A DNA molecule will have 2 strands which looks like a double helix as shown in fig 1. They are for human beings or other organisms. In a DNA sequence A is linked to T and C is linked to G by strong chemical bond. These four different bases are called as: adenine (A), thymine (T), cytosine (C), and guanine (G).

November Issue

Page 61 of 90

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 A DNA sequence is a sole order of collection of bases along a sugar phosphate backbone. The sequence specifies a typical genetic behavior or identity. Any two DNA sequences are held together by a weak bond known as base pairs. A genome is the total number of base pairs present in a Gene. A human genome consists of approximately three billion base pairs. Another important component of a Gene is RNA. RNA has the same structure that of DNA but are of single helix rather than two helix like that of DNA. Hence they are said to be bulky and therefore can form different structures. The third component of gene is proteins which are essentially the most important component of all the biological aspects. Proteins are of different types and also attributes to the behavior of physical health. Analysis of protein structure is also known as protein synthesis. In summary DNA is the most unique and defining structure analyzable conclusively for the identity of a person or his family tree or his health and so on. DNA uses RNAs to transcript its information. RNAs translates this information to proteins. The representation of these sequences for automated data processing and intern automated diagnosis for accurate result is of prime requirement. III. PROPOSED ADAPTIVE SAMPLING CODING APPROACH Higher valued logic for exon prediction is outlined for the processed signal under variable environmental conditions. The given gene sequence is represented as a string which composed of four bases mapped into four binary signals. The value of 1 is taken by the signal b A (n) in the case if A is present in the DNA sequence at index n . But if it is not the case that is if A is absent at index n the value of 0 is taken. For instance, b A (n) for the DNA segment CGTCGTGGAA is given as 0000000011. In the same manner the signals reason, the ( f N / 3) -DFT coefficient magnitude is frequently considerably larger when compared to the surrounding DFT coefficient magnitudes. A figure that can be utilized in order to measure the total spectral content S ( f ) of a DNA character string at frequency f is defined as the sum of the magnitude of the DFT values of the four binary nucleotide sequences. Observe that a calculation of the DFT at the single point f N / 3 is adequate. The window can after that be slid by one or more bases and S (N / 3) recalculated. Therefore, we obtain a picture of how S (N / 3) evolves along the length of the DNA sequence. It is essential that the window length W be adequately large (typical window sizes are a few hundreds, e.g., 351, to a few thousands). On the other hand a long window implies longer computation time, and in addition compromises the base-domain resolution in predicting the exon location [2]. On the other hand the non-coding regions in the DNA spectrum at

2 are not wholly suppressed by the conventional 3

bT (n) , bG (n) and bC (n) can be acquired. After that the DFT of b A (n) , B A ( f ) over W samples
is found. In the same manner it is possible to obtain the DFT of

bT (n) , bG (n) and bC (n) ,

termed as

BT ( f ) , BG ( f ) and BC ( f )

respectively. Period-

three behavior is noticed in several genes and it is also found that is very much helpful in recognizing the coding regions. In addition, several researchers have observed that the period-3 property to be a good (preliminary) indicator of gene location. For this

DSP. Therefore, a non-coding region may be mistakenly recognized as a coding region. To overcome this limitation in this work an approach is proposed called Adaptive sampling-PSD approach to achive higher accuracy in exon region estimation. It is observed that the period-3 pattern in an exon is only a periodicity in statistical sense, the periodicity is not so prominent for some exons especially those of short ones. Additionally on the other hand, nonexon sequence may also exhibit a statical period-3 pattern or any other periodicity just by chance. This makes it difficult to discriminate between exon and non-exon DNA sequences. In order to make the period-3 pattern prominent in spectral analysis, we need to find a way to reduce the effect of the spurious periodicities. Since the period-3 pattern of an exon is mainly due to codon usage bias, and the difference of Higher codons for a same amino acid is mainly on the third nucleotide in the codon triplet, With this observation for, every third sequence the codon sequence is fixed while the first and the second nucleotide are sub sampled. Such sub-sampling reduce the spurious periodicity while at the same time keep the period- 3 pattern of an exon unchanged. This approach of sub sampling is called as Adaptive sampling -PSD approach. In this approach once a sub-sampling operation is carried out the non-exon sequences, after sub-sampling reduces the coefficient density and there will be no much prominent peak at k = N/3 in its spectral energy function. Therefore, on average it will be more

November Issue

Page 62 of 90

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011 discriminative between exon and non-exon sequences after the DNA sequence is sub-sampled. The DNA sequences are sub-sampled a Higherple times, and then on the average spectral energy the process is carried. The average spectral energy function for discriminating from exon to nonexon DNA sequences is then given by,
PSD

SPECTRAL PLOT FOR GENOMIC SEQUENCE 2.2 Subsampled-PSD VOS-PSD

1.8

1.6

1.4

Where L is the number of sub-samples, and ||Sl(k)|| is the spectral energy function.

2
1.2

0.8

10

20

30

IV. RESULT OBSERVATION An case study is carried out on c-elegans chromosome-III dataset for the performance evaluation. The evaluation were carried out on the longer and shorter protein sequences for exon prediction and retrieval based on the suggested and conventional approaches. This approach results in, lower distortion and exact prediction of exon region than the method suggested in Vikranth Tomer et.al.. The result observation for such approach is as shown,
SPECTRAL PLOT FOR GENOMIC SEQUENCE 2 Subsampled-PSD VOS-PSD
PSD

40 50 base location

60

70

80

90

Spectral plot for genomic signal for sequence ACACCACATCATGACAGTACGCAGCTACGA ACTAC

SPECTRAL PLOT FOR GENOMIC SEQUENCE 2.2 Subsampled-PSD VOS-PSD

1.8

1.8

1.6

1.4
1.6

1.2
PSD

1.4

1
1.2

0.8
1

10

20

30

40 50 base location

60

70

80

90

0.8

10

20

30

40 50 base location

60

70

80

90

Spectral plot for genomic signal for sequence ATTAGCATACGCTTCGACTACGATCAGCTAC GCTAC

Spectral plot for genomic signal for sequence ATCAGACTAGACGTAGCTACGATTCGCCAC ACTAC

November Issue

Page 63 of 90

ISSN 2229 5208

International Journal of Computer Information Systems, Vol. 3, No. 5, 2011


SPECTRAL PLOT FOR GENOMIC SEQUENCE 2.2 Subsampled-PSD VOS-PSD

1.8

1.6
PSD

1.4

1.2

0.8

10

20

30

40 50 base location

60

70

80

90

Spectral plot for genomic signal for sequence ACTACACCTACATAGAGCATAGGACACTAG ACATC From these observations it is observed that the objective of accurate exon region estimation is obtained. This approach however has the faster computation due to larger representation coefficients. V. CONCLUSION The objective of faster processing with this accurate estimation a Higher valued approach to this processing is been developed. For the complex signal representation of GSP when sub-sampling approach is applied the resultant exon regions are more clearer in harmonic elimination than the previous method, however the processing has gone slower. It is observed that a Higher valued logic operations are faster in computation than the conventional bi-level logic. Hence to speed the operation speed for computation in this work the sub-sampling operation is carried out and is represented in MVL logic rather than bi-level logic. VI. REFERENCE [1] Mahmood Akhtar, Eliathamby Ambikairajah and Julien Epps, Detection of Period-3 Behavior in Genomic Sequences Using Singular Value Decomposition ,IEEE-2005. [2] Vikrant Tomar, Dipesh Gandhi and C. Vijaykumar Digital Signal Processing for Gene Prediction TENCON 2008 Conference, IEEE 2008, pp.1-5 [3] Jamal Tuqan and Ahmad Rushdi A DSP Approach for Finding the Codon Bias in DNA Sequences IEEE Journal Of Selected Topics In Signal Processing, Vol. 2, No.3, 2008 pp 343-355

[4] Gurnmuluru and V. Su-Shing Chen, Gainesville An Intelligent System for Searching Genomic Sequences Bioinformatics and Bioengineering, BIBE 2007. Proceedings of the 7th IEEE International Conference, 2007 pp 982 - 986 [5] Suprakash and Amir Asif A Fast DFT Based Gene Prediction Algorithm For Identification of Protein Coding Regions IEEE International Conference on Speech, and Signal Processing, ICASSP '05., IEEE 2005, Vol. 5, pp. 653-656 [6] Kevin Crosby and Paula Gabbert BioSPRINT: Classification of Intron and Exon Sequences Using the SPRINT Algorithm Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference (CSB 2004), IEEE 2004, pp. 668 669 [7] Hon Keung Kwan, Swarna Bai Arniker, Numerical Representation of DNA Sequences Electro/Information Technology, 2009. eit '09. IEEE International Conference on 2009 , pp 307 - 310 [8] Jaakko Astola, Radomir S. Stankovic, Signal Processing Algorithms and Higher-Valued Logic Design Methods, Proceedings of the 36th International Symposium on Higherple-Valued Logic (ISMVL06), IEEE, 2006. [9] Elena Dubrova, Random Higherple-Valued Networks: Theory and Applications, Proceedings of the 36th International Symposium on HigherpleValued Logic (ISMVL06), IEEE, 2006. [10] Afef Elloumioueslati, Zied Lachiri and Noureddine Ellouze Spectral Analysis of DNA Sequence: The Exons Location Method Digital Signal Processing, 2007 15th International Conference, IEEE 2007, pp. 115 118

About authors: 1. S.J.Saritha is currently working as an Assistant professor at JNTUA college of engineering Pulivendula is a part-time research scholar at JNTU Hyderabad She had her BTech in ECE from JNTU Anantapur and M.Tech in CSE from JNTU Kakinada. Her areas of interest are Data mining and Bioinformatics 2. Prof P.Govindarajulu is currently working as Principal CM&CS S.V.University, Tirupathi,. He had obtained his M.tech from IIT Madras and Ph.D from IIT Delhi. He worked at several portfolios like Dean and BOS chairman. His research interests are data bases , data mining and Image processing.

November Issue

Page 64 of 90

ISSN 2229 5208

Você também pode gostar