Escolar Documentos
Profissional Documentos
Cultura Documentos
2008; Seto et al., 2007). Most general methods, such as BLAST 2.2 Construction of training set
and MEME, are inappropriate for piRNA prediction. For example, Constructing training set to computationally detect piRNA with Fisher
we cannot find any homologous piRNAs between fruit fly and discriminant algorithm (Fisher, 1936), we use two groups of samples: a
other species with BLAST; and not any conserved motifs are found positive group consisting of true piRNA sequences from five model species
in piRNAs with MEME. Therefore, more efficient computational and a negative group of non-piRNA sequences. The positive dataset consists
methods are urgently demanded. The k-mer scheme is widely used of known piRNA sequences of five species: rat, mouse, human, fruit fly
to characterize biosequences (Burge et al., 1992; Gutierrez, 1993; and nematode. piRNAs from the first three species are downloaded from
Karlin and Ladunga,1994), because patterns using k-mers have been NONCODE version 2.0 (Liu et al., 2005), and piRNAs from the last two
species are obtained from NCBI (nematode: gi222138841 ∼ 222138290;
found to be species or taxon special (Karlin et al., 1994; Karlin and
fruit fly: gi157362817 ∼ 157361675). In total, we obtain 173 090 positive
Mrazek, 1997; Madera et al., 2010). samples, including 32 046 human piRNAs, 72 747 mouse piRNAs, 66 758 rat
It is important to note that, the Solexa small RNA data may piRNAs, 552 Caenorhabditis elegans piRNAs and 987 Drosophila piRNAs.
also include miRNA, piRNA, as well as some short fractions of The negative samples are derived from NONCODE version 2.0
snoRNA, snRNA, long ncRNA and un-annotated mRNAs, which (Liu et al., 2005). NONCODE is a database of a wide variety of ncRNA
have similar lengths to piRNA. In fact, NONCODE version 2.0 classes (small and long ncRNAs) from 861 organisms covering all kingdoms
(Liu et al., 2005) has 9257/20 7765 = 4.46% ncRNAs shorter than of life (eukaryotes, eubacteria, archea and viruses). Data are from three
25 nt, suggesting that most ncRNA may produce a sequence fraction sources: (i) manual extracts from literature, (ii) automatically filtered and
with similar length to piRNA. Moreover, possible background noise manually confirmed GenBank sequences and (iii) experimental data from
also exists, including the background information introduced by Chen’s laboratory (Giulia et al., 2009). In detail, it includes ‘miRNA’,
‘piRNA’, ‘mlRNA’, ‘snoRNA’, ‘snRNA’, ‘tmRNA’, ‘SRP RNA’, ‘gRNA’,
Solexa, the degraded RNA fragments in sample preparation and
‘sbRNA’, ‘snlRNA’, etc. Thus, it is qualified as the source for the ncRNA
the noise caused by the random match of data with genome. study.
Studies of Wei et al. (2009) demonstrated that there are about There are 34 675 non-piRNA non-coding RNA sequences, and it should
30–40% short sequences in 603 607 possible candidates of locust be noted that most of them are much longer than positive sequences. At first,
772
ng g
where m g = n1g k=1 yk ,g = 1,2. Once the Fisher vector w and the threshold
y0 are obtained, the decision of piRNA/non-piRNA in the test set is simply
performed by the criterion of f (x) > 0/f (x) < 0, where f (x) = wτ x −y0 . To
improve the Fisher algorithm, we introduce the ‘cutoff’, which in theoretical
physics means the maximal (or minimal) value of energy, momentum or
length, so that the objects with even smaller (or larger) values than these
physical quantities are ignored. A popular method in increasing precision is
2
to set higher cutoff values (Candolfi et al., 1993; Huang et al., 2006). The m
is the mean value of the projective points [i.e. f (x) = wτ x] of non-piRNAs.
Here, we set Nstd to be the SD of the projective points of non-piRNAs. With
the two variables, we may improve the Fisher discriminant algorithm. In
detail, we change the discriminant formula into the new one shown below.
f (x) = wτ x − m
2 −t ×Nstd ,
Fig. 1. Average frequencies of 1337 strings in piRNAs and non-piRNAs.
and once the Fisher vector w is obtained, the decision of piRNA/non-piRNA The 1337 strings are used differently between piRNAs and non-piRNAs,
is performed by the criterion of f (x) > 0 and f (x) < 0, respectively. Obviously, and the difference is visualized by comparing the average frequencies of the
2 +t ×Nstd .
the cutoff value is m 1337 strings in two groups of samples. Here, the red and black lines represent
piRNAs and non-piRNAs, respectively.
3.2 Position-specific base usage of piRNA especially in the 30, 31 and the 32 position. Furthermore, we detected
The size distribution of all known piRNAs largely varied ranging position-specific base usages by using rank sum test. In the analysis,
from 18 nt to 32 nt, and mainly distributed in 28, 29, 30 and we only considered the beginning 21 base positions of piRNA
31 nt which cover 72.32% known piRNAs (Supplementary Material to cover all possible piRNA sequences. Setting significant level
Fig. S1). With the comparison of piRNAs and non-piRNAs, to be 10−100 , we found that 21 position-specific base usages are
we revisited the position-specific properties in detail. Then, we significantly different between piRNA and non-piRNA. They are
calculated the frequencies of four bases in each position, and a1, g1, c1, t1, g2, t2, g3, c3, t3, a4, t4, a5, g5, t5, a6, c6, t6, t7,
identified conserved position-specific bases at the beginning and a10, c10 and c12. The difference can be visualized by comparing
the end of piRNAs (Fig. 2A), besides G or A at +1 position, A the frequencies of these position-specific bases in piRNAs and
at +4 position and a slight underrepresentation of G at −1 position, non-piRNAs (Fig. 2B).
773
Actual positives TP FN
Actual negatives FP TN
sn sn = TP+FN
TP
sp sp = TP+FP
TP
TP, True positives; FP, False positives; TN, Ture negatives; sn, Sensitivity; sp, Precision.
B
3.3 Cross-validation tests
Prediction precision and sensitivity are widely used to evaluate the
performance of an algorithm. Sensitivity is the ratio of number
of true positive samples to that of actual positive samples, and
the precision is the ratio of number of true positive samples to
those of predicted positive samples. Their definitions are listed in
Table 1. We performed five cross-validation tests in five species:
rat, mouse, human, fruit fly and nematode. In order to evaluate the
precision and sensitivity of current algorithm in predicting piRNA Fig. 3. The relationship of t to precision and sensitivity of 5-fold cross-
for a new species, we used the piRNAs of four species as training validation tests. Here, ‘sn’ and ‘sp’ denotes the prediction sensitivity and
set and the piRNAs of another species as test set. Each time we precision, respectively. Nematode: C.elegans; fruit fly: D.melanogaster; rat:
774
Table 2. The 15 strings that are differently used in two phases with a A
significance level of 10−100
1mer C
2mer CC
3mer ACC,GCC,CCA,CCC,CCT,CTC
4mer CACC,CCTT
5mer CTGCA,CTCTG,TCCGA,TTGCT,TTGTA
775
Betel,D. et al. (2007) Computational analysis of mouse piRNA sequence and biogenesis.
PLoS Comput. Biol., 3, 2219–2227.
Brennecke,J. et al. (2007) Discrete small RNA-generating loci as master regulators of
transposon activity in Drosophila. Cell, 128, 1089–1103.
Burge,C. et al. (1992) Over- and under-representation of short oligonucleotides in DNA
sequences. Proc. Natl Acad. Sci. USA, 89, 1358–1362.
Candolfi, E. et al. (1993) Determination of a new cut-off value for the diagnosis of
congenital toxoplasmosis by detection of specific IgM in an enzyme immunoassay.
Enferm. Infecc. Microbiol. Clin., 12, 396–398.
Cox,D.N. et al. (1998) A novel class of evolutionarily conserved genes defined by piwi
are essential for stem cell self-renewal. Genes Dev., 12, 3715–3727.
Das,P.P. et al. (2008) Piwi and piRNAs Act upstream of an Endogenous siRNA Pathway
to Suppress Tc3 Transposon Mobility in the Caenorhabditis elegans Germline. Mol.
Cell, 31, 79–90.
Fisher,R.A. (1936) The use of multiple measurements in taxonomic problems. Ann.
Fig. 6. The distribution of projective points representing small RNAs A small Eugen., 7, 179–188.
RNA sequence is first characterized by a 1364 D vector, and then further Girard,A. et al. (2006) A germline specific class of small RNAs binds mammalian Piwi
mapped to a projective point by Fisher formula F(x) = wτ x. This figure proteins. Nature, 442, 199–202.
shows the distributions of projective points of 120 000 pairs of drill sequences Giulia,S. et al. (2009) An Ariadnes thread to the identification and annotation of
(piRNAs and non-piRNAs), 603 607 deep-sequenced candidate small RNA noncoding RNAs in eukaryotes. Brief Bioinform., 10, 475–489.
sequences, 87 536 predicted locust piRNAs and 4426 locust piRNAs matched Grivna,S.T. et al. (2006) A novel class of small RNAs in mouse spermatogenic cells.
Genes Dev., 20, 1709–1714.
with the locust transposons. From the top down, the framework of this article
Gutierrez,G. et al. (1993) Dinucleotides and G+C content in human genes: opposite
is presented. behavior of GpG, GpC and TpC at II-III codon positions and in introns. J. Mol.
Evol., 37, 131–136.
Houwing,S. et al. (2007) A role for Piwi and piRNAs in germ cell maintenance and
for studying phase transition via locust transcriptome and RNAi
transposon silencing in Zebrafish. Cell, 129, 69–82.
776