Escolar Documentos
Profissional Documentos
Cultura Documentos
Thesis submitted
to
Institute of Mathematics and Statistics
of the
University of So Paulo
in accordance with the requirements
for the degree
of
Doctor in Bioinformatics
i
ii
Resumo
ANDRADE, P. M. Modelo Bayesiano de Meta-Anlise para dados de ChIP-Seq. 2016. 107
f. Tese (Doutorado) - Instituto de Matemtica e Estatstica, Universidade de So Paulo, So Paulo,
2016.
Com o desenvolvimento do sequenciamento em larga escala, novas tecnologias surgiram para
auxiliar o estudo de sequncias de cidos nucleicos (DNA e cDNA); como consequncia, o desenvolvi-
mento de novas ferramentas para analisar o grande volume de dados gerados fez-se necessrio. Entre
essas novas tecnologias, uma, em particular, chamada Imunoprecipitao de Cromatina seguida de
sequenciamento de DNA em larga escala ou CHIP-Seq, tem recebido muita ateno nos ltimos
anos. Esta tecnologia tornou-se um mtodo usado amplamente para mapear stios de ligao de
protenas de interesse no genoma. A anlise de dados resultantes de experimentos de ChIP-Seq
desaadora porque o mapeamento das sequncias no genoma apresenta diferentes formas de vis.
Os mtodos existentes usados para encontrar picos em dados de ChIP-Seq apresentam limitaes
relacionadas ao nmero de amostras de controle e tratamento usadas, e em relao forma como
essas amostras so combinadas. Nessa tese, mostramos que mtodos baseados em testes estatsticos
de hiptese tendem a encontrar um nmero muito maior de picos medida que aumentamos o
tamanho da amostra, o que os torna pouco conveis para anlise de um grande volume de dados.
O presente estudo descreve um mtodo estatstico Bayesiano, que utiliza meta-anlise para
encontrar stios de ligao de protenas de interesse no genoma resultante de experimentos de ChIP-
Seq. Esse mtodos foi chamado Meta-Analysis Bayesian Approach ou MABayApp. Ns mostramos
que o nosso mtodo robusto e pode ser utilizado com diferentes nmeros de amostras de controle
e tratamentos, assim como quando comparando amostras provenientes de diferentes tratamentos.
Palavras-chave: ChIP-Seq, Estatstica Bayesiana, Meta-Anlise.
iii
iv
Abstract
ANDRADE, P. M. A Meta-Analysis Bayesian Model for ChIP-Seq data. 2010. 107 p. Thesis
(Doctoral) - Institute of Mathematics and Statistics, University of So Paulo, So Paulo, 2016.
With the development of high-throughput sequencing, new technologies emerged for the study of
nucleic acid sequences (DNA and cDNA) and as a consequence, the necessity for tools to analyse a
great volume of data was made necessary. Among these new technologies, one in special Chromatin
Immunoprecipitation followed by massive parallel DNA Sequencing, or ChIP-Seq, has been evidenced
during the last years. This technology has become a widely used method to map locations of binding
sites for a given protein in the genome. The analysis of data resulting from ChIP-Seq experiments
is challenging since it can have dierent sources of bias during the sequencing and mapping of reads
to the genome.
Current methods used to nd peaks in this ChIP-Seq have limitations regarding the number
of treatment and control samples used and on how these samples should be used together. In this
thesis we show that since most of these methods are based on traditional statistical hypothesis tests,
by increasing the sample size the number of peaks considered signicant changes considerably.
This study describes a Bayesian statistical method using meta-analysis to discover binding
sites of a protein of interest based on peaks of reads found in ChIP-Seq data. We call it Meta-
Analysis Bayesian Approach or MABayApp. We show that our method is robust and can be used
for dierent number of control and treatment samples, as well as when comparing samples under
dierent treatments.
Keywords: ChIP-Seq peak calling, Bayesian Model, Meta-Analysis.
v
vi
Contents
List of abbreviations ix
List of Symbols xi
List of Figures xiii
List of Tables xv
1 Introduction 1
1.1 Literature review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Document Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Method (Biology) 5
2.1 ChIP-Seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Treatment and Control samples . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Chip-Seq reads alignment & RNA-Seq data analysis . . . . . . . . . . . . . . 8
2.2 UCSC genome assembly and annotation . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Statistical background 13
3.1 Denitions and properties of Gamma, Beta and Dirichlet Distributions . . . . . . . . 13
3.2 Logistic-Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Examples of logistic Normal Approximation . . . . . . . . . . . . . . . . . . . . . . . 32
4 Model (Statistics) 37
4.1 Categorical Bayesian Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1.1 Meta-Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vii
viii CONTENTS
7 Conclusion 77
7.1 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
A R Code 79
Bibliography 89
List of abbreviations
ChIP-Seq Chromatin immunoprecipitation followed by sequencing
DNA Deoxyribonucleic acid
RNA Ribonucleic acid
PCR Polymerase Chain Reaction
BAC Bacterial articial chromosome
IgG Immunoglobulin G
PolII RNA polymerase II
bp Base pair
pdf Probability distribution function
cdf Cumulative distribution function
TSS Transcription Start Site
3UTR Three prime untranslated region
5UTR Five prime untranslated region
ix
x LIST OF ABBREVIATIONS
List of Symbols
Gamma function
Digamma function
0 Trigamma function
Gamma Gamma distribution
Beta Beta distribution
Dir Dirichlet distribution
N Normal distribution
Independence
|=
Distributed as
Approximate distributed as
E Expectation
Var Variance
DKL Kullback-Leibler Divergence
xi
xii LIST OF SYMBOLS
List of Figures
xiii
xiv LIST OF FIGURES
xv
xvi LIST OF TABLES
Chapter 1
Introduction
With the development of high-throughput sequencing, new technologies emerged for the study
of nucleic acid sequences (DNA and cDNA) and as a consequence, the necessity for tools to analyse
a great volume of data was made necessary (Zambelli et al., 2012). Among these new technologies,
one in special, described in Park (2009), called Chromatin Immunoprecipitation followed by massive
parallel DNA Sequencing, or ChIP-Seq, has been evidenced during the last years.
Chromatin Immunoprecipitation followed by high-throughput Sequencing, or ChIP-Seq, has
become a widely used method to map locations of binding sites for a given protein (e.g., transcription
factor) in the genome (Jothi et al., 2008). The analysis of data resulting from ChIP-Seq experiments
is challenging since it can have dierent sources of bias during the sequencing and mapping of reads
to the genome.
In this method the protein is rst linked to the DNA (during a step called cross-link ), and
the genetic material is fragmented. The fragments of DNA that are linked to a protein of interest
are then captured using specic antibodies, fragments of a specic length are then sequenced and
aligned to the genome. Piles of reads aligned to a specic region of the genome are called peaks.
Current methods used to nd peaks (Hower et al., 2011; Wu et al., 2015; Zhang et al., 2008)
have limitations regarding the number of treatment and control samples used and on how these
samples should be used together. The literature review of relevant research is shown in Section 1.1.
For one of the most used methods (the dominant method according to the recent reviews
Thomas, Thomas, Holloway, and Pollard 2016 and Wilbanks and Facciotti 2010), called MACS -
Model-based Analysis for Chip-Seq (Zhang et al., 2008), there is no consensus among the researches
on how many treatment samples replicates should be used together. Although the software docu-
mentation recommends that dierent replicates should be concatenated in a single le: "For the
experiment with several replicates, it is recommended to concatenate several ChIP-seq treatment les
into a single le.", researches have experienced an unexpected change in the numbers of peaks found
when doing so. Some investigators have even suggested that this strategy should be disconsidered
and the software should be used on single les, and other strategies should be applied to combine
the resulting les afterwards.
In this thesis we show that since most of these methods are based on statistical hypothesis
tests (Wu et al., 2015), by increasing the samples (e.g., duplicating both treatment and control
samples), the number of peaks considered signicant changes considerably for a given threshold;
to the extreme of all the peaks in a given chromosome become signicant given a large amount of
data. This behaviour is know among statistician as "increase sample size to reject " (DeGroot et al.,
1986; Stern, 2008). In contrast, our probabilistic approach becomes more assertive regarding the
signicance of each peak as we increase the sample sizes for both treatment and control samples.
This study describes a Bayesian statistical method using meta-analysis to discover binding
sites of a protein of interest based on peaks of reads found in ChIP-Seq data. We call it Meta-
Analysis Bayesian Approach or MABayApp. The model qualies peaks found in regions enriched
by these reads alignments as signicant or non-signicant binding site of a specic protein in the
genome, using a qualitative measure of probability. The task of identifying peaks in ChIP-Seq data
1
2 INTRODUCTION 1.1
number of false positives. HOMER then nds peaks with Poisson p-value less than the p-value
provided by the users and report these peaks as binding sites.
Hower et al. (2011) proposes a method for the identication of statistically signicant peaks in
ChIP-Seq data based on a topological data analysis, called T-PEAK. In this analysis the height of
each base is found based on the number of read aligned to this base and it incorporates information
on the neighbourhood of each site to dene the peaks shape.
T-PEAK uses a initial tree based on the number of alignments at each base and uses a topological
algorithm called path excursion to nd the corresponding root tree. It then uses a tree shape statistics
to nd the "peaksness" of each tree. The genome is then divided into regions, T-PEAK identies
possibles peaks in these regions, and nds the p-value for each of these trees. Finally it uses a
correction for multiple hypothesis testing to remove false-positive peaks.
Spyrou et al. (2009) proposed the statistical algorithm BayesPeak: Bayesian analysis of ChIP-
Seq data, that uses hidden Markov model to nd binding sites in the genome.
BayesPeak uses a hidden Markov model of four states. t also uses a sliding window to search
for regions in the genome and it assumes that the dependence between subsequent windows is the
same for the whole genome. The states of each window St can be 1 or 0 (St = 1 if there's a binding
site in the region t and St = 0 if there's not a binding site in the region t). The working states Zt
are composed by subsequent windows (Zt = (St , St+1 )), thus Zt can assume one of the for states:
(0, 0); (0, 1); (1, 0); (1, 1).
BayesPeak assumes a Poisson-Gamma mixture model and uses Markov Chain Monte Carlo
(MCMC) algorithms to sample from the Posterior distribution and estimate the parameters of the
model. The likelihood expression has no closed form, it is evaluated using a maximization technique
of probabilistic functions of Markov Chain.
Although BayesPeak does not uses hypothesis testing, the likelihhod and posterior distributions
have no closed form and the sampling from the posterior distribution requires advanced statistical
methods. The maximization methods used also increases the uncertain of the results. Moreover
the method does not allow for multiple control and treatment samples. And according to their
experiments, the use of a control sample resulted in a increase number of peaks called (instead of
reducing the number of signicant peaks), which is surprisingly odd.
1.2 Objective
The main goal of this thesis is to build robust a model, with a strong statistical background
to analyse ChIP-Seq data. This model should allow the researches to use as many control and
treatment samples as they have available. Nonetheless the increase in number of samples should
increase the accuracy of the model, giving more condence in the results found as the sample size
increases.
Together with the model description, a computational tool should be made available for inves-
tigators in the area of genetics, and such tool should take as input the genome sequencing data
resulting from the technique known as ChIP-Seq, and should output a list of genomic positions
(initial and nal) of peaks found, ordered by signicance. This list should also characterize each
peak as probable binding sites of the protein of interest, discarding false-positive peaks. This tool
should allow the researchers to input data with several experiment replicates, for both treatment and
control samples. And the analysis should be performed taking into consideration all the replicates
together, thus minimizing the biased results.
1.3 Contribution
The main contributions of this work are:
construction of a new robust Bayesian model for ChIP-Seq data analysis considering multiple
replicates for both treatment and control samples.
4 INTRODUCTION 1.4
Method (Biology)
The datasets we analyse are datasets resulting from a method called ChIP-Seq (Chromatin
immunoprecipitation followed by sequencing ). The goal of this method is to nd binding sites of
a given protein (of interest) in the DNA. Using ChIP-Seq it's possible to identify a set of genes
that are active in a given cell at certain time, by analysing the regions of DNA responsible for the
regulation of this gene. In order to accomplish this, we use specic antibodies that recognizes the
protein of interest, allowing us to know which region of the DNA this protein is bound to. This
method is discussed next.
2.1 ChIP-Seq
The steps of this method, as shown in Deliard et al. (2013), are described below, and represented
in Figure 2.1.
In this method, the protein is rst xed to the DNA (Figure 2.1a), using a chemical process that
makes use of formaldehyde and glycine, called cross-link, This step is responsible for interrupting
the cellular and molecular mechanisms are interrupted, and this state is preserved through freezing
using liquid nitrogen followed by storage at 80 C.
The DNA is then fragmented, using a process called Sonication, in which sound waves break
the genetic material (Figure 2.1b). Fragments of a specic length can be selected by using either
agarose gel electrophoresis or the result of Polymerase Chain Reaction (PCR).
The immunoprecipitation occurs when specic antibodies are connected to small spheres (called
beads ). This antibodies recognize the protein of interest and are bound to them, attaching to the
DNA at the protein's binding sites; this binding between the antibody and the protein is reversible.
The protein-DNA complex is then precipitated through a process of centrifugation (Figure 2.1c).
After the centrifugation, the genetic material is puried. In this process, the binding between the
antibodies and the proteins are reversed, and the DNA is isolated (Figure 2.1d).
The resulting DNA fragments are enriched using the method of Polymerase Chain Reaction
(PCR), before DNA sequencing.
Finally these fragments are sequenced, and the resulting reads can be aligned to a reference
genome (Figure 2.1e). A pile of reads in a given region of the genome is called peak, and it is a
candidate binding site of the protein of interest.
Figure 2.2 shows the alignment of reads of Sugarcane to a reference genome. The genome used
as reference is a Bacterial articial chromosome (BAC) of Sugarcane. Figures 2.2b, 2.2d and 2.2f
show the results of alignment for three runs (three replicated) using the enzyme RNA Polymerase
II (treatment sample) and Figures 2.2a, 2.2c and 2.2e show the same results for three replicates
using Immunoglobulin G (IgG), commonly used control sample.
5
6 METHOD (BIOLOGY) 2.1
(a) Cross-link of proteins to the DNA. (b) DNA fragmentation through Sonication.
4
Run #1. Control Sample (IgG) BACs Run #1. Treatment Sample (PolII) BACs
6
5
3
4
Alignments
Alignments
2
3
2
1
1
0
0
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
(a) Replicate 1; Control Sample (IgG). (b) Replicate 1; Treatment Sample (RNA-PolII).
Run #2. Control Sample (IgG) BACs Run #2. Treatment Sample (PolII) BACs
10
8
8
6
Alignments
Alignments
6
4
4
2
2
0
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
(c) Replicate 2; Control Sample (IgG). (d) Replicate 2; Treatment Sample (RNA-PolII).
Run #3. Control Sample (IgG) BACs Run #3. Treatment Sample (PolII) BACs
12
4
10
3
8
Alignments
Alignments
2
6
4
1
2
0
0e+00 2e+04 4e+04 6e+04 8e+04 1e+05 0e+00 2e+04 4e+04 6e+04 8e+04 1e+05
(e) Replicate 3; Control Sample (IgG). (f) Replicate 3. Treatment Sample (RNA-PolII).
Figure 2.2: Example of ChIP-Seq reads aligned to Sugarcane BACs.
8 METHOD (BIOLOGY) 2.2
(2015) Schneider and Church (2013). The distribution of the genes and transcripts by chromosome
is shown in Fig. 2.5.
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
seqtype
chr11 3UTR
chr12 5UTR
Exon
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
0 Mb 50 Mb 100 Mb 150 Mb 200 Mb
A total number of 94,647 ENSEMBL transcripts have been ltered according to the ltering
pipeline described in Fig. 2.3. 94,545 transcripts were ltered from random chromosome, from these
transcripts, 38,775 genes have been identied. The largest isoform of each gene was selected to
dene the regions 5'UTR, gene body and 3'UTR of the gene.
The genes found were classied according to their functions, the Fig. 2.7 shows the distribution
of genes functions found given the annotation.
The structure classication of the genes annotated is shown in Table 2.1. We compare the length
of enriched regions based on its gene structure, in order to nd dierence in size for the regions for
5'UTR, 3'UTR and gene body.
10 METHOD (BIOLOGY) 2.2
10000
7500
Annotation
Count
3UTR
5000 5UTR
Gene
Transcript
2500
0
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
Chromossome
Figure 2.5: Genes, transcripts, 5UTR and 3UTR regions distribution by chromossomes for mouse genome
assembly mm10.
1e+05
Length (log10 scale)
Annotation
Gene
Transcript
5UTR
1e+03 3UTR
1e+01
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
Chromossome
Figure 2.6: Length of genes, transcripts, 5UTR and 3UTR regions by chromossomes for mouse genome
assembly mm10.
Small ncRNA
Pseudogene
Long ncRNA
3.88%
12.47%
10.42%
type
Long ncRNA
Protein Coding
Pseudogene
Small ncRNA
73.23%
Protein Coding
Statistical background
Theorem 1 (Gamma Function). The Gamma Function, represented by the letter is dened as
follows.
Z
(y) = xy1 ex dx (3.1)
0
Lemma 1 (Derivative of Gamma Function). The derivative of the Gamma Function (0 ) is dened
as follows.
Z
0 (y) = xy1 ex ln (x) dx (3.2)
0
Proof.
d
0 (y) = (y) =
dy
Z
d
xy1 ex dx =
dy 0
Z
d y ln(x) 1 x
e x e dx =
dy
Z 0
[xy ln (x)] x1 ex dx =
0
Z
xy1 ex ln (x) dx
0
Theorem 2 (Digamma Function). The Digamma function, represented by the letter is dened
as follows.
d 0 (y)
(y) = ln (y) = (3.3)
dy (y)
13
14 STATISTICAL BACKGROUND 3.1
follows.
d2 d
0 (y) = 2
ln (y) = (y) (3.4)
dy dy
X Gamma (, )
1 x
fX (x|, ) = x e for x 0 and , > 0 (3.5)
()
Proof. If X Gamma (, ) and Y = log (X), we have the following relationship between X and
Y.
The distribution of Y can then be found using the cumulative distribution of X , FX (x), as follows.
d y 1 ey y yey
fY (y) = FY (y) = fX (ey ) ey = (e ) e e = e y
dy () ()
X2
If a third variable X1:2 is equal to the sum of X1 and X2 . The variable X1:2 will also have a Gamma
distribution with parameters (1 + 2 , ).
i i 1 xi
fXi (xi |i , ) = x e
(i ) i
1
Sometimes this distribution is called Exponential-Gamma or Distribution.
3.1 DEFINITIONS AND PROPERTIES OF GAMMA, BETA AND DIRICHLET DISTRIBUTIONS 15
= E etx1 etx2
= E etx1 etx2
= E etx1 E etx2
X
Z= Beta(X , Y ) (3.8)
X +Y
16 STATISTICAL BACKGROUND 3.1
Lemma 3 (Probability density function (pdf) of the Beta Distribution). The pdf of Z (Beta Dis-
tribution) is dened as follows.
(X + Y ) X 1
fZ (z|X , Y ) = z (1 z)Y 1 (3.9)
(X ) (Y )
Proof. According to Equation 3.8, the variable Z with a Beta Distribution can be dened as follows.
X X Gamma(X , )
Z= for Y Gamma(Y , )
X +Y
X
|=
Y
The additive property of the Gamma Distribution (Theorem 5) states that the the sum U = X + Y
has a Gamma Distribution, as follows.
Since X and Y are independent, their joint distribution fX,Y (x, y) is equal to the product of their
distributions, as follows.
|=
X Y
fX,Y (x, y) = fX (x) fY (y)
X X 1 x Y Y 1 y
= x e y e
(X ) (Y )
(X +Y ) X 1 Y 1 (x+y)
= x y e
(X )(Y )
The transformation from (x,y) to (u,z), and the corresponding inverse will be the followings.
( (
U =x+y X = uz
=
Z = x/ (x + y) Y = u(1 z)
We can nd the joint distribution of U and Z (fU,Z (u, z)) using the joint distribution of the variables
3.1 DEFINITIONS AND PROPERTIES OF GAMMA, BETA AND DIRICHLET DISTRIBUTIONS 17
From the equation above, we can see that U and Z are independent, and the distribution of the
variable Z is the pdf of the Beta Distribution, as shown in Equation 3.9.
(X + Y ) X 1
fU,Z (u, v) = fU (u) fZ (z) = fZ (z) = z (1 z)Y 1
(X )(Y )
X2
|=
X3
|=
|=
Xk
Pk
The distribution of the vector with components i , where i = Xi j=1 Xj is a Dirichlet
Dritribution, with parameters i , as follows.
Xi
= (1 , , k ) Dir (1 , , k ) where i = Pk (3.11)
j=1 Xj
From the denition of Gamma Distribution (Theorem 4), the parameters i must follow the following
conditions.
Lemma 4 (Probability density function (pdf) of the Dirichlet Distribution). The probability dis-
tribution function of (Dirichlet Distribution) is dened as follows.
P
k k
i=1 i Y 1
f (1 , , k |1 , , k ) = Qk i i (3.13)
i=1 ( i ) i=1
Pk
Proof. Since i=1 i = 1, we can rewrite the pdf of the Dirichlet Distribution as follows.
P k
k k1 k1
i=1 i Y 1 X
f1 , ,k1 (1 , , k1 |1 , , k ) = Qk i i 1 j
i=1 (i ) i=1 j=1
Using the denition of Gamma Distribution (Theorem 4), we can dene the joint distribution of
18 STATISTICAL BACKGROUND 3.1
k
X Xj
X1:k = Xi j = 1j k1
X1:k
i=1
The variable X1:k is the sum of independent variables Xi Gamma (i , ), and according to
the Additive property of Gamma Distribution (Theorem 5), X1:k has a Gamma Distribution with
Pk
parameters i=1 i , , and it's pdf (Equation 3.5) is dened as follows.
k
! Pk
X i=1 i Pki=1 i 1 x1:k
fX1:k x1:k i , = Pk x1:k e (3.15)
i=1 ( i=1 i )
The transformation from (Xj ,Xk ) to (j ,X1:k ), where 1 j k 1, and the corresponding
inverse will be the followings.
( (
X1:k = ki=1 Xi 1:k )P 1 j k 1
Xj = (j ) (X
P
=
j = Xj /X1:k 1 j k 1 Xk = X1:k 1 k1 j=1 j
X1:k 0 0 1
0 X1:k 0 2
.. .. .. .. ..
det J = . . . . .
0 0 X1:k k1
X1:k Pk1
X1:k X1:k 1 j=1 j
Adding the rst k 1 rows to the last one, we end up with a upper triangular matrix, and we nd
the determinant multiplying the elements of the diagonal, as follows.
X1:k 0 0 1
0 X1:k 0 2
.. .. .. .. ..
k1
det J = . . . . . = (X1:k )
0 0 X1:k k1
0 0 0 1
We can nd the joint distribution of j , 1 j k 1 and X1:k f1 , ,k1 ,X1:k (1 , , k1 , x1:k )
3.1 DEFINITIONS AND PROPERTIES OF GAMMA, BETA AND DIRICHLET DISTRIBUTIONS 19
k 1
Pk
i k1 k1
i=1 Yh i Pk1 Pk1
i x1:k (1 j=1 j )
(j x1:k )j 1 x1:k 1
X
Qk j e i=1 e (x1:k )k1 =
i=1 (i ) j=1 j=1
k 1
Pk
i k1 k1
i=1
Pk Pk1
( 1)+k1 x1:k (1 j=1 j )
Yh i Pk1
(j )j 1 1
X
Qk x1:ki=1 i e j e i=1 i
=
i=1 (i ) j=1 j=1
Pk
i=1 i 1
Pk k 1
i ex1:k k1
i=1 x1:k Yh i k1 Pk1 Pk1
(j )j 1 1
X
Qk j e i=1 i e j=1 j =
i=1 (i ) j=1 j=1
Pk
i=1 i 1
Pk k 1
i=1 i ex1:k Pk i k1
x1:k k1
i=1
h i
(j )j 1 1
Y X
P Qk j
k ( )
i=1 i i=1 i j=1 j=1
| {z }
fX1:k (x1:k | ki=1 i , ) as in Equation 3.15
P
From the equation above, we can see that the vector (1 , , k1 ) and the variable X1:k are
independent, and the distribution of the vector of variables (1 , , k1 ) is the pdf of the Dirichlet
Distribution, as shown in Equation 3.13.
P k 1
k k1 k1
i=1 i h i
(j )j 1 1
Y X
f1 , ,k1 (1 , , k1 ) = Qk j
i=1 ( i ) j=1 j=1
2. Inductive step: if it holds for X1:n = X1 +X2 + +Xn , then it holds for X1:n+1 = Xn +Xn+1
The proof follows direct from the denition of Dirichlet Distribution given in Theorem 7 and the
Additive property of Gamma Distribution given in Theorem 5.
20 STATISTICAL BACKGROUND 3.1
According to the Theorem 5, the sum of two independent Gamma distributed variables is another
Gamma distributed variable as follows.
|=
X2
X1 + X2 Gamma(1 + 2 , 1)
Using these k 1 independent Gamma distributed variables [X1 + X2 ] , X3 , , Xk , we can
dene another array of variables [1 + 2 ] , 3 , , k that will have a Dirichlet Distribution, as
follows.
(
X1 + X2 Gamma(1 + 2 , 1)
Xj Gamma(j , 1); for 3 i k
! ! !
X1 + X2 Xk
Pk , , Pk = 1 + 2 , 3 , , k Dir (1 + 2 ) , 3 , , k
i=1 Xi i=1 Xi
If the sum of n independent random variables has a Gamma distribution with parameters 1:n and
.
X1:n = X1 + X2 + + Xn Gamma(1 + 2 + + n , 1)
And another random variable is independent of the n variables above, and has a Gamma distribution
with parameters n+1 and the same beta.
Xn+1 Gamma(n+1 , 1)
According to Theorem 5, the sum of these n + 1 variables will also have a Gamma distribution,
with parameters 1:n +n+1 and , as follows.
n
X
X1:n+1 + Xn+1 Gamma [i ] + n+1 , 1
i=1
Once again, from the denition of Dirichlet Distribution, we can divide each of the kn independent
Gamma distributed variables by their sum, to dene a new Dirichlet Distribution, as follows.
(
X1:n+1 Gamma( n+1
P
i=1 [i ] , 1)
Xj Gamma(j , 1); for (n + 2) i k
n+1 n+1
Pn+1 ! ! !
X k
X X
Pk i=1 , , Pk = i , n+2 , , k Dir [i ] , n+2 , , k
i=1 iX i=1 Xi i=1 i=1
One can repeat this process, starting from integers r1 , r2 , , rn such that, 1 r1 rn = k
to arrive at the Equation 3.16.
and has a Dirichlet Distribution, then each variable i has a Beta Distribution, as follows.
k
X
If = (1 , , k ) Dir (1 , , k ) , then i Beta i , [j ] i (3.17)
j=1
Proof. The proof follows straightforward from the denition of Dirichlet Distribution given in The-
orem 7 and the additive property of Dirichlet Distribution given in Theorem 8.
If = (1 , , k ) has a Dirichlet Distribution with parameters given by the vector =
(1 , , k )
= (1 , , k ) Dir (1 , , k ) (3.18)
By the Additive Property of the Dirichlet Distribution (Theorem 8), for any parameter j (where
1 j k ), if we sum-up all the remaining parameter in (all i , where 1 i 6= j k ) we will
have the following Dirichlet Distribution.
k k
X X
= j , i Dir j , i
i=1 i=1
i6=j i6=j
Pk
From the conditions given in Equation 3.12, we know that i=1 i = 1, therefore we can write the
distribution above, as follows.
k
!
X
= (j , 1 j ) Dir j , [i ] j
i=1
Using the denition of Dirichlet Distribution, given in Equation 3.13, we can write this distribution
as follows.
j + ki=1 [i ] j
P
k
!
Pk
jj (1 j ) i=1 [i ]j
X
f j , 1 j |j , [i ] j = P
k
i=1 (j ) i=1 [i ] j
P
The distribution above is a Beta Distribution (Equation 3.9) with parameters j , ki=1 [i ] j .
Pk1
k Beta k , j=1 j
(1 , , k ) Dir (1 , , k ) = 1
1k (1 , , k1 ) Dir (1 , , k1 )
1
1k (1 , , k1 )
|=
k
Proof. The proof of the two rst statements above follows straightforwardly from the denitions
of Dirichlet Distribution (Theorem 7) and the Marginal Distribution Beta of Dirichlet Distribution
(Theorem 9).
(
Xi Xi Gamma(i , 1) 1ik
(1 , , k ) Dir (1 , , k ) = i = Pk , for
X1
|=
X2
|=
|=
i=1 Xi Xk
k1
Xk X
From Theorem 9, we have: k = Pk Beta k , j
i=1 Xi i=j
Xj
1 1 Pk
X X
For 1 j k 1, (j ) = Pk1 (j ) = P i=1 i = Pk1j
1 k j=1 j
k1 Xj
j=1 Xj
P j=1 k
Xi
i=1
!
1 X1 Xk1
Pk1 (1 , , k1 ) = Pk1 , , Pk1 Dir (1 , , k1 )
j=1 j j=1 Xj j=1 Xk1
| {z }
Theorem 7
In order to prove the last statement, k 1k (1 , , k1 ), we use the following transforma-
1
|=
tion.
Qj = j / (1 k ) 1 j k 2
j = Qj (1 Qk ) 1 j k 2
=
Qk = k k = Qk
1 Qk 0 0 Q1
0 1 Qk 0 Q2
.. .. .. .. ..
k2
det J = . . . . . = (1 Qk )
0 0 1 Qk Qk1
0 0 0 1
3.2 DEFINITIONS AND PROPERTIES OF GAMMA, BETA AND DIRICHLET DISTRIBUTIONS 23
We can nd the joint distribution of Qi , 1 i k fQ (q1 , , qk ) using the joint distribution
of the variables i , i j k (Equation 3.19), as follows.
h i
fQ (q1 , , qk ) = f (q1 , , qk ) (1 qk )k2
P k1 1 "
k k2 k2
#
i=1 i Y
h i
(qj (1 qk ))j 1 1 (qk )k 1 (1 qk )k2
X
= Qk [qj (1 qk )] qk
i=1 (i ) j=1 j=1
| {z }| {z }| {z }
1 , ,k2 k1 k
P k1 1
k k2 k2
i=1 i Y
h i
(qj (1 qk ))j 1 (1 qk ) 1 (qk )k 1 (1 qk )k2
X
= Qk qj
i=1 (i ) j=1 j=1
P k1 1
k k2 k2
i=1 i h i
1
Pk1
(1 qk ) j (j 1)+k2 (qk )k 1
Y X
= Qk qj j 1 qj
i=1 (i ) j=1 j=1
P k1 1
k k2 h k2
i=1 i Y
i
1
Pk1
(qk )k 1 (1 qk ) j j 1
X
= Qk qj j 1 qj
i=1 (i ) j=1 j=1
P P k1 1
k k1 k2 h k2
i=1 i
j=1 j Y
i
1
Pk1
(qk )k 1 (1 qk ) j j 1
X
= Qk P qj j 1 qj
k1
i=1 (i ) j=1 j j=1 j=1
P k1 1 Pk1
k1 k2 k2
j=1 j Y
h i
1
k + i=1 i Pk1
(qk )k 1 (1 qk ) j j 1
X
= Qk1 qj j 1 qj P
k1
i=1 (i ) j=1 j=1 (k ) j=1 j
| {z }
Pk1
qk =k Beta(k , j=1 j )
From the equation above, we can see that the vector (Q1 , , Qk1 ) = 11k (1 , , k1 ) and the
variable Qk are independent, and the distribution of the vector of variables Pk1 1
(Q1 , , Qk1 )
j=1 Qj
is the pdf of the Dirichlet Distribution, as follows.
P k1 1
k1 k2 h k2
j=1 j Y
i
1
X
fQ1 , ,Qk1 (q1 , , qk1 ) = Qk1 qj j 1 qj
i=1 (i ) j=1 j=1
1 1
k (1 , , k1 ) (1 , , k1 ) Dir (1 , , k1 )
|=
and
1 k 1 k
24 STATISTICAL BACKGROUND 3.2
X Beta(, ) , > 0
The distribution of the variable = ln (X/(1 X)) is approximate Normal with mean and variance
equal to = () () and 2 = 0 () + 0 (), respectively. Where () is the Digamma
function dened in Theorem 2 and 0 () is the Trigamma function dened in Theorem 3. This
approximation is accurate for large values of parameters and
!
= ln (X/(1 X)) N () () , 0 () + 0 () (3.20)
We rst prove that: (1) can be written as the dierence of the log of two Gamma distributed
variables; (2) the mean of is equal to () (); (3) the variance of is equal to () ();
(4) the distribution of the log of a variable with Gamma distribution is approximate Normal, and
this approximation is accurate for large values of the parameter of the Gamma distribution. We
then, put all these results together to show that for large values of and , the approximation to
the Normal distribution, given in Equation 3.20, is accurate.
Lemma 5 (The logit of the parameter of a Beta Distribution is equal to a dierence of independent
Gamma distributed variables). The variable = ln (X/(1 X)) is equal to the dierence between
two independent Gamma distributed variables, as follows.
(
X Y Gamma(, b)
= ln = ln (Y ) ln (Z) , where b > 0, Y
|=
Z
1X Z Gamma(, b)
Proof. By our denition of Beta Distribution, given in Theorem 6, X can be dened as Y /(Y + Z),
where Y and Z and independent Gamma distributed variables, as follows.
(
Y Y Gamma(, b)
X= for b > 0, Y
|=
Z
Y +Z Z Gamma(, b)
Lemma 6 (Mean of the logit of the parameter of a Beta Distribution is the dierence of Diagmma
functions). The mean of (i.e., E []) is equal to () (), where is the Digamma function
dened in Theorem 2, as follows.
E [] = () () (3.22)
Proof. As in Equation 3.21, we can dene as the sum of ln of two independent Gamma distributed
variables Y and Z , as follows.
(
Y Gamma(, b)
= ln (Y ) ln (Z) for b > 0, Y
|=
Z
Z Gamma(, b)
|=
Y Z
E [] = E [ln (Y ) ln (Z)] = E [ln (Y )] E [ln (Z)]
Using the denition of Gamma distribution (Theorem 3.5), we can nd the expected value of ln (Y )
as follows.
E [ln (Y )] =
b
Z
y 1 eyb ln (y) dy =
0 ()
Z
1
b y 1 eyb ln (y) dy
() 0
0
()
ln b + = () ln b (3.23)
()
| {z }
(), as in Theorem 2
Therefore, the expected value of ln (Y ) is equal to ()ln b, and since Z has the same distribution
as Y , except for the parameter , instead of , the expected value of ln (Z) is equal to () ln b.
And we can nd the expected value of .
Lemma 7 (The Variance of the logit of the parameter of a Beta Distribution is the sum of Triagmma
functions). The variance of is equal to 0 () 0 (), where 0 is the Trigamma function dened
26 STATISTICAL BACKGROUND 3.2
in Theorem 3, as follows.
Var [] = 0 () + 0 ()
Proof. We use again the relationship of Equation 3.21, writing as the sum of ln of two independent
Gamma distributed variables Y and Z , as follows.
|=
Y Z
Var [] = Var [ln (Y ) ln (Z)] = Var [ln (Y )] + Var [ln (Z)] (3.24)
We use the denition of Gamma distribution (Theorem 3.5) to nd the variance of ln (Y ) as follows.
h i 2
Var [ln (Y )] = E ln (Y )2 E [ln (Y )]
We replace the expected value of ln (Y ) with the expression in Equation 3.23 and use the denition
of Gamma distribution (Theorem 3.5), as follows.
h i 2
Var [ln (Y )] = E ln (Y )2 () ln (b) =
Z 2
b 1 yb
2
y e ln (y) dy () ln (b) (3.25)
0 ()
h i
Making z = yb, we have dz = bdy , and we can nd E ln (Y )2 as follows.
Z " #2
h 1 2
i
z
1
z z dz
E ln (Y ) =
b e ln =
() 0 b b b
Z " #2
1 z 1 dz
b ez ln (z) ln (b) =
() 0 b b
Z " #2
1 1 z
(z) e ln (z) ln (b) dz =
() 0
ln (b)2 1 z
Z
2 ln (b)
Z Z
1
ln (z)2 (z)1 ez dz ln (z) (z)1 ez dz + (z) e dz =
() 0 () 0 () 0
| {z } | {z }
0 () as in Equation 3.2 () as in Theorem 1
0 () ln (b)2
Z
1
ln (z)2 (z)1 ez dz 2 ln (b) + () =
() 0 () ()
| {z }
() as in Theorem 2
Z
1
ln (z)2 (z)1 ez dz 2 () ln (b) + ln (b)2 =
() 0
Z 2
1 2 1 z
2
ln (z) (z) e dz () + () ln (b)
() 0
Taking the derivative of the term z 1 (inside the integral), with respected to the variable , we
have the following result.
1
(z)1 = eln((z) ) = e(1) ln(z)
d 1
(z) = ln (z) e (1) ln(z)
= ln (z) (z)1
d
2 2
d2
1
(z) = ln (z) e (1) ln(z)
= ln (z) (z)1
d2
3.2 LOGISTIC-NORMAL DISTRIBUTION 27
d2
We can then replace the term ln (z)2 (z)1 by d2
(z)1 in the integral, as follows.
Z 2 2 2
h
2
i 1 d 1 z
E ln (Y ) = (z) e dz () + () ln (b) =
() 0 d2
Z 2
1 d2 1 z
2
(z) e dz () + () ln (b) =
() d2 0 | {z }
| {z } 0
()
as in Theorem 2
() as in Theorem 1 ()
2 2
1 d2 0 ()
() + () ln (b) =
() d2 ()
() 2
0 2
1 d 0
() + () ln (b) =
() d ()
!, !2 2
d 0 0 d
() () () () () + () ln (b) =
d d
| {z . }
quotiente rule of derivative for 0 () ()
2
d 0 ()
+ () ln (b) =
d ()
2
d2
ln () + () ln (b) =
d2
| {z }
0 () according to Theorem 3
2
0
() + () ln (b)
h i
Adding this result of E ln (Y )2 back to Equation 3.25, we can nd Var [ln (Y )], as follows.
h i 2
Var [ln (Y )] = E ln (Y )2 () ln (b)
2 2
0
= () + () ln (b) () ln (b) = 0 () (3.26)
Since the variable Z in Equation 3.24 has the same distribution as Y , except for parameter
replacing , it's easy to show, using the calculations above that the variance of Z is equal to 0 ().
We can thus nd the variance of , using Equation 3.24, as follows.
|=
Y Z
Var [] = Var [ln (Y ) ln (Z)] = Var [ln (Y )] + Var [ln (Z)] = 0 () + 0 ()
The mean and variance of ln (Y ) (Equations 3.23 and 3.26) are the followings.
W = E [ln (Y )] = () ln b 2
W = Var [ln (Y )] = 0 ()
28 STATISTICAL BACKGROUND 3.2
The Normal Distribution with parameters mean and variance given above will have the following
pdf.
wew
fW (w|, ) = e
()
( )
1 (w ( () ln b))2
fW (w|, ) p exp
20 () 20 ()
(3.28)
The Kullback-Leibler divergence between these two distribution can be found as follows.
( )
1 (w ( () ln b))2
p(w) = p exp
20 () 20 ()
wew
q(w) = e
()
Z
p(w)
DKL = p(w) ln dw
q(w)
Z Z
= p(w) ln (p(w)) dw p(w) ln (q(w)) dw (3.29)
| {z } | {z }
() entropy of p(w) cross-entropy of p(w) and q(w)
Lemma 9 (Entropy of Normal Distribution). The entropy of p(x) is the entropy of a Normal
Distribution, and it's dened as follows.
Z
1
1 + ln 20 () (3.30)
H (p(x)) = p(w) ln (p(w)) dw =
2
Proof. Let W be a random variable that follows a Normal Distribution with parameters (, 2 ).
W N , 2
( )
2
1 (w )2
R(w) = fW w|, = exp
2 2 2 2
0 ()
w
E [e ] = exp () ln b +
p(w) 2
Z
w
E [e ] = ew p(w)dw
( )
(w )2
Z
w 1
= e exp dw
2 2 2 2
( )
(w )2
Z
1
= exp + w dw
2 2 2 2
Z
w2 2w + 2
1
= exp + w dw
2 2 2 2
Z
w2 2w( + 2 ) + 2
1
= exp dw
2 2 2 2
Z
w2 2w( + 2 ) + 2
1
= exp dw
2 2 2 2
Z
w2 2w( + 2 ) + ( + 2 )2 2 2 + 4
1
= exp dw exp
2 2 2 2 2 2
2
= exp +
2
According to Equation 3.31, the cross entropy of p(x) and q(x) will be the following.
0
H (p(w), q(w)) = ln ( ()) ( ()) + be()ln b+ ()/2
Now, using this result of cross entropy, the entropy of p(x) found in Equation 3.30, and the de-
30 STATISTICAL BACKGROUND 3.2
nition of Kullback-Leibler divergence (Equation 3.29), we can nd the divergence between the two
distribution with respect to b and , as follows.
This function is evaluated for dierent values of parameters and b, and the results are shown in
Figure 3.1. For a xed b = 1, we increase the value of and nd the value of divergences between
the two distributions; this result is shown in Table 3.1.
KullbackLeibler divergence
0.20
Figure 3.1: The values of Kullback-Leibler Diver- Table 3.1: The values of Kullback-Leibler Divergence
gence between the Log-Gamma Distribution and the between the Log-Gamma Distribution and the Normal
Normal Distribution approximation, for dierent val- Distribution approximation, for dierent values of the
ues of parameters and b of the Gamma Distribu- parameter and a xed value of the parameter b = 1
tion. of the Gamma Distribution.
We can see from Figure 3.1 and Table 3.1 that the approximation of a Log-Gamma Distribution
to a Normal Distribution, as given in Equation 3.28, is pretty accurate, according to the measure of
Kullback-Leibler Divergence; the approximation becomes more accurate as increases: for 5, the
divergence is lower than 0.02, and decrease even more, approaching 0.0 as the value of increases.
According to the result above (Lemma 8) the approximation of the Log-Gamma Distributions to
Normal Distributions are accurate, as follow.
N ( () ln b, 0 ())
( (
Y Gamma(, b) ln (Y )
= N ( () ln b, 0 ())
Z Gamma(, b) ln (Z)
According to Lemma 5 Distribution of the logit of a Beta distributed variable X is equal to the
dierence of two independent Gamma distributed variables = ln (X/ (1 X)) = ln (Y ) ln (Z).
(
= ln (Y ) ln (Z)
X Beta(, ) Y Gamma(, b)
=
= ln (X/ (1 X))
Z Gamma(, b)
Y
|=
Z
3.2 LOGISTIC-NORMAL DISTRIBUTION 31
N ( () ln b, 0 ())
ln (Y )
!
ln (Z) N ( () ln b, 0 ()) N () () , 0 () + 0 ()
=
= ln (Y ) ln (Z)
Proof.
2 2
X N x , x Y N y , y
1 2 2 1 2 2
MX (t) = exp X t + X t MY (t) = exp Y t + Y t
2 2
1 2 2 1
MXY (t) = exp X t + X t exp Y t + Y2 t2 =
2 2
1 2
+ Y2 t2
exp (X Y )t +
2 X
2 2
X Y N x y , x + y
32 STATISTICAL BACKGROUND 3.3
0.4
0.4
0.2
Density
Density
Density
0.3
0.2 0.2
0.1
0.1
Density
Density
1.0
0.2
0.50
0.1 0.5
0.25
Density
Density
0.75 2
0.2
0.50
0.1 1
0.25
0.0 0.00 0
15 10 5 0 5 4 3 2 1.0 0.5
logodds based score logodds based score logodds based score
Figure 3.2: For X Beta (, ) and = ln (X/(1 X)); the distribution of is shown in red, and the
Normal Distribution N ( () () , 0 () + 0 ()) is shown in blue; for dierent values of parameters
and ( {1; 10; 100}, {2; 20; 200}).
3.3 EXAMPLES OF LOGISTIC NORMAL APPROXIMATION 33
3 3
Density
Density
2 2
1 1
0 0
5.00 4.75 4.50 4.25 5.00 4.75 4.50 4.25
logodds based score logodds based score
7.5 7.5
Density
Density
5.0 5.0
2.5 2.5
0.0 0.0
5.5 5.4 5.3 5.2 5.1 5.5 5.4 5.3 5.2 5.1
logodds based score logodds based score
Figure 3.3: For X Beta (, ) and = ln (X/(1 X)); the distribution of is shown in red, and the
Normal Distribution N ( () () , 0 () + 0 ()) is shown in blue; for dierent values of parameters
and ( {100; 500}, {1, 000; 10, 000}.
34 STATISTICAL BACKGROUND 3.3
12
0
Density
Density
Normal
12
4
8
0
0
8.45 8.40 8.35 8.30 8.25 8.20 8.45 8.40 8.35 8.30 8.25 8.20
logodds based score logodds based score
Figure 3.4: Example of Logist-Normal approximation, for values of alpha and from peak found in a
control sample of ChIP-Seq dataset.
3.3 EXAMPLES OF LOGISTIC NORMAL APPROXIMATION 35
10
5
10
0
Density
Density
Normal
15
5
10
0
0
9.30 9.25 9.20 9.15 9.10 9.05 9.30 9.25 9.20 9.15 9.10 9.05
logodds based score logodds based score
Figure 3.5: Example of Logist-Normal approximation, for values of alpha and from peak found in a
treatment sample of ChIP-Seq dataset.
36 STATISTICAL BACKGROUND 3.3
Chapter 4
Model (Statistics)
Consider that for a given chromosomal region of the DNA, with base pairs between the positions
b1 and bk , we want to decide if a peak found in this region is signicant. In other words, if the region
of the genome truly represents a binding site of a protein of interest.
In order to accomplish that, we need to verify if the probability of the peak found in this region,
for the treatment sample, is signicantly dierent from the probability of occurrence of the same
peak (in the same genomic region), when using a control sample. Considering that the occurrence
of peaks when using the control sample is random, we can assume that when the probability of the
peak given the treatment sample is trutly dierent, the peak in the treatment sample is not random
as well.
The Bayesian model described in Section 4.1 was proposed to solve this problem. And to over-
come the diculties of using multiple replicate, we used Meta-Analysis described in Section 4.1.1.
In order to model this dataset, we consider the positions in the genome (i.e., bp in the chromosome),
b1 , b2 , ..., bk , to be independent categories, and the number of alignments, n1 , n2 , ..., nk , to be the
number of successes of each category. It is important to mention that the independence assumption
here is an approximation, since although the alignment between a bp of the read and a bp of the
chromosome (know as a "match") is independent of the alignment of any previous bps, the bps are
grouped in sequences (reads ) and these sequences are aligned to the genome as a whole, instead of
each bp aligned independently. But we believe this is a reasonable approximation, and it's necessary
to make the solution feasible in a reasonable time, especially for very large genomes.
The probability of obtaining the alignment counts n (Equation 4.1) between bps b1 and bk can be
modelled under a Bayesian framework, using a Multinomial distribution as likelihood (probability
of obtaining the number of successes equal to n, given the probability of success for bps between b1
and bk ), as described below.
Consider the probabilities vector = [1 , 2 , ..., k ], where i represents the probability of having
one alignment at position bi (i.e, probability of success of category bi ). The probability of obtaining
the vector of alignment counts n, given this vector is dened as follows.
k k
n! X X
p(n|) = n1 n2 knk , for n = ni and i = 1 (4.2)
n1 ! nk ! 1 2
i=1 i=1
To model the prior knowledge about the probabilities vector , we can use a Dirichlet distribu-
37
38 MODEL (STATISTICS) 4.1
tion Ferguson (1973) as a prior distribution, which is the natural conjugate of the Multinomial
distribution de Bragana Pereira and Stern (2008).
For a probability vector , given a known alignment counts vector = [1 ,2 ,...,k ] (prior
knowledge regarding alignment bias in the genome), the Dirichlet distribution is dened as follows.
P
k k
i=1 i
11 1 22 1 kk 1 , for
X
p(|) = Qk i = 1 (4.3)
i=1 (i ) i=1
For the distribution in Equation 4.3 (Dirichlet distribution), i 1 represents the number of reads
previously mapped to the genome at position i. If we don't have any previous knowledge about
these alignments, we can use i = 1, for all i; which is equivalent to use the Uniform distribution
for all categories b1 , ..., bk . But if, instead, we have any knowledge a priori about a bias in mapping
the reads to certain regions of the genome, we can promptly add this information as the prior
distribution in this model.
Finally, the distribution of the probabilities vector given the alignments vector n (posterior
Distribution) will be a Dirichlet distribution as well, which is proportional to the product of the
distributions given in Equations 4.3 (priori) and 4.2 (likelihood).
The distribution in Equation 4.5 is the joint distribution of the probabilities of the alignment
counts found for bps b1 , b2 ,...,bn ; and, for a given peak, we want to compare this distribution between
treatment and control samples.
Once we consider the alignments found for a peak in the control sample to be random alignments
(resulting from alignment biases), when the region is a true binding sites, we expect the peaks for
the treatment sample to have very dierent probabilities. In other words, for a true binding site of
the protein of interest the alignments found for treatment sample are non-random alignments.
In order to compare these probabilities in treatment and control samples for a given peak p, we
use the measure logodds, which is given by the ratio between a probability p and its complement.
p
odds(p) =
1 p
p
logodds(p) = log (odds(p)) = log (4.6)
1 p
i = (i ) (k ) i = 1, ..., k 1
ii = 0 (i ) 0 (k ) i = 1, ..., k 1
ij = 0 (k ) i 6= j (4.7)
4.1 CATEGORICAL BAYESIAN MODEL 39
This approximation becomes more accurate as the parameters i increase. In our case, the
parameters i are the number of alignments in genome positions, and as we describe in the next
section, we will use the sum of alignments for large portions of the genome. Therefore, the parameters
i used in our model are extremely large and this approximation becomes very accurate, as shown
in Appendix Section.
Throughout the rest of this paper, we will simplify the notation for the Dirichlet distribution,
removing the constant and replacing the equal sign with the mathematical symbol of proportionality,
which is a common practice, as follows.
(1 +n1 )1 (2 +n2 )1 ( +n )1
p(|n) 1 2 k k k (4.8)
Models for treatment and control samples For a given region of length k of the genome DNA
sequence (whole chromosomal DNA sequence) with bps between b1 and bk , consider the vector of
alignment counts and the probability vector, for treatment and control samples, as follows.
Pk Pk
for i=1 i = 0
i=1 i = 1.
Given the parameters vector = [1 , 2 , ..., k ] of the prior distribution for the treatment
sample, the posterior distribution for the treatment sample will be given as follows.
Given the parameters vector 0 = [10 , 20 , ..., k0 ] of the prior distribution for the control sample,
the posterior distribution of the control sample will be given as follows.
0 0 0 0 0 0
p(0 |n0 ) 0(n1 +1 )1 (n2 +2 )1 0(nk +k )1 (4.10)
Now, consider we have a list of candidate peaks found for the treatment sample, and we want to
nd out if a specic peak p is signicant (i.e., if it's a true binding site). This peak p begins at the
bp position binit and ends at bp position bend of a chromosome, as shown in Fig. 4.1 and Fig. 4.2.
Since we consider the categories bi (for bi bi bk ) to be independent, we can dene the
probability of this peak, which we will refer to as p , as the sum of the parameters i within this
region. We can also dene the probability of the remaining peaks, R , and the the probability of the
regions with no candidate peaks as , both as the sum of the parameter i within their respective
regions.
Consider the set of all base pairs that fall within regions with no candidate peaks as S , and
the set of all base pairs that fall within the remaining candidate peaks (all peaks except p) as SR .
We dene p , R and as follows,
bX
end X X
p = i , R = i and = i (4.11)
i=binit iSR iS
Given that the sum of parameters i , for i from 1 to k i equal to 1, the sum of probabilities R
(remaining candidate peaks) and (no candidate peak) will be the complement of the probability
p .
+ R = 1 p (4.12)
40 MODEL (STATISTICS) 4.1
10
0
15
Treatment Sample
Replicate 1
10
0
15
Treatment Sample
Replicate 2
10
0
15
Treatment Sample
Replicate 3
10
binit bbend
Figure 4.1: Example of peak region found between base pair binit and bend .
To dene the distributions and relationships between the probabilities p , R and , we will
use three properties of the Dirichlet distribution. They are: Dirichlet's additive property, Dirichlet's
marginal distribution BetaFerguson (1973), and Drirchlet's neutrality James and Mosimann (1980).
The additive property, neutrality and marginal distribution are detailed in Appendix 1.
The additive property of the Dirichlet distribution says that given a set of variables distributed
according to a Dirichlet distribution, if we sum one or more of these parameters, this sum together
with the remaining variable are also Dirichlet distributed (Equation 4.13). The marginal distribution
of a Dirichlet distribution is a Beta distribution (Equation 4.14).
Genome coverage
15
Control Sample
Replicate 1
Coverage
10
0
15
Treatment Sample
Replicate 1
Coverage
10
0
15
Treatment Sample
Replicate 2
Coverage
10
0
15
Treatment Sample
Replicate 3
Coverage
10
binit bbend
Figure 4.2: Example of peak region found between base pair binit and bend .
From Equation 4.5 (the posterior distribution of the probabilities of alignment for bases between
b1 and bk ), we use the additive property of Dirichlet distribution, followed by the neutrality property
and the property of marginal distribution Beta, to nd the distributions of the ration p / and
the distribution of , as follows.
42 MODEL (STATISTICS) 4.1
1 , ..., k |n Dir 1 + n1 , , k + nk
bX bX
!
end end
,
X X
|n Beta [j + nj ], [l + nl ]
jS l6S
bX
end
p X
|n Beta [i + ni ], [j + nj ] (4.16)
i=binit jSR
Therefore, the probability of the region with no candidate peak ( ) will have a Beta distri-
bution, with parameters equal to: (1) sum of the number of alignments found in the region with
no candidate peak (plus parameters i of the prior distribution in this region); and (2) the sum of
the alignments found in the remaining region of the chromosome (plus parameters i of the prior
distribution in this region).
The ratio p /0 will have a Beta distribution, as well, with parameters: (1) sum of the number
of alignments found in the region of the candidate peak p (plus parameters i of the prior distribu-
tion); and (2) the sum of the alignments found in the region of the remaining peaks, set SR (plus
parameters i of the prior distribution).
And the variables and p /0 are independent.
Finally, we can dene a measure of score for a peak p and the score's distribution using the
distributions of and p / , and the approximated distribution of their logodds, as described in
Equation 4.7.
X X
For [nj ] < [nl ] in the treatment samples,
jS l6S
| {z } | {z }
total noise all candidate peaks
p
score(p) = logodds + logodds S (4.17)
S
| {z } | {z }
odds of peak p over noise odds of noise in chromossome
Notice that the rst term of the equation above controls the odds of the peak with respect to
the number of random alignments in the chromosome, and the second term, the odds of the random
alignments. In other words, a large value for the rst term might be misleading: if S is really low,
4.1 CATEGORICAL BAYESIAN MODEL 43
p/S will be large, regardless the value of p; that is the reason why the second term is so important
to identify truly signicant peaks. In the Results Section we show how the number of signicant
peaks can change before and after considering this term.
The condition for using this score is that the total number of alignments in the region S is
strictly lower than the total alignment in the remaining regions of the chromosome. In order words,
the sum of alignments for all the candidate peaks is necessarily greater than the sum of alignments
in regions with no candidate peaks. We consider that this assumption is very reasonable since the
noise should not be greater than the signal or the results would not be accurate.
The distribution of loggodds of p/S will be given as follows.
p p N
logodds = logit
|n p/S , p/S
S
bX
end X
p/S =
[i + ni ] [j + nj ]
binit jSR
bX
end X
p/S = 0 [i + ni ] + 0 [j + nj ] (4.18)
binit jSR
Since p / is independent of (Equation 4.16), the distribution of the sum of their logit
functions will be given as the distribution of the sum of independent Normal distributed variables,
as follows.
p N
score(p) = logodds + logodds S p/S + S , p/S + S (4.20)
S
We nd the distributions of this logodds based score for each peak p in both treatment and
control samples (in the same chromosomal region), and we compare these distributions (for control
and treatment samples) in order to decide if the peak p in the treatment sample is signicant, as
show in the next Section Meta-Analysis.
4.1.1 Meta-Analysis
For each peak found in the treatment sample, we nd the distribution of the logodds based
score of the peak, shown in Equation 4.17, for both treatment and control samples, using the same
positions in the reference genome.
For each replicate, we have the distribution of the score of the peak, and we use the weighted
average of these distributions to compare the scores of each peak in treatment and control samples.
This average is weighted using the total number bp alignments for each sample replicate.
Fig. 4.3 shows examples of regions where peaks have been found for treatment and control
samples using an antibody to recognize human SOX3, and the respective alignment counts for each
replicate. Fig. 4.4 shows the corresponding approximated Normal distribution of the score of this
peak, for all the replicates.
44 MODEL (STATISTICS) 4.1
Finally, Fig. 4.5 shows the resulting weighted average of the replicates. This distribution averaged
across all the replicates is the distribution used to evaluate each peak. Using this distribution, we
can nd the probability of the score found in the treatment sample to be dierent from the score
of the peak found in the control sample using Equation 4.21.
Consider score(p) the score for the region of a candidate peak p in the treatment sample and
score(p0 ) the score for the same region in the control sample.
Candidate peaks with values of probability, given by Equation 4.21, close to 1 will be considered
signicant binding sites (the score of these peaks for the treatment samples are certainly greater
than their score for the control sample).
0
SOX3 ChIPSeq Biological Replicate 1
3
Coverage
0
SOX3 ChIPSeq Biological Replicate 2
0
SOX3 ChIPSeq Biological Replicate 3
0
119150 119200 119250 119300
Position
Figure 4.3: Example of peaks found for treatment and control samples of SOX3 experiment aligned to mm10
reference genome.
4.1 CATEGORICAL BAYESIAN MODEL 45
0.0
SOX3 ChIPSeq Biological Replicate 2
12.5
10.0
7.5
5.0
2.5
0.0
SOX3 ChIPSeq Biological Replicate 3
12.5
10.0
7.5
5.0
2.5
0.0
12.0 11.6 11.2 10.8
logodds based score
Figure 4.4: Normal distribution of the logodds based score of replicates for treatment and control samples
aligned to mm10 genome.
46 MODEL (STATISTICS) 4.1
Figure 4.5: Meta-analysis: comparison of weighted average of distributions of the score for a peak p, given
control and treatment samples.
Chapter 5
5.1 Algorithm
if IsV alid(command.line) then
list.ctrl.path control samples le names
list.treatment.path treatment samples le names
list.candidate.peaks list of candidate peaks
else
Issue(usage.error.msg)
Exit()
end if
ctrl.alignments ReadBamByChuks(list.ctrl.path, chunk.size)
treatment.alignments ReadBamByChuks(list.ctrl.path, chunk.size)
ctrl.coverage F indCoverage(ctrl.alignments)
treatment.coverage F indCoverage(treatment.alignments)
. Find the mean and variance of noise based on non-candidate peaks region
mu.noise digamma(alpha.total + n.total)
digamma(alpha.candidate.peaks + n.candidate.peaks)
sigma.noise trigamma(alpha.total + n.total) +
trigamma(alpha.all.candidate.peaks + n.all.candidate.peaks)
for candidate.peak in list.candidate.peaks do
mu.candidate.peak.over.noise digamma(alpha.candidate.peak + n.candidate.peak)
digamma(alpha.all.candidate.peaks + n.all.candidate.peaks)
sigma.candidate.peak.over.noise trigamma(alpha.candidate.peak + n.candidate.peak) +
trigamma(alpha.all.candidate.peaks + n.all.candidate.peaks)
mu.score.candidate.peak mu.candidate.peak.over.noise + mu.noise
sigma.score.candidate.peak sigma.candidate.peak.over.noise + sigma.noise
. Find the probability of the score in treatment sample greater than score in control sample
p.peak p(score.treatment > score.control)
end for
47
48 WORKFLOW (COMPUTER SCIENCE) 5.2
k k x
y(x, , ) = + tanh (5.1)
2 2
After applying the smooth function to the data, we can nd the derivatives of rst and second
order, in order to nd the local maximum and minimum points. The rules used to dene point of
local maximum and minimum are the following:
Local minimum: the rst order derivative is equal to zero (stationary point) and the second
order derivative is positive (concave up curve); or the rst derivative changes sign at this
point, from negative to positive (the function is decreasing before this point and increases
after this point).
Local maximum: the rst order derivative is equal to zero (stationary point) and the second
order derivative is negative (concave down curve); or the rst derivative changes sign at this
point, from positive to negative (the function is increasing before this point and decreases
after this point).
Based on the points of local maximum and local minimum, we can identify the peaks in the
data. Figure 5.2 shows the resulting peaks found after applying the smoothing function. The rst
Subgure 5.2(a) shows the number of alignments in each position of the genome (i.e., each base pair);
Subgure 5.2(b) shows the smooth function of number of alignments, after applying Equation 5.1 to
the original data with parameters: k = number of steps taken, = 1 and = 1. In Subgure 5.2(c)
and 5.2(d) we can see the change in number of local maximum and local minimum points found
after the application of the kernel smoother function. The original data has a local maximum and
local minimum for each step. After applying the kernel smoother based on the hyperbolic tangent
function, this number is reduced signicantly, reecting the true maximum and minimum points in
the data. The nal Subgure 5.2(e) shows the start position, end position and the top of each peak
found on the data.
The last step of the peaks processing is the decision regarding to join, or not, consecutive peaks.
To decide if two consecutive peaks should be joint or not, we rst nd the area under the curve of
both peaks in the original data. We then join the peaks, by connecting the top position of the peaks
and we calculate again the area under the curve. Finally, we split the peaks, by going downhill,
smoothly, from the top of each peak.
We, then, calculate the distance between the area of the original peaks and two other areas: (1)
after joining the peaks; (2) splitting the peaks. We decide for the option with the least distance
5.2 PEAK SMOOTHING AND IDENTIFICATION 49
0.75 0.75
0.50 0.50
0.25 0.25
0.00 0.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
(e) Smooth step up shift control (f) Smooth step down shift control
1.00 0.00
5 5
0.75 4 0.25 4
3 3
2 2
0.50 1 0.50 1
0 0
1 1
0.25 2 0.75 2
3 3
4 4
5 5
0.00 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
(c) Smooth step up scaling control (d) Smooth step down scaling control
1.00 0.00
1 1
0.75 2 0.25 2
3 3
4 4
0.50 5 0.50 5
6 6
7 7
0.25 8 0.75 8
9 9
10 10
0.00 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Figure 5.1: In (a) and (b) steps up and down found on the original data; (c) and (d) their respective
smooth steps resulting from using the function described in Equation 5.1 with parameters: k = 1 (a single
step), = 1 (constant scaling) and ranging from 5 to +5 (changind function shift). In (e) and (f ),
smooth steps resulting from using the same function with parameter k = 1, = 1 (constant shift) and
ranging from 1 to 10 (changing function scaling).
50 WORKFLOW (COMPUTER SCIENCE) 5.2
5
4
3
2
1
0
0 50 100 150
5
4
3
2
1
0
0 50 100 150
5
point
4
3 local maximum
2
1 local minimum
0
0 50 100 150
5
point
4
3 local maximum
2
1 local minimum
0
0 50 100 150
5
4 peak start
3
2 peak top
1
0 peak end
0 50 100 150
Genome Position
Figure 5.2: (a) Original data (number of alignments in each position of the genome), (b) Smooth data
(smooth function of the number of alignments), (c) Local maximum and local minimum points found on the
original data (a very large number of local maximum and local minimum points are found on the orignal
data), (d) Local maximum and local minimum found on the smooth data (a reduced number of local maximum
and local minimum points are found on the smooth data), (e) Resulting peaks found using the smooth data.
5.2 PEAK SMOOTHING AND IDENTIFICATION 51
from the original area. If the original area is closer to the area (1), we join the peaks; otherwise, we
join the consecutive peaks.
52 WORKFLOW (COMPUTER SCIENCE) 5.2
Chapter 6
53
54 RESULTS AND DISCUSSION 6.1
The results found are shown in Figure 6.1 and the number of peaks called (peaks found signicant)
are shown in Table 6.1. We can see from the results that the number of peaks called change dras-
tically. Although there are many candidate peak regions in the genome where there is a signicant
dierence in enrichment between treatment vs. control samples, this peaks are not signicant when
compared against the remaining regions (region with no candidate peak). Therefore these peaks are
considered false positives.
6.1 MABAYAPP OVERVIEW AND MODEL COMPARISON 55
0.25
0.00
SOX3 ChIPSeq Treatment Sample Biological Replicate 2 vs. Control
1.00
0.75
bias correction
0.50
0.25
0.00
SOX3 ChIPSeq Treatment Sample Biological Replicate 3 vs. Control
1.00
0.75
bias correction
0.50
0.25
0.00
0 2500 5000 7500 10000
peak (ordered)
Before chromossomal bias correction After chromossomal bias correction
Figure 6.1: Decreasing in the number of signicant peaks after application of chromossomal bias correction.
Peak on the right side of the dashed black line have probability equal or greater than 0.9.
Table 6.1: MABayApp Results for Single Treatment Files vs. Control
Peaks called before bias Peaks called by
Samples correction MABayApp
(probability0.9) (probability0.9)
Treatment sample 1 vs. Control sample 4,979 1,916
Treatment sample 2 vs. Control sample 4,925 1,283
Treatment sample 3 vs. Control sample 4,339 1,195
0.75
0.50
probability of logodds based score
0.25
0.00
SOX3 ChIPSeq Treatment Sample Biological Replicate 2 vs. Control
1.00
0.75 1.00
0.75
0.50 0.50
0.25
0.25
0.00
0.00
SOX3 ChIPSeq Treatment Sample Biological Replicate 3 vs. Control
1.00
0.75
0.50
0.25
0.00
0 2500 5000 7500 10000
peak (ordered)
Figure 6.2: Single Treatment Files. Probability of score in Treatment sample greater than score in Control
sample.
show the venn diagram with the overview of number of peaks called for all three treatment samples
separetly, as well as the result after combining the samples two by two, and all together.
We can see from these gures that the number of peaks called reduces when combining the
treatment samples. The resulting number of peaks called were (see Figure 6.4): 819 peaks for
treatment samples 1 and 2 combined, 863 peaks for treatment samples 1 and 3 combined, 812
peaks for treatment samples 2 and 3 combined, and 575 peaks for treatment samples 1, 2 and 3 all
together.
This reduction in number of peaks called is a direct consequence of the nature of the meta-
analysis that takes the average distribution of the logodds of the peaks across all the treatment
replicates. Therefore a peak found to be signicant when analysing a single treatment versus control
control sample might be discarded from the list of peaks called, when combining the remaining
treatment samples. Because, in this case, the peak looses signicance when the average distribution
of logodds is taken across all treatment replicates.
1.00
0.75
1.00
0.75
0.50 0.50
0.25
0.25 0.00
0.00
0 5000 10000 15000
Peak (ordered)
Bayesian Metaanalysis Results Treament Samples 1, 2 and 3 vs. Control
Probability based score
1.00
0.75
1.00
0.75
0.50 0.50
0.25
0.25 0.00
0.00
0 5000 10000 15000
Peak (ordered)
Figure 6.3: Multiple Treatment les - Meta-Analysis.
Figure 6.4: Overview of the number of peaks found by MABayApp for treatment samples of SOX3 ChIP-Seq
dataset - chromossome Y.
58 RESULTS AND DISCUSSION 6.1
peaks as possible. All peaks with a score greater or equal than 10 were found, resulting in 9, 639
candidate peaks. We then, concatenated the same treatment .bam les, using Samtools (Li et al.,
2009) command "samtools merge treatmentX2.bam treatment.bam treatment.bam " resulting in a
larger .bam le (treatmentX2.bam ) with twice the number of alignments; we performed the same
operation for the control le. We used the resulting control and treatment les as input to MACS,
with parameters to keep the duplicate reads "-p 0.1 keep-dup='all' -f "BAM" -s 50 verbose 3 ",
and we observed the score variation for the original 9, 639 candidate peaks. We repeated the steps
described above, until we had a dataset with 32 times the number of alignments. We used the
coordinates of the peaks found by MACS at each step as input to our model, and observed the
score variation for the 9, 639 candidate peaks in our model as well. The results of this simulation
are shown in Figure 6.5 and 6.6 for MACS and MABayApp, respectively.
From these gures, we can see that as we increase the dataset, less peaks are rejected by the
statistical test performed by MACS (the green line in Figure 6.5 shows that all the peaks are
considered signicant using the default threshold when the dataset is duplicated 16 times). While
our model becomes more decisive on the signicance of the peaks (the green line in Figure 6.6 shows
that while some of the peaks are considered more signicant p approaches 1.0 other peaks have
their score reduced p approaches 0.0). As we increase the data available for our model, the scores of
the peaks approaches a step function, giving some of the peaks a score of 1.0 (denitely signicant)
and the 0.0 (denitely non-signicant).
The numbers resulting from this experiments are shown in Table 6.2, where we can see the
behaviour of both models, as described above.
100
10
for the total number of peaks called by MABayApp): 959 peaks for treatment 1 versus control
(50.05% of the peaks called by MABayApp), 1,250 peaks for treatment 2 versus control (97.43%
of the peaks called by MABayApp), 1,111 peaks for treatment 3 versus control (92.97% of the
peaks called by MABayApp).
In yellow, we have the peaks called by MACS, but not called by MABayApp (MABayApp scores
lower than 0.9 and MACS score greater or equal than 50). These results show that, for two of the
treatment replicates, most of the peaks called by MACS, were also called by MABayApp. The
number of peaks called only by MACS were (see Table 6.3 for number peaks called only by MACS
and Table 6.4 for the total number of peaks called by MACS): 348 peaks for treatment 1 versus
control (26.67% of the peaks called by MACS), 44 peaks for treatment 2 versus control (57.14%
of the peaks called by MACS), 49 peaks for treatment 3 versus control (36.84% of the peaks called
by MACS).
Finally, in green we have the candidate peaks called by both MACS and MABayApp. According
to these results, the percentage of peaks called by MACS that was also called by MABayApp was
between 42% and 73%: 957 peaks for treatment 1 versus control (49.95% of the peaks called by
MABayApp and 73.33% of the peaks called by MACS), 33 peaks for treatment 2 versus control
(2.57% of the peaks called by MABayApp and 42.86% of the peaks called by MACS), 84 peaks
for treatment 3 versus control (7.03% of the peaks called by MABayApp and 63.16% of the peaks
called by MACS).
In order to nd the peaks called by MACS when using all the treatment samples together,
we concatenated the dierent replicates in a single .bam le, as suggested by the documentation
of MACS, and used the resulting le as treatment sample. And in order to nd the peaks called
by MABayApp, we rst ran MACS with a threshold of 0.1 (to nd as many candidate peaks as
possible) for each treatment sample against the control sample. We then used all the three resulting
candidate peaks les (one for each treatment sample) as input of MABayApp, together with the
60 RESULTS AND DISCUSSION 6.1
1e04
1.00
0.75 Peaks not called by either MACs or MABayApp
0.50 Peaks called by both MACs and MABayApp
Peaks called by MACs but not by MABayApp
0.25 Peaks called by MABayApp but not by MACs
0.00
MACs score
Peaks called by MACs and or MABayApp Treatment sample 3 vs. Control
MABayApp prob.
1.00
0.75 Peaks not called by either MACs or MABayApp
0.50 Peaks called by both MACs and MABayApp
Peaks called by MACs but not by MABayApp
0.25 Peaks called by MABayApp but not by MACs
0.00
10 100 1000
MACs score
Figure 6.7: Peaks called by MACS and MABayApp for SOX3 ChIP-Seq data: direct comparison. The can-
didate peaks not called by either MACS or MABayApp are shown in black. Peaks called only by MABayApp
are shown in blue. In yellow are all the peaks called by MACS, but not by MABayApp. And in green are the
peaks called by both models.
three treatment sample les and the control sample le (totalling four .bam les).
The resulting number of peaks found by each model is shown in Table 6.4. As we can see the
number of peaks found by the models are very dierent; MACS found only 13 peaks to be signicant
(score 50), while MABayApp found 575 peaks to be signicant (probability based score 0.9).
We analysed the genome annotation of this 575 regions of Chromosome Y found by MABayApp in
Section 6.3 to show that these peaks are important for researches studying the SOX genes family
binding sites.
0.75
MABayapp probability
level
160
120
0.50
80
40
0.25
0.00
Figure 6.8: Density 2d estimation for: RNA-Seq CDS p-value MABayApp CDS probability. The highest
density for CDS with p-value between 0.00 and 0.05 and log2 fold change greater than 0 are concentrated
in highest value of MABayApp probabilities, conrming that CDSs considered signicant during RNA-Seq
analysis (p-values closer to 0.0) were also considered signicant binding sites according to our model (prob-
abilities closer to 1.0).
The second test used to validate our model was performed over a list of candidate peaks extracted
from the custom annotation described in Section 2.2. We ran MABayApp for ChIP-Seq experiments
to nd the signicance peaks within the genomic regions 3UTR, 5UTR, gene body, TSS200 and
TSS1500. We analysed the RNA-Seq experiments using this same annotation le as reference.
For genes and transcriptions showing Sox3 RNA-Seq expression greater than control (log2 fold-
change greater than zero), we separated the genomic features 3UTR, 5UTR, TSS200 and TSS1500
in two groups: signicant and non-signicant. Genes with p-values lower or equal than 0.05 were
considered signicant and the remaining, non-signicant. The analysis performed using this genomic
features shows that our model is capable of detecting peaks in dierent regions of the genes, and
it's not biased towards promoter regions, for example.
We then found, for these four features in the annotation (3UTR, 5UTR, TSS200 and TSS1500 ),
the probability based score found by MABayApp. Figure 6.9 shows the histogram of MABayApp
probability score (MABayApp on the horizontal axis and density on the vertical axis) for each
genomic feature, separating signicant and non-signicant genes.
As we can see from this gure the signicant genes (in blue) have more features with MABayApp
probability score greater or equal than 0.9 (blue histogram is higher than red histogram for highest
values on the x-axis). This result shows that the genomic regions 3UTR, 5UTR, TSS200 and
TSS1500 considered signicant according to RNA-Seq data analysis (p-value lower or equal than
0.05) had more features considered signicant by our model, as well (MABayApp score greater or
equal than 0.9).
64 RESULTS AND DISCUSSION 6.3
Figure 6.9: Histogram of MABayApp probabilties found for annotation features (TSS200, TSS1500, 3UTR
and 5UTR), comparing signicant genes (genes with p-value lower or equal than 0.05) and non-signicant
genes (remaining genes), according to RNA-Seq analysis. The dotted line shows the MABayApp threshold
for signicant binding sites (probability greater or equal than 0.9). Signicant genes had more features found
to be binding site according to MABayApp (p 0.9).
We also found, for the Sox gene family, the RNA-Seq analysis results for their respective CDS
(log2 fold-change and p-value) and MABayApp probability score for the respective features TSS200,
TSS1500, 3UTR and 5UTR. The results are shown in Table 6.5 with value of log2 fold-change greater
than 0 in bold font, and value of MABayApp score greater or equal than 0.9 in bold font, as well.
These results show that, for Sox family CDS with log2 fold change greater than 0 (Sox3 expression
higher than Control expression), many of the respective features were found signicant binding sites
(MABayApp equal or greater than 0.9).
Table 6.5: SOX transcription factor family results for RNA-Seq and MABayApp.
Porcentage of significant gene body, 3UTR and 5UTR regions per chromosome
100
75
3UTR
50
25
0
100
Percent
75 Significant
5UTR
50 no
25 yes
0
100
Gene body
75
50
25
0
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
Number of significant gene body, 3UTR and 5UTR regions per chromosome
2000
1500
Annotation
Count
3UTR
1000
5UTR
Gene body
500
0
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
Figure 6.10: Percentage of signicant genomic regions, and number of signicant regions found by
MABayApp (signicant features are those with probability based score 0.9).
Next, we check if the MABayApp is biased towards larger regions. In order to check that
we analysis the distribution of the length for both signicant and non-signicant regions. The
result is shown in Figure 6.11. As we can see from this Figure, there is no discrepancy in the
distribution of the length when comparing signicant and noon-signicant regions, which indicates
that MABayApp is not biased toward either larger or smaller regions.
6.3 GENOME ANNOTATION ANALYSIS 67
1e+04
3UTR
1e+02
Length (log10 scale)
1e+06
1e+04 Significant
5UTR
no
yes
1e+02
1e+06
Gene body
1e+04
1e+02
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chrX
chrY
chrM
Chromossome
Figure 6.11: Sequence length comparison between signicant and non-signicant genomic features: 3'UTR,
5'UTR and Gene body. Signicant features are those with probability based score 0.9.
Since we expected many genes to share signicant features, we analysed the number of genes
that shared a combination of genomic features. Figure 6.12 shows the Venn Diagram of the number
of genes that share combinations of signicant features 3'UTR, 5'UTR, TSS1500, TSS200 and gene
body. The results shows that, as expected, many genes indeed have more than one signicant feature;
and many of them have at least three genomic features considered signicant by MABayApp.
68 RESULTS AND DISCUSSION 6.3
Figure 6.12: Number of genes with regions enriched. Many genes had enrichmend of other regions as well.
There are 32 genes which have all the 5 regions enriched: 3'UTR, 5'UTR, gene body, TSS200 and TS1500.
In order to validate our results, we observed genomic regions enriched for genes of The SOX
family. We expected to nd many regions of these genes enriched, since the treatment samples had
antibodies specic to target genes of this family. As we can see from Table 6.6, the results have
conrmed that MABayApp indeed found many features for genes of these families with signicant
enrichment, which was a very satisfactory result.
We also looked at the genes that had all the 5 genomic features considered signicant. THose
genes are shown in Table 6.7, together with the gene type information and know role of the gene.
It's interesting to notice that most of the genes have functions associated with the phase of initial
development of the mouse, which was expected, since the samples from Mus Musculus were taken
during neural lineage development. Among the roles played by these genes, the most present ones
are: neural development and regulation, axis/skeleton formation, cellular regulation (proliferation,
dierentiation and apoptosis) and development of immune system.
The Gene Ontology analysis of the enriched regions (Figures 6.13, 6.14, 6.15, 6.16, 6.17, and
6.18) show similar results, with enriched biological processes related to neuronal development and
6.3 GENOME ANNOTATION ANALYSIS 69
dierentiation, enriched cellular components related to development of Golgi and skeleton; and
enriched molecular functions related to neuronal development.
70 RESULTS AND DISCUSSION 6.3
Table 6.7: Genes which have all the 5 regions enriched: 3'UTR, 5'UTR, gene body, TSS200
and TS1500.
Gene Id Gene Name Type Known Role
ENSMUST00000179781 Bsg Protein Coding Reproduction, neural function
ENSMUST00000000127 Wnt3 Protein Coding Primary axis formation in the mouse
ENSMUST00000026119 Gcgr Protein Coding Glucagon receptor
ENSMUST00000000128 Wnt9a Protein Coding Regulation of cell fate during embryogenesis
ENSMUST00000108375 Myo18a Protein Coding Golgi membrane tracking
ENSMUST00000147875 Lyrm9 Protein Coding Unknown
ENSMUST00000038696 Ppp1r9b Protein Coding Receiving signals from central nervous system
ENSMUST00000042779 Zbtb1 Protein Coding Development of lymphocytes in mice
ENSMUST00000140770 Plekhd1 Protein Coding Form and maintain the skeleton
ENSMUST00000021674 Fos Protein Coding Form and maintain the skeleton
ENSMUST00000075558 Hist1h3f Protein Coding Transcription regulation, DNA repair
ENSMUST00000049488 Serinc5 Protein Coding Inhibiting an early step of viral infection
ENSMUST00000022952 Osr2 Protein Coding Development of the palate
ENSMUST00000161785 Zfp41 Protein Coding Meiosis in spermatogenesis
ENSMUST00000127208 Lrrc14 Protein Coding Formation of protein-protein interaction
ENSMUST00000082439 Selo Protein Coding Uncharacterized protein
ENSMUST00000023150 1810013L24Rik Protein Coding Unknown function
ENSMUST00000056882 Olig1 Protein Coding Formation of oligodendrocytes within the brain
ENSMUST00000063344 Lmf1 Protein Coding Transport through the secretory pathway
ENSMUST00000013706 4833413E03Rik Protein Coding Unknown
ENSMUST00000038287 Dusp5 Protein Coding Cellular proliferation and dierentiation
ENSMUST00000066646 Rcor2 Protein Coding Neurogenesis in the developing mouse brain
ENSMUST00000087215 Rqcd1 Protein Coding Required for cell dierentiation
ENSMUST00000065587 Ackr3 Protein Coding Unknown
ENSMUST00000043760 Mvk Protein Coding Formation of cytoskeleton
ENSMUST00000046999 Abhd11 Protein Coding Unknown
ENSMUST00000085591 Pdx1 Protein Coding Necessary for pancreatic development
ENSMUST00000165164 Pcgf1 Protein Coding Promotes cell progression and proliferation
ENSMUST00000071492 Fam136a Protein Coding Development of neurossensorial epithelium
ENSMUST00000047621 Ppp1r13l Protein Coding Regulation of apoptosis and transcription
ENSMUST00000044111 Rras Protein Coding Organization of cytoskeleton
ENSMUST00000112588 Kdm5c Protein Coding Transcriptional repression of neuronal genes
6.3 GENOME ANNOTATION ANALYSIS 71
3UTR
4
2
0
Ocurrences
5UTR
4
2
0
Gene Body
6
4
2
0
GO:0003357
GO:0006004
GO:0008594
GO:0009083
GO:0018022
GO:0021960
GO:0030917
GO:0033132
GO:0042481
GO:0046931
GO:0060261
GO:0060712
GO:0060913
GO:0060968
GO:2000035
GO:0019556
GO:0046322
GO:2000288
anterior commissure morphogenesis peptidyllysine methylation
branchedchain amino acid catabolic process photoreceptor cell morphogenesis
cardiac cell fate determination pore complex assembly
fucose metabolic process positive regulation of myoblast proliferation
positive regulation of transcription initiation from
histidine catabolic process to glutamate and formamide
RNA polymerase II promoter
midbrainhindbrain boundary development regulation of gene silencing
negative regulation of fatty acid oxidation regulation of odontogenesis
negative regulation of glucokinase activity regulation of stem cell division
noradrenergic neuron differentiation spongiotrophoblast layer development
Figure 6.13: Number of occurrences of biological processes found only among signicant regions 3'UTR,
5'UTR and gene body (MABayApp score 0.9). These biological processes have not been found on any of
the non-signicant regions (MABayApp score < 0.9).
72 RESULTS AND DISCUSSION 6.3
TSS1500
4
2
Ocurrences
TSS200
4
0
GO:0003229
GO:0003253
GO:0006265
GO:0006555
GO:0006672
GO:0008608
GO:0021522
GO:0060716
GO:0061024
GO:0097194
GO:1900364
GO:0060136
attachment of spindle microtubules to kinetochore labyrinthine layer blood vessel development
cardiac neural crest cell migration involved in
membrane organization
outflow tract morphogenesis
ceramide metabolic process methionine metabolic process
DNA topological change negative regulation of mRNA polyadenylation
embryonic process involved in female pregnancy spinal cord motor neuron differentiation
execution phase of apoptosis ventricular cardiac muscle tissue development
Figure 6.14: Number of occurrences of biological processes found only among signicant regions TSS200
and TSS1500 (MABayApp score 0.9). These biological processes have not been found on any of the non-
signicant regions TSS200 and TSS1500 (MABayApp score < 0.9).
6.3 GENOME ANNOTATION ANALYSIS 73
3UTR
2
1
0
Ocurrences
4
3
5UTR
2
1
0
4
Gene Body
3
2
1
0
GO:0000323
GO:0000803
GO:0001650
GO:0001651
GO:0001674
GO:0005584
GO:0033150
GO:0035061
GO:0044326
GO:0071598
GO:0090571
GO:0000111
GO:0000120
GO:0000172
GO:0000811
GO:0005955
GO:0030688
GO:0032398
calcineurin complex lytic vacuole
collagen type I trimer MHC class Ib protein complex
cytoskeletal calyx neuronal ribonucleoprotein granule
dendritic spine neck nucleotideexcision repair factor 2 complex
dense fibrillar component preribosome, small subunit precursor
female germ cell nucleus ribonuclease MRP complex
fibrillar center RNA polymerase II transcription repressor complex
GINS complex RNA polymerase I transcription factor complex
interchromatin granule sex chromosome
Figure 6.15: Number of occurrences of cellular components found only among signicant regions 3'UTR,
5'UTR and gene body (MABayApp score 0.9). These cellular components have not been found on any of
the non-signicant regions (MABayApp score < 0.9).
74 RESULTS AND DISCUSSION 6.3
TSS1500
3
1
Ocurrences
0
5
TSS200
3
0
GO:0000137
GO:0000930
GO:0001940
GO:0005606
GO:0005742
GO:0005828
GO:0005832
GO:0005869
GO:0031415
GO:0031904
GO:0042587
GO:0019008
chaperonincontaining Tcomplex kinetochore microtubule
dynactin complex laminin1 complex
endosome lumen male pronucleus
gammatubulin complex mitochondrial outer membrane translocase complex
glycogen granule molybdopterin synthase complex
Golgi cis cisterna NatA complex
Figure 6.16: Number of occurrences of cellular components found only among signicant regions TSS200
and TSS1500 (MABayApp score 0.9). These cellular components have not been found on any of the
non-signicant regions TSS200 and TSS1500 (MABayApp score < 0.9).
6.3 GENOME ANNOTATION ANALYSIS 75
3UTR
4
2
0
Ocurrences
5UTR
4
2
0
Gene Body
6
4
2
0
GO:0004030
GO:0004035
GO:0004359
GO:0004488
GO:0004704
GO:0008401
GO:0016015
GO:0016876
GO:0016972
GO:0030280
GO:0031545
GO:0033829
GO:0035368
GO:0000171
GO:0001607
GO:0004594
GO:0035326
Ofucosylpeptide
aldehyde dehydrogenase [NAD(P)+] activity
3betaNacetylglucosaminyltransferase activity
alkaline phosphatase activity pantothenate kinase activity
enhancer binding peptidylproline 4dioxygenase activity
glutaminase activity retinoic acid 4hydroxylase activity
ligase activity, forming aminoacyltRNA and related
ribonuclease MRP activity
compounds
methylenetetrahydrofolate dehydrogenase (NADP+)
selenocysteine insertion sequence binding
activity
morphogen activity structural constituent of epidermis
neuromedin U receptor activity thiol oxidase activity
NFkappaBinducing kinase activity
Figure 6.17: Number of occurrences of molecular functions found only among signicant regions 3'UTR,
5'UTR and gene body (MABayApp score 0.9). These molecular functions have not been found on any of
the non-signicant regions (MABayApp score < 0.9).
76 RESULTS AND DISCUSSION 6.3
TSS1500
4
2
Ocurrences
TSS200
4
0
GO:0001875
GO:0003841
GO:0003847
GO:0004652
GO:0004703
GO:0008469
GO:0008504
GO:0031996
GO:0035242
GO:0035613
GO:0005229
GO:0034711
1acylglycerol3phosphate Oacyltransferase activity lipopolysaccharide receptor activity
1alkyl2acetylglycerophosphocholine esterase
monoamine transmembrane transporter activity
activity
Gprotein coupled receptor kinase activity polynucleotide adenylyltransferase activity
proteinarginine omegaN asymmetric methyltransferase
histonearginine Nmethyltransferase activity
activity
inhibin binding RNA stemloop binding
intracellular calcium activated chloride channel
thioesterase binding
activity
Figure 6.18: Number of occurrences of molecular functions found only among signicant regions TSS200
and TSS1500 (MABayApp score 0.9). These molecular functions have not been found on any of the
non-signicant regions TSS200 and TSS1500 (MABayApp score < 0.9).
Chapter 7
Conclusion
The results of the application of our model showed that it is robust on the detection of signicant
peaks in ChIP-Seq dataset. The Bayesian method was less sensitive to change in sample size and
the meta-analysis corrected the number of signicant peaks found by averaging the distribution of
the logodds of each candidate peak over all the samples. Nonetheless the score used also penalizes
candidate peaks with enrichment similar to those regions with no candidate peak in the genome,
resulting in an extra lter of false-positive peaks. The model has also a very strong statistical
background, as detailed in Section 3 which gives a high condence for its application and future
related research.
In order to make this solution available, we developed an application in R language which is
available under request. This model should be used when the researcher is interested in validating
peaks called for ChIP-Seq datasets with multiple treatment and control samples, where the number
of treatment samples are not necessarily equal to the number of control samples. The model can
also be used to compare samples with dierent treatments; in this case, the investigator should
select one of the treatment samples as control to nd regions with signicant enrichment in the
remaining treatment samples.
Our validation against other method used for the same data, RNA-Seq, had a very satisfactory
result showing that the peaks considered signicant by our model were also found signicant when
using RNA-Seq, as well as regions of genomic features TS200/TS1500 and 3'UTR and 5'UTR
with signicant enrichment in RNA-Seq also had high MABayApp score. The analysis of RNA-Seq
dierential expression versus MABayApp scores for the SOX genes family were also satisfactory
since SOX genes with high log2 fold change of RNA-Seq expression and low p-values were also
found to be signicant when using our model. Finally, our model identied several genomic regions
enriched for the SOX family genes (SOX2 to Sox30) as expected, since this family of genes was
the focus of the investigators how provided the sample (McAninch and Thomas, 2014); and a Ven
diagram of the dierent genomic features enriched showed that for many genes, dierent genomic
features were enriched at the same time (gene body, 3'UTR, 5'UTR, TSS200, TSS1500).
1. Extrapolate the model to be used with more complex genomes, for example: the hybrid
Sugarcane genome
3.1. dierent parameter values for the smoothing method already dened in Section 5.2
3.2. new kernel functions to dene other smoothing methods
77
78 CONCLUSION
We believe this work has achieved its main goals: to develop a robust model with a strong
statistical background to nd signicant peaks in ChIP-Seq datasets, and to make it available for
dierent researchers in the area of genomic studies. It also contains points that could be further
explored by other investigator in Bioinformatics and related areas.
Appendix A
R Code
1 #! / u s r / b i n / R s c r i p t
2 rm ( l i s t = ls () )
3
4 suppressPackageStartupMessages ( l i b r a r y ( pryr ) )
5 s u p p r e s s P a c k a g e S t a r t u p M e s s a g e s ( l i b r a r y ( pracma ) )
6 s u p p r e s s P a c k a g e S t a r t u p M e s s a g e s ( l i b r a r y ( Rcpp ) )
7 suppressPackageStartupMessages ( l i b r a r y ( i n l i n e ) )
8 s u p p r e s s P a c k a g e S t a r t u p M e s s a g e s ( l i b r a r y ( GenomicAlignments ) )
9 s u p p r e s s P a c k a g e S t a r t u p M e s s a g e s ( l i b r a r y ( GenomicRanges ) )
10 suppressPackageStartupMessages ( l i b r a r y ( p a r a l l e l ) )
11
12 findaligmentCoverage < f u n c t i o n ( chr . coverage , a l i g n m e n t s . chunk ){
14 if ( i s . n u l l ( chr . coverage ) ) {
16 } else {
18 }
19 rm ( c h u n k . c o v e r a g e )
20 r e t u r n ( as . numeric ( chr . c o v e r a g e ) )
21 }
22
23 filteraligments
< f u n c t i o n ( a l i g n m e n t s . chunk , peaks . gr ){
24 r e t u r n ( a l i g n m e n t s . c h u n k [ u n i q u e ( q u e r y H i t s ( f i n d O v e r l a p s ( a l i g n m e n t s . chunk , peaks .
gr ) ) ) ] )
25 }
26
27 empty . a l i g n m e n t s < f u n c t i o n ( a l i g n m e n t s . chunk . l i s t ) {
29 f o r ( a l i g n m e n t . chunk . i d x in 1 : l e n g t h ( a l i g n m e n t s . chunk . l i s t ) )
]]) == 0L )
31 r e t u r n ( i s . empty )
32 }
36 sum = sum + x[ i ];
37 "
39
40 #############################
79
80 APPENDIX A
46 # output file
51 bam . f i l e s . c o n t r o l . p a t h . l i s t < l i s t ()
52 bam . f i l e s . t r e a t m e n t . p a t h . l i s t < l i s t ()
55
56 i f ( ( l e n g t h ( a r g s ) <8) ||
57 ( a r g s [ 1 ] ! ="c " ) ||
58 ( l e n g t h ( p o s . t ) ==0) ||
59 ( l e n g t h ( p o s . p ) ==0) ||
60 ( l e n g t h ( p o s . o ) ==0) )
61 {
62 cat ( " usage : chipseq c a l l p e a k s .R c <c o n t o l BAM file 1> < c o n t o l BAM file 2> . . .
66 } else {
70 c a t ( p a s t e ( a r g s [ a r g . c o n t r o l . i d x ] , " \n" ) )
73 }
77 c a t ( p a s t e ( a r g s [ a r g . t r e a t m e n t . i d x ] , " \n" ) )
80 }
82 f o r ( a r g . peak . i d x in ( p o s . p+1) : ( p o s . o 1) )
83 {
87 }
88 c a t ( " \n" )
92 c a t ( " \n" )
93
94 c a t ( " Log file name : \ n" )
97 c a t ( " \n" )
98
99 s i n k ( p a s t e ( o u t p u t . f i l e . name , " p e a k s . l o g " , s e p=" " ) )
100
101 ##############################
. l i s t [ [ bam . f i l e . c . i d x ] ] ,
107 y i e l d S i z e =1000000)
108 }
p a t h . l i s t [ [ bam . f i l e . t . i d x ] ] ,
111 y i e l d S i z e =1000000)
112 }
117
118 ##############################
121
122 ##############################
125 treatment . chr . coverage . l i s t < vector (" l i s t " , n . treatment . samples )
126
127 ##############################
130 treatment . chr . peaks . coverage . l i s t < vector (" l i s t " , n . treatment . samples )
131
132 ##############################
f i l e . idx ] ] [ , 1 ] ,
140 r a n g e s=I R a n g e s ( s t a r t =p e a k . f i l e s .
l i s t [ [ peak . f i l e . i d x ] ] [ , 2 ] ,
141 e n d=p e a k . f i l e s .
l i s t [ [ peak .
f i l e . idx
]][ ,3]) )
142 }
143
144 ##############################
145 # union of peak genomic regions as parameter for reading BAM file
147 i f ( l e n g t h ( p e a k . g r . l i s t ) >1)
148 {
151 }
152 }
153
154 ##############################
157 a l l . peaks . gr . t reat men t . cov < vector (" l i s t " , n . treatment . samples )
158 ##############################
168
169 ##############################
172 o p e n ( bam . f i l e s . c o n t r o l . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ] )
173 }
175 o p e n ( bam . f i l e s . t r e a t m e n t . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ] )
176 }
178 repeat {
184 }
189 }
190
191 # stop reading BAM files if no there ' s no chunk to be read an ym or e
193 ( empty . a l i g n m e n t s ( t r e a t m e n t . a l i g n m e n t s . c h u n k . l i s t ) ) )
194 break
195
196 # filter peaks alignments from total alingments
201 c o n t r o l . a l i g n m e n t s . chunk . l i s t [ [ c o n t r o l . s a m p l e . i d x
]])
204 a l l . peaks . gr )
]] ,
207 c o n t r o l . a l i g n m e n t s . p e a k s . chunk . l i s t [ [ c o n t r o l . s a m p l e
. idx ] ] )
211 } else {
214 c o v e r a g e ( c o n t r o l . a l i g n m e n t s . chunk . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ] )
215 }
216 }
217
218 t r e a t m e n t . a l i g n m e n t s . p e a k s . chunk . l i s t < vector (" l i s t " , n . treatment . samples )
]] ,
222 t r e a t m e n t . a l i g n m e n t s . chunk . l i s t [ [ t r e a t m e n t . s a m p l e .
idx ] ] )
225 a l l . peaks . gr )
. idx ] ] ,
228 t r e a t m e n t . a l i g n m e n t s . p e a k s . chunk . l i s t [ [ t r e a t m e n t .
sample . idx ] ] )
232 } else {
235 c o v e r a g e ( t r e a t m e n t . a l i g n m e n t s . chunk . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ] )
236 }
237 }
239 rm ( c o n t r o l . a l i g n m e n t s . c h u n k . l i s t )
240 rm ( c o n t r o l . a l i g n m e n t s . p e a k s . c h u n k . l i s t )
241 rm ( t r e a t m e n t . a l i g n m e n t s . c h u n k . l i s t )
242 rm ( t r e a t m e n t . a l i g n m e n t s . p e a k s . c h u n k . l i s t )
243 }
245 ##############################
248 c l o s e ( bam . f i l e s . c o n t r o l . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ] )
249 }
250 rm ( bam . f i l e s . c o n t r o l . l i s t )
252 c l o s e ( bam . f i l e s . t r e a t m e n t . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ] )
253 }
254 rm ( bam . f i l e s . t r e a t m e n t . l i s t )
256
257 ##############################
262 }
idx ] ]
267 }
269
270 ##############################
273
274 a l l . peaks . c o n t r o l . n < vector (" l i s t " , n . c o n t r o l . samples )
279 }
280
281 a l l . peaks . treatment . n < vector (" l i s t " , n . treatment . samples )
286 }
287
288 ##############################
84 APPENDIX A
296 }
300 }
301
302 c a t ( " Find noise mean and v a r i a n c e . \ n" )
303 mu . c o n t r o l . n o i s e . l i s t <
304 lapply ( vector (" l i s t " , n . c o n t r o l . samples ) , function (x) x< v e c t o r ( " l i s t " ,
length ( chr . len ) ) )
307
308 mu . t r e a t m e n t . n o i s e . l i s t <
309 lapply ( vector (" l i s t " , n . treatment . samples ) , function (x) v e c t o r ( " l i s t " ,
x<
312
313 c a t ( " Find noise mean and variance c o n t r o l . \ n" )
317 ################################################
])
324 ################################################
330 ################################################
336 mu . c o n t r o l . n o i s e . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ] [ [ c h r . i d x ] ] <
337 digamma ( a l p h a . c o n t r o l . p o s t e r i o r )
338 digamma ( b e t a . c o n t r o l . p o s t r i o r )
341 trigamma ( b e t a . c o n t r o l . p o s t r i o r )
342 }
343 }
344
345 c a t ( " Find noise mean and variance t r e a t m e n t . \ n" )
349 ################################################
355 as . numeric ( treatment . chr . peaks . c o v e r a g e . l i s t [ [ treatment . sample . idx ] ] [ chr
. idx ] )
356 ################################################
362 ################################################
368 mu . t r e a t m e n t . n o i s e . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ] [ [ c h r . i d x ] ]
<
369 digamma ( a l p h a . t r e a t m e n t . p o s t e r i o r )
370 digamma ( b e t a . t r e a t m e n t . p o s t e r i o r )
373 trigamma ( b e t a . t r e a t m e n t . p o s t e r i o r )
374 }
375 }
376
377 ###################################################
385 c h r . d a t a . c o n t r o l [ [ c o n t r o l . s a m p l e . i d x ] ] $mu . c o n t r o l . n o i s e . l i s t
<
386 mu . c o n t r o l . n o i s e . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ]
389 }
390
391 cat ( " chr data t r e a t m e n t . \ n" )
392 chr . data . treatment < vector (" l i s t " , n . treatment . samples )
394 chr . data . tr eatm ent [ [ t rea tmen t . sample . i d x ] ] $ a l l . peaks . gr . t rea tmen t . cov <
395 a l l . peaks . gr . tr eat men t . cov [ [ t rea tme nt . sample . i d x ] ]
396 chr . data . treatment [ [ treatment . sample . idx ] ] $ treatment . chr . peaks . c o v e r a g e . l i s t
<
397 treatment . chr . peaks . c o v e r a g e . l i s t [ [ treatment . sample . idx ] ]
398 c h r . d a t a . t r e a t m e n t [ [ t r e a t m e n t . s a m p l e . i d x ] ] $mu . t r e a t m e n t . n o i s e . l i s t
<
399 mu . t r e a t m e n t . n o i s e . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ]
400 chr . data . treatment [ [ treatment . sample . idx ] ] $ var . treatment . n o i s e . l i s t <
401 var . treatment . n o i s e . l i s t [ [ treatment . sample . idx ] ]
402 }
403
404 cat ( " Start mclapply . . . \n" )
chr . current ) ]
")
418 # Child
423 }
424 p a r a l l e l : : : mcexit ( )
425 }
427
428 result < mclapply ( peaks . l i s t , mc . c o r e s = detectCores () , f u n c t i o n ( peak )
429 {
430 w r i t e B i n ( 1 /n . peaks , f )
435 ##################################################
441 mu . c o n t r o l . p e a k . l i s t <
442 mapply ( f u n c t i o n ( x , y ) {
446 ( a b s ( x $mu . c o n t r o l . n o i s e . l i s t [ [ p e a k . c h r . i d x ] ] ) ) } ,
447 x=c h r . d a t a . c o n t r o l , y= t o t a l . p e a k . c o n t r o l . a l i g n m e n t s )
450 ( t r i g a m m a ( y+p e a k . w i d t h )+
454 x=c h r . d a t a . c o n t r o l , y= t o t a l . p e a k . c o n t r o l . a l i g n m e n t s )
457 c ( l . i n f =(x 5 s q r t ( y ) ) ,
458 l . s u p =(x+5 s q r t ( y ) ) ) } ,
460 ##################################################
]])
466
467 mu . t r e a t m e n t . p e a k . l i s t <
468 mapply ( f u n c t i o n ( x , y ) {
472 ( a b s ( x $mu . t r e a t m e n t . n o i s e . l i s t [ [ p e a k . c h r . i d x ] ] ) ) } ,
473 x=c h r . d a t a . t r e a t m e n t , y= t o t a l . p e a k . t r e a t m e n t . a l i g n m e n t s )
476 ( t r i g a m m a ( y+p e a k . w i d t h )+
480 x=c h r . d a t a . t r e a t m e n t , y= t o t a l . p e a k . t r e a t m e n t . a l i g n m e n t s )
483 c ( l . i n f =(x 5 s q r t ( y ) ) ,
484 l . s u p =(x+5 s q r t ( y ) ) ) } ,
486 ##################################################
491 ###################################################
494 ###################################################
497 s d=s q r t ( v a r . c o n t r o l . p e a k . l i s t [ [ c o n t r o l . s a m p l e . i d x ] ] ) ) }
498 ###################################################
idx ) {
]]) ,
501 s d=s q r t ( v a r . t r e a t m e n t . p e a k . l i s t [ [ t r e a t m e n t . s a m p l e . i d x ] ] ) ) }
502 ###################################################
507 x=c o n t r o l . c h r . c o v e r a g e . l i s t ,
508 y=mu . c o n t r o l . p e a k . l i s t ,
509 z=v a r . c o n t r o l . p e a k . l i s t )
. chr . idx ]
512 }
513 ###################################################
518 x=t r e a t m e n t . c h r . c o v e r a g e . l i s t ,
519 y=mu . t r e a t m e n t . p e a k . l i s t ,
520 z=v a r . t r e a t m e n t . p e a k . l i s t )
peak . c h r . i d x ]
523 }
524 ###################################################
values ){
control . values )}
529 t r y ( i n t e g r a l 2 ( f =p r o d u c t . f . l o g o o d s ,
530 l o g o d d s . l i m i t s $ min ,
531 l o g o d d s . l i m i t s $max ,
532 l o g o d d s . l i m i t s $ min ,
540 l o g o d d s . l i m i t s $ min ,
541 l o g o d d s . l i m i t s $max ,
542 l o g o d d s . l i m i t s $ min ,
544 }
545 ###################################################
546 p < p r o b . l o g o o d s . t r e a t m e n t . g r . c o n t r o l $Q
549 ###################################################
550 c ( p e a k . seqname , s t a r t ( p e a k . r a n g e ) , e n d ( p e a k . r a n g e ) , p)
551 })
552 result
553 })
554 close ( f )
555 r e s u l t . peaks
556 })
557
558 tf < proc . time ( )
peaks , byrow=T) )
561 w r i t e . t a b l e ( d f . s i g n i f i c a n t . peaks ,
567 c a t ( "End . . . writing data frame " , chr . current , " . . . \ n" )
568 }
569 sink ()
570 }
Bibliography
J Atchison and Sheng M Shen. Logistic-normal distributions: Some properties and uses. Biometrika,
67(2):261272, 1980. 24, 38
Tanya Barrett, Stephen E Wilhite, Pierre Ledoux, Carlos Evangelista, Irene F Kim, Maxim Toma-
shevsky, Kimberly A Marshall, Katherine H Phillippy, Patti M Sherman, Michelle Holko, et al.
Ncbi geo: archive for functional genomics data sets-update. Nucleic acids research, 41(D1):D991
D995, 2013. 53
Maria Bergsland, Daniel Ramskld, Ccile Zaouter, Susanne Klum, Rickard Sandberg, and Jonas
Muhr. Sequentially acting sox transcription factors in neural lineage development. Genes &
development, 25(23):24532464, 2011. 53
Wesley Bylsma. Approximating smooth step functions using partial fourier series sums. Technical
report, DTIC Document, 2012. 48
Asif T Chinwalla, Lisa L Cook, Kimberly D Delehaunty, Ginger A Fewell, Lucinda A Fulton,
Robert S Fulton, Tina A Graves, LaDeana W Hillier, Elaine R Mardis, John D McPherson, et al.
Initial sequencing and comparative analysis of the mouse genome. Nature, 420(6915):520562,
2002. 2
Carlos Alberto de Bragana Pereira and Julio Michael Stern. Special characterizations of standard
discrete models. REVSTATStatistical Journal, 6(3):199230, 2008. 24, 38
Morris H Morris H DeGroot et al. Probability and statistics. Number 04; QA273, D4 1986. 1986. 1
Sandra Deliard, Jianhua Zhao, Qianghua Xia, and Struan FA Grant. Generation of high quality
chromatin immunoprecipitation dna template for high-throughput sequencing (chip-seq). JoVE
(Journal of Visualized Experiments), (74):e50286e50286, 2013. 5
Ron Edgar, Michael Domrachev, and Alex E Lash. Gene expression omnibus: Ncbi gene expression
and hybridization array data repository. Nucleic acids research, 30(1):207210, 2002. 8, 53, 54,
56
Thomas S Ferguson. A bayesian analysis of some nonparametric problems. The annals of statistics,
pages 209230, 1973. 13, 38, 40
BA Frigyik, A Kapila, and MR Gupta. Introduction to the dirichlet distribution and related
processes, university of washington technical report. Technical report, UWEETR-2010-0006,
2010. 13
Sven Heinz, Christopher Benner, Nathanael Spann, Eric Bertolino, Yin C Lin, Peter Laslo, Jason X
Cheng, Cornelis Murre, Harinder Singh, and Christopher K Glass. Simple combinations of lineage-
determining transcription factors prime cis-regulatory elements required for macrophage and b
cell identities. Molecular cell, 38(4):576589, 2010. 2
Valerie Hower, Steven N Evans, and Lior Pachter. Shape-based peak identication for chip-seq.
BMC bioinformatics, 12(1):15, 2011. 1, 2, 3
89
90 BIBLIOGRAPHY
Ian R James and James E Mosimann. A new characterization of the dirichlet distribution through
neutrality. The Annals of Statistics, pages 183189, 1980. 40
Raja Jothi, Suresh Cuddapah, Artem Barski, Kairong Cui, and Keji Zhao. Genome-wide identi-
cation of in vivo proteindna binding sites from chip-seq data. Nucleic acids research, 36(16):
52215231, 2008. 1
Daehwan Kim, Geo Pertea, Cole Trapnell, Harold Pimentel, Ryan Kelley, and Steven L Salzberg.
Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene
fusions. Genome biology, 14(4):1, 2013. 8
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods,
9(4):357359, 2012. 8, 47
Heng Li, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo
Abecasis, Richard Durbin, et al. The sequence alignment/map format and samtools. Bioinfor-
matics, 25(16):20782079, 2009. 8, 58
Dennis V Lindley. The bayesian analysis of contingency tables. The Annals of Mathematical
Statistics, pages 16221643, 1964. 24
Dale McAninch and Paul Thomas. Identication of highly conserved putative developmental en-
hancers bound by sox3 in neural progenitors using chip-seq. PloS one, 9(11):e113361, 2014. 8,
53, 56, 77
Peter J Park. Chipseq: advantages and challenges of a maturing technology. Nature Reviews
Genetics, 10(10):669680, 2009. 1
Ctia Petri. Relao entre nveis de signicncia Bayesiano e freqentista: e-value e p-value em
tabelas de contingncia. PhD thesis, Universidade de So Paulo, 2007. 24
Gordon Robertson, Jacqueline Schein, Readman Chiu, Richard Corbett, Matthew Field, Shaun D
Jackman, Karen Mungall, Sam Lee, Hisanaga Mark Okada, Jenny Q Qian, et al. De novo assembly
and analysis of rna-seq data. Nature methods, 7(11):909912, 2010. 8
Kate R Rosenbloom, Joel Armstrong, Galt P Barber, Jonathan Casper, Hiram Clawson, Mark
Diekhans, Timothy R Dreszer, Pauline A Fujita, Luvina Guruvadoo, Maximilian Haeussler, et al.
The ucsc genome browser database: 2015 update. Nucleic acids research, 43(D1):D670D681,
2015. 8, 53
Christiana Spyrou, Rory Stark, Andy G Lynch, and Simon Tavar. Bayespeak: Bayesian analysis
of chip-seq data. BMC bioinformatics, 10(1):1, 2009. 3
Julio Michael Stern. Cognitive constructivism and the epistemic signicance of sharp statistical
hypotheses. Tutorial book for MaxEnt, pages 611, 2008. 1
Reuben Thomas, Sean Thomas, Alisha K Holloway, and Katherine S Pollard. Features that dene
the best chip-seq peak calling algorithms. Briengs in bioinformatics, page bbw035, 2016. 1
BIBLIOGRAPHY 91
Cole Trapnell, Adam Roberts, Loyal Go, Geo Pertea, Daehwan Kim, David R Kelley, Harold
Pimentel, Steven L Salzberg, John L Rinn, and Lior Pachter. Dierential gene and transcript
expression analysis of rna-seq experiments with tophat and cuinks. Nature protocols, 7(3):
562578, 2012. 8, 62
Elizabeth G Wilbanks and Marc T Facciotti. Evaluation of algorithm performance in chip-seq peak
detection. PloS one, 5(7):e11471, 2010. 1
Qian Wu, Kyoung-Jae Won, and Hongzhe Li. Nonparametric tests for dierential histone enrichment
with chip-seq data. Cancer informatics, 14(Suppl 1):11, 2015. 1
Federico Zambelli, Graziano Pesole, and Giulio Pavesi. Motif discovery and transcription factor
binding sites before and after the next-generation sequencing era. Briengs in bioinformatics,
page bbs016, 2012. 1
Chongzhi Zang, Dustin E Schones, Chen Zeng, Kairong Cui, Keji Zhao, and Weiqun Peng. A
clustering approach for identication of enriched domains from histone modication chip-seq
data. Bioinformatics, 25(15):19521958, 2009. 2
Yong Zhang, Tao Liu, Cliord A Meyer, Jrme Eeckhoute, David S Johnson, Bradley E Bernstein,
Chad Nusbaum, Richard M Myers, Myles Brown, Wei Li, et al. Model-based analysis of chip-seq
(macs). Genome Biol, 9(9):R137, 2008. 1, 2, 53, 54