Você está na página 1de 315

Bárbara Domingues Bitarello

"Seleção balanceadora no genoma


humano: relevância biológica e
consequências deletérias"

“Balancing selection in the human genome: biological


relevance and deleterious consequences”

Diogo Meyer
Orientador

São Paulo
2016
Bárbara Domingues Bitarello

"Seleção balanceadora no genoma


humano: relevância biológica e
consequências deletérias"

Tese apresentada ao Instituto de Biociên-


cias da Universidade de São Paulo, para
a obtenção de Título de Doutor em Ciên-
cias, na Área de Biologia (Genética).

Orientador: Diogo Meyer

São Paulo
2016

i
Ficha Catalográfica
Domingues Bitarello, Bárbara
"Seleção balanceadora no genoma humano: rele-
vância biológica e consequências deletérias".
299 páginas.
Tese (Doutorado) - Instituto de Biociências da
Universidade de São Paulo. Departamento de Gené-
tica e Biologia Evolutiva.

1. Evolução Molecular;

2. Evolução Humana;

3. Seleção Balanceadora;

4. Evolução Adaptativa;

5. Genética de Populações;

6. Genômica de Populações;

7. Carga Genética

I. Universidade de São Paulo. Instituto de Biociências.


Departamento de Genética e Biologia Evolutiva.

Comissão Julgadora:

Prof. Dr. Prof. Dr.

Prof. Dr. Prof. Dr.

Prof. Dr. Diogo Meyer

ii
À memória de Maria Gabriela Duarte Macêdo (Kikita).

iii
“We speak not only to tell other people what we think, but to tell ourselves
what we think. Speech is a part of thought.”
– Oliver Sacks

iv
Agradecimentos
Às minhas amigas e amigos atemporais, que eu quase não vejo, mas que
sempre torceram junto comigo a cada etapa na vida acadêmica até aqui: Laura
Prado, Marcela Combat, Poliana Cardoso, Denise Nogueira, Luciana Matta, Ra-
mon Vitral. Quero agradecer à Laura Prado por ter me ouvido e ter dado muitas
dicas úteis nos meses finais do doutorado.
Tenho o privilégio de ter trabalhado em um ambiente muito agradável (o
Porão). Agredeço ao Pato (Guilherme Garcia) por ter me dado várias dicas so-
bre formatação da tese, e por ter cedido seu template de LATEX(e à Débora Brandt
também): graças a vocês eu pude desenvolver exatamente o que eu queria, sem
gastar tempo desnecessário. Agradeço Daniela Rossoni, Bárbara Tafinha, Ana
Paula Assis e Anna Penna pela amizade. Ao meu colega Gustavo Franca por
ter me dado muitas dicas perto do fim do doutorado, fora o reforço positivo.
Agradeço ao Diog(R)o Melo por ter me ajudado com questões de Linux. Final-
mente, agradeço a oportunidade de trabalhar com todos dos grupos do profes-
sor Gabriel Marroig e da professora Tatiana Torres, bem como outros grupos
que participam dos encontros Evolução no Porão.
Aos meus colegas do grupo de Genética Evolutiva: sou muito grata a to-
dos. Foi um prazer trabalhar com um grupo colaborativo como o nosso e ver
o quanto pudemos crescer juntos. Ao Vitor Aguiar por ter me escutado muito
(mas muito mesmo) e por sempre ser tão gentil. Ao Jônatas, pela generosidade
com seus scripts e por me apresentar a Tia das Massas. Sou grata pelo quanto me
ajudou a programar melhor e por sua enorme contribuição com as análises do
Capítulo 2. Agradeço o incentivo do Limão aos meus hobbies musicais e por ter
ajudado a achar erros nos dados usados no Capítulo 2. À Maria Helena Maia,
que foi uma irmã durante meu mestrado e início do doutorado.
Agradeço especialmente à Kelly Nunes e à Débora Brandt. À Kelly por ser
minha sábia pós-doc de plantão, e sempre ter transmitido calma e entusiasmo
quando eu precisei. Obrigada especialmente por ter lido muitas partes da tese
com carinho e ter me ajudado muito a aprimorá-la. À Débora Brandt, quero
agradecer por sua amizade, generosidade e o quanto me ajudou com a quali-
dade dos dados que analisei, além de sua ajuda corrigindo diversos trechos da
tese. Algumas outras pessoas que gostaria de agradecer: Caroline Lima, Caro-
lina Malcher, Rodrigo dos Santos Francisco.
Aos amigos/colegas/colaboradores que fiz em Leipzig: Cesare de Filippo,
João Teixeira, Michael Dannemann, André Strauss, Sandra Oliveira, Diana Le-
Duc, Fabrizio Manfezzoni, Felix Key, Petra Korlevic. Não apenas pude apren-
der com vocês, mas vocês fizeram minha estadia em Leipzig ser melhor. Ao
Stéphane Peyregné, que me revelou que trabalhar ouvindo trilha sonora da Dis-
ney (agradeço vocês também, Disney) aumenta a produtividade. À Annalisa
Schmidt, por ser uma amiga muito presente durante meu ano em Leipzig. Teria
sido muito menos legal sem você lá.

v
Gostaria de agradecer especialmente à pesquisadora Aida Andrés. Passei
um ano com seu grupo no Max Planck Institute for Evolutionary Anthropo-
logy, onde aprendi mais do que eu poderia antever. Gostaria de agradecer es-
pecialmente pela confiança que teve em mim desde o início, e também por sua
calma. Nessa etapa do trabalho eu tive, efetivamente, dois orientadores. É um
privilégio que nem todos os alunos de doutorado têm, e sou muito grata.
Gostaria de agradecer a alguns professores e/ou membros da minha banca
de qualificação, que considero terem contribuído muito para minha formação
ao longo de toda a pós-graduação: Tatiana T. Torres, Walter A. Neves, Gabriel
Marroig, Paulo Otto. Sou grata ao professor Eduardo Tarazona, que me apre-
sentou ao Diogo Meyer.
Ao meu orientador, Diogo Meyer, agradeço por ter me ajudado a aprender
o máximo possível ao longo desses sete (!) anos de pós-graduação na USP e às
oportunidades que me proporcionou. Eu cheguei aqui sem saber muita coisa,
exceto que queria estudar genética de populações humanas, e você me propor-
cionou estudar exatamente o que eu queria. Concluir essa etapa da minha for-
mação é um "sonho"que nutro desde muito jovem, e você foi uma pessoa muito
importante ao longo desta trajetória.
À Ale Chris, com quem eu pude contar absolutamente sempre que precisei.
Agradeço eternamente por tê-la como amiga, e por sua enorme generosidade.
Agradeço também à Gisele Melo pela amizade.
À Klervia Jaouen, obrigada pela sua confiança inabalável em mim, pela pa-
ciência, compreensão e por sua disposição em me ajudar, sempre, seja ouvindo
ensaios de apresentação, seja lendo o que eu escrevi (e tudo isso em português,
sua terceira, e ainda incipiente, língua). Obrigada por me fazer feliz e querer ser
uma pessoa melhor, sempre.
À minha avó, Tê, por todo o suporte que sempre me deu. Aos meus pais,
Bia e Flávio, e aos meus segundos pais, Beth e Joe: obrigada por serem ótimos
exemplos pra mim, todos vocês, e por sempre terem me incentivado a seguir
essa carreira. Agradeço à minha mãe por ter lido e comentado a introdução (sei
que não foi fácil) e por ter sido compreensiva e sempre ter me ajudado e lidar
com a vida acadêmica. Agradeço finalmente à minha irmã, que foi muito com-
preensiva com a minha necessidade de reclusão nos últimos meses e sempre
menteve um reconfortante interesse pelas coisas científicas e nerds. A todas as
pessoas que eu porventura tenha esquecido de agradecer, obrigada.
Finalmente, agradeço à Fundação de Amparo à Pesquisa do Estado de São
Paulo (FAPESP) por ter me financiado no doutorado, incluindo o periodo que
passei em Leipzig.

vi
Resumo

Seleção balanceadora é um processo evolutivo que engloba diversos mecanismos:


vantagem do heterozigoto, seleção dependente de frequência, pressões seletivas que
variam ao longo do tempo ou do espaço, e alguns casos de pleiotropia. O estudo des-
ses mecanismos em si foi e ainda é um tópico de grande interesse para os biólogos
evolutivos, e moldou o estudo da evolução ao longo do último século. Antes de a te-
oria neutra ter sido proposta, acreditava-se que a seleção balanceadora fosse comum.
A descoberta de que muita da diversidade genética observada podia ser explicada por
evolução neutra motivou, portanto, uma melhor compreensão da seleção balanceadora
como um regime seletivo capaz de manter variantes vantajosas nas populações.
O estudo da seleção balanceadora, em seus primórdios, foi restrito a organismos
que podiam ser manipulados em laboratório. Com o advento de métodos que permi-
tiam quantificar a variabilidade genética – tais como a eletroforese de proteínas, se-
quenciamento em pequena escala e re-sequenciamento genômico de milhares de indi-
víduos –, a variabilidade genética humana passou a ser ativamente estudada e inter-
pretada. Diversos estudos buscaram por assinaturas de seleção natural – i.e., padrões
de variação genômica deixadas por tais regimes seletivos – e avaliaram seu significado
comparando-as com o que seria esperado sob um cenário estritamente neutro. A maior
parte desses esforços foram concentrados no estudo da seleção positiva, tida como o
principal mecanismo responsável pela evolução adaptativa.
Poucos estudos buscaram assinaturas de seleção balanceadora no genoma humano.
Isso se deve em parte à escassez de métodos com alto poder para detectar tais assina-
turas. Adicionalmente, estudos prévios não analisaram dados em escala genômica, ou
se concentraram principalmente nas regiões codificadoras de proteínas. Aqui, nós des-
crevemos um método simples e com alto poder para detectar assinaturas de seleção
balanceadora. Em humanos, esse método supera outros comumente usados para a de-
tecção de tais assinaturas e, em teoria, poderia ser usado para detectá-las em outras
espécies, desde que seu poder seja avaliado caso-a-caso através de simulações neutras.
Nosso método (“Non-Central Deviation”, NCD) é apresentado em duas versões: NCD2,
que requer informação acerca dos polimorfismos da espécie analisada e das substitui-
ções entre essa espécie e um grupo externo, e NCD1, que requer apenas informação
acerca dos polimorfismos da espécie analisada. Embora em humanos NCD2 supere
NCD1, este último pode ser utilizado para espécies para as quais não haja informação
de um grupo externo.

vii
Quando aplicamos NCD2 a dados humanos, usando chimpanzé como grupo ex-
terno, encontramos mais de 200 genes codificadores de proteínas com forte assinatura
de seleção balanceadora, dos quais apenas 1/3 tinha evidência prévia de seleção ba-
lanceadora. Encontramos também um enriquecimento para diversas categorias de on-
tologia gênica, das quais cerca da metade é relacionada à imunidade. Verificamos que
dentre os genes com evidências de seleção balanceadora há um excesso de casos de ex-
pressão preferencial em tecidos tais como “adrenal” e “pulmão”, e também um excesso
de genes com expressão mono-alélica. No geral, vimos que as regiões selecionadas no
genoma humano incluem tanto sítios codificadores quanto regulatórios. Não encon-
tramos um excesso de assinaturas de seleção balanceadora em regiões regulatórias, ao
contrário do que reportaram outros estudos. Finalmente, encontramos um excesso de
polimorfismos não-sinônimos em relação aos sinônimos nos genes selecionados.
Tendo documentado a ocorrência de seleção balanceadora no genoma humano e
identificado genes que foram potencialmente alvos deste regime seletivo, nós investi-
gamos as consequências evolutivas desse processo. Nós partimos da hipótese que a
seleção balanceadora sobre um sítio reduz a eficiência com a qual a seleção purifica-
dora elimina variantes deletérias em sítios vizinhos. Esse processo é uma consequência
do quanto a seleção sobre um loco afeta, através de ligação genética, as frequências
de sítios não-neutros adjacentes. Testamos essa hipótese examinando se os genes sob
seleção balanceadora apresentam um excesso de variantes deletérias em relação a ex-
pectativas derivadas a partir do restante do genoma. Usando três diferentes métricas
para determinadas se e/ou o quão deletéria é uma dada variante, identificamos um ex-
cesso de variantes deletérias dentro dos genes sob seleção balanceadora, e mostramos
que tal padrão não pode ser atribuído a efeitos confundidores. Esse achado mostra que,
juntamente com os benefícios associados à variação adaptativa, a seleção balanceadora
aumenta o fardo de mutações deletérias no genoma humano.
De forma geral, nossos achados sugerem que a seleção balanceadora provavelmente
mantém variantes genéticas envolvidas em uma miríade de processos biológicos além
da imunidade e que ela foi mais comum no genoma humano do que se acreditava
anteriormente, afetando entre 1-8% dos genes codificadores de proteínas, bem como
diversas regiões não-codificadoras. Adicionalmente, a seleção balanceadora parece ser
importante para a evolução humana não apenas por seu efeito sobre a aptidão, mas
também por ter sido uma importante força capaz de moldar a diversidade genética
observada atualmente em humanos e a susceptibilidade a doenças.

Palavras-chave: evolução molecular, evolução humana, seleção balanceadora, evolu-


ção adaptativa, genética de populações, genômica de populações, carga genética

viii
Abstract

Balancing selection is an evolutionary process that encompasses several mecha-


nisms: heterozygote advantage, negative frequency dependent selection, selective pres-
sure that fluctuates in time or in space, and some cases of pleiotropy. The study of these
mechanisms per se has been and still is a topic of great interest for evolutionary biol-
ogists, and has shaped the study of evolution throughout the last century. Before the
proposition of the neutral theory of molecular evolution, it was believed that balanc-
ing selection was pervasive. The realization that much of the observed genetic diver-
sity could be explained by neutral evolution thus motivated a better understanding of
balancing selection as a selective regime capable of maintaining adaptive variants in
populations.
The study of balancing selection, in its early stages, was restricted to organisms that
could be manipulated in the laboratory. With the advent of methods that allowed quan-
tification of genetic variation – such as protein electrophoresis, small scale sequencing
and genome-wide re-sequencing of thousands of individuals – human variation started
to be actively studied and interpreted. Several studies have looked for signatures of
natural selection – i.e., patterns of genomic variation that selective regimes leave in the
genome – and evaluated their significance by comparing them to what would be ex-
pected under a strictly neutral scenario. Most of these efforts focused on the study of
positive selection, thought of as the prime mechanism responsible for adaptive evolu-
tion.
Only a few studies looked for signatures of balancing selection in the human genome.
This is partially due to the paucity of powerful methods to detect its signatures. More-
over, previous studies either did not analyze data on genomic scale or focused primarily
on protein-coding regions. Here, we describe a powerful and simple method to detect
signatures of balancing selection. In humans, it outperforms other methods commonly
used to detect such signatures and could in theory be used for other species, provided
that its power is evaluated for each species through neutral simulations. Our method
("Non-Central Deviation", NCD) has two versions: NCD2, which requires polymor-
phism information on the ingroup species, as well as divergence information between
the ingroup and an outgroup species, and NCD1, which only requires the ingroup
information. Although NCD2 is more powerful for humans, NCD1 can be used for
species that lack information from an outgroup.

ix
When applying NCD2 to human data, using chimpanzee as the outgroup, we found
more than 200 protein-coding regions with strong signatures of balancing selection,
only 1/3 of which had prior evidence for balancing selection. There was also an enrich-
ment for several gene ontology categories, approximately half of which are related to
immunity. We also found that among genes with evidence for balancing selection there
was an excess of cases of preferential expression in specific tissues, such as "adrenal"
and "lung", and an excess of genes with mono-allelic expression. Overall, we found
that selected regions of the genome include both coding and regulatory sites. We failed
to find a marked excess of balancing selection in regulatory regions, as reported in
previous studies. Finally, we found an excess of nonsynonymous versus synonymous
polymorphisms within the selected genes.
Having documented the occurrence of balancing selection in the human genome
and identified genes which were potential targets of this selective regime, we next in-
vestigated evolutionary consequences of this process. We hypothesized that balancing
selection acting on a site reduces the efficiency with which purifying selection purges
deleterious variants at nearby sites. This process is a consequence of how the dynam-
ics of selection at one locus, mediated by linkage, can interfere with the frequencies of
adjacent non-neutral sites. We tested this hypothesis by examining if the genes under
balancing selection show an excess of deleterious variants with respect to expectations
derived from the remainder of the genome. Using three different metrics to determine
deleteriousness , we identified a significant excess of deleterious variants within bal-
anced genes, and we show that this pattern cannot be attributed to confounding fac-
tors. This finding shows that together with the benefits associated with adaptive varia-
tion, balancing selection is increasing the burden of deleterious mutations in the human
genome.
Overall, our findings suggest that balancing selection likely maintains variation in
a myriad of biological processes other than immunity and that it has been more com-
mon in the human genome than previously thought, affecting between 1-8% of human
protein-coding genes, as well as a number of non-protein coding regions. Moreover,
balancing selection appears to be important to human evolution not only because of
its influence on fitness, but also because it has been an important force shaping current
human genetic diversity and susceptibility to disease.

Keywords: molecular evolution, human evolution, balancing selection, adaptive evo-


lution, population genetics, population genomics, genetic load

x
Sumário

Prólogo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Introdução Geral 4
Seleção Balanceadora: conceito, mecanismos e importância . . . . . . . 4
Por que estudar os mecanismos de manutenção da variabilidade
genética? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Teoria neutra da evolução molecular . . . . . . . . . . . . . . . . . 8
Mecanismos de manutenção de diversidade adaptativa . . . . . . 12
Evolução adaptativa no genoma humano . . . . . . . . . . . . . . . . . 25
Assinaturas de seleção balanceadora . . . . . . . . . . . . . . . . . 26
Seleção balanceadora no genoma humano . . . . . . . . . . . . . . 33
Carga genética induzida por seleção balanceadora . . . . . . . . . . . . 37
Carga genética . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Relevância, Questões & Hipóteses . . . . . . . . . . . . . . . . . . . . . 46
Relevância . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Questões & Hipóteses . . . . . . . . . . . . . . . . . . . . . . . . . 49

Bibliografia 52

1 Buscando alvos de seleção balanceadora no genoma humano 61


Considerações Iniciais . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
NCD Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Power of the NCD statistics to detect LTBS . . . . . . . . . . . . . 70
Identifying signatures of LTBS in the human genome . . . . . . . 78
Reliability of significant and outlier windows . . . . . . . . . . . . 79

xi
SUMÁRIO SUMÁRIO

Non-random distribution across the genome . . . . . . . . . . . . 81


The biological pathways influenced by LTBS . . . . . . . . . . . . 81
Overlap of significant windows across populations . . . . . . . . 83
The putative function of balanced SNPs . . . . . . . . . . . . . . . 83
The top candidate genes . . . . . . . . . . . . . . . . . . . . . . . . 85
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
NCD Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Pervasiveness of LTSB in the human genome . . . . . . . . . . . . 93
Protein-coding and intergenic targets . . . . . . . . . . . . . . . . . 94
The frequency of the balanced allele(s) . . . . . . . . . . . . . . . . 95
The candidate genes . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Power analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Human population genetic data and filtering . . . . . . . . . . . . 100
Identifying signatures of LTBS . . . . . . . . . . . . . . . . . . . . . 101
Enrichment Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 104

References 107
Supplementary Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
S1 Text: Additional analyses for significant and outlier windows
and genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
S2 Text: A set of significant genes . . . . . . . . . . . . . . . . . . . 116
S3 Text: Manual verification of reliability of SNPs contained in
four of the outlier genes . . . . . . . . . . . . . . . . . . . . 116
Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

2 Acúmulo de mutações deletérias em genes que foram alvos de seleção


balanceadora de longo prazo em humanos 165
Considerações Iniciais . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Population datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

xii
SUMÁRIO SUMÁRIO

Targets of balancing selection . . . . . . . . . . . . . . . . . . . . . 174


Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Quantifying genetic load . . . . . . . . . . . . . . . . . . . . . . . . 176
Re-sampling control SNPs . . . . . . . . . . . . . . . . . . . . . . . 181
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
The site frequency spectrum of balanced genes . . . . . . . . . . . 185
Measures of deleteriousness correlate negatively with allelic fre-
quency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Extreme values for HLA SNPs . . . . . . . . . . . . . . . . . . . . . 193
Increased nonsynonymous to synonymous SNPs in balanced genes195
Increased proportion of damaging to synonymous SNPs in bal-
anced genes . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Increased C-scores in balanced genes . . . . . . . . . . . . . . . . . 198
Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . 203
Increased genetic load in balanced genes . . . . . . . . . . . . . . . 203
The challenges of quantifying genetic load . . . . . . . . . . . . . 205
Sheltered load and hitch-hiking . . . . . . . . . . . . . . . . . . . . 207

References 209

Considerações Finais e Perspectivas 214


Seleção balanceadora no genoma humano . . . . . . . . . . . . . . . . . 214
Desenvolvimento e avaliação de um novo método para a detec-
ção de assinatura de seleção balanceadora . . . . . . . . . 214
Prevalência de SBLP no genoma humano . . . . . . . . . . . . . . 216
Partilhamento entre continentes . . . . . . . . . . . . . . . . . . . . 218
Características das regiões candidatas . . . . . . . . . . . . . . . . 219
Variação deletéria em regiões e genes com assinaturas de SBLP . . . . . 225
Perspectivas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
Conciliando assinaturas de seleção e fenótipos . . . . . . . . . . . 228
Potencial das estatísticas NCD em futuros estudos . . . . . . . . . 230

Bibliografia 232

Apêndices 234
Apêndice A.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

xiii
SUMÁRIO SUMÁRIO

Apêndice A.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247


Apêndice A.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Apêndice A.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

xiv
Prólogo
Existem diversas fontes de evidência da ação da seleção natural no genoma
humano. A seleção natural pode ser direcional, aumentando ou diminuindo a
frequência de variantes vantajosas ou deletérias (seleção positiva ou negativa,
respectivamente) ou balanceadora. A seleção positiva vem sendo amplamente
investigada há pelo menos duas décadas, sob a forma de “scans genômicos”
e é vista como o mecanismo principal da evolução adaptativa. Estima-se que
entre 2-14% do genoma humano foram alvo desse regime seletivo, em diversas
escalas de tempo.

Diferentemente do que ocorre para a seleção positiva – a qual diminui a


diversidade genética – a seleção balanceadora mantém a diversidade genética
nas populações e, apesar de sua relevância, poucos estudos até hoje exploraram
seus alvos no genoma humano.

Aqui, eu me propus a explorar a importância da seleção balanceadora ao


longo evolução humana: seus alvos no genoma, as propriedades destes alvos e
as possíveis consequências deletérias causadas na vizinhança de polimorfismos
balanceados.

Primeiramente, busquei fazer um levantamento o mais completo possível


dos alvos de seleção balanceadora no genoma humano. Para isso, desenvolve-
mos uma nova ferramenta estatística (NCD, Non-Central Deviation), otimizada
para a detecção de assinaturas de seleção balanceadora em humanos. Essa ferra-
menta foi utilizada em dados genômicos de quatro populações a fim de mapear
a atuação desse regime seletivo em humanos: seus alvos, suas propriedades bi-
ológicas e genômicas, e as diferenças entre diferentes populações e continentes
(Capítulo 1).

1
Em segundo lugar, buscamos testar a hipótese de que a seleção balancea-
dora, mantendo polimorfismos por longas escalas de tempo (milhões de anos),
teria um efeito deletério sobre sítios próximos ao(s) sítio(s) selecionados. Essa
hipótese é oriunda de uma ampla literatura acerca do acúmulo de mutações de-
letérias em regiões vizinhas a alvos de seleção positiva, de carga genética em
humanos, e também do fato de que muitos genes sob seleção balanceadora pa-
recem estar associados a doenças complexas. Além disso, há evidências de tal
acúmulo ocorre ao redor dos genes HLA que que foram alvos de seleção balan-
ceadora. A fim de explorar essas questões, o primeiro capítulo da tese trata da
detecção dos alvos da seleção balanceadora no genoma humano, e o segundo no
estudo dos efeitos da seleção balanceadora sobre regiões vizinhas do genoma.

Além disso, apresento uma introdução geral aos temas dos dois capítulos, e
uma discussão final sobre as implicações dos achados dos dois estudos. Assim,
a tese está dividida em:

• Introdução Geral: uma introdução acerca (a) do regime seletivo conhe-


cido como seleção balanceadora – definição, importância histórica para
a genética de populações, mecanismos através dos quais atua, proprieda-
des do regime seletivo e importância evolutiva – e (b) do conceito de carga
genética e dos efeitos deletérios que a seleção natural deixa em regiões li-
gadas geneticamente aos alvos de seleção.

• Primeiro capítulo: em que apresento um novo método estatístico (“NCD”),


focado especificamente na detecção de assinaturas genômicas deixadas
por regimes de seleção balanceadora que perduram por milhões de anos,
bem como os resultados de um scan genômico feito com o método descrito
usando dados reais de genética de populações humanas.

2
• Segundo capítulo: em que apresento uma investigação acerca do acúmulo
de mutações deletérias em regiões que foram alvo de seleção balanceadora
em humanos, conforme detectadas no scan genômico apresentado no pri-
meiro capítulo.

• Considerações Finais e Perspectivas: em que discuto os principais acha-


dos dos dois capítulos principais da tese, bem como as perspectivas para
trabalhos futuros que podem decorrer de nossas contribuições.

No anexo incluí publicações que resultaram de colaborações realizadas no


período do doutorado. Três delas (Apêndices A.1, A.2, A.4) trazem como tema
principal os genes HLA – exemplos clássicos de alvos de seleção balanceadora –
, sendo dois deles focados nas assinaturas de seleção balanceadora desses genes
(A.2 e A.4), e o outro focado em problemas metodológicos com sequenciamento
de nova geração para essa região do genoma (A.1). Finalmente, o trabalho no
anexo A.3 trata da evolução adaptativa de forma mais ampla, com foco em tes-
tes de seleção positiva.

3
Introdução Geral

Seleção Balanceadora: conceito,


mecanismos e importância
como surgiram e como foram mantidas as variações fenotí-

E
NTENDER

picas observadas nas populações naturais, bem como suas possíveis


implicações funcionais, é um dos objetivos centrais da genética de
populações (Kimura, 1983; Dobzhansky, 1937). Não é surpreendente, portanto,
que a elucidação acerca dos mecanismos através dos quais a variabilidade ge-
nética ao nível molecular é mantida nas populações seja uma "grande questão":
talvez o problema mais importante do campo da biologia evolutiva e da gené-
tica de populações (Kimura, 1983), marcado por acirrados debates (Figura 1).

Por que estudar os mecanismos de manutenção da variabilidade

genética?

O "vigor do híbrido", ou heterose, é observado há séculos. Mendel percebeu


que as ervilhas híbridas de seus estudos tinham altura média maior do que as
das linhagens parentais (Crow, 1987) e Darwin escreveu um livro inteiro sobre

4
5
Figura 1: Linha do tempo do estudo da seleção balancedora Adaptada a partir de Gloss e Whiteman (2016) resu-
mindo as principais contribuições teóricas e empíricas para a compreensão da importância da seleção balanceadora
para a manutenção de variação genética. SBLP, seleção balanceadora de longo prazo (ver texto).
Introdução Geral
Introdução Geral

vigor do híbrido em plantas (Darwin, 1876)1 . Apesar de ter sido observado


frequentemente por criadores de plantas e animais, esse fenômeno só pôde ser
corretamente interpretado após a re-descoberta das Leis de Mendel no início do
século 20 (revisado em Crow, 1987), quando estabeleceu-se de forma definitiva
que a variação genética é um dos fatores determinantes da variação fenotípica.

Em 1922, Fisher menciona que, embora o vigor do híbrido fosse um fato,


não era clara a razão biológica pela qual um heterozigoto qualquer seria mais
apto que seu homozigoto correspondente. Fisher foi o primeiro a demonstrar
como um equilíbrio de frequências entre dois alelos pode ser mantido em um
loco sob vantagem de heterozigoto (ver página 13). Ele propôs, ainda, que tal
fenômeno deveria ser comum na natureza, e capaz de explicar tanto o vigor do
híbrido quanto os efeitos deletérios às vezes observados em animais domesti-
cados submetidos a endogamia.

Na medida em que começaram a surgir, desde a primeira metade do século


20, métodos capazes de acessar a variação genética diretamente – uma proprie-
dade que, começava-se então a perceber, era aparentemente ubíqua em popula-
ções naturais –, começou também a surgir um interesse crescente em se explicar
os padrões de variação genética (revisado em Bamshad e Wooding, 2003; Gloss
e Whiteman, 2016). Mesmo com métodos que para o geneticista de populações
contemporâneo parecem bastante limitados (pois sempre forneciam subestima-
tivas dos níveis reais de variabilidade genética das populações), variações ge-
néticas já eram observadas desde meados do século 20, e sua persistência foi
atribuída à ação da seleção balanceadora (Dobzhansky, 1937).

O termo “seleção balanceadora” agregava todo e qualquer processo evolu-


1 Nesse livro, Darwin faz experimentos de auto-polinização e polinização cruzada em mais
de 60 espécies de plantas e conclui que (na maior parte dos casos) a performance da prole
resultante de auto-polinização é, para diversos traços, inferior.

6
Introdução Geral

tivo que, de forma adaptativa, mantinha variação genética nas populações. Por
exemplo, Dobzhansky (1937) observou polimorfismos na orientação de longos
trechos de DNA em cromossomos de Drosophila – as chamadas “inversões cro-
mossômicas” usando técnicas de coloração de cromossomos, muito antes de o
sequenciamento de DNA ser possível – e atribuiu à seleção balanceadora a ma-
nutenção de tais polimorfismos na natureza (Figura 1).

Nessa época, no início do século 20, o interesse pela seleção balanceadora


tinha múltiplas origens. Criadores de animais e plantas queriam maximizar a
produtividade/performance, e o vigor do híbrido era observado, porém não to-
talmente compreendido. Dobzhansky, assim como Muller, estava preocupado
com o potencial da espécie para continuar evoluindo. Entretanto, Muller tinha
outra preocupação mais “urgente”: o impacto do aumento da taxa de muta-
ção causada por radiação nas gerações futuras (revisado em Crow, 1987). Es-
sas três preocupações giravam em torno do quanto a variabilidade genética na
população depende da sobredominância2 , i.e. em loci nos quais o fenótipo do
heterozigoto está além dos limites dos fenótipos dos dois homozigotos corres-
pondentes3 .

Nos anos 60, Lewontin e Hubby (1966) revelaram à comunidade científica


que os níveis de variação genética em alozimas de Drosophila eram muito mais
altos do que o que se estimava ser a variabilidade genética nas populações até
então analisadas (Figura 1). Suas descobertas foram baseadas em um novo mé-
todo de detecção de polimorfismos – a eletroforese de proteínas –, que permitia
quantificar ainda mais variações do que as técnicas de coloração de cromosso-

2 Esse termo, usado frequentemente como sinônimo para vantagem do heterozigoto, foi em-
pregado originalmente para explicar a heterose em plantas (Hedrick, 2012).
3 Essa é uma definição mais antiga de vantagem do heterozigoto (Crow, 1987). Outras serão

mencionadas ao longo do texto.

7
Introdução Geral

mos utilizadas por Dobzhansky uma década antes. A seleção balanceadora –


e, particularmente, a seleção que varia ao longo do tempo – se consagrou então
como uma explicação bastante popular para a manutenção de polimorfismos
na natureza (revisado em Bamshad e Wooding, 2003; Gloss e Whiteman, 2016).

É importante mencionar que o mecanismo principal de seleção balancea-


dora invocado para explicar os níveis de variação em Drosophila era a hetero-
geneidade espacial nas pressões seletivas (Levene, 1953), decorrente do fato de
tais espécies ocuparem hábitats diversos: uma determinada variante não seria
a mais adaptativa em todos os hábitats, assim levando à manutenção de po-
limorfismos na população (Figura 1). Mais tarde, Dempster (1955) mostrou
que pressões seletivas que variam ao longo do tempo também podem manter
variações genéticas (revisado em Gloss e Whiteman, 2016).

Nas décadas seguintes, na medida em que modelos matemáticos foram sendo


desenvolvidos no sentido de descrever tais regimes seletivos, evidências em-
píricas de sua ocorrência na natureza começaram a surgir (i.e, Allison, 1954).
Nesse período, acreditava-se amplamente em uma teoria neo-Darwinista bas-
tante “selecionista”, i.e, focada em seleção natural como o principal mecanismo
capaz de alterar frequências alélicas e fenotípicas: as variantes seriam em sua
grande maioria deletérias, e uma proporção menor seria vantajosa (Figura 2).

Teoria neutra da evolução molecular

Com a sensibilidade cada vez maior dos métodos moleculares em detectar os ní-
veis de variabilidade (i.e, Lewontin e Hubby, 1966), constatou-se a abundância
de polimorfismos4 em populações naturais, o que levou a um grande desen-
4A presença de variantes fenotípicas discretas em uma população é chamada de polimor-
fismo. Os polimorfismos “visíveis” são, contudo, uma subestimativa da diversidade genética
subjacente (Charlesworth e Charlesworth, 2010). Ao longo do texto, polimorfismos são diferen-

8
Introdução Geral

Figura 2: Modelos selecionista, neutro e quase neutro de evolução molecular.


Figura adaptada a partir de Bromham e Penny (2003) e Bernardi (2007). Em
1859, Darwin publica o livro "A origem das espécies", no qual descreve suas
ideias sobre seleção natural. Darwin acredita que possam haver mutações neu-
tras, porém a maioria das mutações são deletérias e poucas são vantajosas. Na
primeira metade do século 20 a seleção natural é conciliada com as bases do
mecanismo molecular de herança (Neo-Darwinismo). Neste período, o foco
selecionista aborda apenas mutações deletérias e vantajosas. Em 1968 Kimura
publica a primeira versão do modelo neutro de evolução molecular (e uma atu-
alização importante em 1983) e em 1973 Ohta propõe o modelo quase neutro de
evolução molecular. Esses dois últimos autores mostram que a deriva genética
pode manter variantes neutras ou quase neutras nas populações. Ver também
Figura 1.

volvimento de modelos de seleção balanceadora (Gloss e Whiteman, 2016). A


Figura 1 resume muitas dessas contribuições.

Entretanto, esse entusiasmo foi contrabalanceado pela então recente “teoria


neutra da evolução molecular”. Em seu livro divisor de águas para o campo
da biologia evolutiva5 , Kimura apresenta à comunidade científica a ideia que
a principal causa das mudanças evolutivas no nível molecular – i.e, mudanças

tes alelos alelos existentes na população para um dado loco.


5 “The Neutral Theory of Molecular Evolution (1983)” é a versão consultada aqui e ampla-

mente usada em genética de populações, embora a teoria tenha sido proposta pela primeira vez
antes (Kimura, 1968).

9
Introdução Geral

no material genético per se – não é a seleção positiva darwiniana, mas a fixação


aleatória de variantes neutras ou quase neutras (Figura 2).

Desde a primeira proposição da teoria neutra em 1968, Kimura enfrentou


muitas críticas, em grande parte devido ao fato de a biologia evolutiva ter sido
dominada por mais de meio século pela visão darwinista de que organismos se
tornam progressivamente adaptados a seus ambientes pelo acúmulo de varian-
tes benéficas (Figura 2). Além disso, a teoria recebeu críticas de cunho técnico
tais como a dificuldade de conciliar suas predições com alguns aspectos impor-
tantes dos dados gerados: variância elevada em taxas de evolução molecular,
viés no uso de códons, evidência de seleção em estudo de genes específicos,
entre outros. Essas críticas forçaram mudanças na teoria, que foi sendo revi-
sada ao longo do tempo (Kimura, 1968; Kimura, 1983; Kimura, 1991). Um outro
desdobramento interessante foi a proposição da teoria “quase neutra” que, ao
contrário da teoria neutra original, que tratava apenas de polimorfismos neu-
tros e os muito deletérios, incorporou também os polimorfismos fracamente
deletérios (Ohta, 1973) e os benéficos (Ohta, 1995; Ohta e Gillespie, 1996; Figura
2).

Segundo a teoria neutra, variantes neutras, introduzidas por mutações, po-


deriam subir de frequência estocasticamente. Kimura (1968,1983) apontou que,
para contribuir para a adaptação, as mutações precisam ser mais do que be-
néficas: elas precisam escapar a perda por deriva, especialmente quando raras.
Levando ambos os fatores em conta, Kimura concluiu que as mutações de efeito
intermediário são as que mais provavelmente contribuem para a adaptação (re-
visado em Orr, 2005). Adicionalmente, concluiu que a maior parte da varia-
bilidade molecular intraespecífica, como por exemplo aquela manifestada pe-
los polimorfismos de proteínas, é essencialmente neutra, de forma que a maior

10
Introdução Geral

parte dos alelos polimórficos são mantidos na espécie devido a um balanço en-
tre mutação e extinção aleatória de alelos (Kimura, 1983).
Ao longo dos anos 60 e 70, por influência das ideias de Kimura, os geneticis-
tas evolutivos ficaram cada vez mais convencidos de que muita – se não a maior
parte – da evolução molecular reflete a fixação de mutações neutras ou quase
neutras, e não benéficas. A teoria parecia ser capaz de explicar, através de pro-
cessos estocásticos, a maior parte da variação observada dentro de populações.
Nesse período, o estudo teórico dos mecanismos de evolução adaptativa – sele-
ção positiva e balanceadora – diminuiu consideravelmente (Orr, 2005) (Figuras
1 e 2).
Contudo, é interessante observar que a teoria neutra não se opõe à noção de
que a evolução de forma e função possam ser guiadas por seleção darwiniana,
mas destaca um outro aspecto do processo evolutivo ao enfatizar o papel crucial
que as pressões mutacionais e a deriva genética possuem no nível molecular.
Kimura (1983) define a teoria neutra como:

“(...)the theory that at the molecular level evolutionary changes and poly-
morphisms are mainly due to mutations that are nearly enough neutral
with respect to natural selection that their behavior and fate are mainly
determined by mutation and random drift.” (Kimura 1983; primeiro ca-
pítulo).

A teoria olha os polimorfismos ao nível da proteína e do DNA como fases


transitórias da evolução molecular e rejeita a noção de que a maior parte desses
polimorfismos seja adaptativo e mantido na espécie devido a alguma forma de
seleção balanceadora (Figura 2). A teoria neutra prevê, portanto, que a maior
parte das variantes genéticas são neutras; ela não rejeita, por outro lado, que as

11
Introdução Geral

variantes genéticas funcionais - aquelas que afetam os fenótipos, e que dentro


de seu modelo, representam uma minoria da variação existente na natureza -
poderiam ser mantidas por seleção balanceadora (Kimura, 1983; Gloss e White-
man, 2016).

Com esse histórico – não exaustivo – sobre o tema da manutenção de diversi-


dade nas populações, busquei apresentar os conceitos e distinção entre polimor-
fismos neutros, mantidos com certa probabilidade simplesmente pela combina-
ção dos efeitos de mutação, deriva, “efeito carona”6 e migração, e polimorfismos
adaptativos, que são mantidos por um ou mais mecanismos de seleção balance-
adora7 . A seguir, detalharei cada um desses mecanismos.

Mecanismos de manutenção de diversidade adaptativa

Dentro do pensamento evolutivo clássico não se previa um nível de variação


que implicaria na existência de muitos casos de polimorfismos balanceados (Fi-
gura 1), i.e, aqueles mantidos em frequências intermediárias. Essa visão foi con-
testada pela descoberta de polimorfismos balanceados na natureza. Tal desco-
berta foi uma importante contribuição que ocorreu nos primórdios da genética
de populações e que ajudou muito para uma melhor compreensão da evolução
(Charlesworth e Charlesworth, 2010.

Níveis de variação genética, sabe-se hoje, são influenciados por fatores de-
mográficos, tais como flutuações no tamanho populacional, estruturação, mis-

6 Genetichitch-hiking, processo através do qual mutações neutras – ou, em alguns casos, dele-
térias – mudam de frequência em uma população devido ao efeito de ligação genética com uma
mutação selecionada (revisado em Cutter e Payseur, 2013). Esse tópico será abordado em maior
detalhe na seção “Carga genética induzida por seleção balanceadora” e também no Capítulo 2.
7 Embora a seleção positiva também atue sobre as mutações vantajosas ou adaptativas, ela

tende a fixar tais variantes vantajosas na população, e portanto reduz, em vez de manter, a
variabilidade genética.

12
Introdução Geral

cigenação e migração (Tishkoff e Williams, 2002)8 . Por outro lado, padrões de


diversidade variam ao longo do genoma por diversos motivos, includindo ta-
xas de mutação, taxas de recombinação, conversão gênica9 e processos seleti-
vos (Tishkoff e Williams, 2002). A classe de processos seletivos que levam a um
aumento de diversidade genética vantajosa é conhecida como “seleção balan-
ceadora” (Andrés, 2011; Key et al., 2014). Refere-se a tais polimorfismos como
“polimorfismos balanceados” (Charlesworth e Charlesworth, 2010).

Seleção balanceadora é um termo que engloba diversos mecanismos, os quais


serão discutidos a seguir.

Aptidões constantes

A modelagem de regimes de seleção balanceadora frequentemente é feita usando-


se as aptidões dos genótipos sem que se especifique as causas subjacentes para
as diferenças de fenótipo10 . Embora seja uma estratégia útil, e na prática muitas
das assinaturas genômicas deixadas pelos diferentes mecanismos de seleção ba-
lanceadora sejam as mesmas, é importante buscar entender quais aspectos bio-
lógicos de um organismo são capazes de determinar sua aptidão (Charlesworth
e Charlesworth, 2010).

O modelo mais básico de seleção natural assume aptidões constantes dos


genótipos (Figura 3A). Para um sistema diploide e bi-alélico, as aptidões dos
genótipos A1 A1 , A1 A2 e A2 A2 são dadas por w11 , w12 , e w22 , respectivamente.
Em sistemas diploides, é comum utilizar o conceito de aptidões marginais dos

8 Entretanto, por haver estocasticidade, em média a demografia afeta o genoma como um


todo, e não apenas regiões específicas.
9 Um tipo específico de recombinação, que resulta em uma troca não-recíproca de material

genético, em que uma fita de DNA é usada para modificar a sequência de outra (Tishkoff e
Williams, 2002).
10 Por exemplo, ver a seção de métodos do Capítulo 1.

13
Introdução Geral

alelos A1 e A2 para nos referirmos a uma média da aptidão de cada alelo con-
siderando todos os genótipos em que ele aparece (i.e, homo e heterozigoto).
No cenário que estamos tratando aqui, embora w11 , w12 e w22 não mudem ao
longo do tempo, as aptidões marginais continuam dependendo das frequên-
cias alélicas. Ou seja, é possível que as duas aptidões marginais se igualem, e
que as frequências alélicas parem de mudar, atingindo um equilíbrio estável de
frequências. Chamamos a frequência de cada alelo, no equilíbrio, de frequência de
equilíbrio11 . Interessantemente, os valores absolutos das aptidões dos genótipos
são irrelevantes: apenas os valores relativos dos genótipos entram nas equações
de seleção. Pode-se, portanto, definir um genótipo (e.g. o heterozigoto) como
sendo o "padrão", e expressar as aptidões dos outros genótipos em relação a
este, como apresentado a seguir (Charlesworth e Charlesworth, 2010).

Vantagem do heterozigoto Seguindo o modelo anterior, e usando frequências


relativas dos genótipos, um cenário de vantagem do heterozigoto contendo dois
alelos A1 e A2 com frequências p e q pode ser modelado da seguinte forma12 :

A1 A1 A2 A2 A2 A2
1−t 1 1−s

, onde t e s são os coeficientes seletivos dos alelos A1 e A2 , respectivamente.


Quando as duas aptidões marginais, de A1 e A2 , são iguais (qt = ps), tem-se
um equilíbrio polimórfico, e a frequência de equilíbrio é dada por:
t
peq = t+s ; qeq = 1 − peq

11 No caso de locos bi-alélicos, a frequência de equilíbrio pode ser definida como a frequência
do alelo menos frequente.
12 A primeira demonstração e discussão de como um polimorfismo pode ser mantido por

seleção, de forma bastante semelhante à apresentada aqui, foi feita no trabalho entitulado "On
the dominance ratio", de Fisher (1922). Ver a Figura 1.

14
Introdução Geral

Figura 3: Uma possível estratégia para a identificação de instâncias de seleção


balanceadora consiste em medir diferenças de aptidão entre classes genotípicas.
Em (A), temos um cenário de sobredominância ou vantagem do heterozigoto,
onde as aptidões dos genótipos são constantes mas a do heterozigoto é sempre
mais alta que a de ambos os homozigotos. Nesse exemplo, os homozigotos têm
aptidões distintas (sobredominância assimétrica). Em (B), temos um cenário de
aptidões variáveis ao longo do espaço (por exemplo, diferentes hábitats ocu-
pados por uma espécie) ou no tempo (quando um genótipo homozigoto tem
maior aptidão em um dado momento, e reduz em gerações subsequentes). Fi-
gura adaptada de Key et al. (2014).

Através de vantagem do heterozigoto (também comumente chamada de so-


bredominância), polimorfismos podem ser mantidos em uma população devido
à maior aptidão do genótipo heterozigoto em relação aos dois genótipos homo-
zigotos, o que leva a um balanço de frequências entre as duas variantes (Andrés,
2011; Fijarczyk e Babik, 2015; Key et al., 2014). A frequência de equilíbrio será
0.5 quando s = t, um cenário improvável, conhecido como sobredominância
simétrica (Figura 4A), e 6= 0.5 quando s 6= t (Figura 4B). Desde que a condição
qt = ps seja atendida, a frequência de equilíbrio é atingida independentemente
de sua frequência inicial (Figura 4), i.e., o equilíbrio é estável. No Capítulo
1, propomos uma estatística sumária que utiliza muitas dessas propriedades

15
Introdução Geral

para detectar assinaturas de seleção balanceadora em dados genômicos (ver,


por exemplo, Figuras 1 e 6 no Capítulo 1).

Um desdobramento interessante do modelo acima é que, supondo-se que


um alelo A2 , inicialmente raro, entra em uma população inicialmente fixada
para A1 (por migração ou por mutação), então a proporção de genótipos A2 A2
produzida por reprodução aleatória (q2 ) é muito baixa comparada com a de
heterozigotos 13 . Como genótipos A2 A2 são inicialmente muito raros, a única
condição compatível com um aumento de frequência de A2 é que A1 A2 tenha
aptidão mais alta que A1 A1 , mesmo que indivíduos A2 A2 tenham aptidão muito
reduzida. Assim, A2 só vai aumentar de frequência até um certo ponto, uma
vez que A2 A2 é deletério.

Com aptidão constante dos genótipos (Figura 3A), a seleção balanceadora


atuando sobre um só loco sempre maximiza a aptidão média de uma população
com reprodução aleatória, ainda que, no caso de vantagem do heterozigoto, ho-
mozigotos com aptidões mais baixas sejam gerados por segregação mendeliana
a cada geração (Charlesworth e Charlesworth, 2010).

Devido a essa propriedade, Sellis et al. (2011) demonstraram, teoricamente,


que é provável que esta seja a trajetória da maior parte das mutações adapta-
tivas em diploides. Esse estado “balanceado” seria, por definição, muito fre-
quente e de curta duração, e não deixaria assinaturas passíveis de serem cap-
tadas pelos métodos focados em assinaturas de seleção balanceadora de longa
duração. Por outro lado, teria a vantagem de manter diversidade adaptativa nas
populações – mesmo que a curto prazo – o que compensaria a perda de aptidão
causada pelos homozigotos. Esse fenômeno poderia ser responsável por man-
ter diversidade adaptativa nas populações, que por sua vez poderia ser usada
13 Pois p ≈ 1 e 2pq ≈ 2q

16
Introdução Geral

A. 1
Frequência alélica
0.75
Frequência de
0.5
equilíbrio: 0.5
0.25

0
Tempo

B.

1
Frequência alélica

0.75

0.5
Frequência de
equilíbrio: 0.3
0.25

0
Tempo

Figura 4: Independente da frequência inicial de cada alelo em um sistema bi-


alélico com vantagem de heterozigoto, um equilíbrio pode ser atingido (desde
que o alelo sobreviva a perda por deriva). Se as aptidões relativas dos dois ge-
nótipos homozigotos em relação ao heterozigoto são idênticas, o equilíbrio é
atingido em uma frequência de 0.5 para cada alelo (A). Se um dos genótipos ho-
mozigotos tem aptidão relativa maior do que o outro, a frequência de equilíbrio
atingida será diferente 0.5 (B, onde a frequência de equilíbrio é de 0.3 para um
alelo e 0.7 para o outro, consequentemente). No Capítulo 1, um exemplo real de
sobredominância assimétrica é apresentado (Página 68).

prontamente pela seleção natural em casos de mudança repentina de pressão


seletiva14 . Embora seja uma proposição muito interessante, faltam evidências
de empíricas para apoiá-la ou rejeitá-la.

Aptidão média Seria lógico pensar que, em um cenário de vantagem do he-


terozigoto, a inevitável geração de homozigotos a cada geração por segregação
mendeliana levaria a uma redução da aptidão média da população. Entretanto,
14 Esse fenômeno, que vem sendo cada vez mais estudado, é chamado de selection on standing
variation.

17
Introdução Geral

esse não é o caso: a aptidão média de uma população (i.e, considerando todos
os genótipos e suas frequências) com reprodução aleatória e vantagem do he-
terozigoto atinge o seu máximo no equilíbrio. Por isso, diz-se que a frequência de
equilíbrio em um cenário de vantagem do heterozigoto é aquela que maximiza
a aptidão média da população (Charlesworth e Charlesworth, 2010; Andrés, 2011).
Portanto, embora a presença de homozigotos com baixa aptidão no caso da ane-
mia falciforme, por exemplo, seja muito prejudicial para o indivíduo, a aptidão
da população como um todo é mais alta quando indivíduos resistentes à ma-
lária são mantidos na população (Charlesworth e Charlesworth, 2010; Wright,
1937).

Seleção antagonista Pressões seletivas opostas entre diferentes contextos – am-


bientes, sexo do indivíduo, componentes de aptidão e estágios de desenvolvi-
mento – podem gerar seleção antagonista ao nível genético-populacional (Con-
nallon e Clark, 2013; Prout, 2000). Alelos selecionados de forma antagonista são
aqueles que aumentam a aptidão em um contexto, mas diminuem-na em outro.
Quando os contextos são componentes individuais de aptidão, tem-se um ce-
nário conhecido como "pleiotropia antagonista", em que um mesmo loco afeta
mais de um caráter, e para um deles um alelo é adaptativo, e para o outro, é
deletério (Connallon e Clark, 2013).

Suponhamos dois componentes de aptidão: fertilidade e sobrevivência. Se


os efeitos do homozigoto para uma dada mutação atuam em direções opostas
nos dois caráteres, e o alelo favorável em relação a cada caráter é dominante
sobre o alelo deletério, o resultado será um cenário essencialmente indistin-
guível daquele de vantagem de heterozigoto, em que o loco pode evoluir para
ter frequências intermediárias através de seleção balanceadora. Tal cenário foi

18
Introdução Geral

descrito pela primeira vez em um trabalho sobre evolução do envelhecimento


(Williams, 1957), em que se propôs que se um alelo causa alta fertilidade na
juventude e envelhecimento precoce, o segundo efeito seria compensado pelo
primeiro (revisado em Charlesworth, 2000).

Entretanto, a caminhada até o equilíbrio seria lenta, requirindo dezenas de


milhares de gerações, mesmo com coeficientes seletivos não muito baixos (Con-
nallon e Clark, 2013), o que implica que: (1) alelos próximos do equilíbrio de-
vem ser relativamente antigos (ver discussão sobre escalas de tempo, abaixo); e
(2) que as populações devem em geral estar longe do equilíbrio para loci sujeitos
a esse regime (Connallon e Clark, 2013).

Exemplos de vantagem do heterozigoto Como na prática é muito difícil esta-


belecer inequivocamente qual (ou quais) mecanismo(s) de seleção balanceadora
é responsável pela manutenção de um polimorfismo balanceado durante longas
escalas de tempo, existem poucos exemplos não controversos de vantagem do
heterozigoto nessa escala de tempo (Charlesworth e Charlesworth, 2010).

Os genes do MHC (HLA em humanos15 ) são um possível exemplo de vanta-


gem do heterozigoto. Heterozigotos para alelos de genes codificadores de mo-
léculas apresentadoras de antígenos teriam a capacidade de responder a um re-
pertório maior de antígenos, e, portanto, responder melhor a infecções (Doherty
e Zinkernagel, 1975). Existe, entretanto, uma vasta literatura que mostra as
dificuldades de se diferenciar os mecanismos atuantes sobre os genes MHC e
suas contribuições relativas16 . O cenário mais provável é que diversos mecanis-
mos contribuíram, em diferentes pontos do espaço e do tempo, para a evolução

15 Major histocompatibility complex e Human leukocyte antigen, respectivamente.


16 VerIntrodução e Discussão do artigo no Apêndice A.4 para uma discussão detalhada sobre
este tópico.

19
Introdução Geral

desse sistema, e possivelmente concomitantemente. Por outro lado, é imprová-


vel que a vantagem do heterozigoto, apenas, seja capaz de explicar o número
de alelos encontrados nos genes do MHC (De Boer et al., 2004).

Em escalas de tempo mais recentes (milhares de anos), o exemplo mais clás-


sico é o da β - hemoglobina defeituosa que acarreta em anemia falciforme ao
portador da mutação, e sua relação com proteção à malária17 . Muitos outros
exemplos de vantagem do heterozigoto são conhecidos com base em mensura-
ções de aptidão de diferentes genótipos, por exemplo em animais domesticados
– tais como porcos, cachorros, cavalos, gatos e ovelhas. Em muitos casos, tra-
ços selecionados por seleção artificial e que resultam em fenótipos desejáveis,
geram como consequência fenótipos indesejados nos homozigotos recessivos
(Hedrick, 2012). Um exemplo extremo é o da ausência de cauda no gato da raça
Manx, que é o fenótipo heterozigoto (selecionado); em homozigose, a mutação
é letal (Hedrick, 2012).

Aptidões não constantes

Na prática, a premissa de aptidões que são constantes ao longo do tempo é


pouco realista: a variabilidade genética pode ser mantida ativamente mesmo
na ausência de qualquer forma de vantagem do heterozigoto (Figura 3B).

Dependência de frequência negativa Assumimos novamente os genótipos


A1 A1 , A1 A2 e A2 A2 , e que A1 é dominante, de forma que genótipos A1 /− têm
aptidão diferente de A2 A2 . Se A1 /− tem aptidão maior do que A2 A2 quando
A1 é raro, mas a aptidão diminui na medida em que A1 aumenta de frequência,
tem-se um cenário de seleção com dependência de frequência (também cha-
17 Ver também Página 68.

20
Introdução Geral

mada de vantagem do alelo raro). Ou seja, a aptidão marginal de um alelo é


negativamente correlacionada à sua frequência na população, o que leva a um
balanço estável das frequências de cada variante, onde nenhuma das duas é
eliminada18 (Clarke, 1962; Charlesworth e Charlesworth, 2010). Aqui, a manu-
tenção do polimorfismo não requer vantagem do heterozigoto.

Como nesse caso as aptidões dependem das frequências dos genótipos, o ar-
gumento acima sobre a frequência de equilíbrio ser aquela que maximiza a ap-
tidão média da população não procede neste caso: aqui, o equilíbrio estável não
precisa, por definição, coincidir com o a aptidão média máxima (Charlesworth
e Charlesworth, 2010).

Exemplos de dependência de frequência É provável que cenários como aque-


les descritos acima sejam bastante comuns em sistemas naturais. Um exemplo é
a seleção apostática, em que predadores exercem pressão seletiva sobre suas pre-
sas ao “aprenderem” enquanto caçam: esse cenário favorece presas com fenóti-
pos raros (Clarke, 1962; Charlesworth e Charlesworth, 2010). Outro exemplo é
o mimetismo batesiano, onde uma espécie modelo perigosa, ou impalatável, é
mimetizada por outra espécie (não perigosa), que é sujeita a predação. Variantes
raras da espécie mimética, que porventura se pareçam com a espécie modelo,
se beneficiam por serem evitadas por predadores, e aumentam de frequência;
esse aumento de frequência aumenta a chance de um predador comer a espécie
mimética e associar seu padrão a um gosto agradável (Charlesworth e Char-
lesworth, 2010), assim diminuindo sua vantagem.

É importante salientar que esses exemplos não levam inevitavelmente a um


polimorfismo balanceado: isso depende da relação entre aptidão e frequências
18 Em geral, ao longo do texto, assume-se que um loco permite duas variantes, ou seja, é bi-
alélico.

21
Introdução Geral

dos genótipos. Mais ainda, ao contrário do que ocorre na vantagem do hete-


rozigoto, o equilíbrio é instável: análises teóricas apontam que nesse tipo de
regime seletivo a população oscilará permanentemente em torno da frequência
de equilíbrio (Charlesworth e Charlesworth, 2010).

Outro exemplo importante é o sistema de auto-incompatibilidade: plantas


com capacidade de se auto-polinizar costumam ter os chamados locos “S”, que
compõem, junto com os genes do MHC em vertebrados, um dos sistemas mais
polimórficos em termos de número de alelos (Charlesworth e Charlesworth,
2010). Nos exemplos mais simples, uma planta com grãos de pólen S1 não
pode fertilizar óvulos que também carregam alelos S1; assim, todas as plantas
são heterozigotas por definição, e um novo alelo que surge tem uma enorme
vantagem seletiva, pois o pólen que o carrega poderá fertilizar qualquer planta
da população. Nesse cenário, um equilíbrio poderia ser atingido se todos os he-
terozigotos tivessem a mesma frequência (Charlesworth e Charlesworth, 2010).

Finalmente, outro exemplo é o da “corrida armamentista” entre hospedeiros


e parasitas. No contexto de imunidade adquirida, um tipo comum de parasita
provavelmente terá infectado uma grande proporção dos hospedeiros, que se
tornam imunes a novas infecções. Assim, parasitas com novos tipos de antíge-
nos têm uma vantagem dependente de frequência, acarretando altos níveis de
polimorfismos na população de parasitas (Trachtenberg et al., 2003; Borghans
et al., 2004; Slade e McCallum, 1992; Spurgin e Richardson, 2010).

É provável que parte da diversidade observada nos genes MHC seja man-
tida desta forma, uma vez que diferentes variantes dos genes que codificam
as moléculas apresentadoras de antígenos são capazes de apresentar um reper-
tório determinado de epítopos, e novas variantes que surgem no hospedeiro
podem ter uma vantagem dependente de frequência (Trachtenberg et al., 2003;

22
Introdução Geral

Borghans et al., 2004). Essa hipótese é corroborada pela diversidade particular-


mente alta na fenda apresentadora de antígeno das moléculas MHC (Garrigan
e Hedrick, 2003) e por uma correlação entre diversidade de patógenos e varia-
bilidade no MHC humano (Prugnolle et al., 2005), mas, conforme mencionado
anteriormente, o debate acerca dos mecanismos responsáveis pela diversidade
dos genes MHC permanece acirrado.

Aptidões que variam no tempo Podemos imaginar cenários em que um alelo


tem má performance em uma geração, e boa na geração subsequente (Figura
3B). Um bom exemplo aqui são as flutuações sazonais observadas por Dobzhansky
(1937) em inversões cromossômicas de D. pseudoobscura ao longo das estações
do ano. Mais recentemente, Bergland et al. (2014) detectaram centenas de loci
(SNPs) em D. melanogaster cujas frequências variam dramaticamente entre es-
tações do ano19 , e argumentam que tais polimorfismos estão sujeitos a forte
pressão seletiva, variável ao longo do tempo, dado que eles estão associados a
fenótipos que variam entre as estações e que tais polimorfismos respondem a
variações climáticas.

De fato, a teoria prevê que “polimorfismos protegidos” (Charlesworth e


Charlesworth, 2010; Prout, 1968) podem permanecer em uma população com
aptidões que variam ao longo do tempo, desde que a média geométrica das ap-
tidões do heterozigoto seja mais alta que a de ambos os homozigotos ao longo
do espaço de tempo considerado (Bergland et al., 2014; Charlesworth e Char-
lesworth, 2010; Gillespie e Langley, 1974)20 .

19 Cercade 10 gerações por verão, e duas por inverno (Bergland et al., 2014).
20 Entretanto,no estudo de Bergland et al. (2014) modelos mais realistas foram considerados,
que levam em conta a possibilidade de gerações que se sobrepõem, múltiplos loci ligados e uma
combinação de variações espaciais e temporais nas pressões seletivas (todos compatíveis com
D. melanogaster), uma discussão além dos objetivos desta introdução.

23
Introdução Geral

Aptidões que variam no espaço Um polimorfismo protegido também pode


ser mantido em ambientes que variam ao longo do espaço (Figura 3B), sob as
condições previstas no modelo de Levene (1953) (ver Figura 1), se: (a) a aptidão
relativa dos genótipos varia entre diferentes nichos ocupados pela espécie; (b)
toda a seleção ocorre dentro de cada nicho, e (c) a reprodução ocorre aleatoria-
mente entre os indivíduos de cada nicho.

Com alguma abstração, podemos também incluir aqui casos em que exis-
tem aptidões diferentes para um dado genótipo considerando os dois sexos: se
alelos A1 e A2 têm efeitos opostos sobre a aptidão de machos e fêmeas (anta-
gonismo sexual), polimorfismos podem ser mantidos na ausência de vantagem
do heterozigoto21 . Em D. melanogaster, estima-se que 8% dos genes têm padrões
compatíveis com esse tipo de seleção (Innocenti e Morrow, 2010).

No caso de pressões seletivas que variam ao longo do espaço ocupado por


uma espécie, a existência ou não de uma frequência de equilíbrio depende não
apenas da relação entre as aptidões dos diversos alelos entre os ambientes,
mas da proporção de indivíduos alocados a cada ambiente e da quantidade
de fluxo gênico entre os ambientes (Gloss e Whiteman, 2016; Charlesworth e
Charlesworth, 2010). Sendo assim, nesse mecanismo a manutenção de polimor-
fismos a longo-prazo pode ou não ocorrer.

Em muitos dos casos documentados de seleção balanceadora de curto e


longo prazo (i.e, que não são detectados com base em estudos de geração atual),
é difícil ou impossível determinar qual dos mecanismos acima definidos é res-
ponsável pelo padrão observado. Ou seja, a detecção de assinaturas genômicas
compatíveis com regimes seletivos de seleção balanceadora não permite dife-
21 Ver
também Página 18, onde o cenário de pleiotropia antagonista foi definido junto com o
de vantagem de heterozigoto, pois ambos os mecanismos resultam em assinaturas indistinguí-
veis.

24
Introdução Geral

renciar qual mecanismo é responsável pelo padrão observado (ver discussão no


próximo item). Além disso, múltiplos mecanismos podem ter atuado simulta-
neamente ou em diferentes momentos ao longo da história evolutiva (Hedrick,
2012), como é o caso do MHC.

Evolução adaptativa no genoma humano

“It is easy to invent a selectionist explanation for almost any specific


observation; proving it is another story. Such facile explanatory excesses
can be avoided by being more quantitative.” (Kimura, 1983)

Durante décadas, o esforço de detectar casos de seleção positiva através do


exame individual de genes candidatos resultou em alguns casos bem documen-
tados, como o dos genes associados à persistência da enzima lactase em adultos
(e.g. Bersaglieri et al., 2004) e à pigmentação da pele em humanos (Jablonski e
Chaplin, 2010). Até recentemente, essa abordagem era a única forma prática de
encontrar alvos de seleção positiva em humanos (Sabeti et al., 2006).

Nos últimos 15 anos, abordagens genômicas tornaram-se populares, e houve


uma explosão de scans genômicos22 em busca de assinaturas de seleção positiva
no genoma humano (Bamshad e Wooding, 2003; Bustamante et al., 2005; Enard
et al., 2010; Fay et al., 2001; Nielsen, 2005; Sabeti et al., 2006; Sabeti et al., 2007).
Até 2009, 21 scans para seleção positiva em humanos já haviam sido feitos (re-
visado em Akey, 2009).

Abordagens genômicas têm a vantagem de possibilitar a compreensão do


22 Em um scan genômico, uma estatística de interesse é calculada em janelas do genoma, e
destaca-se aquelas que estão nos extremos empíricos da estatística e/ou além de algum limiar
definido com base em simulações neutras.

25
Introdução Geral

impacto da seleção natural ao longo de todo o genoma e de inferir categorias


funcionais mais sujeitas à ação da seleção positiva em humanos (Sabeti et al.,
2006). Graças à grande quantidade de scans genômicos em busca de assinaturas
de seleção positiva, sabe-se hoje que muitos dos genes com tais assinaturas estão
relacionados com certas categorias: “imunidade e defesa”, “percepção senso-
rial” e “imunidade mediada por células T”, “gametogênese”, “espermatogênese
e mobilidade” (Harris e Meyer, 2006; Nielsen et al., 2005). Para as regiões não-
codificadoras, destacam-se processos biológicos como “neurogênese”, “outras
atividades neuronais” e “desenvolvimento muscular” (Haygood et al., 2010).

Assinaturas de seleção balanceadora

Desde a proposição da teoria neutra, muitos testes para seleção balanceadora


no passado recente23 foram desenvolvidos. Por usarem o modelo neutro como
modelo nulo, testes usados para detectar assinaturas de seleção natural são tam-
bém chamados de "testes de neutralidade"24 .

Seleção atual

Classicamente, a detecção de instâncias de vantagem do heterozigoto foi restrita


a estudos da geração atual, que focam em desvios das proporções genotípicas
esperadas sob certas premissas (Hardy-Weinberg, panmixia) ou em relações en-
tre genótipo/alelo e aptidão (Figura 5). Tais estudos são de extrema importân-
cia no sentido de avaliar aptidões de fenótipos e comprovar a relação entre o
fenótipo selecionado e o genótipo subjacente, mas não focaremos neles aqui.

23 Na literatura, frequentemente usa-se a expressão “seleção balanceadora de curto prazo”


(últimos milhares de anos).
24 Ver página 8.

26
Introdução Geral

Figura 5: HW, Hardy-Weinberg, refere-se às proporções p2 + 2pq + q2 = 1 para


os genótipos. Figura adaptada a partir de Hedrick (2012). As assinaturas e os
testes de neutralidade correspondentes são abordados no item “Assinaturas de
seleção balanceadora”.

Seleção de curto prazo

Em relação a eventos de seleção ocorridos no passado “recente” (até cerca de


1 milhão de anos atrás; Fu e Akey, 2013), um dos primeiros testes do modelo
neutro foi o de Ewens-Watterson (e.g. Watterson, 1978), ou teste de homozigose
da amostra (Figura 5). Esse teste assume o modelo IAM (modelo de infinitos
alelos25 ), e que a população encontra-se em equilíbrio entre mutação e deriva. O
teste relaciona a expectativa de número de loci homozigotos esperados em uma
população em equilíbrio com aquela que de fato é observada. Uma deficiência
de homozigose pode ser interpretada como indício de seleção balanceadora (i.e.

25 Um modelo que postula que cada novo alelo que surge em uma população é “novo” ou
“único”, i.e, diferente de todos os que surgiram antes. Esse modelo foi proposto por Kimura e
Crow (1964) em uma tentativa de estimar a proporção de loci homozigotos em uma população
diploide finita.

27
Introdução Geral

os alelos segregam em frequências intermediárias, diminuindo a homozigose


observada). Esse teste foi muito influente em estudos prévios à era genômica,
mas hoje em dia é preterido por outros, que têm maior poder.

Outros testes com poder para detectar assinaturas de eventos de seleção ba-
lanceadora compatíveis com essa escala de tempo olham: (a) a distribuição de
frequências alélicas observada, comparando com aquela esperada sob o modelo
neutro, (b) a variação genética e desequilíbrio de ligação26 em certas regiões
genômicas com as observadas em regiões evoluindo de forma neutra, (c) a di-
ferenciação geográfica observada em certos loci com aquelas encontradas para
marcadores neutros (Hedrick, 2006; Hedrick, 2012; Mitchell-Olds et al., 2007)
(Figura 5).

A desvantagem de testes focados em desequilíbrio de ligação (Figura 5) é


que essa assinatura é essencialmente indistinguível daquela deixada por varre-
duras seletivas incompletas (seleção positiva). Uma outra assinatura que pode
ser observada ao nível populacional é que diferentes subpopulações terão pouca
diferenciação entre elas para loci-alvo de seleção balanceadora, dado que ela
mantém variantes em frequências intermediárias em ambas as populações. Essa
assinatura depende de uma congruência entre regimes seletivos entre os ambi-
entes ocupados pelas duas populações e de ausência de adaptação local de uma
ou mais subpopulações (Figura 5).

Por outro lado, com a disponibilidade de dados de sequência, tornou-se pos-


sível investigar o efeito cumulativo da seleção ao longo de diversas gerações. O
sinal de seleção no passado recente é gerado ou perdido ao longo de dezenas

26 Uma medida que reflete se dois alelos em dois diferentes loci coexistem de forma não-neutra

em uma população. Alelos em desequilíbrio de ligação são encontrados mais frequentemente


no mesmo haplótipo do que seria esperado se a recombinação ocorresse livremente entre eles
(revisado em Cutter e Payseur, 2013).

28
Introdução Geral

a milhares de gerações, dependendo da influência de deriva genética, fluxo gê-


nico e recombinação (Hedrick, 2012).

Seleção de longo prazo

O sinal de seleção que começou no passado distante e durou muito tempo27 é


determinado principalmente por mutação e seleção (Figura 5), e geralmente
leva de milhares a milhões de gerações para ser gerado. Quando persiste por
milhares a milhões de gerações, a seleção balanceadora resulta não apenas na
manutenção de maior quantidade de alelos nas populações (Andrés et al., 2009),
mas também em uma maior persistência, ao longo do tempo, da diversidade
alélica em relação à variação neutra (Richman, 2000). Em outras palavras, em
casos de seleção balanceadora de longo prazo, alelos segregando para um dado
loco têm um TMRCA28 mais longo, o que implica em determinadas assinaturas
genômicas: excesso de polimorfismos na região do sítio selecionado (pois ha-
verá mais tempo para que mutações ocorram no haplótipo) e uma manutenção
de tais polimorfismos por mais tempo que o esperado para uma mutação neu-
tra, que acaba por ser fixada ou perdida após, em média, 4N gerações (onde N
é o tamanho da população).

Relação entre taxas de substituição não-sinônimas e sinônimas De acordo


com a teoria neutra (Kimura, 1968; Kimura, 1983), mutações capazes de alte-
rar a função de uma proteína (mutações não-sinônimas) geralmente são deleté-
rias e, portanto, alvo da seleção purificadora. Análises comparativas apontam
27 À qual nos referiremos como seleção balanceadora de longo prazo, SBLP, e em humanos
corresponde a eventos que ocorreram há mais de um milhão de anos, ainda que possam ter
persistido até recentemente (Fu e Akey, 2013).
28 Time to most recent common ancestor, tempo de coalescência até o ancestral comum mais re-

cente.

29
Introdução Geral

que ao menos 38% das mutações não-sinônimas sejam deletérias (Eyre-Walker


e Keightley, 1999). Já as mutações sinônimas, que não alteram a sequência de
aminoácidos da proteína, evoluiriam de forma neutra29 . A evolução adaptativa
pode levar a um aumento da taxa de substituição de mutações não-sinônimas
(dN), tornando-a mais alta do que a taxa de substituição sinônima (dS).

Nesse sentido, a razão dN/dS > 1 (ou ω > 1) é uma assinatura genética de
seleção positiva (Gillespie, 1991; Nielsen, 2005), mas também de seleção balan-
ceadora (e.g. Bitarello et al., 201530 ) (Figura 5). Entretanto, o critério de ω > 1
para considerar que genes estejam sob evolução adaptativa é muito conserva-
dor. Partindo da premissa de que a maior parte das mutações não-sinônimas é
deletéria (Kimura e Crow, 1963; Kimura, 1968; Eyre-Walker e Keightley, 1999), o
critério muitas vezes não é atendido quando genes inteiros são analisados. Isso
ocorre porque geralmente apenas alguns códons estão sob seleção positiva ou
balanceadora, enquanto a maior parte das mutações não-sinônimas são dele-
térias e, portanto, estão sob seleção purificadora31 . Por isso, há algum tempo
convencionou-se analisar subconjuntos de códons em busca de seleção (e.g.
Hughes e Nei, 1988; Hughes e Nei, 1989; Bitarello et al., 2015) ou através de mo-
delos que estimam diferentes valores de dN/dS para grupos de códons (Yang
e Swanson, 2002; Bitarello et al., 2015), tornando possível inferir quais deles
evoluíram adaptativamente.

Espectro de frequências alélicas Evidências de seleção do passado distante


podem também ser baseadas na distribuição de frequência dos alelos de uma

29 Embora algumas mutações sinônimas possam ser alvo de seleção devido ao viés no uso de
códons e uma parcela das mutações não-sinônimas ser neutra, a premissa é válida devido às
proporções (e.g. Comeron et al., 2008).
30 Esta referência está disponibilizada no Apêndice A.4.
31 Essa ideia é indiretamente explorada no Capítulo 2.

30
Introdução Geral

amostra, o espectro de frequências alélicas (SFS, site frequency spectrum). O SFS


é uma contagem do número de mutações que existem em uma frequência de
xi = i/n para i = 1,2,...,n − 1 em uma amostra de tamanho n, onde x é a
frequência relativa de cada contagem. Em outras palavras, o SFS sumariza as
frequências alélicas das várias mutações presentes em uma amostra (Nielsen,
2005). Muitos testes estatísticos em genética de populações usam informação
acerca da proporção de SNPs que são comuns ou raros, e acerca da frequência
de alelos derivados e ancestrais, a fim de fazer inferências sobre a história de-
mográfica e possíveis regimes seletivos. Um dos testes dentro dessa categoria é
o D de Tajima (Tajima, 1989; Mitchell-Olds et al., 2007) (Figura 5). No Capítulo
1, nós propomos dois novos testes que exploram essa assinatura32 .

Relação entre níveis de polimorfismo e divergência Outros testes para se-


leção balanceadora nessa escala de tempo se baseiam em comparações entre
os níveis de polimorfismo (observados dentro de uma espécie) e os níveis de
divergência (entre duas espécies)33 . Aqui inclui-se o teste Hudson-Kreitman-
Aguadé, ou HKA (Hudson et al., 1987), usado para testar predições do modelo
neutro de evolução molecular. Um excesso de polimorfismo em relação a diver-
gência pode ser interpretado como evidência de seleção balanceadora (Figura
5).

Uma versão modificada do teste HKA – o teste de McDonald-Kreitman (MK)


– compara polimorfismos e divergência entre diferentes classes de mutação,
como as substituições sinônimas e não-sinônimas (McDonald e Kreitman, 1991).

32 As figuras 1 e 5 do Capítulo 1 mostram espectros de frequência alélica esquemático e de


dados reais, respectivamente.
33 Ao longo do texto, e particularmente no Capítulo 1, refiro-me a substituições entre huma-

nos e chimpanzés embora em princípio possa se tratar de substituições entre quaisquer duas
espécies.

31
Introdução Geral

Nas situações em que há um excesso de divergência não-sinônima, temos um


padrão consistente com seleção positiva (que fixa diferenças entre espécies e
remove polimorfismos, explicando o padrão descrito). Já um excesso de poli-
morfismos não-sinônimos pode ser interpretado como uma assinatura de sele-
ção balanceadora (Figura 5). Entretanto, conforme discutido por Eyre-Walker
(2006), essa estatística é muito sensível a mudanças de tamanho populacional, e
mesmo aumentos modestos de Ne podem criar evidências espúrias de evolução
adaptativa. Sella et al. (2009), por sua vez, argumentam que uma das premis-
sas mais problemáticas do teste MK é a de que a fração de novas mutações
que é neutra, que é estimada a partir dos dados de polimorfismo de uma das
espécies, tenha permanecido constante durante a história evolutiva das duas
espécies sendo comparadas. A razão pela qual essa premissa é problemática
é que uma história demográfica que resulta em uma população atual fora de
equilíbrio torna essa premissa falsa quando a seleção é fraca, resultando em es-
timativas erradas de taxa de evolução adaptativa (por exemplo, Eyre-Walker,
2006; Fay et al., 2001; Nielsen, 2005; Sella et al., 2009).

Partilhamento de polimorfismos entre espécies Finalmente, se um polimor-


fismo é mantido por seleção balanceadora por um tempo suficientemente longo,
o mesmo polimorfismo pode ser encontrado em duas espécies-irmãs: um poli-
morfismo trans-específico (Hedrick, 2012; Klein et al., 1998) (Figura 5).

Se a possibilidade de que mutações idênticas tenham ocorrido independente


nas duas espécies puder ser descartada – por exemplo requirindo-se que mais
de um polimorfismo seja compartilhado dentro de um haplótipo de tamanho
reduzido (e.g. Leffler et al., 2013; Teixeira et al., 2015) – , a outra possível expli-
cação é que o polimorfismo seja neutro e mantido “ao acaso” nas duas espécies.

32
Introdução Geral

Se o tempo de divergência entre as espécies é muito maior do que o tempo


médio de coalescência intra-específica, essa alternativa tem baixíssima probabi-
lidade.

Teixeira et al. (2015), por exemplo, demonstraram que a probabilidade de


um polimorfismo em humanos ser também um polimorfismo em chimpanzés e
bonobos é da ordem de 10−10 . Mesmo supondo independência entre todos os
SNPs, considerando amostras de 20 indivíduos de cada espécie, a probabilidade
de haver um polimorfismo partilhado é de ∼ 0.00005, ou seja, praticamente
zero. Portanto, embora essa assinatura seja extremamente convincente como
evidência de seleção balanceadora de longo prazo, ela é bastante rara.

Seleção balanceadora no genoma humano

“Balancing selection is not unique to the human lineage, nor is it the


dominant force in shaping the human genome. It is, however, there and its
effects are without dispute.” (Vallender e Johnson, 2008)

Embora diversos scans genômicos tenham sido feitos com o intuito de lo-
calizar alvos de seleção positiva (revisado em Akey, 2009), poucos trabalhos,
comparativamente, buscaram localizar alvos de seleção balanceadora. Em parte
isso é devido às dificuldades de detecção desse tipo de seleção em escala genô-
mica (Andrés et al., 2009). A Figura 6 resume os estudos que buscaram por
assinaturas de seleção balanceadora em humanos.

Até muito recentemente, não se havia estabelecido mais que alguns poucos
casos de seleção balanceadora em humanos. Mesmo com o advento de dados
de sequência pra diversos genes, poucos alvos foram propostos além dos genes

33
Introdução Geral

HLA, do gene ABO e do gene da hemoglobina S (Allison, 1954; Asthana et al.,


2005; Bubb et al., 2006; Hughes e Nei, 1988; Ségurel et al., 2013).

Figura 6: Resumo dos estudos que buscaram assinaturas de seleção balancea-


dora de longo prazo em humanos. SFS, espectro de frequências alélicas; dbSNP,
banco de dados de SNP, indica que o estudo não usou informações de frequên-
cias de SNPs estimadas para populações individuais, mas de um banco de da-
dos comum; trSNPs, SNPs trans-específicos entre (H-C) humano-chimpanzé ou
(H-C-B) humano, chimpanzé, bonobo. a, excesso de SNPs em relação a sítios
divergentes entre humanos e chimpanzé, uma assinatura de seleção balancea-
dora. ** apesar de que eles encontram 16 regiões com alta diversidade genética
além de genes HLA e o gene ABO, os autores as descartam como reais candida-
tas pois não possuem evidência de polimorfismo trans-específico, o que é um
critério muito rigoroso (ver “Assinaturas de seleção balanceadora”.)

O primeiro estudo em escala genômica que identificou novas assinaturas de


seleção balanceadora em humanos foi o de Andrés et al. (2009). Nele, foi feito
um scan para genes sob seleção balanceadora de longo prazo (SBLP) no genoma
humano. O scan de Andrés et al. utilizou uma base de dados de exoma consis-
tindo de 13.400 genes de duas populações humanas (19 norte-americanos com
ancestralidade europeia e 20 norte-americanos com ancestralidade africana). O
método consistiu em contrastar padrões de polimorfismo em cada gene com o
resto do genoma e com as expectativas neutras, obtidas através de simulações

34
Introdução Geral

parametrizadas pelos próprios dados de sequenciamento. O método buscava


genes com duas assinaturas de SBLP: excesso de variantes em frequências in-
termediárias e excesso de sítios polimórficos em humanos (em relação a substi-
tuições entre humanos e chimpanzés)34 .

Apenas genes extremos em relação a ambas as assinaturas, i.e, altamente


significativos para dois testes independentes, foram considerados como candi-
datos. Apesar do baixo poder estatístico deste tipo de abordagem (Fijarczyk
e Babik, 2015), ela tem poucos falsos positivos. Foram encontrados 60 genes
com assinaturas de SBLP, muitos deles envolvidos com a imunidade, mas não
restritos aos exemplos clássicos até então descritos (Key et al., 2014).

Foi observado que a maior parte dos loci sob seleção balanceadora eram
compartilhados entre as duas populações, com poucas exceções : quatro genes
com evidência de seleção balanceadora apenas nos americanos com ascendên-
cia africana e nove apenas nos americanos com ascendência europeia.

A ausência de sequências flanqueadoras não-codificadoras, intergênicas e


intrônicas nas bases de dados desse estudo tem duas consequências: primeiro,
os efeitos da seleção balanceadora sobre regiões ligadas a loci sob seleção não
puderam ser quantificados; segundo, alvos de seleção balanceadora em regiões
não-codificadoras não puderam ser detectadas.

Mais recentemente, DeGiorgio et al. (2014) desenvolveram dois novos testes


para a identificação de padrões locais de diversidade esperados em posições li-
gadas a um polimorfismo balanceado (T1 e T2) para a identificação e usaram
esses métodos em dados genômicos de duas populações (YRI, Yoruba, uma po-
pulação africana e CEU, uma população de norte-americanos do Utah com as-
cendência norte-europeia). Eles identificaram 200 genes candidatos (Figura 6)
34 Ambas as assinaturas servem de base para testes discutidos na seção anterior.

35
Introdução Geral

– alguns deles conhecidos (e.g. Andrés et al., 2009), e outros novos (Key et al.,
2014).

Três importantes limitações destes trabalhos são que: (1) apesar de o traba-
lho mostrar que os métodos T1 e T2 têm poder elevado sob modelos simples,
os autores não exploram modelos demográficos humanos (e.g. Gravel et al.,
2011); (2) os 200 genes reportados provêm de uma lista dos “100 genes mais ex-
tremos” para cada teste e população, mas não foi estabelecido um critério para
que 100 genes fossem reportados; portanto, esse trabalho, apesar de inques-
tionavelmente contribuir muito para o conhecimento acumulado de regiões do
genoma humano que possuem assinaturas de SBLP, não fornece uma estimativa
aproximada do quão frequente ela pode ser no genoma humano; (3) apesar de
os autores terem usado dados de genoma completo, eles reportam como alvos
apenas genes codificadores de proteínas, e não exploram ou reportam regiões
genômicas candidatas que estão fora dos limites gênicos.

Finalmente, dois estudos em escala genômica buscaram por polimorfismos


partilhados entre humanos e outras espécies: trata-se da assinatura mais ex-
trema de SBLP, pois requer a manutenção de polimorfismos por milhões de
anos. Leffler et al. (2013) olharam o genoma completo de humanos e chim-
panzés e buscaram por polimorfismos compartilhados, encontrando seis genes
com forte evidência de polimorfismos balanceados compartilhados (Figura 6),
e Teixeira et al. (2015) olharam os éxons de humanos, chimpanzés e bonobos,
encontrando forte evidência de polimorfismo balanceado partilhado entre as
três espécies no gene LAD1, além de alguns genes HLA (Figura 6).

Eventos de seleção balanceadora de curto e longo prazo deixam assinaturas


genômicas que não permitem diferenciar entre os mecanismos de seleção ba-
lanceadora expostos anteriormente. Além disso, um sinal de seleção de longo

36
Introdução Geral

prazo não significa que a seleção tenha perdurado até o passado recente ou
até a geração atual. Vê-se, portanto, que definir escalas de tempo dos regimes
seletivos é importante tanto no sentido de determinar quais as ferramentas ade-
quadas, quanto no grau de resolução que se pode alcançar.

Por si só as análises de genética de populações acima descritas são capazes


de identificar genes que evoluíram de forma não-neutra durante milhares ou
até milhões de gerações. Essa é o tipo de investigação proposta no Capítulo 1, a
fim de melhor compreender o impacto da seleção balanceadora de longo prazo
no genoma humano. Tal tipo de investigação não fornece, todavia, informação
conclusiva acerca do caráter fenotípico que foi alvo de seleção (Mitchell-Olds
et al., 2007). Trata-se de um desafio em andamento na genômica de populações,
e que será discutido no item Discussão e Considerações Finais (Página 214).

Carga genética induzida por seleção


balanceadora
“Hence there must be a far greater number of different kinds of ailments
whose characteristics are traceable to genetic changes of natural origin than
there are different kinds of infectious diseases. This general reasoning does
not in itself give us much idea, however, of the actual frequencies with which
these mutational disorders occur in populations. For this it is necessary to
turn to quantitative studies.” (Muller, 1950)

Como vimos, os polimorfismos de sequência podem estar evoluindo de forma


neutra (provavelmente a grande maioria deles), mas podem também represen-
tar estados transientes de variantes genéticas rumo à eliminação por serem de-

37
Introdução Geral

letérias, ou rumo à fixação por serem vantajosas (Hudson et al., 1987; Mitchell-
Olds et al., 2007; Sellis et al., 2011). Finalmente, uma certa quantidade desses
polimorfismos é mantida em populações individuais por seleção balanceadora
ou ao longo de toda a distribuição da espécie, por adaptação local35 (revisado
em Mitchell-Olds et al., 2007).

Estudos genômicos têm documentado grandes quantidades de polimorfis-


mos, e tem-se feito muito progresso no sentido de compreender quais processos
evolutivos moldaram tal variação (revisado em Mitchell-Olds et al., 2007). Para
um geneticista evolutivo, detectar possíveis alvos de seleção natural no genoma
é fundamental. Primeiramente, padrões compatíveis com um cenário de evo-
lução adaptativa (seleção positiva ou balanceadora) devem ser buscados. Um
caminho são as assinaturas de seleção e os testes de neutralidade, vistos anteri-
ormente. Uma vez detectadas as regiões de interesse, é necessário investigar se
há base biológica para sustentar que tais padrões de fato resultem de seleção.
Finalmente, pode-se investigar se a seleção sobre um dado loco no genoma tem
consequências deletérias sobre regiões vizinhas.

A existência de assinaturas de seleção positiva ou balanceadora indica que


pode ter havido adaptação para algum traço (aquele relacionado à pressão se-
letiva), mas a consequência da pressão seletiva pode incluir também mudanças
que causam má-adaptação, em função de mudanças que ocorrem em outras re-
giões do mesmo gene que está sendo selecionado, ou mesmo em regiões não
codificadoras – porém funcionais – adjacentes.

35 Situação
na qual genótipos de diferentes populações têm aptidão maior em seus ambientes
de origem, devido a seleção natural histórica na região.

38
Introdução Geral

Carga genética

A compreensão da base genética da adaptação requer a definição de “carga ge-


nética”. O conceito foi discutido pela primeira vez por Haldane (1937), sendo
posteriormente elaborado por Muller (1950). A ideia central é que uma popu-
lação fica aquém de sua aptidão máxima por dois motivos: (1) a ocorrência de
mutações deletérias recorrentes (carga mutacional) e; (2) a produção de homo-
zigotos menos aptos, nas situações em que os heterozigotos são o genótipo mais
apto (carga segregacional). Esse decréscimo na aptidão máxima de uma popu-
lação é a carga genética.

Embora as mutações deletérias incorram em um custo de aptidão, elas nem


sempre são removidas da população de forma eficiente. É ainda mais difícil
remover mutações deletérias de populações pequenas (baixos valores de Ne 36 ),
e seu acúmulo pode levar a reduções no tamanho populacional e, em última
instância, à extinção (Chun e Fay, 2011).

No âmbito da evolução molecular, a maior parte das mutações não são adap-
tativas, piorando (mutações mal-adaptativas) ou não interferindo (neutras) no
grau de adaptação dos caráteres ao ambiente (Orr, 1998). A eficácia de remoção
de variantes deletérias de uma população depende de vários fatores: mutação
(que cria novas variantes deletérias constantemente), dominância (que influ-
encia o quanto a mutação é “visível” para a seleção) (e.g. Sellis et al., 2011),
demografia e ligação (Gravel, 2016).

Além disso, uma situação de má-adaptação de uma população pode ser cau-
36 Tamanho populacional efetivo: reflete o tamanho de uma população idealizada que estaria
sujeita à deriva da mesma forma que a população de fato. O Ne pode ser menor que o tama-
nho real da população devido a vários fatores, incluindo variância no sucesso reprodutivo, uma
história demográfica com gargalos genéticos (reduções extremas de tamanho populacional, se-
guida de uma expansão a partir de uma amostra da população original) e endogamia (revisado
em Cutter e Payseur, 2013).

39
Introdução Geral

sada por falta de variação genotípica segregante para responder à seleção. De-
riva genética e endogamia, por exemplo, removem as populações de seus picos
adaptativos e podem levar à má-adaptação fenotípica (Crespi, 2000). A pleiotro-
pia37 pode resultar em populações mal-adaptadas pois a otimização conjunta de
muitos caracteres é inviável (Charlesworth e Charlesworth, 2010; Crespi, 2000).
Finalmente, a migração entre populações de indivíduos que se adaptaram em
diferentes subpopulações pode também levar à má-adaptação (Charlesworth e
Charlesworth, 2010; Crespi, 2000).

A má-adaptação pode, ainda resultar de pressões seletivas para a adapta-


ção em sítios ligados, que discutirei em maior detalhes abaixo. Nesse sentido,
seja em termos fenotípicos, seja em termos genéticos (acúmulo de mutações de-
letérias), o termo má-adaptação alude a um “custo adaptativo” com o qual as
populações têm que arcar em função de estarem evoluindo sob determinadas
pressões seletivas.

Efeitos de ligação sobre polimorfismos neutros

A última década documentou uma explosão de estudos sobre a prevalência


e efeito da seleção natural, em particular no genoma da nossa espécie. Um
dos achados mais marcantes foi o fato de haver uma proporção relativamente
grande de mudanças entre humanos e chimpanzés que resultam da seleção na-
tural. Por exemplo, alguns estudos mostraram que até 10% das substituições
que carregamos resultam de seleção positiva (Bustamante et al., 2005; Fay et
al., 2001), uma fração muito maior do que seria esperado por um neutralista
(revisado em Eyre-Walker, 2006). Esses resultados foram obtidos usando uma

37 Fenômeno em que um gene afeta múltiplos caracteres, considerado o modo quase universal

de atuação gênica.

40
Introdução Geral

abordagem que contrasta o grau de polimorfismo e divergência entre huma-


nos e primatas38 , e revelou que há mais diferenças fixas não-sinônimas entre as
duas espécies do que seria esperado sob neutralidade – diferenças essas que são
explicáveis se supusermos que a seleção fixou mutações vantajosas diferentes
entre as linhagens de humanos e chimpanzés (revisado em Eyre-Walker, 2006).

É esperado que a seleção natural deixe assinaturas específicas sobre os pa-


drões de variação neutra intimamente ligados ao sítio com a mutação vanta-
josa. Essa ideia é a base dos métodos de genética molecular de populações que
buscam por adaptações no genoma humano (Kreitman e Di Rienzo, 2004). Essa
propriedade é crucial para que testes de neutralidade tenham poder para detec-
tar regiões com assinaturas de seleção, ainda que o sítio selecionado seja apenas
um (revisado em Charlesworth, 2006). A informação genética em escala genô-
mica permite também fazer inferências sobre consequências da seleção natural
sobre regiões do genoma fisicamente próximas aos genes ou regiões genômicas
selecionados(as).

Existe também uma bem-documentada correlação positiva entre taxas de


recombinação e níveis de polimorfismo em Drosophila (Zhang e Parsch, 2005) e
em humanos (Hellmann et al., 2003). Essa correlação é consistente com a ideia
de que a seleção positiva ocorre em diversos locais do genoma, e quando afeta
um gene numa região de baixa recombinação, “arrasta” com ele um uma parte
do cromossomo (isto é, promove um evento de carona genética), que tem como
consequência a perda da variação naquela região do genoma. Nesse processo,
a variação neutra em sítios ligados é reduzida. E quanto menor for a taxa de
recombinação na região, mais pronunciada será a perda de diversidade.

38 Conforme discutido na página 31 e na Figura 5.

41
Introdução Geral

Estas varreduras seletivas39 levam a uma queda de diversidade em torno do


sítio selecionado, que, com o passar do tempo, vai progressivamente aumen-
tado na medida em que aumenta a distância entre o sítio neutro e o sítio-alvo.
O tamanho da área vizinha que perderá polimorfismos neutros depende da in-
tensidade da seleção, da taxa de mutação e da taxa de recombinação (revisado
em Bamshad e Wooding, 2003). Sabe-se hoje que a seleção positiva em huma-
nos afeta bastante os sítios neutros próximos ao sítio selecionado, reduzindo
em 6% seu nível de polimorfismo ao longo de todo o genoma, e 11% na porção
codificadora de proteínas do genoma (Cai et al., 2009).

Analogamente, a seleção balanceadora de longo prazo também afeta sítios


neutros vizinhos (Charlesworth, 2006). Por aumentar o tempo de coalescência,
a seleção balanceadora leva a um aumento de diversidade em torno do sítio
selecionado, além de mudar o formato do espectro de frequências alélicas local
(SFS), que passa a ter um excesso de alelos segregando em frequências próximas
à do polimorfismo balanceado (Andrés et al., 2009; Andrés, 2011; Bamshad e
Wooding, 2003; Charlesworth, 2006).

Entretanto, ao contrário das varreduras seletivas, que envolvem altos coe-


ficientes seletivos e diminuem a diversidade neutra em relativamente poucas
gerações, gerando assinaturas que se estendem por longos trechos do cromos-
somo (Bamshad e Wooding, 2003), a seleção balanceadora de longo prazo, por
envolver escalas de tempo de milhares a milhões de anos, gera assinaturas em
curtos segmentos em torno do polimorfismo balanceado40 . Isso ocorre porque,
ao longo de muitas gerações, a recombinação tem a oportunidade de ir "que-

39 Fenômeno em que uma mutação recém-surgida e altamente adaptativa sobe rapidamente


de frequência na população.
40 Uma propriedade que é explorada quando avaliamos o poder de nossas estatísticas no Ca-

pítulo 1.

42
Introdução Geral

brando"a ligação entre o polimorfismo balanceado e os sítios neutros vizinhos


(Andrés, 2011; Charlesworth, 2006), assim reduzindo o efeito de ligação a seg-
mentos curtos do cromossomo.

Efeitos de ligação sobre polimorfismos não-neutros

Existem dois modos através dos quais a seleção sobre um traço interfere so-
bre a seleção sobre outros traços. O primeiro se dá em condições em que um
gene tem funções pleiotrópicas. Seleção positiva ou balanceadora pressupõe
adaptação para algum traço (aquele relacionado à pressão seletiva), mas a con-
sequência da pressão seletiva em termos de fixação de mutações pode não ser
uma adaptação para todas as possíveis funções que exerce. Nesse caso, o “sub-
produto” seria uma má-adaptação. Como exemplo, temos os genes HLA, que
estão relacionados à resposta imune em humanos e têm fortes evidências de se-
leção positiva e balanceadora. Por outro lado, muitas doenças inflamatórias e
autoimunes também estão relacionadas aos genes de HLA (Becker et al., 1998).

O segundo modo ocorre quando o sítio selecionado interfere sobre o destino


de mutações não-neutras em sítios ligados geneticamente41 . A seleção natural
não atua independentemente sobre locos ligados e, sendo assim, sua eficiência
está diretamente relacionada à taxa de recombinação (Hill e Robertson, 1966;
Comeron et al., 2008). A ligação entre sítios reduz a eficácia da seleção natural
em populações finitas (Hill e Robertson, 1966; Comeron et al., 2008). Mantendo-
se as outras variáveis constantes, espera-se que o grau de interferência entre
alelos selecionados varie entre regiões com diferentes taxas de recombinação,
potencialmente levando a mudanças nas taxas de evolução42 . Regiões de baixa
41 Ver Figura 1 do Capítulo 2 na página 170.
42 Hill-Robertson Effect é o nome dado a essa interferência (Hill e Robertson, 1966; Comeron
et al., 2008).

43
Introdução Geral

recombinação estariam sujeitas a efeitos mais pronunciados de Hill-Robertson,


dado que a baixa recombinação reduz a independência entre os sítios, aumen-
tando o efeito relativo da deriva sobre a região – o que equivale a uma redução
do Ne local (Maynard-Smith e Haigh, 1974; Comeron et al., 2008).

Diante da informação de que substituições adaptativas (fixadas por seleção


positiva) são relativamente comuns, tornou-se importante investigar a forma
como esses eventos influenciam a atuação da seleção natural em regiões adja-
centes do genoma. As trajetórias até a fixação das variantes neutras ligadas a
variantes vantajosas permanecem inalteradas – dado que a probabilidade de
uma mutação vantajosa carregar consigo uma mutação neutra é diretamente
proporcional à frequência do alelo na população (Birky e Walsh, 1988; Chun e
Fay, 2011; Comeron et al., 2008; Harris, 2010; Charlesworth, 2012). Entretanto,
mutações levemente deletérias, quando ligadas geneticamente a variantes van-
tajosas, terão probabilidade de fixação maior do que a esperada sob o modelo
neutro (Lynch, 2007).

O efeito da seleção positiva sobre regiões ligadas em Drosophila foi inves-


tigado por Betancourt e Presgraves (2002), que verificaram que os genes de
evolução adaptativa lenta (sujeitos principalmente à seleção purificadora) se
concentram principalmente nas regiões de baixa recombinação. Além disso, foi
encontrada uma correlação negativa significativa entre a taxa de evolução (me-
dida em termos de dN) e a proporção de uso de códons ótimos43 . Os autores
descartaram a possibilidade de restrição seletiva relaxada; eles concluem que
a causa para essa correlação é a interferência do sítio codificador sobre o fraco
coeficiente seletivo para uso ótimo de códons em sítios vizinhos fortemente li-

43 Diferentes sítios sinônimos são seletivamente diferentes, uma vez que certos códons (ditos
“ótimos”) são utilizados mais frequentemente que outros, possivelmente em função de eficiên-
cia e precisão da tradução (Betancourt e Presgraves, 2002).

44
Introdução Geral

gados.
A conclusão geral é que, ao menos em Drosophila, parece haver uma “hie-
rarquia de pressões seletivas”: a seleção contra mutações deletérias é mais forte
do que a seleção sobre mutações não-sinônimas vantajosas, que por sua vez é
maior do que a seleção para códons ótimos. O que não significa, conforme os
próprios autores salientam, que o efeito cumulativo de muitos códons subóti-
mos seja negligenciável.
Além disso, foi demonstrado que a seleção direcional forte (positiva ou ne-
gativa) é capaz de gerar um aumento da proporção de variantes deletérias se-
gregando em regiões adjacentes àquelas que foram alvo da seleção direcional
em humanos (Chun e Fay, 2011). Assim como os scans genômicos para seleção
balanceadora são muito menos abundantes que aqueles para seleção positiva,
o mesmo ocorre em relação ao acúmulo de deletérias: até o momento, nenhum
estudo investigou especificamente o impacto que a seleção balanceadora tem
sobre o acúmulo de variantes deletérias em sítios na vizinhança do polimor-
fismo balanceado.
Existe evidência de que genes na vizinhança dos genes HLA têm um ex-
cesso de variantes potencialmente deletérias (Mendes, 2013; Lenz et al., 2016).
Entretanto, dadas as várias particularidades dessa região genômica (Meyer et
al., 2006), permanece em aberto qual o efeito que a seleção balanceadora tem,
em geral, sobre variantes não-neutras ligadas. No Capítulo 2 eu abordo essa
questão.

45
Introdução Geral

Relevância, Questões & Hipóteses

Relevância

Na última década, com a implementação cada vez mais frequente de scans genô-
micos, foram geradas várias listas de regiões e genes do genoma humano e de
outras espécies que têm assinaturas de seleção positiva. O acesso a sequências
de genomas de diversas espécies com ferramentas bioinformáticas estimulou a
quantificação da evolução adaptativa, que resulta nos padrões de polimorfismo
observados em humanos. O fato de grupos de genes definidos com base em
sua função em categorias tais como “espermatogênese”, “olfação”, “percepção
sensorial” e “resposta imune” serem recorrentes nas listas de genes candidatos
à ação da seleção positiva (e.g. Nielsen, 2005; Sabeti et al., 2006) é algo que, in-
trinsecamente, é provido de sentido biológico: muitos genes nessas categorias
estão diretamente envolvidos em interações com o ambiente. Tais observações
aumentam nossa confiança em relação aos scans genômicos, ao mesmo tempo
em que nos ajudam a compreender as adaptações específicas de nossa espécie.
Por ser considerada amplamente como o principal mecanismo responsável pela
evolução adaptativa, a seleção positiva foi e é intensamente estudada.

A variabilidade genética é o objeto de estudo do geneticista de populações.


O regime seletivo conhecido como seleção balanceadora engloba uma série de
mecanismos capazes de manter polimorfismos adaptativos segregando nas po-
pulações por curtos ou longos períodos de tempo. Se por um lado hoje te-
mos um mapa abrangente de genes que sofreram seleção positiva em huma-
nos, existe uma deficiência de informação acerca dos alvos e da abrangência da
seleção balanceadora na história evolutiva humana.

46
Introdução Geral

Um dos motivos é histórico, conforme exposto na introdução: se outrora


foi um regime seletivo que entusiasmou gigantes da biologia evolutiva como
Dobzhansky e Fisher, a seleção balanceadora deixou de ser o foco de pesquisa
dos biólogos evolutivos por mais de duas décadas após o advento da teoria
neutra de Kimura (1968) e Kimura (1983). De certa forma, a demonstração de
Hughes e Nei (1988) de que a seleção balanceadora mantém níveis de diversi-
dade especificamente na fenda apresentadora de antígenos de genes do MHC –
compatível com um a ação da seleção balanceadora – marca um renascimento
do interesse por esse tópico.

O MHC segue sendo o melhor e menos controverso exemplo de seleção ba-


lanceadora em humanos, bem como um exemplo de que múltiplos mecanis-
mos podem atuar em um mesmo sistema: existem evidências independentes
de vantagem do heterozigoto, seleção dependente de frequência e pressão se-
letiva variável ao longo do tempo e/ou do espaço atuando sobre esses genes,
e um acirrado debate acerca da proporção com que cada mecanismo contribui
para os padrões de alta diversidade do MHC.

Ainda assim, a defasagem de conhecimento que temos acerca dos alvos de


seleção balanceadora no genoma humano em relação aos de seleção positiva,
e as propriedades entre esses regimes seletivos, é enorme. Entre os poucos
estudos que buscaram assinaturas de seleção balanceadora, existem diversas
limitações incluindo: uma falta de poder para detectar as assinaturas, dados
insuficientes (poucas populações ou restritos a exons, por exemplo). Dos que
buscaram por assinaturas em escala genômica, um estudo discute apenas os al-
vos codificadores de proteínas – ainda que todo o genoma tenha sido analisado
(DeGiorgio et al., 2014) – e o outro (Leffler et al., 2013) documenta que a maior
parte dos alvos é não-gênico, i.e, possivelmente regulatórios ou relacionados à

47
Introdução Geral

expressão gênica. Entretanto, este último usou um critério muito estringente na


determinação de alvos de seleção balanceadora (a presença de polimorfismos
compartilhados com chimpanzé), de forma que faltam estudos que apoiem ou
contestem essa observação.

A detecção de assinaturas de seleção natural é possível, em grande parte,


devido ao sinal produzido sobre sítios neutros adjacentes aos sítios alvo de se-
leção. O método desenvolvido no Capítulo 1 (NCD, Non-Central Deviation) é
um exemplo de utilização dessa propriedade: mesmo que apenas um sítio seja
efetivamente mantido polimórfico por seleção balanceadora44 , ele irá afetar ní-
veis de polimorfismo neutro adjacentes, por efeito de ligação, e a extensão desse
efeito sobre a vizinhança depende da escala de tempo, duração e intensidade de
seleção.

Por outro lado, pouco se sabe sobre o efeito que polimorfismos balancea-
dos têm sobre variantes não-neutras adjacentes. Esse conhecimento é escasso
mesmo para alvos de seleção positiva, e praticamente inexistente para alvos
de seleção balanceadora – com exceção de alguns estudos em genes HLA (e.g.
Oosterhout, 2009; Lenz et al., 2016). Entender melhor o impacto da seleção ba-
lanceadora na evolução humana é importante não apenas para entender como
nos tornamos o que somos, mas também para melhor entender doenças com-
plexas que ocorrem com frequência relativamente alta em humanos (Vallender
e Johnson, 2008).

44 Umapossibilidade, mas não uma certeza. Para os genes HLA sabe-se que muitos sítios são
ativamente mantidos polimórficos.

48
Introdução Geral

Questões & Hipóteses

Nesse contexto, as principais questões abordadas nesta tese foram: (1) é pos-
sível desenvolver métodos mais poderosos para encontrar regiões do genoma
que evoluem sob seleção balanceadora? (2) quais são os alvos de seleção balan-
ceadora de longo prazo em humanos? (3) quais são as propriedades biológicas
desses alvos: eles são majoritariamente genes (codificadores de proteínas), re-
giões regulatórias, ou regiões que afetam a expressão gênica? (4) quais são as ca-
tegorias funcionais mais abundantes entre genes-alvo de seleção balanceadora:
fora os genes HLA, que estão envolvidos da resposta imune, o que podemos di-
zer sobre os alvos em termos de função? (5) o que sabemos sobre a importância
biológica de alguns desses genes candidatos, com base em estudos independen-
tes? (6) os alvos de seleção balanceadora são partilhados entre populações ou
continentes? (7) qual é a prevalência de assinaturas de seleção balanceadora de
longo prazo no genoma humano? Podemos quantificar a proporção do genoma
humano que foi moldado por mecanismos de manutenção de diversidade? (8)
a seleção balanceadora sobre um ou mais sítios interfere na eficácia da seleção
purificadora sobre sítios não-neutros adjacentes?

As hipóteses exploradas nesse contexto foram:

• (a) a seleção balanceadora não é muito frequente no genoma, mas pro-


vavelmente mais frequente do que o que se estimou até agora, dado que
os métodos e/ou os dados utilizados não permitiram obter uma estima-
tiva menos conservadora de sua frequência no genoma. A fim de testar
essa hipótese, propusemos uma nova estatística (Capítulo 1), com poder
aumentado em relação a testes de neutralidade comumente usados e oti-
mizada para vasculhar o genoma humano.

49
Introdução Geral

• (b) a seleção balanceadora afeta tanto regiões gênicas quanto regiões regu-
latórias/controladoras de expressão.

• (c) a seleção balanceadora afeta majoritariamente genes relacionados com


a defesa do organismo (e.g. genes que integram vias do sistema imunoló-
gico, proteínas de membrana, de matriz extracelular, etc), ou relacionados
a interações com o ambiente extracelular ou com outras células, e com a
reprodução45 .

• (d) haverá maior compartilhamento de alvos de seleção balanceadora en-


tre populações de um mesmo continente do que entre populações de con-
tinentes distintos, e o nível geral de compartilhamento entre quaisquer
populações será alto; poucos genes/regiões apresentariam sinais opostos
de seleção em diferentes populações/continentes.

• (e) existe um excesso de SNPs não-sinônimos (ou de SNPs deletérios) em


regiões próximas a alvos de seleção balanceadora. A possível explicação
seria que por efeitos de ligação, mutações fracamente deletérias poderiam
aumentar de frequência juntamente com os sítios alvos de seleção balan-
ceadora. Por outro lado, uma deficiência de mutações deletérias poderia
estar relacionada à frequência das variantes (um artefato) ou ao fato de
que a seleção balanceadora aumenta o tamanho efetivo local.

As hipóteses (a-d) foram testadas com os alvos obtidos com um método que
desenvolvi com colaboradores, que vasculhou o genoma todo em busca de as-
sinaturas de seleção balanceadora, para quatro populações de dois continentes.
45 Embora haja relativamente poucos casos descritos até o momento e os mecanismos não se-
jam totalmente compreendidos, eles parecem promissores. Por exemplo, genes envolvidos em
espermatogênese, reconhecimento entre espermatozoide e óvulo, e hormônio folículo estimu-
lante (revisado em Vallender e Johnson, 2008).

50
Introdução Geral

Tais hipóteses são exploradas no Capítulo 1.


A premissa para a hipótese (e) é que a seleção balanceadora é suficiente-
mente mais forte do que a seleção purificadora contra as variantes deletérias,
de forma que a segunda não consiga contrabalancear a primeira. Esta última
hipótese foi explorada no Capítulo 2.

51
Bibliografia

Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we


go from here?” Em: Genome Research 19 (5), pp. 711–722.
Allison, A. C. (1954). “Protection Afforded by Sickle-cell Trait Against Subtertian Malarial In-
fection.” Em: British Medical Journal 1 (4857), pp. 290–294.
Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. Em: eLS, pp. 1–8.
Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” Em: Molecular
Biology and Evolution 26 (12), pp. 2755–64.
Asthana, S., S. Schmidt e S. R. Sunyaev (2005). “A limited role for balancing selection”. Em:
Trends in genetics : TIG 21 (1), pp. 30–32.
Bamshad, M. e S. P. Wooding (2003). “Signatures of natural selection in the human genome”.
Em: Genetics 4 (February), pp. 99–111.
Becker, K. G., R. M. Simon, J. E. Bailey-Wilson, B. Freidlin, W. E. Biddison, H. F. McFarland e J. M.
Trent (1998). “Clustering of non-major histocompatibility complex susceptibility candidate
loci in human autoimmune diseases”. Em: Proceedings of the National Academy of Sciences of
the United States of America 95 (17), pp. 9979–9984.
Bergland, A. O., E. L. Behrman, K. R. O’Brien, P. S. Schmidt e D. A. Petrov (2014). “Genomic Evi-
dence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in Drosophila”.
Em: PLoS Genetics 10 (11), e1004775.
Bernardi, G. (2007). “The neoselectionist theory of genome evolution.” Em: Proceedings of the
National Academy of Sciences of the United States of America 104 (20), pp. 8385–90.
Bersaglieri, T., P. C. Sabeti, N. Patterson, T. Vanderploeg, S. F. Schaffner, J. A. Drake, M. Rhodes,
D. E. Reich e J. N. Hirschhorn (2004). “Genetic signatures of strong recent positive selection
at the lactase gene.” Em: American Journal of Human Genetics 74 (6), pp. 1111–20.

52
Introdução Geral

Betancourt, A. J. e D. C. Presgraves (2002). “Linkage limits the power of natural selection in


Drosophila.” Em: Proceedings of the National Academy of Sciences of the United States of America
99 (21), pp. 13616–20.
Birky, W. e J. B. Walsh (1988). “Effects of linkage on rates of molecular evolution.” Em: Procee-
dings of the National Academy of Sciences of the United States of America 85, pp. 6414–6418.
Bitarello, B. D., R. D. S. Francisco e D. Meyer (2015). “Heterogeneity of dN/dS Ratios at the
Classical HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”. Em:
Journal of Molecular Evolution 82 (1), pp. 38–50.
Borghans, J., J. Beltman e R. Boer (2004). “MHC polymorphism under host-pathogen coevolu-
tion”. Em: Immunogenetics 55 (11), pp. 732–739.
Bromham, L. e D. Penny (2003). “The modern molecular clock.” Em: Nature reviews. Genetics
4 (3), pp. 216–224.
Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing
selection.” Em: Genetics 173 (4), pp. 2165–77.
Bustamante, C. D. et al. (2005). “Natural selection on protein-coding genes in the human ge-
nome”. Em: Nature 437 (7062), pp. 1153–1157.
Cai, J. J., J. M. Macpherson, G. Sella e D. A. Petrov (2009). “Pervasive Hitchhiking at Coding and
Regulatory Sites in Humans”. Em: PLoS Genetics 5 (1), pp. 1–13.
Chakraborty, M. e J. D. Fry (2015). “Evidence that Environmental Heterogeneity Maintains a De-
toxifying Enzyme Polymorphism in Drosophila melanogaster”. Em: Current Biology 26 (2),
pp. 1–5.
Charlesworth, B. (2000). “Fisher, Medawar, Hamilton and the Evolution of Aging”. Em: Genetics
156 (3), pp. 927–931.
— (2012). “The effects of deleterious mutations on evolution at linked sites”. Em: Genetics
190 (1), pp. 5–22.
Charlesworth, B. e D. Charlesworth (2010). Elements of Evolutionary Genetics. 1ª ed. Roberts e
Company Publishers, p. 768. ISBN: 0981519423.
Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome
regions.” Em: PLoS Genetics 2 (4), pp. 379–384.
Chun, S. e J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the human
genome.” Em: PLoS genetics 7 (8), e1002240.

53
Introdução Geral

Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. Em: Taxo-
nomy and Geography. Ed. por D. Nichols. Oxford: Systematics Association.
Comeron, J. M., a. Williford e R. M. Kliman (2008). “The Hill-Robertson effect: evolutionary
consequences of weak selection and linkage in finite populations.” Em: Heredity 100 (1),
pp. 19–31.
Connallon, T. e A. G. Clark (2013). “Antagonistic versus nonantagonistic models of balancing
selection: characterizing the relative timescales and hitchhiking effects of partial selective
sweeps.” Em: Evolution; international journal of organic evolution 67 (3), pp. 908–17.
Crespi, B. J. (2000). “Short Review The evolution of maladaptation”. Em: Heredity 84 (March
1999), pp. 623–629.
Crow, J. F. (1987). “Muller, Dobzhansky, and overdominance”. Em: Journal of the History of Biology
20 (3), pp. 351–380.
Cutter, A. D. e B. A. Payseur (2013). “Genomic signatures of selection at linked sites: unifying
the disparity among species.” Em: Nature reviews. Genetics 14 (4), pp. 262–74.
Darwin, C. (1859). The origin of species: complete and fully illustrated. 1979ª ed. New York: Gra-
mercy Books. ISBN: 9780517123201.
— (1876). The effects of cross and self fertilisation in the vegetable kingdom.
De Boer, R. J., J. a. M. Borghans, M. van Boven, C. Kesmir e F. J. Weissing (2004). “Heterozygote
advantage fails to explain the high degree of polymorphism of the MHC.” Em: Immunoge-
netics 55 (11), pp. 725–731.
DeGiorgio, M., K. E. Lohmueller e R. Nielsen (2014). “A model-based approach for identifying
signatures of ancient balancing selection in genetic data.” Em: PLoS genetics 10 (8), e1004561.
Dempster, E. R. (1955). “Maintenance of genetic heterogeneity.” Em: Cold Spring Harbor Symposia
on Quantitative Biology. Cold Spring Harbor Laboratory Press, pp. 25–32.
Dobzhansky, T. (1937). Genetics and the Origin of Species. 2nd. New York: Columbia University
Press.
Doherty, P. C. e R. M. Zinkernagel (1975). “Enhanced immunological surveillance in mice hete-
rozygous at the H-2 gene complex”. Em: Nature 256 (5512), pp. 50–52.
Enard, D., F. Depaulis e H. Roest Crollius (2010). “Human and Non-Human Primate Genomes
Share Hotspots of Positive Selection”. Em: PLoS Genetics 6 (2), pp. 1–13.

54
Introdução Geral

Eyre-Walker, A. (2006). “The genomic rate of adaptive evolution.” Em: Trends in ecology & evolu-
tion 21 (10), pp. 569–75.
Eyre-Walker, A. e P. D. Keightley (1999). “High genomic deleterious mutation rates in homi-
nids”. Em: Nature 397 (6717), pp. 344–347.
Fay, J. C., G. J. Wyckoff e C.-I. I. Wu (2001). “Positive and negative selection on the human
genome.” Em: Genetics 158 (3), pp. 1227–34.
Fijarczyk, A. e W. Babik (2015). “Detecting balancing selection in genomes: Limits and pros-
pects”. Em: Molecular Ecology, n/a–n/a.
Fisher, R. A. (1922). “On the Dominance Ratio.” Em: Proc. R. Soc. 42, pp. 321–341.
Fu, W. e J. M. Akey (2013). “Selection and Adaptation in the Human Genome”. Em: Annual
Review of Genomics and Human Genetics 14 (1), pp. 467–489.
Garrigan, D. e P. W. Hedrick (2003). “Detecting adaptive molecular polymorphism : Lessons
from the MHC”. Em: Evolution 57 (8), pp. 1707–1722.
Gillespie, J. H. (1991). The causes of molecular evolution. Oxford: Oxford University Press. ISBN:
0-19-509271-6.
Gillespie, J. H. e C. Langley (1974). “A general model to account for enzyme variation in natural
populations”. Em: Genetics 76 (4), pp. 837–48.
Gloss, A. D. e N. K. Whiteman (2016). “Balancing Selection: Walking a Tightrope”. Em: Current
Biology 26 (2), R73–R76.
Gravel, S. (2016). “When Is Selection Effective?” Em: Genetics 203 (1), pp. 451–462.
Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs
e C. D. Bustamante (2011). “Demographic history and rare allele sharing among human
populations.” Em: Proceedings of the National Academy of Sciences of the United States of America
108 (29), pp. 11983–8.
Haldane, J. (1937). “The Effect of Variation on Fitness”. Em: The American Naturalist 71 (735),
pp. 337–349.
Harris, E. E. (2010). “Nonadaptive processes in primate and human evolution.” Em: American
journal of physical anthropology 143 Suppl, pp. 13–45.
Harris, E. E. e D. Meyer (2006). “The Molecular Signature of Selection Underlying Human
Adaptations”. Em: Yearbook of Physical Anthropology 130, pp. 89–130.

55
Introdução Geral

Haygood, R., C. C. Babbitt, O. Fedrigo e G. A. Wray (2010). “Contrasts between adaptive coding
and noncoding changes during human evolution”. Em: Proceedings of the National Academy
of Sciences of the United States of America 107 (17), pp. 7853–7857.
Hedrick, P. W. (2006). “Genetic Polymorphism in Heterogeneous Environments: The Age of
Genomics”. Em: Annual Review of Ecology, Evolution, and Systematics 37, pp. 67–93.
— (2012). “What is the evidence for heterozygote advantage selection?” Em: Trends in Ecology
& Evolutiony & evolution 27 (12), pp. 698–704.
Hellmann, I., I. Ebersberger, S. E. Ptak, S. Pääbo e M. Przeworski (2003). “A neutral explanation
for the correlation of diversity with recombination rates in humans.” Em: American journal
of human genetics 72 (6), pp. 1527–35.
Hill, W. G. e A. Robertson (1966). “The effect of linkage on limits to artificial selection”. Em:
Genetical Research 8 (03), p. 269.
Hudson, R. R., M. Kreitman e M. Aguade (1987). “A Test of Neutral Molecular Evolution Based
on Nucleotide Data”. Em: Genetics 116 (1), pp. 153–159.
Hughes, A. L. e M. Nei (1989). “Nucleotide substitution at major histocompatibility complex
class II loci: evidence for overdominant selection”. Em: Proceedings of the National Academy
of Sciences of the United States of America 86 (3), pp. 958–962.
Hughes, A. L. e M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility
class I loci reveals overdominant selection”. Em: Letters to Nature 335 (8), pp. 167–170.
Innocenti, P. e E. H. Morrow (2010). “The sexually antagonistic genes of drosophila melanogas-
ter”. Em: PLoS Biology 8 (3), e1000335.
Jablonski, N. G. e G. Chaplin (2010). “Human skin pigmentation as an adaptation to UV ra-
diation”. Em: Proceedings of the National Academy of Sciences 107 (Supplement_2), pp. 8962–
8968.
Key, F. M., J. C. Teixeira, C. de Filippo e A. M. Andrés (2014). “Advantageous diversity maintai-
ned by balancing selection in humans”. Em: Current Opinion in Genetics & Development 29,
pp. 45–51.
Kimura, M. (1991). “The neutral theory of molecular evolution: a review of recent evidence”.
Em: Japanese Journal of Genetics 66 (4), pp. 367–386.
Kimura, M. (1968). “Evolutionary rate at the molecular level”. Em: Nature 217, pp. 624–626.

56
Introdução Geral

— (1983). The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press.
ISBN: 9780511623486. URL: http://ebooks.cambridge.org/ref/id/CBO9780511623486.
Kimura, M. e J. F. Crow (1963). “The Measurement of Effective Population Number”. Em: Evo-
lution 17 (3), pp. 279–288.
— (1964). “The Number of Alleles that Can Be Maintained in a Finite Population”. Em: Genetics
49, pp. 725–738.
Klein, J., A. Sato, S. Nagl e C. O’hUigin (1998). “Molecular trans-species polymorphism”. Em:
Annual Review of Ecology and Systematics 29, pp. 1–21.
Kreitman, M. e A. Di Rienzo (2004). “Balancing claims for balancing selection”. Em: Trends in
Genetics 20 (7), pp. 300–304.
Lande, R. (1975). “The maintenance of genetic variability by mutation in a polygenic character
with linked loci”. Em: Genetical Research 26 (3), pp. 221–35.
Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. Em: Science 339 (6127), pp. 1578–1582.
Lenz, T. L., V. Spirin, D. M. Jordan e S. R. Sunyaev (2016). “Excess of Deleterious Mutations
around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. Em: bioRxiv, pp. 1–
30.
Levene, H. (1953). “Genetic Equilibrium When More Than One Ecological Niche is Available”.
Em: The American Naturalist 87 (836), pp. 331–333.
Lewontin, R. C. e J. L. Hubby (1966). “A Molecular Approach to the Study of Genic Heterozy-
gosity in Natural Populations. II. Amount of Variation and Degree of Heterozygosity in
Natural Populations of Drosophila pseudoobscura”. Em: Genetics 54 (2), pp. 595–609.
Lynch, M. (2007). “The evolution of genetic networks by non-adaptive processes.” Em: Nature
reviews. Genetics 8 (10), pp. 803–13.
Maynard-Smith, J. e J. Haigh (1974). “The hitch-hiking effect of a favorable gene.” Em: Genetical
Research (23), pp. 23–35.
McDonald, J. H. e M. Kreitman (1991). “Adaptive protein evolution at the Adh locus in Dro-
sophila.” en. Em: Nature 351 (6328), pp. 652–4.
Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Rel. téc.
Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/41/
41131/tde-02082013-161104/pt-br.php.

57
Introdução Geral

Meyer, D., R. M. Single, S. J. Mack, H. A. Erlich e G. Thomson (2006). “Signatures of demo-


graphic history and natural selection in the human major histocompatibility complex Loci.”
Em: Genetics 173 (4), pp. 2121–2142.
Mitchell-Olds, T., J. H. Willis e D. B. Goldstein (2007). “Which evolutionary processes influence
natural genetic variation for phenotypic traits?” Em: Nature reviews. Genetics 8 (11), pp. 845–
856.
Muller, H. J. (1950). “Our Load of Mutations”. Em: The American Journal of Human Genetics 2 (2),
pp. 111–176.
Nielsen, R. (2005). “Molecular Signatures of Natural Selection”. Em: Annual Review of Genetics
39 (1), pp. 197–218.
Nielsen, R. et al. (2005). “A Scan for Positively Selected Genes in the Genomes of Humans and
Chimpanzees”. Em: PLoS Biology 3 (6), e170.
Ohta, T. (1973). “Slightly Deleterious Mutant Substitutions in Evolution”. Em: Nature 246 (5428),
pp. 96–98.
— (1995). “Synonymous and nonsynonymous substitutions in mammalian genes and the ne-
arly neutral theory”. Em: Journal of Molecular Evolution 40, pp. 56–63.
Ohta, T. e J. H. Gillespie (1996). “Development of Neutral and Nearly Neutral Theories”. Em:
Theoretical Population Biology 49 (2), pp. 128–142.
Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune
genes.” Em: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657),
pp. 657–65.
Orr, H. A. (1998). “The population genetics of adaptation: the distribution of factors fixed during
adaptive evolution.” Em: Evolution 52 (4), pp. 935–949.
— (2005). “The genetic theory of adaptation: a brief history.” Em: Nature Reviews Genetics 6 (2),
pp. 119–27.
Prout, T. (1968). “Sufficient Conditions for Multiple Niche Polymorphism”. Em: The American
Naturalist 102 (928), pp. 493–496.
— (2000). “How well does opposing selection maintain variation?” Em: Evolutionary genetics:
from molecules to morphology. Cambridge: Cambridge University Press, pp. 157–181.

58
Introdução Geral

Prugnolle, F., A. Manica, M. Charpentier, J. F. Guégan, V. Guernier e F. Balloux (2005). “Pathogen-


driven selection and worldwide HLA class I diversity.” Em: Current Biology 15 (11), pp. 1022–
7.
Richman, A. (2000). “Evolution of balanced genetic polymorphism”. Em: Molecular Ecology 9 (12),
pp. 1953–1963.
Sabeti, P. C., S. F. Schaffner, B. Fry, J. Lohmueller, P. Varilly, O. Shamovsky, A. Palma, T. S. Mik-
kelsen, D. Altshuler e E. S. Lander (2006). “Positive natural selection in the human lineage.”
Em: Science 312 (5780), pp. 1614–20.
Sabeti, P. C. et al. (2007). “Genome-wide detection and characterization of positive selection in
human populations.” Em: Nature 449 (7164), pp. 913–8.
Ségurel, L., Z. Gao e M. Przeworski (2013). “Ancestry runs deeper than blood: The evolutionary
history of ABO points to cryptic variation of functional importance”. Em: BioEssays 35 (10),
pp. 862–867.
Sella, G., D. A. Petrov, M. Przeworski e P. Andolfatto (2009). “Pervasive Natural Selection in the
Drosophila Genome?” Em: PLoS Genetics 5 (6).
Sellis, D., B. J. Callahan, D. a. Petrov e P. W. Messer (2011). “Heterozygote advantage as a natural
consequence of adaptation in diploids”. Em: Proceedings of the National Academy of Sciences
108 (51), pp. 20666–20671.
Slade, R. e H. McCallum (1992). “Overdominant vs. frequency-dependent selection at MHC
loci.” Em: Genetics 132, pp. 861–864.
Spurgin, L. G. e D. S. Richardson (2010). “How pathogens drive genetic diversity: MHC, me-
chanisms and misunderstandings.” Em: Proceedings. Biological sciences / The Royal Society
277 (1684), pp. 979–88.
Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly-
morphism.” Em: Genetics 123 (3), pp. 585–595.
Teixeira, J. C. et al. (2015). “Long-Term Balancing Selection in LAD1 Maintains a Missense Trans-
Species Polymorphism in Humans, Chimpanzees, and Bonobos”. Em: Molecular Biology and
Evolution 32 (5), pp. 1186–1196.
Tishkoff, S. A. e S. M. Williams (2002). “Genetic analysis of African populations: human evolu-
tion and complex disease.” Em: Nature Reviews Genetics 3 (8), pp. 611–621.

59
Introdução Geral

Trachtenberg, E. et al. (2003). “Advantage of rare HLA supertype in HIV disease progression”.
Em: Nature Medicine 9, pp. 928–935.
Vallender, E. J. e W. E. Johnson (2008). “Balancing Selection in Human Evolution”. Em: eLS.
Watterson, G. A. (1978). “The homozygosity test of neutrality.” Em: Genetics 88 (2), pp. 405–17.
Williams, G. C. (1957). “Pleiotropy, Natural Selection, and the Evolution of Senescence”. Em:
Evolution 11 (4), p. 398.
Wright, S. (1937). “The Distribution of Gene Frequencies in Populations.” Em: Proceedings of the
National Academy of Sciences 23 (6), pp. 307–320.
Yang, Z. e W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution that
Account for Heterogeneous Selective Pressures Among Site Classes”. Em: Molecular Biology
and Evolution 19 (1), pp. 49–57.
Zhang, Z. e J. Parsch (2005). “Positive correlation between evolutionary rate and recombination
rate in Drosophila genes with male-biased expression.” Em: Molecular Biology and Evolution
22 (10), pp. 1945–7.

60
Capítulo 1

Buscando alvos de seleção balancea-


dora no genoma humano

Considerações Iniciais
Neste capítulo apresento um manuscrito – atualmente em revisão final pelos
co-autores – em que desenvolvemos uma nova estatística para detecção de ins-
tâncias de seleção balanceadora no genoma humano. Ela quantifica diretamente
as duas principais assinaturas de regimes de seleção balanceadora atuantes por
longas escalas de tempo: um excesso de alelos segregando em frequências in-
termediárias e um excesso de sítios polimórficos em relação às expectativas sob
um modelo nulo.

Cerca de um terço dos genes que detectamos com essa nova estatística tem
evidência prévia de seleção balanceadora – de acordo com métodos e dados
bastante diferentes dos nossos. Contudo, descrevemos também mais de 150
novos genes candidatos, bem como regiões não-codificadoras candidatas e as
propriedades dessas regiões.

61
Capítulo 1

Nosso método tem maior poder que outros descritos na literatura, e é ex-
tremamente simples de ser implementado e interpretado, além de rodar rapi-
damente. Combinado a um dedicado controle de qualidade dos dados utiliza-
dos, e verificação das regiões candidatas obtidas, acreditamos ter fornecido um
mapa extremamente confiável da extensão das assinaturas de seleção balance-
adora no genoma humano. Com este trabalho, contribuímos para a literatura
(não muito extensa) de seleção balanceadora em humanos, além de propormos
um método com alto poder estatístico que, em princípio, pode ser utilizado em
abordagens semelhantes para outras espécies.
Este trabalho foi feito em colaboração com a pesquisadora Aida M. Andrés,
do Max Planck Institute for Evolutionary Anthropology (MPI-EVA, Leipzig),
que concebeu a ideia do novo método. O trabalho começou em 2013, durante
meu doutorado sanduíche, e contou com a co-supervisão de Diogo Meyer e
A.M.A. Contei ainda com a colaboração dos alunos Cesare de Filippo (pós-
doutorando, MPI-EVA) e João C. Teixeira (doutorando, MPI-EVA). J.C.T. rea-
lizou parte das análises de enriquecimento para as regiões candidatas, e C.F.
colaborou nas etapas de simulações para avaliação da estatística e na imple-
mentação do scan em si.
O manuscrito foi redigido por mim, juntamente A.M.A. e D.M, e todos os
autores contribuíram com comentários sobre a redação do mesmo. Ele será sub-
metido para o períódico Plos Genetics.
Todo o material suplementar citado no texto foi disponibilizado no fim do
capítulo.

62
Capítulo 1

Uncovering targets of balancing selection in


the human genome
Bárbara Domingues Bitarello1 , Cesare de Fillipo2 , João Teixeira2 , Diogo
Meyer1 *, and Aida M. Andrés2 *

*, co-supervised the study

1, Universidade de São Paulo, São Paulo, Brazil

2,Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany

Introduction
refers to a class of selective mechanisms that main-

B
ALANCING SELECTION

tain advantageous genetic diversity in populations. Although perhaps not a

pervasive form of natural selection, balancing selection maintains genetic di-

versity with phenotypic relevance. For example, decades of research have established

HLA genes as a prime example of balancing selection (Meyer and Thomson, 2001; Spur-

gin and Richardson, 2010) with thousands of alleles segregating in humans, extensive

support for the functional relevance of polymorphism (e.g., Hedrick et al., 1991; Prug-

nolle et al., 2005) and various well-documented cases of association between selected

alleles and disease resistance and susceptibility (e.g. Raychaudhuri et al., 2012; Howell,

2014).

The catalog of well-understood non-HLA targets of balancing selection remains

small, but genes identified are associated to phenotypes such as auto-immune dis-

eases (Raychaudhuri et al., 2012), malaria resistance (Malaria Genomic Epidemiology

Network, 2015), resistance to HIV infection (Biasin et al., 2007) and polycystic ovary

syndrome (Day et al., 2015). Thus, the relevance of balanced polymorphisms is not re-

63
Capítulo 1

stricted to their historical influence on individual fitness: they also shape, today, human

phenotypic diversity and susceptibility to disease.

Balancing selection encompasses several mechanisms (Andrés, 2011; Charlesworth

and Charlesworth, 2010; Clarke, 1962; Fijarczyk and Babik, 2015; reviewed in Andrés,

2011; Key et al., 2014b). These include heterozygote advantage (or overdominance)

(Andrés, 2011; Key et al., 2014b; Fijarczyk and Babik, 2015), frequency-dependent se-

lection, including rare allele advantage (Clarke, 1962; Charlesworth and Charlesworth,

2010), selective pressures that fluctuate in time (Andrés, 2011; Bergland et al., 2014; Fi-

jarczyk and Babik, 2015) or in space in panmitic populations (Andrés, 2011; Charlesworth

et al., 1997; Charlesworth, 2006; Fijarczyk and Babik, 2015; Key et al., 2014b) and cases

of pleiotropy (Johnston et al., 2013). For some mechanisms, including overdominance,

pleiotropy, and some instances of selection that varies in space, a stable equilibrium

can be reached (Charlesworth and Charlesworth, 2010). For other mechanisms the

frequency of the selected allele can change in time with no theoretical equilibrium fre-

quency, although the frequency of the balanced polymorphism will be strongly affected

by the selective process.

Regardless of the mechanism, balancing selection can increase genetic variation

with respect to neutral expectations and has the potential to leave identifiable signa-

tures in genomic data. These include local site-frequency spectra with an excess of

alleles close to the frequency of the balanced allele and, when selection is old enough,

an excess of polymorphisms relative to divergence (reviewed in Key et al., 2014b). In

some cases, very ancient balancing selection can maintain trans-species polymorphisms

in sister species (Leffler et al., 2013; Teixeira et al., 2015), while recent balancing selection

or selection that is transient (e.g., that predicted in the model of Sellis et al., 2011) will

result in signatures that are probably difficult to distinguish from incomplete, recent

positive selection sweeps (Key et al., 2014b).

While balancing selection has been extensively explored from a theoretical perspec-

64
Capítulo 1

tive, an empirical understanding of its prevalence in the human genome lags behind

our knowledge of positive selection. This stems from technical difficulties in detect-

ing balancing selection, as well as the perception that balancing selection may be rare

(Hedrick, 2012). In fact, few methods have been developed to identify its targets, and

only a handful of studies have sought to uncover targets of balancing selection genome-

wide (Andrés et al., 2009; Alonso et al., 2008; Asthana et al., 2005; Bubb et al., 2006;

Leffler et al., 2013; DeGiorgio et al., 2014; Rasmussen et al., 2014; Teixeira et al., 2015),

with different methods and datasets. Andrés et al. (2009) and DeGiorgio et al. (2014)

identified, with different approaches, genes (Andrés et al., 2009) or genomic regions

(DeGiorgio et al., 2014) with an excess of polymorphism and with site-frequency spec-

tra showing an excess of intermediate frequency alleles.

Leffler et al. (2013) and Teixeira et al. (2015) identified trans-species polymorphisms

between humans and other primates. Overall, these studies suggested that balancing

selection may act on a relatively small portion of the genome, although the limited ex-

tent of the data available (e.g. exome data in Andrés et al., 2009 and small sample size

in DeGiorgio et al., 2014), and the stringency of the criteria - e.g., balanced polymor-

phisms that pre-date human-chimpanzee divergence in Leffler et al. (2013) and Teixeira

et al. (2015) - may underlie the paucity of targets detected.

Here, we developed two new test statistics that summarize, directly and in a sim-

ple way, the degree to which allele frequencies in a genomic region deviate from the

frequencies expected under balancing selection. Through extensive simulations, we

showed that one of our methods outperforms existing methods for realistic demo-

graphic scenarios for human populations. We applied our statistic to the genome-wide

1000 Genomes Project (Abecasis et al., 2012) data in four human populations and used

both outlier and simulation-based cut-offs to identify both known and new genomic

regions that have evolved under long-term balancing selection.

65
Capítulo 1

Results

NCD Method

Background Owing to linkage, the signature of long term balancing selection (LTBS)
on a site extends to the genetic neighborhood of the selected variant(s), so the patterns

of polymorphism and divergence in a genomic region can be used to infer whether

it evolved under balancing selection (Charlesworth, 2006; Andrés, 2011). LTBS leaves

two distinctive signatures in linked variation, when compared with neutral expecta-

tions. The first is an increase in the ratio of polymorphic to divergent sites. This occurs

because, by reducing the probability of fixation, balancing selection increases the local

TMRCA (Hudson and Kaplan, 1988). One commonly used test to detect this signature

is the HKA test (Hudson et al., 1987).

The second signature is an excess of alleles segregating at intermediate frequencies.

In humans, the folded site frequency spectrum (SFS), which is the distribution of the fre-

quency of the minor alleles (MAF) regardless of whether they are ancestral or derived,

is typically L-shaped, showing an excess of low-frequency alleles when compared to

expectations under neutrality and demographic equilibrium. This is a consequence of

recent population expansions (e.g. Coventry et al., 2010), with the abundance of rare al-

leles further increased by purifying selection and recent selective sweeps. On the other

hand, regions under LTBS are expected to show a markedly different SFS, with propor-

tionally more alleles at intermediate frequency (Fig 1A-B). Such a deviation in the SFS

is the signature identified by classical neutrality tests, such as Tajima’s D (Tajima, 1989)

and newer statistics such as MWU-high (Nielsen et al., 2009).

The signatures of LTBS on the SFS will depend on the selective regime and the

intensity of selection on each genotype. For example, under overdominance the fre-

quency equilibrium depends on the relative fitness of each genotype (Charlesworth,

66
Capítulo 1

Figure 1. Schema for NCD statistics definition


(A) Schematic representation of distributions of derived allele frequencies (un-
folded SFS) expected for loci under neutrality (grey), containing one site under
balancing selection with frequency equilibrium of 0.5 (blue), 0.4 (orange) and 0.3
(pink). DAF is the derived allele frequency, ranging from 0 to 1. (B) Schematic
representations of distributions of minor allele frequencies (folded SFS), rang-
ing from 0 to 0.5. Colors as in A. (C) Schematic representation of density plots
of the distribution of NCD expected under neutrality (grey) and under selection
(following the f eq values given in (A).

67
Capítulo 1

2006; Charlesworth and Charlesworth, 2010; Fijarczyk and Babik, 2015). Given selec-

tion coefficients s and t against the AA and BB homozygotes, respectively, the deter-

ministic frequency equilibrium ( f eq ) is given by:

s
f eqA = (1)
s+t

With symmetric overdominance (s = t), f eq = 0.5. With asymmetric overdomi-

nance (t 6= s), which might be more prevalent in natural systems (Hedrick, 2012), it

follows that f eq 6= 0.5. A classic example of asymmetric overdominance is the case of

β-globin and sickle cell anemia, where in regions of endemic malaria the fitness of the

HbA homozygote for the β-globin locus is approximately 9 times higher than that of

the HbS homozygote, with the resulting equilibrium frequency of the HbS allele being

0.13 (Allison and Clyde, 1961). Under frequency-dependent selection, f eq will depend

on the frequency of the favored allele. Under fluctuating selection the frequency of the

selected allele will depend on the temporal and spatial scales of selection (Andrés, 2011;

Clarke, 1964; Pasvol et al., 1978) and although no stable, long-term frequency equilib-

rium may be reached, the balanced polymorphism may be actively maintained (as long

as the heterozygote fitness exceeds that of homozygotes in their harmonic and geomet-

ric means, for spatial and temporal models, respectively) (reviewed in Hedrick, 2006).

In these cases, f eq can be thought of as the frequency, at the time of sampling, of the

balanced polymorphism.

Non-Central Deviation (NCD) In the tradition of neutrality tests that analyze di-
rectly the SFS (e.g. Nielsen et al., 2005; Nielsen et al., 2009; Williamson et al., 2007), we

propose two related test statistics that explore the abundance and frequency of poly-

morphisms in a given locus. Both tests measure a “Non-Central Deviation” (NCD),

which we define as the degree to which the local SFS deviates from a pre-specified al-

lele frequency (the target frequency, t f ). Under a model of balancing selection, t f can

68
Capítulo 1

be thought of as the deterministic frequency that would be attained given the selection

parameters, with the NCD statistic querying how far SNP frequencies are from it. We

propose two implementations for this statistic: NCD1 and NCD2. The NCD1 statistic

is based solely on the SFS, using information on allelic frequency, pi , of each site in a

locus:

v
u n
u ∑ ( p i − t f )2
u
t i =1
NCD1t f = (2)
n

where i = 1,2,3,...,n is the i-th polymorphism, pi is the MAF for the i-th polymor-

phism, and t f is is the target frequency with respect to which the deviations of the

observed alleles frequencies are computed. Thus, NCD1 is a non-central standard de-

viation that quantifies the dispersion of allelic frequencies from t f , rather than from the

mean of the distribution. Because the frequencies of alleles at bi-allelic loci are comple-

mentary, and under balancing selection there is no prior expectation on the ancestral

or the derived allele being maintained at higher frequency, we use the folded SFS (Fig

1). The minimum amount of data required for calculating NCD1 is one polymorphism,

and for simplicity we consider only bi-allelic SNPs.

The NCD2 statistic is an extension of NCD1 that includes information not only on

the frequency of polymorphisms, but also on the number of fixed differences (FDs):

v
u n · (0 − t f )2 + n ( p − t f )2
u
u ∑ i
i =1
NCD2t f = (3)
t
nfd + n

, where n f d is the number of FDs in the locus. In NCD2, all informative sites (IS

= SNPs + FDs) are taken into account. FDs can be considered informative sites with a

minor allele frequency (MAF) of 0, and as such they contribute to deviation from t f :

the greater the number of fixed differences, the larger the NCD2 value and hence the

weaker the support for LTBS. The minimal data required for calculating NCD2 is one

69
Capítulo 1

informative site, and for simplicity only bi-allelic allelic SNPs and single nucleotide FDs

are considered.

From equations 2 and 3 it follows that the maximum value for NCD2t f for a given

t f is the target frequency itself (i.e, no SNPs and one or more FDs in the locus, as in S1

Fig) and for NCD1t f the maximum value approaches - but never reaches - t f when all

SNPs are singletons. The minimum value for both NCD1t f and NCD2t f is 0, when all

SNPs segregate at t f and, in the case of NCD2t f , there are no FDs (S2 Fig). Thus, low

NCD1 and NCD2 values reflect a low deviation of the SFS from the pre-defined target

frequency, which is expected in windows containing sites under LTBS (Fig 1C).

Power of the NCD statistics to detect LTBS

We evaluated the specificity and sensitivity of NCD1 and NCD2 by benchmarking their

performance using simulations. Specifically, we considered demographic scenarios in-

ferred for African, European, and Asian human populations (Fig 2), and simulated both

neutrality and balancing selection using a model of heterozygote advantage (see Meth-

ods). We then explored the influence of the parameters that may affect the power of

the NCD statistics: time since the onset of balancing selection (Tbs), the deterministic

frequency equilibrium defined by the selection coefficients ( f eq ), the demographic his-

tory of the sampled population, the chosen target frequency in NCD calculation (both

for cases in which f eq does and does not match t f ), the length of the genomic region

analyzed (L), and the implementation of NCD (NCD1 or NCD2). Box 1 summarizes

the nomenclature used throughout the text.

70
Capítulo 1

BOX 1. List of Abbreviations

LTBS, long-term balancing selection.

MAF, minor allele frequency.

SFS, site-frequency spectrum.

FD, fixed difference (between two species).

IS, informative sites (number of polymorphic sites in the ingroup species plus

the number of fixed differences between ingroup and outgroup species).

f eq , deterministic equilibrium frequency achieved by site(s) under balancing

selection as defined by the selection coefficients.

t f , target frequency, i.e, the frequency used in the NCD statistics as the value

to which queried allele frequencies are compared to.

NCD statistics, non-central deviation statistics, with two implementations.

NCD1, NCD statistic that measures the average distance between poly-

morphic allele frequencies and a pre-determined frequency (t f ). NCD1t f is

NCD1 for that given t f .

NCD2, NCD statistic that measures the average distance between allelic

frequencies and a pre-determined frequency (t f ) considering both polymor-

phisms and fixed differences with an outgroup. NCD2t f is NCD2 for that

given t f .

NCDt f refers to the average of NCD1t f and NCD2t f .

For simplicity we present power values (always at false positive rate of 5%) aver-

aged across NCD implementations (NCDt f being the average of NCD1t f and NCD2t f ),

demographic models and sequence lengths. These averages are helpful in that they

reflect the general changes in power when changing individual parameters. Never-

theless, because they often include conditions for which power is low, the averages

underestimate the power that the test can reach under a given parameter. The full ma-

71
Capítulo 1

Figure 2. Human demographic model and parameters used in simulations


For all simulations, the human demographic model is the one inferred in Gravel
et al. (2011) , including migration rates, population split times, and effective
population sizes (Ne ). Divergence with chimpanzees was added to this model.
g, generations; T time in generations since different events: the split between
human and chimpanzee lineages (Tdiv ), the population growth in African (Tga f ),
the out-of-Africa migration (Tooa ), and the European-Asian split (Teua s ); N refers
to Ne of different populations: the ancestral population (Nanc ), the chimpanzee
population (Nc ), the ancestral human population (Nh ), the African population
after Tga f population growth (Na f r ), the Eurasian ancestral population (Nooa ),
the European population (Neur ) and the Asian population; r are the growth rates
of Asian and European populations. Tbs is the time (in millions of years) since
onset of balancing selection, and f eq the frequency equilibrium of the balanced
polymorphism.

72
Capítulo 1

trix of power results for each condition is presented in S1 Table, and some key points

are discussed below.

Time since the onset of balancing selection and sequence length The signa-
tures of LTBS are expected to be stronger the longer the time since the onset of balancing

selection, because there will have been more time for linked mutations to accumulate

and reach intermediate frequencies. We simulated sequences with a balanced polymor-

phism with Tbs of 1, 3, and 5 million years (myr) (Fig 2). For simplicity, in this section

we consider only cases where t f = f eq although this condition is relaxed in later sec-

tions.

For both NCD10.5 and NCD20.5 ( f eq = 0.5), power to detect balancing selection with

Tbs = 1 myr is low across all scenarios and for all t f (always lower than 0.43, S1 Table).

Nevertheless, power to identify older balanced polymorphisms is high, for all t f , for

both 3 myr (e.g. average NCD0.5 is 0.70) and 5 myr (average NCD0.5 0.77) (S3-S8 Figs,

S1 Table). We thus focus exclusively on long-term balancing selection: 3 and 5 myr.

Tbs also affects the length of the region bearing the signature of balancing selection,

as in the absence of epistasis the long-term effects of recombination result in narrower

signatures with older Tbs (Leffler et al., 2013; Teixeira et al., 2015). This is indeed the

case for all t f (S3-S8 Figs, S1 Table). For example, NCD0.5 at Tbs = 5 myr resulted in

average power of 0.78, 0.76, and 0.67 for 3, 6, and 12 Kb, respectively (S3-S8 Figs, S1

Table), and a similar pattern emerges for NCD0.4 and NCD0.3 . For NCD1 the power

increment for shorter regions was less pronounced than for NCD2 (S1 Table), perhaps

due to the lower number of informative sites. Again, a similar picture emerges for

NCD0.4 – with 21% reduction in power for 12 Kb compared to 3 Kb – and NCD0.3 –

with 25% reduction in power for 12 Kb compared to 3 Kb (S1 Table; S3-S8 Figs).

In summary, the power of the NCD statistics grows with the age of the balanced

polymorphism and the narrowness of the analyzed window. These analyses suggest

that the NCD statistics are well powered to detect balancing selection that started at

73
Capítulo 1

least 3 myr ago in windows of 3 Kb centered on the selected site (S1 Table) and we

therefore do not include 1 myr results in the remainder of the discussion.

Demography Power is similar for samples simulated under the African (average
NCD0.5 of 0.86) and European (average NCD0.5 of 0.87) demographic scenarios for both

NCD10.5 and NCD20.5 and drastically lower for a population under the demographic

model for Asians (average NCD0.5 of 0.48; S3-S8 Figs, and S1 Table). Similar trends

are observed for NCD0.4 (75% average reduction in Asia when compared with Africa)

and NCD0.3 (92% reduction). One explanation for the lower power under the Asian

demographic model is the stronger effect of random genetic drift in this population due

to its lower Ne (Gutenkunst et al., 2009; Gravel et al., 2011), which affects both the SFS of

neutral loci (putatively increasing the proportion of alleles at intermediate frequency)

and those under balancing selection (reducing the efficacy of selection and putatively

increasing the dispersion from the balanced frequency equilibrium). We thus focused

our subsequent analyses on the African and European populations, for which power

was high and comparable (thus allowing fair comparisons between these geographic

regions).

Simulated and target frequencies So far we discussed only cases where t f = f eq ,


which is expected to favor the performance of the method. In this case the NCD statis-

tics have high power: on average 0.86, 0.79, and 0.70 for f eq = 0.5, 0.4, and 0.3, respec-

tively (S1 Table). Selection with f eq =0.2 resulted in low power across all parameters and

t f values (S3-S8 Figs), so we do not further explore this target frequency.

In practice, though, one does not have a priori knowledge about the equilibrium

frequency of balanced polymorphisms. We thus explored the power of NCD when the

simulated equilibrium and the target frequencies differ. The power to detect LTBS is

very high for NCD20.5 and NCD10.5 , even when selection is simulated with other f eq

values (average NCD0.5 of 0.79, Table 1, S3-S8 Figs, and S1 Table) and similarly high for

74
Capítulo 1

Table 1. Power for simulations under the African and European demographic
models
Tbs, time since onset of balancing selection (in millions of years); f eq , frequency
equilibrium in the simulations. Power values are for a false positive rate of 0.05,
for simulations of the African and European demographic scenarios, L = 3 Kb.
Africa Europe
NCD2 NCD1 NCD2 NCD1
tf tf tf tf
Tbs f eq 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 0.96 0.94 0.84 0.93 0.91 0.39 0.97 0.95 0.83 0.92 0.85 0.20
5 0.4 0.94 0.94 0.89 0.89 0.89 0.67 0.95 0.94 0.91 0.85 0.82 0.59
5 0.3 0.90 0.91 0.93 0.72 0.80 0.84 0.84 0.85 0.89 0.47 0.57 0.74
3 0.5 0.91 0.88 0.68 0.86 0.80 0.24 0.93 0.89 0.68 0.81 0.69 0.14
3 0.4 0.88 0.86 0.76 0.78 0.78 0.56 0.89 0.87 0.79 0.74 0.71 0.46
3 0.3 0.75 0.77 0.81 0.56 0.64 0.71 0.73 0.76 0.79 0.39 0.48 0.63

NCD0.4 (average 0.78) and NCD0.3 (average 0.70) (S1 Table).

Conversely, power to detect LTBS with f eq = 0.4 is similar with NCD0.5 or NCD0.4

(Table 1 and S1 Table), but for f eq = 0.3 power is 10% is higher for NCD0.3 than for

NCD0.5 (Table 1 and S1 Table). Therefore, NCD statistics can be well powered both

when the frequency of the balanced polymorphism is the same as the target frequency,

and when it is not (as expected given correlations among these statistics; S9 Fig). Nev-

ertheless, the closest t f is to f eq , the highest the power to identify targets of LTBS (Table

1). Thus, information is gained by calculating NCD with different target frequencies.

NCD implementations The power for NCD2 is greater than for NCD1, for all t f :
f eq = 0.5 (average power of 0.94 for NCD20.5 and 0.88 for NCD10.5 ), f eq = 0.4 (0.93 for

NCD20.4 and 0.80 for NCD10.4 ) and for f eq = 0.3 (0.85 for NCD20.3 and 0.73 for NCD10.3

(Table 1, Fig 3, S1 Table). The gain in power that occurs when using information on FDs

was also explored by jointly considering NCD1 with HKA (see below).

NCD statistics compared to existing methods We compared the power of NCD20.5


to other statistics commonly used to detect balancing selection. We focused on Tajima’s

D (TajD) and HKA (Hudson et al., 1987; Tajima, 1989), a pair of composite likelihood-

75
Capítulo 1

76
Figure 3. ROC curves for comparison between NCD20.5 and other tests
Power to detect LTBS for simulations where the balanced polymorphism was modeled to achieve frequency equilib-
rium ( f eq ) of A) 0.3, B) 0.4, and C) 0.5. Plotted values are for African demography, Tbs = 5 myr, L = 3 kb, except for
T1 and T2 (DeGiorgio et al., 2014), which were evaluated based on 100 SNPs in 15 Kb simulated windows following
the original publication (see Methods). Target frequency for NCD1 and NCD2 matches the simulated f eq . Similar
results are observed for European demography (Fig S7).
Capítulo 1

based measures recently developed by DeGiorgio et al. (2014) termed T1 and T2 (T1

only looks at the SFS, T2 includes information on FDs), and NCD1 and NCD2. We

additionally explored the power of a composite statistic, where the p-value was jointly

computed as a function of NCD1 and HKA statistics (NCD1+HKA), with the goal of

quantifying the contribution of FDs to NCD power (see Methods). For simplicity we

considered Tbs = 5 my and 3 Kb for all comparisons. The only exceptions are T1 and

T2, for which a larger window size (100 informative sites) was used, following DeGior-

gio et al. (2014), to compare the methods using their optimal window size.

When f eq = 0.5, NCD20.5 has the highest power: 0.96 (0.94 for T2, 0.93 for TajD, 0.91

for NCD10.5 +HKA, 0.78 for HKA, and 0.5 for T1; Fig 3). The gain in power provided by

NCD20.5 is much higher when f eq departs from 0.5, where NCD2 clearly outperforms

all other tests if t f = f eq (Fig 3). For f eq = 0.4, the power of NCD20.4 is 0.94 (0.9 for

TajD, T2, and NCD10.5 +HKA; 0.76 for HKA, and 0.58 for T1; Fig 3 and Table 1) and

for f eq = 0.3 NCD20.3 power is 0.91 (0.89 for T2, 0.85 for NCD10.5 +HKA, 0.75 for TajD,

0.73 for HKA, 0.59 for T1; Fig 3). These patterns are consistent in both African and

European simulations (Fig 3, Table 1, S10 Fig). Thus, NCD2 has greater or comparable

power to detect LTBS than TajD, HKA, T1 and T2, and a combined test of NCD1+HKA

for African and European scenarios (Fig 3, Table 1, S7 Fig). Notably, as the simulated

frequency equilibrium moves away from 0.5, its advantage over TajD increases (Fig 3).

Recommendations based on power analyses. Overall, NCD1 and NCD2 per-


form very well in regions of 3 Kb (Table 1, Fig 3). In fact, NCD2 outperforms all other

methods tested (Table1 , Fig 3) and it reaches very high power when t f = f eq (higher

than 0.9 for 5 myr alleles and than 0.79 for 3 myr alleles). While the f eq of a puta-

tively balanced allele is unknown, the simplicity of the statistic makes it trivial to run

it for several t f values. Importantly, power was very similar under the African and

European models (Table 1, Fig 3, S10 Fig). Because NCD2 outperforms NCD1 we rec-

ommend using of NCD2 in humans, although NCD1 is a good choice when outgroup

77
Capítulo 1

data is lacking.

Identifying signatures of LTBS in the human genome

We aimed to identify regions of the genome under LTBS. Based on the power analyses,

we used NCD20.5 , NCD20.4 and NCD20.3 , which are well powered to detect LTBS and

do not provide fully overlapping sets of candidate windows. We calculated these statis-

tics for 3kb windows (1.5kb step size) and tested for significance using two complemen-

tary approaches: one based on neutral expectations, and one based on the empirical

data. We analyzed genome-wide data from two of African (YRI: Yoruba in Ibadan,

Nigeria; LWK: Luhya in Webuye, Kenya) and two European populations (GBR: British

from England and Scotland; TSI:Toscani in Italy) (Abecasis et al., 2012). We filtered for

mappability, segmental duplications, and orthology with the outgroup genome (chim-

panzee, see Methods and S13 Fig).

In addition, because windows with a low number of IS have high NCD2 variance

due to noisy SFS (S18 Fig), a pattern also observed in neutral simulations (S11 Fig), we

excluded windows with less than 19 and 15 IS in African and European populations, re-

spectively. This filter removed only 4% of the windows while keeping a set of windows

for which NCD2 values remain quite stable regardless of the number of IS (S11-S18

Figs). After all filters, the genomic coordinates defining the windows were identical in

all populations, allowing comparisons among them. We analyzed 1,631,372 windows

throughout the genome (Table 2, S13 Fig). These windows overlapped 18,308 protein-

coding genes (95% of all human autosomal genes). For each window we calculated a

p-value that reflects the quantile of its NCD2 value, when compared with the NCD2

distribution of 10,000 neutral simulations under the inferred demographic history of

each population, and conditioned on the same number of IS (to account for the higher

variance in sets of windows witlow number of IS, Methods).

Over all populations, between 4,826 and 5,910 (0.30-0.36%) of the genomic win-

78
Capítulo 1

dows have a lower NCD20.5 value than any of the 10,000 neutral simulations (p-value <

0.0001, Table 2). The proportions were very similar for NCD20.4 and NCD20.3 : between

0.34-0.39% and 0.33-0.38%, respectively (Table 2). We refer to these simulation-based

sets, whose patterns we cannot explain under neutrality, as the significant windows.

Due to our criterion for defining significance, all significant windows had an identi-

cal p-value (p < 0.0001). To quantify the degree of departure from neutral expectations,

NCD2 was compared to the mean of NCD2 values for the 10,000 simulations with the

same number of IS. We defined, for each genomic window, Ztf (Equation 4) as the num-

ber of standard deviations that its NCD value for each window lies from the neutral

expectation, conditioned on the number of informative sites of that window. To iden-

tify the most extreme signatures of LTBS, we selected the windows with the 0.05% most

extreme Ztf values for each population and t f value (resulting in 816 outlier windows),

which we refer to as the outlier windows (Table 2). The empirical outlier windows,

which represent a smaller and more conservative set of genes, are almost entirely a sub-

set of the significant windows (Methods). Below, we discuss properties of the union of

all significant (or outlier) windows (Table 2) taken over all of the target(s) frequency(s)

under which they reached significance (“U” set, Table 2).

Reliability of significant and outlier windows

The significant windows are extremely rich both in polymorphic sites (Fig 4) and num-

ber of intermediate-frequency alleles (Fig 5), with the shape of the SFS depending on

the t f at which they reach significance. These patterns are not unexpected, since they

were used to identify these windows. Nevertheless, they show that neither SNP den-

sity nor the SFS dominate the selection process, as significant windows are unusual in

both aspects. Also, the striking differences with respect to the background empirical

distribution, combined with the fact that no neutral simulation had lower NCD value

than any significant window, discard relaxation of selective constraint as a plausible

79
Capítulo 1

explanation (Andrés et al., 2009).

Figure 4. Polymorphism to divergence


A) LWK population. B) GBR population. P/(FD+1) measures the proportion
of polymorphisms with respect to all informative sites. Background (grey) are
all non-significant windows. Significant windows are the union of significant
windows for all t f values.

To avoid technical artifacts among significant windows we carefully considered

mapping errors due to genomic duplicates (e.g. we removed positions with poor map-

pability, and those that fall within tandem repeats and segmental duplications; S13 Fig

and Methods). Also, we found that the significant windows have extremely similar

coverage to the rest of the genome (S14 Fig), showing that they are not enriched in

unannotated, polymorphic duplications.

We also examined whether evidence of selection could be driven by two biological

mechanisms other than balancing selection: introgression and gene conversion. The

outlier windows are significantly depleted of SNPs annotated as introgressed from Ne-

anderthals (S17 Fig, S1 Text), and significant windows do not show a different propor-

tion of introgressed SNPs from controls, showing that introgression is not a confound-

ing mechanism leading to significant or outlier regions (S7 Fig, S1 text). Finally, the

genes overlapped by significant windows are not predicted to be particularly affected

80
Capítulo 1

by non-homologous gene conversion with neighboring paralogs, with the exception of

olfactory receptors (S16 and S19 Figs, S1 Text). Thus, the significant and outlier win-

dows represent a catalog of strong candidate targets of balancing selection in human

populations that are not likely to be driven by introgression or gene conversion (S16,

S17, S19 Figs, S1 Text).

Non-random distribution across the genome

Significant and outlier windows were not randomly distributed across the genome.

Chromosome 6 is the most enriched for signatures of LTBS, contributing 11.2% of sig-

nificant windows genome-wide (24.5% of outlier windows) while having only 6.4% of

analyzed windows (S12 Fig). This is due to the presence of the MHC region, rich in

genes with well-supported evidence for balancing selection. In fact, several HLA genes

known to be targets of LTBS appear among our outlier windows, i.e, the strongest can-

didates. For the outlier windows, 10 HLA genes are found in all four populations,

most of which have prior evidence for balancing selection (Table 3): HLA-B,HLA-C,

HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DQB2, HLA-DRB1, HLA-DRB5,

HLA-G (DeGiorgio et al., 2014; Liu et al., 2006; Meyer et al., 2006; Sanchez-Mazas, 2007;

Solberg et al., 2008; Tan et al., 2005).

The biological pathways influenced by LTBS

Although the union of significant windows considering all t f values span on average

only 0.51% of the genome (Table 2), 37.8% of those windows overlap protein-coding

genes. To gain insight on the biological pathways influenced by balancing selection, we

focused on protein-coding genes that contain at least one significant or outlier window

(“U” set, Table 2), and investigated the functional categories they belong to.

We found enrichment for 30 GO (Gene Ontology) categories for the significant

genes (S2 Table), 22 of which are shared by at least two populations. Three significant

81
Capítulo 1

categories are driven by olfactory receptor genes (OR), which we could not rule out

as artifacts (S1 Text), although they do not appear in the more conservative outlier set

of genes (S3 Table). Among the remaining categories, at least half of them are directly

related to immune response (e.g. “type I interferon signaling pathway”, “MHC class

I protein complex”, “positive regulation of T cell mediated cytotoxicity”) and 11 are

involved in antigen presentation by MHC molecules (e.g. “MHC class I protein com-

plex”, “MHC class II protein complex”, “peptide antigen binding”, among others). For

the outlier genes, 27 enriched categories were found, at least 18 of which are immune-

related, and 10 of which are directly related to antigen presentation by MHC molecules

(S3 Table).

When classical HLA genes are removed from the sets, no categories remain enriched

for the outlier genes (S3 Table; but note that this resulted in a small set of 162-192 genes

per population, with lower power to detect GO category enrichment), as in DeGiorgio

et al. (2014). For the larger set of significant genes, the immune related category “pep-

tide antigen binding” remains significant in LWK , driven by TAP1, TAP2, and HLA-G,

all previously reported candidate targets of balancing selection (Cagliani et al., 2011;

Tan et al., 2005). These results show the strong influence of the classical HLA genes

to signatures of LTBS. However, “extracellular region” and “keratin filament” are en-

riched in the set of significant genes, in several populations, even after the removal

of HLA genes, in agreement with previous findings pointing that balancing selection

targets genes related to extracellular and cell-surface proteins (Key et al., 2014b).

Nevertheless, for the significant genes only about half of the immune-related en-

riched categories are directly linked to peptide presentation by MHC molecules. Other

categories (e.g. “type I interferon signaling pathway”, “cytokine mediated pathway”,

“T cell co-stimulation”, “immune response”), even if they cease to be significantly en-

riched after the removal of HLA genes (S2 Table), are not strictly composed of HLA

genes.

82
Capítulo 1

In order to gain more insight on the importance of non-HLA immune related genes

to the outlier set of genes, we verified that the GO categories of 62 outlier genes shared

by at least two populations (Table 3) are immune-related, although only 10 HLA genes

compose that set (S8 Table). This shows that not only HLA-related categories are en-

riched among the significant genes, pointing that immune response, in a broader sense,

is enriched for LTBS (reviewed in Key et al., 2014b).

Regarding tissue of expression, among the genes overlapped by significant win-

dows, we found enrichment for genes preferentially expressed in “adrenal” (TSI, p-

value=0.003, S5 Table) and “lung” (GBR, p-value=0.004, S5 Table, S1 Text).

Overlap of significant windows across populations

Most windows were found to be significant (S20A Fig) or outliers (S20B Fig) in multiple

populations. On average 81% of significant windows in any one population are shared

between any two populations, and 69% of the windows are shared between two pop-

ulations within the same continent (66% between African and 71% between European

populations, see S20A Fig). For the more restrictive set of outlier windows, the shar-

ing increased to 87% between any two populations, and 78% within continent (75% of

African windows were shared, and 80% of European (S20B Fig). There was also similar

sharing considering t f values separately (S21-S22 Figs).

The putative function of balanced SNPs

Functional protein-coding sites To further investigate the differences among out-


lier and non-outliers (background) windows, we examined the degree to which they

overlap exons. On average, 31.2% of the windows that overlap protein-coding genes

overlap their exons, very similar to the 30.8% for the background distribution (S15

Fig). In fact, significant windows contain a higher (but non-significant) proportion of

protein-coding SNPs than background windows (Fig 6A,C).

83
Capítulo 1

When these sites are divided as synonymous (putatively neutral) and non-synonymous,

significant windows are enriched for non-synonymous SNPs when compared with con-

trols sampled from the background distribution (Fig 6A,C). This is also true when only

intermediate frequency alleles are considered (MAF>0.20, Fig 6B,D). Taken together,

our results indicate that balancing selection is associated to regions of increased non-

synonymous polymorphism.

Regulatory function It has been suggested that balancing selection may have a par-
ticularly important role in maintaining genetic diversity that affects gene expression

(Leffler et al., 2013; Savova et al., 2016). Because the identification of significant and

outlier windows is independent from functional annotation, we are in a position to

test the hypothesis that LTBS has preferentially targeted regulatory regions. Signifi-

cant windows were enriched in SNPs that have regulatory functions (Fig 7A, p<0.001),

annotated as eQTLs (Regulome Category 1).

Nevertheless, power to annotate a SNP as an eQTL increases with its frequency, so

allele frequency must be accounted for. When only SNPs with intermediate frequency

alleles are considered, significant windows no longer show a statistical enrichment in

eQTLs (Fig 7D); rather, in most populations there is a significant depletion of eQTLs

(Fig 7D). Accordingly, we observed a depletion of SNPs overlapping putatively reg-

ulatory regions when considering a more inclusive category that depends exclusively

on genomic context (rather than on eQTL annotation, RegulomeDB categories 1 and 2,

see Methods; Fig 7B-E). Regardless of allele frequency, SNPs in significant windows are

enriched in sites with no evidence of a role in gene regulation (RegulomeDB category

7, Fig 7C-F). Although the annotation of each of these RegulomeDB categories is not

perfect, these results suggest that balancing selection does not preferentially target, in

human populations, sites with a role in gene expression regulation.

Finally, in agreement with Savova et al. (2016) we find a modest yet significant en-

richment for genes with mono-allelic expression (MAE) among the outlier genes shared

84
Capítulo 1

by at least two populations (Table 3): 26% of them are MAE genes, while only 22% of

the non-outliers are MAE (p=0.03, Fisher Exact Test, one-sided).

The top candidate genes

The signatures of long-term balancing selection may not be shared between popula-

tions due to changes in selective pressure, which may be important during fast, local

adaptation (Filippo et al., 2016). Still, loci with signatures across human populations are

more likely to represent old, stable events of balancing selection in human populations.

We considered as “African” those outlier genes resulting from the union of outlier win-

dows for all t f values (Table 2) that are shared between YRI and LWK (but neither or

only one of the European populations), and as “European” those that shared between

GBR and TSI (but neither or only one of the African populations). Those shared by all

four populations were considered as “African and European” (Table 3). Importantly,

these designations do not imply that the genes referred to as “African” or “European”

in Table 3 are putative targets of LTBS for only one of the continents, as there are power

differences between Africa and Europe, particularly for t f = 0.3 (Fig 3, Table 1, S10 Fig),

but rather serve the purpose of quantifying the extent of sharing across populations.

The combined set of “African” (69 genes), “European” (71 genes) and “African and

European” (75 genes) contains 213 genes ( 1.1% of all queried genes) (Table 3). When

applying the same criteria for the significant windows, the set contains 1,470 genes ( 8%

of all queried genes, see S2 Text and S8 Table). We focus the following discussion on the

set of 213 outlier genes, since they constitute the most restricted set. Of these, 61 (29%)

have evidence of balancing selection in at least one previous genome-wide analysis

(Andrés et al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013), and others were detected

in individual gene studies (Table 3 and Discussion). Overall, about 70% of the outlier

genes reported here have not been reported as having signatures of LTBS in previous

studies.

85
Capítulo 1

Obviously, a given window can be significant for more than one t f value. Because

our simulations suggest that the t f is informative about the frequency of the balanced

allele, we use the lowest Ztf to assign a t f value to each window (for a given popula-

tion), providing information on the nature of the SFS skew (S7 Table). For 50% of the

outlier windows, the assigned t f is 0.3, and 36% have 0.5 as assigned t f ; only 14%

have assigned t f = 0.4 (S6 Table).

Based on the p-values of the most extreme window for each of the outlier genes, we

were able to rank them. The top 10 genes are highlighted in Table 3. Among the top

ten candidates, two (HLA-DRB5 and HLA-DQA1) are related to adaptive immunity,

two (PCDH15 and NDUFA10) are related to sensory perception of mechanical stimu-

lus, including sound and two (PROKR2 and CPE) are related to neuropeptide signal-

ing pathway. Six of the top 10 genes (PROKR2, HLA-DQA1, CPE, HLA-DRB5, LUZP2,

and MYO3A) have been previously described as having signatures of LTBS in humans

(Table 3). The four among them that are novel (B4GALNT2, C1orf101, NDUFA10, and

PCDH15) are discussed in more detail in the discussion.

Table 3. Outlier genes All reported genes have are overlapped by at least one
outlier window for at least one t f value (Table 2 and Methods). Outliers for both
African populations (“African”), for both European populations (“European")
or for all four populations ("African & European"). A version of this table with
p-values and assigned t f values is provided in S7 Table. When a gene has been
previously reported as having signatures of LTBS, the reference is provided.
[A], Andrés et al. (2009), [D], DeGiorgio et al. (2014), [L], Leffler et al. (2013), [S],
reported as being under balancing selection in Savova et al. (2016), [T] Tan et al.
(2005). * top 10 most highly ranked genes (for YRI).

African European African and European

ABO[SG][S] AC121757.1 APBB1IP [D]


ADAM29 ADAM12 B4GALNT2**
AIM1 ADRA1D BICD1
ALDH1L1 AL590867.1 CAMK1D
ALK ALG8[D] CDSN[D][A][S]
ARHGAP8 ATP10D CHST11
ATXN3 B3GNTL1 CPE[D]

86
Capítulo 1

BCAR3 BICC1[D] CTNNA3


C1orf54 C1orf222 DMBT1 [D]
C15orf48 CCDC169[D] EDARADD
C1orf101* CCDC169-SOHLH2[D] EGLN3
C22orf34 CDH5 ERO1LB
CCHCR1 CEP112[D] FAM101A
CELSR1 CNR2 FAM19A5
CLDN16 CNTNAP2 FCER2
COG6 COL4A3[D] FMN2
COL25A1 CPNE4[D] GPR137B
COMMD10 CRHR1 GPRIN3
CUBN[D] CSMD1[D] HBE1 **
DIRC3 DOK6 HBG2 **
DTNA FRAS1[D] HIP1
EPHA6 FXN[D] HLA-B[A][D][S]
EXTL3 GABBR2 HLA-C [D]
FRMD4B[D] GNA15 HLA-DPA1[D]
GPR114 GRAMD4 HLA-DPB1 [D]
GTF2IRD1 GRID1 HLA-DQA1[D]*
HLA-DQA2 GRIP1 HLA-DQB1[D]
IGFBP7[D,L] HLA-DRA[D] HLA-DQB2
IL37[S] HUS1[L] HLA-DRB1 [D]
LGALS8[D,A,S] IDO2 HLA-DRB5 [D]
LGI2 IL18R1[S] HLA-G
LUZP222 IL1RL1[S] LRP1B [D]
MLPH ITGA1[D] MANBA
MYOM2[D] KALRN MMP26
NEDD4L KANSL1 BICD1
NFATC1 KDM4C[D] MROH2A
NR3C1 KL[D] MYO3A[D]
OR52A1 KRT83[D] MYRIP[D]
OR6J1 LAMC2 NCAM2
PACRG[D] LDLRAD4[D] NDUFA10
PADI2 MYO7A OLAH
PARP15 NRXN3 PARK2
PDE10A NSUN4 PCDH15*
PGLYRP4[D] NTN4[D] PDSS1
PRR5-ARHGAP8 OAS1[S] PHACTR2
PTPRB[D] ORC5 PLCB4[D]

87
Capítulo 1

PTS OVCH1-AS1 PRDM15


RNF39 PGBD5[D] PREX2[A]
RP11-96O20.4 PKHD1 PROKR2[L]*
SFTPD POLR1E[D] PSORS1C1[D]
SGCZ[D] PPAP2B RIMKLA
SLC17A5 PSMC1 RP11-257K9.8
SLC22A16 RASSF6 SH3RF3 [D]
SLC27A6 RBFOX1[D] SIRPA
SLC35F2 REG4 SLA
SPRR3 RUNX2 SNX19[D]
SPTLC3 SEPT11 SPAG16
SQRDL SGCG[D] SPATA16
STAU2 SKAP2 SUCLG2
STK32A[D] SLC1A6 SV2C
STXBP6 SLC2A9[D] [A][S] TG
SUMF1 SOHLH2 THSD7B[D]
TGM6 SVEP1 TMPRSS2
TMCO3 TCP11L1 TMTC2
TMEM132D TEC UGT2B4[S][T]
TMEM135 TNN WNT7A
WSCD1 TNS1[A] ZC3H12D
WWOX TRIM5[S] ZNF385D
ZNF331 TRPM3 ZNF83
ZNF670 VPS8
ZNF695 WDYHV1
ZZEF1 ZNF254

88
89
Figure 5. Site frequency spectra
SFS in A) LWK population and B) GBR population of background windows (all windows in chromosome 1, in grey),
significant windows for NCD20.5 (blue), significant windows for NCD20.4 (orange), and significant windows for
NCD20.3 (pink).
Capítulo 1
Capítulo 1

Table 2. Significant and outlier windows and protein-coding genes across populations
Significant and outliers, see main text. U, union of all windows found with all target frequencies (t f ).
Population LWK YRI GBR TSI
tf 0.3 0.4 0.5 U 0.3 0.4 0.5 U 0.3 0.4 0.5 U 0.3 0.4 0.5 U
Significant
5,620 5,516 4,826 7,770 6,137 5,919 5,213 8,436 5,465 6,312 5,904 8,526 5,464 6,183 5,801 8,395
windows
Outlier

90
816 816 816 1,139 816 816 816 1,142 816 816 816 1,131 816 816 816 1,163
windows
Significant
1,037 1,003 878 1,321 1,129 1,044 928 1,400 967 1,025 971 1,321 983 1,047 1,009 1,378
genes
Outlier
128 130 147 202 124 120 131 187 107 114 123 172 116 121 137 189
genes
Queried
1,631,372 1,631,372 1,631,372 1,631,372
windows
Capítulo 1

Figure 6. Enrichment in protein-coding and non-synonymous SNPs


Proportion of SNPs that are protein-coding (A,C) and proportion of protein-
coding SNPs that are nonsynonymous (B,D) for all SNPs (A,B) or SNPs with
intermediate MAF in the significant and background windows (C,D). NS, non-
synonymous; S, synonymous; C, coding; I, intergenenic. In gray, distribution
obtained from 1,000 samplings of a set of windows from the background (see
Methods). In orange, significant windows.

91
Capítulo 1

92
Figure 7. RegulomeDB enrichment analysis for scores 1 and 7
Proportion of SNPs in (A,D) RegulomeDB category 1 (eQTLs), (B,E) RegulomeDB categories 1 and 2 (overlapping a
putatively regulatory site) or (C,F) RegulomeDB category 7 (no evidence for regulatory role) for (A,B,C) all SNPs or
(D,E,F) SNPs with intermediate MAF in both the significant and background windows. In gray, distribution obtained
from 1,000 samplings of a set of windows from the background (see Methods). In orange, significant windows.
Capítulo 1

Discussion

NCD Method

We present two new summary statistics, which are simple and fast to compute and to

run, and which allow, unlike classical approaches such as Tajima’s D (Tajima, 1989) or

the Mann-Whitney U for comparing local and global SFS (Andrés et al., 2009; Nielsen et

al., 2009), explicit exploration of different target frequencies - a property also shared by

the T1 and T2 tests (DeGiorgio et al., 2014), albeit in a likelihood framework. We show

that the NCD statistics are well powered to detect balancing selection for a complex

demographic scenario, such as that of human populations.

The NCD statistics can be used to detect selected regions using null distributions

based on neutral simulations (identifying significant windows whose signatures are

not expected under neutrality) or by an empirical outlier approach, which allows the

investigation of balancing selection when there is limited knowledge on demographic

history. Furthermore, NCD1 can be used in the absence of a close outgroup species,

which extends further the set of possible species. This allows exploring the genomes of

species where balancing selection remains completely unexplored.

Many previous and well-supported targets of balancing selection are present in our

list of selected genes, but approximately 70% of the protein-coding genes we identify

are novel candidate targets of balancing selection (Table 3).

Pervasiveness of LTSB in the human genome

On average 0.51% of the windows show, per population, signatures of LTBS that are

significant under our simulation-based criteria: we never observed comparable signa-

tures of LTBS with neutral simulations. We showed that these windows are unlikely to

be affected by technical or biological artifacts.

93
Capítulo 1

Although the total proportion of the genome under balancing selection may be

small, our results show that many genes contain putatively selected regions. For ex-

ample, under a restrictive criterion of being significant in at least two populations from

a continent, 8% of the protein-coding genes contain a significant window, and 1% con-

tain an outlier window. Because our statistic is powerful for detecting selection in rel-

atively narrow genomic regions (3kb), it is possible that we are identifying signatures

that would not be found when analyzing properties of entire genes or larger genomic

regions.

Protein-coding and intergenic targets

Long-term balancing selection is known to maintain both coding – e.g., HLA-B, HLA-C,

ABO (Hughes and Nei, 1988; Hughes and Nei, 1989; Segurel et al., 2012; Ségurel et al.,

2013) – and regulatory diversity – e.g. HLA-G, UGT2B (Tan et al., 2005; Sun et al., 2011)

(we confirm these targets in Table 3). We are in a particularly good position to quantify

the relative proportion of cases of selection acting on coding or regulatory regions in

humans. We found no excess of eQTLs within selected windows once the frequency of

the alleles is accounted for, and also no evidence for enrichment of regulatory function.

A recent study suggested that there is enrichment for genes with mono-allelic ex-

pression (MAE) among those with signatures of balancing selection (Savova et al.,

2016). In agreement with this observation, we found a small but significant enrichment

for MAE genes among the outlier genes reported in Table 3 (p = 0.03, Fisher Exact Test).

We note that this overlap would be even greater if HLA genes had not been excluded

by the MAE genes list provided in Savova et al. (2016). This result is consistent with the

claim for a biological link between balancing selection and MAE (Savova et al., 2016).

Nevertheless, it remains elusive whether the detection of MAE genes is correlated with

allelic frequency, as is the case for eQTLs which could this explain this enrichment.

Although significant windows show a depletion of overlap with protein-coding

94
Capítulo 1

genes more often than expected by chance, the proportion of the windows overlap-

ping genes that overlap exons is the same for significant and background windows,

showing that there is a depletion of introns in the significant windows. Finally, signif-

icant and outlier windows show an enrichment for nonsynonymous SNPs. This result

is compatible with two scenarios: (a) direct selection on multiple coding sites or (b) an

accumulation of slightly deleterious variants as a bi-product of selection (e.g. Chun and

Fay, 2011).

The frequency of the balanced allele(s)

For both new and previously known targets, an advantage of our method is that it

provides an assigned target frequency for each window, and consequently information

on the shape of its SFS (Table 3, S7 Table). In some candidate genes – the HLA genes

– we know that LTBS has targeted not one, but several sites (Hughes and Nei, 1988;

Hughes and Nei, 1989). In this case, the theoretical expectation of the shape of the re-

sulting local SFS is unclear. Nevertheless, in loci with a single balanced polymorphism,

which we assume may be common outside of the MHC, our simulations suggest that

the assigned t f can be informative about the frequency of the balanced allele. Our

results indicate that a large proportion of significant windows (50 %) have minor al-

lele frequencies which lie closer the target allele frequency of 0.3 than to 0.5, as would

be expected, for instance, under asymmetric overdominance. This highlights the im-

portance of considering balancing selection regimes with different frequencies of the

balanced polymorphism.

The candidate genes

Whereas studies of positive selection show a remarkably low overlap with respect to

the genes they identify, with Akey (2009) reporting that only 14% of protein-coding loci

appear in more than a single study, we identified 61 of the outlier genes (29%) with

95
Capítulo 1

evidence of balancing selection in at least one previous genomic analyses (Andrés et

al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013) (Table 3), and a few other genes

detected in individual gene studies (Tables 3 and S7). This is a reasonable overlap as

these studies used both different approaches and datasets.

Many candidates for balancing selection from previous studies are also identified

here. For example, Leffler et al. (2013) identified 6 genes with particularly strong ev-

idence of trans-species polymorphisms, 3 of which are outliers in our study (HUS1,

PROKR2, IGFBP7; Table 3). Of the 5 genes identified by both Andrés et al. (2009) (an

exon-based approach) and DeGiorgio et al. (2014) (a genome-wide study), 4 are among

our outlier genes (HLA-B, CDSN, LGALS8, SLC2A9; Table 3), and one (RCBTB1) among

our significant genes (S8 Table). We find 2 additional genes from Andrés et al. (2009)

(PREX2 and TNS1; Table 3) and 53 genes from DeGiorgio et al. (2014) (Table 3).

Other outlier genes have prior evidence for balancing selection in candidate gene

studies. Among the oldest known cases of genetic polymorphisms in humans are the

blood-group genes (Segurel et al., 2012; Ségurel et al., 2013), including the ABO gene,

which we also identify (Table 3). TRIM5 has prior evidence of balancing selection in

humans and Old World Monkeys (Cagliani et al., 2010) and OAS1 since the split be-

tween humans, chimpanzees and gorillas (reviewed in Fijarczyk and Babik, 2015); both

are involved in innate immune defense. Additional examples include UGT2B4 - an

enzyme that metabolizes steroid hormones and bile acids and is associated to predis-

position to breast cancer (Sun et al., 2011) – and HLA-G – a non-classic HLA gene that

has tightly-regulated expression patterns between fetal and adult life (Tan et al., 2005).

Among the top 10 ranked genes (which we manually checked for undetected dupli-

cations and non-homologus gene conversion, S3 Text), we find B4GALNT2, C1orf101,

NDUFA10, and PCDH15. NDUFA10 produces a subunit of the enzyme NADH, the

largest among the complexes of the electron-transport chain, and is associated to neu-

rodegenerative diseases such as Leygh’s syndrome, Huntingston’s and Parkinson’s.

96
Capítulo 1

C1orf101 is a protein of unknown function which is highly expressed in human tes-

ticular tissues (Petit et al., 2015). PCDH15 is a protocadherin protein that is essential

for normal retinal and cochlear function. Interestingly, this gene shows strong signa-

tures of positive selection in East Asian populations (Sabeti et al., 2007). Moreover, two

other outlier genes (not among the top 10 ranked genes) are members of the beta-globin

cluster and have evidence for recent positive selection in Andean (HBE1, HBG2) and Ti-

betan populations (HBG2) (Bigham et al., 2010; Rottgardt et al., 2010; Yi et al., 2010). It

is plausible that these genes have been under LTBS in Africa and Europe, and recently

been subjected to strong positive selection in non-African populations, a pattern of shift

in selective regime recently detected for other loci (e.g. Filippo et al., 2016).

Finally, B4GALNT2 encodes a blood-group enzyme that has evidence for trans-

species polymorphism maintaining two classes of alleles with high divergence, which

are responsible for alternative tissue-specific expression patterns (Linnenbrink et al.,

2011). Moreover, variation in this gene in mice seems to be associated with the pres-

ence of Helicobacter species in the gut (Staubach et al., 2012; Ségurel et al., 2013). Finally,

a deletion encompassing the first exon of this gene has been described and it is possible

that it became fixed in chimpanzees by positive selection (Perry et al., 2008). To date,

our study is the first to confirm evidence of LTBS on B4GALNT2 in humans.

Conclusions
We have developed a tool to identify genomic regions under long-term balancing selec-

tion that is simple, fast, and has a high degree of sensitivity for different frequencies of

the balanced polymorphism. The NCD statistics can be applied to single loci of to the

whole genome, in species with sufficient demographic information and those without

it, and both in the presence and in the absence of an appropriate outgroup.

Our analyses indicate that, in humans, balancing selection may be shaping variation

97
Capítulo 1

in about 0.5% of the genome including at least 1% of the human protein-coding genes.

Because there are so many genes, and since although they affect mostly immunity they

also affect other pathways and phenotypes, we provide evidence that balanced poly-

morphisms appear to be relevant to many biological processes.

Besides, we provide a catalog of candidate targets of long-term balancing selec-

tion, including many completely novel targets. These shall be further investigated, for

example, to infer the selective force maintaining the balanced polymorphisms, to de-

termine their phenotypic consequences in present-day human populations. Although

about 80% of windows are shared across populations, the remaining show signatures

in individual populations; these will be particularly interesting to investigate their pu-

tative influence in subsequent local adaptations through shifts in selective pressure (as

in Filippo et al., 2016).

Materials and Methods

Simulations

Performance of NCD2 and NCD1 was evaluated by extensive simulations with MSMS

(Ewing and Hermisson, 2010). The simulations followed a realistic demographic model

for African, European and East Asian human populations described in Gravel et al.

(2011), including the effective populations sizes (Ne ) and migration rates. A generation

time of 25 years, a mutation rate of 2.5 ∗ 10−8 mutations per site (Nachman and Crowell,

2000) and a recombination rate of 1 ∗ 10−8 were used. The human-chimpanzee split at

6.5 million years ago was added to the model. This was our null demographic model

(Fig 2), used to obtain the neutral distributions of NCD.

For simulations with selection, a balanced polymorphism was added to the center

of the simulated sequence. The frequency equilibrium ( f eq ) achieved by the balanced

98
Capítulo 1

polymorphism was modeled following an overdominant model, as follows. Under the

overdominance model, for a bi-allelic locus with alelles A and B, the relative fitnesses

of the three genotypes are: wAA = 1 − s1 , wAB = 1, and wBB = 1 − s2 , where s1 and

s2 are the selection coefficients in the two homozygous genotypes, and the frequency

equilibrium ( f eq ) is equal to s1 /(s1 + s2 ), as in Equation 4.

In MSMS, in order to achieve the f eq we are interested in, we parameterized selec-


w AB
tion in the following way: wAB = 1 + (2Ne s), wBB = 1 + [2wAB − 1− f eq ], and wAA = 1,

where Ne is the effective population size used to scale the coalescent simulations and s

is the selection coefficient for the mutant allele (B). The selection coefficient (s) was set

to 0.01 (the influence of s is modest once the frequency equilibrium is reached, as in the

case of LTBS). We considered four frequency equilibria: f eq = 0.2, 0.3, 0.4, 0.5. Simula-

tions with and without selection were run for different sequence lengths (L), such that

L = 3, 6, 12 kb and time of onset of balancing selection (Tbs), such that Tbs = 1, 3, 5 myr

(Fig 2).

Power analyses

For each set of parameters, 1,000 neutral simulations were compared to 1,000 match-

ing simulations with balancing selection for evaluation of the performance of the NCD

statistics. The relationship between the true positive (TPR, the power of the statistic)

and false positive (FPR) rates is represented through receiver operating characteristic

(ROC) curves. For comparisons between statistics and across demographic scenarios,

NCD implementations (NCD1 and NCD2) and other parameters, the power at the FPR

= 0.05 threshold was considered. When comparing performance under a given con-

ditions (e.g. L values), power values were averaged across implementations (NCD1

and NCD2), demographic scenarios (Africa, Europe, Asia), and the other parameters,

unless explicitly stated otherwise.

The same simulations and procedures were used to evaluate the comparative per-

99
Capítulo 1

formance of the different methods. NCD2 and NCD1 were run using 3kb windows

and L=3kb and Tbs = 5 myr. They were compared with Tajima’s D (TajD), HKA (Tajima,

1989; Hudson et al., 1987), and the combined NCD1+HKA test (a joint distribution of

the two summary statistics) also in 3kb windows. DeGiorgio et al. (2014) report the

performance of T1/T2 based on windows of 100 informative sites upstream and down-

stream of the target site (on average 13.7 Kb in YRI and 14.7 Kb in CEU). Therefore, we

divided 15kb simulations in windows of 100 informative sites and calculated T1 and

T2 using BALLET (DeGiorgio et al., 2014). We selected the highest T1 or T2 value from

each simulations to perform the power evaluation.

Human population genetic data and filtering

Data We analyzed genome-wide data from the 1000 Genomes Project phase I (Abeca-
sis et al., 2012). SNPs that were only detected in the high coverage exome sequencing

of the 1000G were not considered because the difference in coverage between the low

versus high coverage-exclusive SNPs make the exome dataset biased in the sense that

coding regions have higher SNP density, potentially biasing our results.

The genomes of individuals from African and European populations were queried

(excluding the recently admixed AWS population), but not those from Asian popula-

tions due to lower performance in this population (see “Demography” in the Results

section). We considered two African populations (YRI and LWK), and two European

populations (GBR and TSI). For comparisons between continents only two European

populations were considered (GBR and TSI).

To equalize sample size, we randomly sampled 50 unrelated individuals from each

population (Key et al., 2014a). We used the minor allele frequency (MAF) in the NCD

statistics calculations to analyze the folded SFS (Fig 1). This allows us to retain SNPs

where the ancestry could not be determined by the 1000G.

100
Capítulo 1

Filtering Genome analyses require extensive filtering in order to avoid the inclusion
of errors that may bias the results. We dedicated extensive efforts to obtain a filtered

dataset (see Fig S13). We disregarded positions not present in the 50mer CRG Alignabil-

ity track (Derrien et al., 2012), which requires that 50 bp segments should map uniquely

(only one region of the genome, allowing up to two mismatches). We filtered out all re-

gions annotated as segmental duplications (Alkan et al., 2009; Cheng et al., 2005) and

positions that are simple units of repeat detected by the Tandem Repeat Finder (Benson,

1999). We also required that all scanned positions be orthologous to the PanTro2 chim-

panzee reference sequence, because NCD2 includes FDs. After this filtering, NCD2

was calculated for the remaining windows (1,705,970 windows per population).

Identifying signatures of LTBS

Because L = 3 Kb yielded the best performance for NCD2 for both African and Euro-

pean simulations (see Results, Figs 3, S1, S2), we queried the human population genetic

data with sliding windows of L = 3 Kb with 1.5 Kb step size. Windows are defined in

physical distance since the presence of balancing selection may affect the population-

based estimates of recombination rate. Variable positions were categorized as a SNP (if

polymorphic in the sample) or a FD (if all humans differ from the chimpanzee); the only

exception are polymorphic sites where both allelic states differ from the chimpanzee

reference state, as this position was considered both a SNP and a FD. Each population

was queried separately, and NCD2 was calculated considering three target frequencies:

0.3, 0.4, 0.5. For each queried window, the number of SNPs, FDs, IS, SNP/(FD+1) and

NCD2 (t f = 0.3, 0.4, 0.5) was computed for each window.

Filtering and correction for number of informative sites (IS) Neutrality tests
typically place a threshold on the minimum number of informative sites necessary – e.g.

at least 10 IS in Andrés et al. (2009), and 100 informative sites in DeGiorgio et al. (2014).

101
Capítulo 1

We observe considerable variance in the number of IS per 3 Kb window in the real

human genomic data, and find NCD2 has high variance when the number of IS is low

(S18 Fig). We therefore required that each window has at least 19 (African populations)

or 15 (European populations) IS, and the same sets of windows were queried in all

4 populations (Figs S9 and S10). These values where chosen because beyond them

NCD2 stabilizes (Figs S18 and S19). This final filter resulted in 1,631,372 considered

windows (4% of the queried windows were excluded) (Fig S13). Furthermore, neutral

simulations with different mutation rates were performed in order to retrieve 10,000

simulations for each bin of IS ranging from 4-229 (Africa) and 4-199 (Europe); this range

is compatible with the range seen in the actual data. Next, NCD2 (t f = 0.3,0.4,0.5)

was calculated for all simulations. These simulations per bin of IS allowed both the

assignment of significant windows, and the calculation of Zt f (Equation 4, see below).

Significant windows We defined two sets of windows with signatures of LTBS:


the significant windows (obtained based on the simulations) and the outlier windows

(obtained based on the empirical distribution). The significant windows were defined

as those that fulfill the criterion whereby the observed NCD2t f value is lower than any

of the 10,000 values obtained for simulations with the same number of IS. Based on this

criterion, all significant windows have the same p-value (p < 0.0001).

Outlier windows In order to rank the queried windows and apply an outlier ap-
proach, we developed a standardized distance measure between the observed NCD2t f

(for the queried window) and the mean of the NCD2t f values for the 10,000 simulations

for the matching number of IS. This distance (Ztf ) is given by:

NCD2t f − NCD2 IS
Zt f = (4)
sd IS

, where Ztf is the corrected NCD2t f distance by the number of IS, NCD2t f is the

NCD2 value for the n-th empirical window, NCD2 IS is the mean NCD2 for 10,000

102
Capítulo 1

neutral simulations for the corresponding value of IS, and sd IS is the standard deviation

of NCD2 for 10,000 simulation values with the matching number of IS.

This standardized distance measure takes into account the range of possible values

within each IS value, and also the different ranges of values across different target fre-

quencies. Therefore, Zt f allows not only the ranking of all windows for a given t f , but

also takes into account the residual effect that the number of IS has on NCD2t f (even

after filtering for a minimum number of IS, see S11 and S18 Figs) and, finally, allows a

comparison between the rankings of a window considering different target frequencies.

Once the Zt f scores were calculated, the outlier sets of windows were ranked according

to Z0.5 , Z0.4 , and Z0.3 . An empirical p-value was attributed to each window based on

the Zt f values, and the windows corresponding to the 0.05% lower tail (816 windows)

of the genomic distribution of Zt f values were defined as the “outlier windows”. All

outlier windows are contained within the significant windows except four windows in

LWK, and one window in TSI.

Assigned t f values As mentioned above, the p-values obtained from Zt f can be


directly compared across t f values. When a window is an outlier for several t f values,

this property allowed an assignment, for each window, of the t f value that minimizes

the NCD2t f . For the significant and outlier windows, we assigned a t f value as the t f

that yields the lowest empirical p-value for the window (S6 Table). For the outlier genes

in Table 3, a t f value was assigned to a gene by asking: (1) which window overlapping

the gene has the lowest p-value; and (2) which t f value is associated with the p-value

in 1. Thus, the assigned t f value for a gene is the assigned t f for the window that has

the lowest empirical p-value. This was done for each population separately as seen in

S7 Table.

Coverage To test whether the signatures of LTBS are driven by undetected duplica-
tions, which can produce mapping error and false SNPs, we analyzed modern human

103
Capítulo 1

shotgun genome-wide data that has been sequenced to an average coverage per indi-

vidual between 20x and 30x (Meyer et al., 2012; Prüfer et al., 2013). We used an indepen-

dent dataset because read coverage data is low and cryptic in the 1000G and because

putative duplications that affect the SFS must be at appreciable frequency and should

be present in other datasets. We considered 12 genomes, two genomes per population,

and two populations per continent: Yoruba and San (Africa), French and Sardinian

(Europe), Dai and Han Chinese (Asia).

For each sample, we retrieved the positions that have coverage higher than the

97.5% of the coverage distribution specific for that sample (termed “high coverage”

positions). For each window in our analysis for signatures of LTBS, we calculated the

proportion of positions having high coverage in at least two samples (pHC), and plot

the distributions for different NCD2 empirical p- values – i.e, those based on the Zt f

scores (S14 Fig). Our significant and outlier windows are not enriched in positions with

high coverage in the samples considered herein, but rather the opposite: the significant

windows show a significant reduction in the proportion of positions with high coverage

when compared with non-significant windows (all Mann-Whitney U test two-tail p-

value < 0.001).

Enrichment Analyses

Gene and Phenotype Ontology Whenever a candidate window overlaps a pro-


tein coding gene to any extent, this gene is considered a “candidate gene”. This in-

cludes windows that fall within intronic regions. GO (gene ontology) and PO (phe-

notype ontology) enrichment analyses were performed using the software GOWINDA

(Kofler and Schlötterer, 2012), which avoids common biases that result from gene length

(longer genes with more windows have by chance a higher probability of containing a

candidate window) and/or gene clustering. We ran the analysis in mode: gene and per-

formed 100,000 simulations for FDR estimation. Significant categories were obtained by

104
Capítulo 1

considering an FDR<0.05.

GOWINDA was designed for SNP-based analysis so we considered the middle po-

sition of every scanned window as the target site. To correct for this, we extended

gene coordinates by 1500bp up/down-stream by using the option updownstream1500

in SNP to gene mapping, so we consider the correct coordinates of each window. We

used the annotation file (.gtf) and the gene set file for Gene Ontology from Ensembl

(http://www.ensembl.org/index.html), and the Phenotype Ontology file from the Hu-

man Phenotype Ontology database (http://human-phenotype-ontology.github.io/). Sep-

arate analyses were performed for each population and considering a combination of

different sets of genes: 1) different types of candidate windows (outliers vs significant

windows); 2) different t f (0.5, 0.4 and 0.3); 3) the union of candidate windows for

all t f ; 4) excluding the classical HLA genes with previous evidence of balancing se-

lection (HLA-B, HLA-C, HLA-DRB1, HLA-DRB5, HLA-DPA1, HLA-DPA2, HLA-DPB1,

HLA-DPB2, HLA-DQB1, HLA-DQB2, HLA-DQA1, HLA-DQA2).

Enrichment for coding and non-synonymous SNPs We used annotation from


the 1000 Genomes (Abecasis et al., 2012) to define coding, intergenic, synonymous and

non-synonymous SNPs. Every SNP used in NCD2 calculation and overlapping NCD2-

analyzed windows was considered in this analysis. A GOWINDA re-sampling ap-

proach as described above was used to perform the enrichment analysis. To control

for possible effects of allele frequency on the enrichment for specific features such as

eQTLs, a separate analysis only included SNPs at intermediate frequencies (MAF >=

20%) in each of the four populations.

RegulomeDB To test for enrichment of putatively regulatory sites among targets of


balancing selection we used RegulomeDB, which is a SNP-based annotation for known

and predicted regulatory elements (Boyle et al., 2012). Specifically, we considered Regu-

lomeDB scores of 1 (eQTL + t f binding + matched t f motif + matched DNase Footprint

105
Capítulo 1

+ DNase peak) score 2 (TF + binding + matched t f motif + matched DNase Footprint

+ DNase peak), and 1+2 together (sites that are annotated as eQTL and those that are

not), as well as 7 (no regulatory annotation). These represent SNPs with the highest

and the lowest evidence for regulatory function, respectively. We also considered score

2 alone (TF binding + matched t f motif + matched DNase Footprint + DNase peak).

For each candidate window we sum the number of SNPs with each score that over-

lap the window. The expectation in the absence of LTBS is obtained by randomly sam-

pling from the genome the same number of windows as there are with evidence for

LTBS (Table 2). This enabled the calculation of an empirical p-value of the enrichment

of RegulomeDB scores in candidate windows when compared with the empirical back-

ground distribution while accounting for the size of each candidate windows set (sig-

nificance when p < 0.05). Because we considered the sum of scores across all windows,

considering each SNP only once even if it overlapped more than one window, our strat-

egy is insensitive to window length. We conducted similar analyses by considering

only alleles found at intermediate frequencies (MAF >= 20%) as described above.

Immune-related genes To specifically test for enrichment for significant genes re-
lated to immunity, we used a list of 386 immune-related keywords from the Compre-

hensive List of Immune Relate Genes from Immport (https://immport.niaid.nih.

gov/) to query the GO categories of the outlier genes. In total, 200 out of our 212 out-

lier genes have at least one associated GO category, of which 62 have at least one GO

category that matches at least one of the keywords on the list and was thus considered

to be “immune-related”.

106
References

Abecasis, G. R., A. Auton, L. D. Brooks, M. a. DePristo, R. M. Durbin, R. E. Handsaker, H. M.


Kang, G. T. Marth, and G. A. McVean (2012). “An integrated map of genetic variation from
1,092 human genomes.” In: Nature 491 (7422), pp. 56–65.
Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we
go from here?” In: Genome Research 19 (5), pp. 711–722.
Alkan, C. et al. (2009). “Personalized copy number and segmental duplication maps using next-
generation sequencing”. In: Nature Genetics 41 (10), pp. 1061–1067.
Allison, A. C. and D. F. Clyde (1961). “Malaria in African Children with Deficient Erythrocyte
Glucose-6-phosphate Dehydrogenase”. In: British Medical Journal 1 (5236), pp. 1346–1349.
Alonso, S., S. Lopez, N. Izagirre, and C. de la Rua (2008). “Overdominance in the Human
Genome and Olfactory Receptor Activity”. In: Molecular Biology and Evolution 25 (5), pp. 997–
1001.
Anders, S. and W. Huber (2010). “Differential expression analysis for sequence count data”. In:
Genome Biology 11 (10), R106.
Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. In: eLS, pp. 1–8.
Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” In: Molecular
Biology and Evolution 26 (12), pp. 2755–64.
Asthana, S., S. Schmidt, and S. R. Sunyaev (2005). “A limited role for balancing selection”. In:
Trends in genetics : TIG 21 (1), pp. 30–32.
Benson, G. (1999). “Tandem repeats finder: a program to analyze DNA sequences”. In: Nucleic
acids research 27 (2), pp. 573–580.

107
Capítulo 1

Bergland, A. O., E. L. Behrman, K. R. O’Brien, P. S. Schmidt, and D. A. Petrov (2014). “Ge-


nomic Evidence of Rapid and Stable Adaptive Oscillations over Seasonal Time Scales in
Drosophila”. In: PLoS Genetics 10 (11), e1004775.
Biasin, M. et al. (2007). “Apolipoprotein B mRNA—Editing Enzyme, Catalytic Polypeptide—Like
3G: A Possible Role in the Resistance to HIV of HIV-Exposed Seronegative Individuals.” In:
Journal of Infectious Diseases 195 (7), pp. 960–964.
Bigham, A. et al. (2010). “Identifying Signatures of Natural Selection in Tibetan and Andean
Populations Using Dense Genome Scan Data”. In: PLoS Genetics 6 (9). Ed. by D. J. Begun,
e1001116.
Boyle, A. P. et al. (2012). “Annotation of functional variation in personal genomes using Regu-
lomeDB”. In: Genome Research 22 (9), pp. 1790–1797.
Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing
selection.” In: Genetics 173 (4), pp. 2165–77.
Cagliani, R., M. Fumagalli, M. Biasin, L. Piacentini, S. Riva, U. Pozzoli, M. C. Bonaglia, N.
Bresolin, M. Clerici, and M. Sironi (2010). “Long-term balancing selection maintains trans-
specific polymorphisms in the human TRIM5 gene.” In: Human genetics 128 (6), pp. 577–
88.
Cagliani, R., S. Riva, U. Pozzoli, M. Fumagalli, G. P. Comi, N. Bresolin, M. Clerici, and M. Sironi
(2011). “Balancing selection is common in the extended MHC region but most alleles with
opposite risk profile for autoimmune diseases are neutrally evolving”. In: BMC Evolutionary
Biology 11 (1), p. 171.
Charlesworth, B. and D. Charlesworth (2010). Elements of Evolutionary Genetics. 1st ed. Roberts
and Company Publishers, p. 768. ISBN: 0981519423.
Charlesworth, B., M. Nordborg, and D. Charlesworth (1997). “The effects of local selection, bal-
anced polymorphism and background selection on equilibrium patterns of genetic diversity
in subdivided population”. In: Genetical Research 70, pp. 155–174.
Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome
regions.” In: PLoS Genetics 2 (4), pp. 379–384.
Cheng, Z. et al. (2005). “A genome-wide comparison of recent chimpanzee and human segmen-
tal duplications”. In: Nature 437 (7055), pp. 88–93.

108
Capítulo 1

Chun, S. and J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the
human genome.” In: PLoS genetics 7 (8), e1002240.
Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. In: Taxon-
omy and Geography. Ed. by D. Nichols. Oxford: Systematics Association.
— (1964). “Frequency-Dependent Selection for the Dominance of Rare Polymorphic Genes”.
In: Evolution 18 (3), pp. 364–369.
Coventry, A. et al. (2010). “Deep resequencing reveals excess rare recent variants consistent with
explosive population growth”. In: Nature Communications 1 (8), p. 131.
Day, F. R. et al. (2015). “Causal mechanisms and balancing selection inferred from genetic asso-
ciations with polycystic ovary syndrome”. In: Nature Communications 6, p. 8464.
DeGiorgio, M., K. E. Lohmueller, and R. Nielsen (2014). “A model-based approach for iden-
tifying signatures of ancient balancing selection in genetic data.” In: PLoS genetics 10 (8),
e1004561.
Derrien, T., J. Estellé, S. Marco Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca (2012).
“Fast Computation and Applications of Genome Mappability”. In: PLoS ONE 7 (1). Ed. by
C. A. Ouzounis, e30377.
Ewing, G. and J. Hermisson (2010). “MSMS: a coalescent simulation program including recom-
bination, demographic structure and selection at a single locus”. In: Bioinformatics 26 (16),
pp. 2064–2065.
Fijarczyk, A. and W. Babik (2015). “Detecting balancing selection in genomes: Limits and prospects”.
In: Molecular Ecology, n/a–n/a.
Filippo, C. de, F. M. Key, S. Ghirotto, A. Benazzo, J. R. Meneu, A. Weihmann, G. Parra, E. D.
Green, and A. M. Andrés (2016). “Recent Selection Changes in Human Genes under Long-
Term Balancing Selection”. In: Molecular Biology and Evolution, msw023.
Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A.
Gibbs, and C. D. Bustamante (2011). “Demographic history and rare allele sharing among
human populations.” In: Proceedings of the National Academy of Sciences of the United States of
America 108 (29), pp. 11983–8.
Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante (2009). “Infer-
ring the Joint Demographic History of Multiple Populations from Multidimensional SNP
Frequency Data”. In: PLoS Genetics 5 (10). Ed. by G. McVean, e1000695.

109
Capítulo 1

Hedrick, P. W. (2006). “Genetic Polymorphism in Heterogeneous Environments: The Age of


Genomics”. In: Annual Review of Ecology, Evolution, and Systematics 37, pp. 67–93.
— (2012). “What is the evidence for heterozygote advantage selection?” In: Trends in Ecology &
Evolutiony & evolution 27 (12), pp. 698–704.
Hedrick, P. W., T. S. Whittam, and P. Parham (1991). “Heterozygosity at individual amino acid
sites: extremely high levels for HLA-A and -B genes.” In: Proceedings of the National Academy
of Sciences 88 (13), pp. 5897–5901.
Howell, W. M. (2014). “HLA and disease: Guilt by association”. In: International Journal of Im-
munogenetics 41 (1), pp. 1–12.
Hudson, R. R. and N. L. Kaplan (1988). “The coalescent process in models with selection and
recombination”. In: Genetics 120 (3), pp. 831–840.
Hudson, R. R., M. Kreitman, and M. Aguade (1987). “A Test of Neutral Molecular Evolution
Based on Nucleotide Data”. In: Genetics 116 (1), pp. 153–159.
Hughes, A. L. and M. Nei (1989). “Evolution of the major histocompatibility complex: inde-
pendent origin of nonclassical class I genes in different groups of mammals.” In: Molecular
Biology and Evolution 6 (6), pp. 559–79.
Hughes, A. L. and M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility
class I loci reveals overdominant selection”. In: Letters to Nature 335 (8), pp. 167–170.
Johnston, S. E., J. Gratten, C. Berenos, J. G. Pilkington, T. H. Clutton-Brock, J. M. Pemberton, and
J. Slate (2013). “Life history trade-offs at a single locus maintain sexually selected genetic
variation”. In: Nature 502 (7469), pp. 93–95.
Key, F. M., B. Peter, M. Y. Dennis, E. Huerta-Sánchez, W. Tang, L. Prokunina-Olsson, R. Nielsen,
and A. M. Andrés (2014a). “Selection on a Variant Associated with Improved Viral Clear-
ance Drives Local, Adaptive Pseudogenization of Interferon Lambda 4 (IFNL4).” In: PLoS
genetics 10 (10), e1004681.
Key, F. M., J. C. Teixeira, C. de Filippo, and A. M. Andrés (2014b). “Advantageous diversity
maintained by balancing selection in humans”. In: Current Opinion in Genetics & Development
29, pp. 45–51.
Kofler, R. and C. Schlötterer (2012). “Gowinda: Unbiased analysis of gene set enrichment for
genome-wide association studies”. In: Bioinformatics 28 (15), pp. 2084–2085.

110
Capítulo 1

Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582.
Linnenbrink, M., J. M. Johnsen, I. Montero, C. R. Brzezinski, B. Harr, and J. F. Baines (2011).
“Long-Term Balancing Selection at the Blood Group-Related Gene B4galnt2 in the Genus
Mus (Rodentia; Muridae)”. In: Molecular Biology and Evolution 28 (11), pp. 2999–3003.
Liu, X. et al. (2006). “An ancient balanced polymorphism in a regulatory region of human ma-
jor histocompatibility complex is retained in Chinese minorities but lost worldwide.” In:
American Journal of Human Genetics 78 (3), pp. 393–400.
Malaria Genomic Epidemiology Network (2015). “A novel locus of resistance to severe malaria
in a region of ancient balancing selection”. In: Nature 526 (7572), pp. 253–257.
Meyer, D., R. M. Single, S. J. Mack, H. A. Erlich, and G. Thomson (2006). “Signatures of demo-
graphic history and natural selection in the human major histocompatibility complex Loci.”
In: Genetics 173 (4), pp. 2121–2142.
Meyer, D. and G. Thomson (2001). “How selection shapes variation of the human major histo-
compatibility complex: a review.” In: Annals of Human Genetics 65 (1), pp. 1–26.
Meyer, M. et al. (2012). “A High-Coverage Genome Sequence from an Archaic Denisovan Indi-
vidual”. In: Science 338 (6104), pp. 222–226.
Nachman, M. W. and S. L. Crowell (2000). “Estimate of the Mutation Rate per Nucleotide in
Humans”. In: Genetics 156 (1), pp. 297–304.
Nielsen, R. et al. (2005). “A Scan for Positively Selected Genes in the Genomes of Humans and
Chimpanzees”. In: PLoS Biology 3 (6), e170.
Nielsen, R. et al. (2009). “Darwinian and demographic forces affecting human protein coding
genes.” In: Genome Research 19 (5), pp. 838–49.
Pasvol, G., D. J. Weatherall, and R. J. M. Wilson (1978). “Cellular mechanism for the protective
effect of haemoglobin S against P. falciparum malaria”. In: Nature 274 (5672), pp. 701–703.
Perry, G. H. et al. (2008). “Copy number variation and evolution in humans and chimpanzees”.
In: Genome Research 18 (11), pp. 1698–1710.
Petit, F. G., C. Kervarrec, S. P. Jamin, F. Smagulova, C. Hao, E. Becker, B. Jegou, F. Chalmel, and
M. Primig (2015). “Combining RNA and Protein Profiling Data with Network Interactions
Identifies Genes Associated with Spermatogenesis in Mouse and Human”. In: Biology of
Reproduction 92 (3), pp. 71–71.

111
Capítulo 1

Prüfer, K. et al. (2013). “The complete genome sequence of a Neanderthal from the Altai Moun-
tains”. In: Nature 505 (7481), pp. 43–49.
Prugnolle, F., A. Manica, M. Charpentier, J. F. Guégan, V. Guernier, and F. Balloux (2005). “Pathogen-
driven selection and worldwide HLA class I diversity.” In: Current Biology 15 (11), pp. 1022–
7.
Rasmussen, M. D., M. J. Hubisz, I. Gronau, and A. Siepel (2014). “Genome-Wide Inference of
Ancestral Recombination Graphs”. In: PLoS Genetics 10 (5). Ed. by G. Coop, e1004342.
Raychaudhuri, S. et al. (2012). “Five amino acids in three HLA proteins explain most of the
association between MHC and seropositive rheumatoid arthritis”. In: Nature Genetics 44 (3),
pp. 291–296.
Rottgardt, I., F. Rothhammer, and M. Dittmar (2010). “Native highland and lowland popula-
tions differ in γ-globin gene promoter polymorphisms related to altered fetal hemoglobin
levels and delayed fetal to adult globin switch after birth”. In: Anthropological Science 118 (1),
pp. 41–48.
Sabeti, P. C. et al. (2007). “Genome-wide detection and characterization of positive selection in
human populations.” In: Nature 449 (7164), pp. 913–8.
Sanchez-Mazas, A. (2007). “An apportionment of human HLA diversity”. In: Tissue Antigens 69,
pp. 198–202.
Savova, V., S. Chun, M. Sohail, R. B. McCole, R. Witwicki, L. Gai, T. L. Lenz, C.-t. Wu, S. R.
Sunyaev, and A. A. Gimelbrant (2016). “Genes with monoallelic expression contribute dis-
proportionately to genetic diversity in humans”. In: Nature Genetics 48 (3), pp. 231–237.
Ségurel, L., Z. Gao, and M. Przeworski (2013). “Ancestry runs deeper than blood: The evolu-
tionary history of ABO points to cryptic variation of functional importance”. In: BioEssays
35 (10), pp. 862–867.
Segurel, L. et al. (2012). “The ABO blood group is a trans-species polymorphism in primates”.
In: Proceedings of the National Academy of Sciences 109 (45), pp. 18493–18498.
Sellis, D., B. J. Callahan, D. a. Petrov, and P. W. Messer (2011). “Heterozygote advantage as a
natural consequence of adaptation in diploids”. In: Proceedings of the National Academy of
Sciences 108 (51), pp. 20666–20671.
Solberg, O. D., S. J. Mack, A. K. Lancaster, R. M. Single, Y. Tsai, A. Sanchez-Mazas, and G.
Thomson (2008). “Balancing selection and heterogeneity across the classical human leuko-

112
Capítulo 1

cyte antigen loci: A meta-analytic review of 497 population studies”. In: Human Immunology
69 (7), pp. 443–464.
Spurgin, L. G. and D. S. Richardson (2010). “How pathogens drive genetic diversity: MHC,
mechanisms and misunderstandings.” In: Proceedings. Biological sciences / The Royal Society
277 (1684), pp. 979–88.
Staubach, F., S. Künzel, A. C. Baines, A. Yee, B. M. McGee, F. Bäckhed, J. F. Baines, and J. M.
Johnsen (2012). “Expression of the blood-group-related glycosyltransferase B4galnt2 influ-
ences the intestinal microbiota in mice”. In: The ISME Journal 6 (7), pp. 1345–1355.
Sun, C., D. Huo, C. Southard, B. Nemesure, A. Hennis, M. Cristina Leske, S.-Y. Wu, D. B. Witon-
sky, O. I. Olopade, and A. Di Rienzo (2011). “A signature of balancing selection in the region
upstream to the human UGT2B4 gene and implications for breast cancer risk”. In: Human
Genetics 130 (6), pp. 767–775.
Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly-
morphism.” In: Genetics 123 (3), pp. 585–595.
Tan, Z., A. M. Shon, and C. Ober (2005). “Evidence of balancing selection at the HLA-G pro-
moter region”. In: Human Molecular Genetics 14 (23), pp. 3619–3628.
Teixeira, J. C. et al. (2015). “Long-Term Balancing Selection in LAD1 Maintains a Missense Trans-
Species Polymorphism in Humans, Chimpanzees, and Bonobos”. In: Molecular Biology and
Evolution 32 (5), pp. 1186–1196.
Vernot, B. and J. M. Akey (2014). “Resurrecting Surviving Neandertal Lineages from Modern
Human Genomes”. In: Science 343 (6174), pp. 1017–1021.
Williamson, S. H., M. J. Hubisz, A. G. Clark, B. A. Payseur, C. D. Bustamante, and R. Nielsen
(2007). “Localizing recent adaptive evolution in the human genome.” In: PLoS genetics 3 (6),
e90.
Yi, X. et al. (2010). “Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude”. In:
Science 329 (5987), pp. 75–78.

113
Capítulo 1

Supplementary Text

S1 Text: Additional analyses for significant and outlier windows

and genes

Ruling out possible biological confounding factors

In all the analyses below, the set of significant or outlier windows (or genes) consists on the
union of windows or genes overlapped by them considering all t f values.

Neanderthal introgression

Background Genomic segments that contain introgressed haplotypes from archaic human
forms (Meyer et al., 2012; Prüfer et al., 2013) have, on average, older TMRCA and higher di-
versity than the rest of the genome. In the absence of positive (or balancing) selection, though,
introgressed segments are not expected to reach intermediate frequencies and contribute to the
significant and outlier windows defined in the main paper.

Results Accordingly, the significant and outlier windows in European populations are not
enriched in putatively introgressed SNPs (defined as those with an allele absent in the Africans,
shared between Europeans and Neanderthals, and that fall in previously identified introgressed
regions (Vernot and Akey, 2014) (S17 Fig and Methods in main paper). In fact, the outlier win-
dows are significantly depleted of introgressed SNPs (S17 Fig).

Methods We tested the enrichment of Neanderthal introgression among candidate win-


dows in TSI and GBR by using the resampling approach described for RegulomeDB functional
enrichment analysis. Putative Neanderthal-introgressed SNPs were ascertained by considering
SNPs that overlap annotated Neanderthal introgressed haplotypes in (Vernot and Akey, 2014)
and that in the 1000 genomes data show a derived allele shared between TSI/GBR and Nean-
derthals and absent in YRI and LWK (Abecasis et al., 2012). The remaining SNPs overlapping
scanned windows were considered as non-introgressed.

114
Capítulo 1

Non-homologous gene conversion

Results and Methods We also investigated the possibility of non-homologous gene con-
version, which is another biological phenomenon that may increase diversity. To do so, for each
significant or outlier gene (see Table 2 in main text) we analyzed the distribution of the number
of paralogs that reside on the same chromosome. Significant genes show no tendency towards
having more paralogs on the same chromosome than all autosomal genes (see S16 Fig), show-
ing that this is not a general issue. In both cases, more than 60% of the genes have no paralogs
on the same chromosome (S16 Fig). We nevertheless singled out olfactory receptor (OR) genes
(see below), which often appear in tandem and may undergo gene conversion. Unlike the other
significant and background genes, more than 80% of the OR genes present in all populations for
at least one t f value have at least one paralog on the same chromosome (S19 Fig). Thus, non-
homologous gene conversion does not appear to be a general issue among significant genes,
with the exception of the OR genes.

Olfactory receptor genes

Among the windows believed to have less false positive candidates to LTBS, only two OR genes
are present: OR52A1 and OR6J1 (Table 3). Although patterns compatible with overdominance
have been reported for human olfactory receptor activity genes (Alonso et al., 2008), we cannot
rule out that the enrichment signature detected in the genes pertaining to the olfactory receptor
(OR) gene family is due to paralogous gene conversion (S19 Fig). Moreover, OR52A1 has 10
paralogues on the same chromosome, and OR6J1 has 1 (Table 3). We therefore recommend that
the results concerning OR genes be interpreted with caution.

Phenotype Ontology Analyses

Results A phenotype ontology analysis uncovered “abnormality of the sclera” as the only
significant category in YRI, and no significant categories appear in the other three populations
analyzed (S4 Table).

Methods See Methods section in main paper (“Gene and phenotype ontology”).

115
Capítulo 1

Tissue-specific expression

Results Interestingly, though, when we perform enrichment analysis among significant win-
dows for tissue-specific expression, we find that targets of LTBS are significantly enriched in
genes that are highly expressed in adrenal gland in both TSI and in the lung in GBR European
populations (S5 Table). These results are mirrored in African populations when considering
outlier windows, albeit the results are not significant.

Methods We performed an enrichment analysis using genes showing tissue-specific expres-


sion also using GOWINDA, with the same parameters and strategy described above, except
here we only considered two criteria for defining different sets in each population: 1) different
ascertainment of candidate windows (outlier vs significant); 2) the union of windows for dif-
ferent target frequencies. We used Illumina BodyMap 2.0 (Derrien et al., 2012) expression data
for 16 different tissues, and developed a tissue-specific expression metric that considers genes
that are significantly higher expressed in a particular tissue when compared to the remaining
15 tissues using the DESeq package (Anders and Huber, 2010).

S2 Text: A set of significant genes


We considered the union of significant genes (between 1,321 and 1,400 genes for each popula-
tion, Table 2), and defined as “African” those shared by YRI and LWK and neither or only one
of the European populations, as “European” those shared by GBR and TSI and neither or only
one of the African populations, and as ‘African and European’ those shared by all four popu-
lations. This resulted in 1,051 genes shared between LWK and YRI and 1,089 shared between
GBR and TSI. In total, this amounts to 1,470 genes (∼ 8% of all queried genes): 670 are con-
sidered as “African and European”, 381 are considered as “African” and 419 are considered as
“European”. These genes are presented in S8 Table.

S3 Text: Manual verification of reliability of SNPs contained in

four of the outlier genes

116
Capítulo 1

Description

In the main text, we mention that among the top 10 outlier genes from Table 3 (considering
the p-values for YRI), 6 have been reported previously as having signatures of LTBS: PROKR2
(Leffler et al., 2013), HLA-DQA1 (DeGiorgio et al., 2014), CPE (DeGiorgio et al., 2014), HLA-
DRB5 (DeGiorgio et al., 2014), LUZP2 (DeGiorgio et al., 2014), and MYO3A (Asthana et al.,
2005; DeGiorgio et al., 2014) (Tables 3 and S7).

Thus, the novel top candidates are: B4GALNT2 (Beta-1,4-N-Acetyl-Galactosaminyl Trans-


ferase 2), C1orf101 (Chromosome 1 Open Reading Frame 101), NDUFA10 (NADH-Ubiquinone
Oxidoreductase 42 KDa Subunit), and PCDH15 (Protocadherin-Related 15). In the main text,
we discuss these genes in more detail, given that they have extreme signatures of LTBS shared
across populations.

In order to certify that these genes have genuine extreme signatures of LTBS due to bal-
ancing selection, and not: (a) bad SNP calls by collapsed reads of duplicates, and (b) non-
homologous gene conversion between close paralogs, we performed a manual verification us-
ing BLAT (http://www.ensembl.org/Multi/Tools/Blast?db=core).

Methods

For each gene, the corresponding FASTA sequence was taken from the hg19 reference genome
and queried in BLAT. We only considered the top 100 hits for each gene. For each of the 100
hits, positions that coincide with a SNP position for the gene in the Phase 1 1000 Genomes data
set (Abecasis et al., 2012) were manually verified. If the position is a match between the query
and the hit, i.e, both have the same variant in the SNP position, this SNP is considered a match.
If the position is a mismatch between the query and the hit, i.e, query and hit have different
variants in the SNP position, this SNP is considered a mismatch.

A mismatched SNP could either have the alternate allele in the hit (alternate mismatch*) or
an allele which is not the alternate allele in the 1000 G data set (simple mismatch). Further, we
considered as more relevant and likely problematic those SNPs that not only are classified as
alternate mismatch, but have somewhat intermediate frequencies (> 0.10). Results are provided
for each gene separately, below. Location of the gene is provided, for reference.

117
Capítulo 1

Results

B4GALNT2

Description Position: chr17:47203660-47211160 (7,500 bp in total). In total, 96 SNPs in our


filtered data (see Methods in main). Roughly 36 of them have somewhat intermediate frequen-
cies.

BLAT After looking at all hits, we found three alternate mismatch SNPs. Only two of them
have intermediate frequency: rs78050610,rs11654406), while the other is a singleton (rs140853454).

Conclusion Roughly 94 intermediate SNPs remain for this gene, making it thus unlikely
that its signatures are dependent on problematic SNPs.

NDUFA10

Description Position: chr2:240850012-240854512 (4,500 bp in total). In total, 74 SNPs in our


filtered data set (see Methods in main). Roughly 41 are intermediate in frequency in most Afr
and Eur populations.

BLAT After looking at all hits, we found two alternate mismatch SNPs, one of which is
intermediate frequency (rs6759128) and the other is low to intermediate (rs28429725).

Conclusion Roughly 29 intermediate SNPs remain for this gene, making it thus unlikely
that its signatures are dependent on problematic SNPs.

PCDH15

Description Position: chr10:56902047-56911047 (9,000bp in total) 145 SNPs in our filtered


data set (see Methods in main), roughly 63 at intermediate frequency.

118
Capítulo 1

BLAT After looking at all hits, we found 1 alternate mismatch SNP (rs188658080), which
has verylow frequency in all populations or is absent for some populations (considering only
African and European populations).

Conclusion Although this SNP is likely unreliable, it has low frequency. Coupled with the
fact that only this position clearly appears to be problematic, we concluded that the signature
observed for this gene is reliable.

C1orf101

Description Position: chr1: 244,617,679-244,804,479 (10,500 pb in total), 143 SNPs, roughly


76 at intermediate frequency.

BLAT After looking at all hits, we found 19 alternate mismatch SNPs. This seemed to be
a potentially problematic candidate for LTBS, so we looked at in in further detail. Of the 19
alternate mismatch SNPs, between 7 and 12 have intermediate frequency. They are listed below.

Alternate mistmatch SNPs for C1orf101 (the ones with * have intermediate frequency and
are thus more likely to be sources of bias for observed NCD2 values): rs3003250, rs3005972,
rs3005973, rs138545291 (singleton), rs114536159, rs142620294, rs189591539(singleton), rs3003251*,
rs3005938*, rs3005945*, rs3005947*, rs3005957*, rs3005958*, rs3005968*, rs3005969*, rs3005971*,
rs3005975*, rs7538776*, rs9429008 *

Given that several SNPs for this gene are likely problematic (alternate mismatch with inter-
mediate frequencies, we thus recalculated NCD2 for the windows that overlap this gene, after
removing those SNPs. We removed the 19 SNPs (including the low/high frequency ones and
singletons) from our data set (Methods in main), and recalculated NCD20.5 for the six outlier
windows (Table 2) that overlap this gene for YRI. The six outlier windows that overlapped this
gene have zero FDs. For the first of the six window, the removal of the problematic SNPs re-
sulted in it going from 19 to 17 SNPs (IS=17), so it does not fulfill our criteria of having at least
19 IS in Africa (see Methods in main). The second windows goes from having 22 to 19 IS. It
passes the "significance criterion" (simulation based p < 0.0001) and has Zt f value within the
observed range for outlier windows.

119
Capítulo 1

Conclusion 19 SNPs are alternate mismatches, and 7-12 also have intermediate frequencies
in most African and European populations. Even if all (19) SNPs are removed and NCD20.5 is
re-calculated, at least one window remains as significant and probably outlier, demonstrating
that this gene is likely a true positive. Moreover, there is an entire an entire range of the query
(between positions 8,000-10,500) for which there are no hits, and this range contains 37 SNPs
(around 12 intermediate frequency), further supporting the signatures observed for this gene.

Supplementary Tables

S1-A Table. Power analyses based on simulations (Africa). Reported values


are for simulations following African demographic scenarios (see Methods).
f eq , frequency equilibrium used in the simulations; target frequency is the t f
value used in NCD1 and NCD2; Tbs, time since onset of balancing selection; L,
length of the simulated sequence. Reported values are always for a false posi-
tive rate of 0.05.

NCD2 NCD1
Target Frequency
Tbs f eq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.959 0.944 0.835 0.929 0.911 0.393
5 0.5 6 0.917 0.885 0.728 0.903 0.847 0.392
5 0.5 12 0.829 0.789 0.548 0.846 0.772 0.325
5 0.4 3 0.939 0.935 0.886 0.886 0.894 0.674
5 0.4 6 0.871 0.860 0.790 0.838 0.819 0.651
5 0.4 12 0.742 0.726 0.612 0.745 0.717 0.534
5 0.3 3 0.895 0.908 0.929 0.717 0.801 0.836
5 0.3 6 0.776 0.796 0.833 0.659 0.709 0.794
5 0.3 12 0.572 0.597 0.638 0.509 0.570 0.663
3 0.5 3 0.911 0.882 0.681 0.855 0.797 0.236
3 0.5 6 0.856 0.809 0.574 0.854 0.770 0.266
3 0.5 12 0.727 0.666 0.410 0.768 0.678 0.232
3 0.4 3 0.878 0.864 0.759 0.781 0.783 0.557
3 0.4 6 0.803 0.785 0.678 0.770 0.753 0.527
3 0.4 12 0.659 0.621 0.500 0.654 0.629 0.441
3 0.3 3 0.749 0.774 0.811 0.561 0.640 0.706
3 0.3 6 0.628 0.648 0.700 0.526 0.570 0.658

120
Capítulo 1

3 0.3 12 0.425 0.456 0.509 0.388 0.443 0.536


1 0.5 3 0.419 0.336 0.159 0.389 0.321 0.087
1 0.5 6 0.356 0.283 0.107 0.395 0.294 0.086
1 0.5 12 0.273 0.217 0.100 0.359 0.268 0.096
1 0.4 3 0.372 0.338 0.247 0.348 0.340 0.185
1 0.4 6 0.305 0.278 0.206 0.355 0.324 0.195
1 0.4 12 0.252 0.232 0.156 0.304 0.274 0.174
1 0.3 3 0.229 0.239 0.280 0.194 0.249 0.279
1 0.3 6 0.190 0.203 0.233 0.210 0.232 0.277
1 0.3 12 0.142 0.150 0.169 0.159 0.182 0.219

S1-B Table. Power analyses based on simulations (Europe). Reported values


are for simulations following European demographic scenarios (see Methods).
f eq , frequency equilibrium used in the simulations; target frequency is the t f
value used in NCD1 and NCD2; Tbs, time since onset of balancing selection;
L, length of the simulated sequence. Reported values are always for a false
positive rate of 0.05.

NCD2 NCD1
Target Frequency
Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.968 0.951 0.835 0.921 0.846 0.197
5 0.5 6 0.941 0.907 0.747 0.916 0.871 0.234
5 0.5 12 0.849 0.80 0.573 0.847 0.767 0.201
5 0.4 3 0.948 0.944 0.907 0.849 0.826 0.596
5 0.4 6 0.901 0.892 0.832 0.849 0.850 0.633
5 0.4 12 0.779 0.757 0.687 0.750 0.744 0.536
5 0.3 3 0.836 0.855 0.892 0.471 0.569 0.740
5 0.3 6 0.726 0.758 0.810 0.497 0.606 0.722
5 0.3 12 0.551 0.595 0.670 0.387 0.493 0.644
3 0.5 3 0.928 0.892 0.678 0.814 0.693 0.145
3 0.5 6 0.875 0.833 0.607 0.841 0.755 0.195
3 0.5 12 0.761 0.704 0.451 0.759 0.670 0.187
3 0.4 3 0.888 0.875 0.794 0.738 0.709 0.461
3 0.4 6 0.828 0.809 0.723 0.782 0.765 0.517
3 0.4 12 0.678 0.662 0.567 0.682 0.676 0.462
3 0.3 3 0.733 0.757 0.795 0.389 0.480 0.634
3 0.3 6 0.609 0.643 0.703 0.425 0.512 0.632
3 0.3 12 0.433 0.473 0.558 0.332 0.402 0.548
1 0.5 3 0.472 0.388 0.161 0.430 0.305 0.054

121
Capítulo 1

1 0.5 6 0.404 0.325 0.139 0.467 0.347 0.064


1 0.5 12 0.337 0.264 0.120 0.430 0.306 0.058
1 0.4 3 0.425 0.386 0.286 0.371 0.332 0.218
1 0.4 6 0.362 0.330 0.260 0.398 0.398 0.249
1 0.4 12 0.296 0.269 0.216 0.380 0.392 0.208
1 0.3 3 0.229 0.249 0.287 0.145 0.170 0.259
1 0.3 6 0.190 0.206 0.263 0.172 0.233 0.299
1 0.3 12 0.151 0.161 0.220 0.140 0.164 0.256

S1-C Table. Power analyses based on simulations (Asia). Reported values


are for simulations following Asian demographic scenarios (see Methods). f eq ,
frequency equilibrium used in the simulations; target frequency is the t f value
used in NCD1 and NCD2; Tbs, time since onset of balancing selection; L, length
of the simulated sequence. Reported values are always for a false positive rate
of 0.05.

NCD2 NCD1
Target Frequency
Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.666 0.687 0.705 0.448 0.476 0.365
5 0.5 6 0.584 0.605 0.614 0.438 0.465 0.378
5 0.5 12 0.469 0.476 0.450 0.398 0.401 0.332
5 0.4 3 0.343 0.372 0.430 0.136 0.167 0.225
5 0.4 6 0.262 0.291 0.356 0.135 0.167 0.224
5 0.4 12 0.187 0.206 0.241 0.116 0.133 0.189
5 0.3 3 0.113 0.135 0.186 0.015 0.022 0.055
5 0.3 6 0.062 0.071 0.113 0.012 0.024 0.046
5 0.3 12 0.030 0.041 0.068 0.011 0.015 0.037
3 0.5 3 0.611 0.627 0.616 0.393 0.404 0.344
3 0.5 6 0.532 0.545 0.529 0.411 0.422 0.374
3 0.5 12 0.412 0.418 0.389 0.371 0.370 0.298
3 0.4 3 0.245 0.269 0.332 0.111 0.141 0.173
3 0.4 6 0.189 0.208 0.252 0.101 0.128 0.166
3 0.4 12 0.128 0.144 0.178 0.085 0.111 0.142
3 0.3 3 0.073 0.087 0.126 0.011 0.022 0.050
3 0.3 6 0.036 0.052 0.072 0.012 0.017 0.046
3 0.3 12 0.019 0.029 0.047 0.010 0.018 0.037
1 0.5 3 0.287 0.274 0.222 0.245 0.225 0.145
1 0.5 6 0.222 0.212 0.159 0.235 0.215 0.167
1 0.5 12 0.181 0.173 0.132 0.205 0.188 0.135

122
Capítulo 1

1 0.4 3 0.092 0.098 0.116 0.048 0.061 0.092


1 0.4 6 0.0.68 0.074 0.098 0.044 0.049 0.084
1 0.4 12 0.055 0.065 0.076 0.041 0.051 0.075
1 0.3 3 0.028 0.028 0.042 0.016 0.018 0.028
1 0.3 6 0.015 0.020 0.030 0.008 0.014 0.026
1 0.3 12 0.016 0.018 0.026 0.010 0.012 0.018

S2 Table. Gene ontology enrichment analyses for significant windows. The


union of significant windows for at least one of the t f values is used. t f , tar-
get frequency used in NCD equation. FDR, false discovery rate. genes (sims),
expected number of genes in this category (see Methods). genes (data), actual
number of genes in the category in the analyzed set.

GO term # genes (sims) # genes (data) p-value FDR Category description

LWK
G-protein_coupled_receptor_
GO:0007186 35.17 61 0.00001 0.00104
signaling_pathway
GO:0042612 0.13 4 0.00001 0.00104 MHC_class_I_protein_complex
GO:0042613 0.291 10 0.00001 0.00104 MHC_class_II_protein_complex
GO:0045095 0.961 9 0.00001 0.00104 keratin_filament
GO:0042605 0.564 8 0.00001 0.00104 peptide_antigen_binding
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.689 12 0.00001 0.00104
reticulum_membrane
GO:0032395 0.124 6 0.00001 0.00104 MHC_class_II_receptor_activity
positive_regulation_of_T_cell_
GO:0001916 0.659 7 0.00001 0.00104
mediated_cytotoxicity
GO:0016021 314.878 394 0.00001 0.00104 integral_to_membrane
antigen_processing_and_presentation_of_
GO:0002504 0.216 7 0.00001 0.00104 peptide_or_polysaccharide_antigen_via
_MHC_class_II
GO:0006955 14.317 33 0.00001 0.00104 immune_response
interferon-gamma-mediated_
GO:0060333 3.3 13 0.00001 0.00104
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.223 4 0.00001 0.00104 exogenous_peptide_antigen_via_
MHC_class_I,_TAP-independent
GO:0019882 1.306 15 0.00001 0.00104 antigen_processing_and_presentation
GO:0030658 2.507 11 0.00001 0.00104 transport_vesicle_membrane
GO:0030669 0.972 8 0.00001 0.00104 clathrin-coated_endocytic_vesicle_membrane

123
Capítulo 1

GO:0004984 2.255 26 0.00001 0.00104 olfactory_receptor_activity


GO:0012507 1.324 12 0.00001 0.00104 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 24.693 55 0.00001 0.00104 G-protein_coupled_receptor_activity
GO:0005887 89.892 126 0.00002 0.00194 integral_to_plasma_membrane
GO:0032588 3.136 11 0.00002 0.00194 trans-Golgi_network_membrane
GO:0007608 4.93 14 0.00004 0.00387 sensory_perception_of_smell
antigen_processing_and_presentation_of_
GO:0002479 2.565 10 0.00014 0.01231 exogenous_peptide_antigen_via_
MHC_class_I,_TAP-dependent
antigen_processing_and_presentation_of_
GO:0019885 0.167 3 0.00014 0.01231 endogenous_peptide_antigen_via_
MHC_class_I
antigen_processing_and_presentation_of_
GO:0042590 2.749 10 0.00024 0.02308 exogenous_peptide_antigen_via_
MHC_class_I
GO:0046967 0.039 2 0.00027 0.02518 cytosol_to_ER_transport
LWK (without HLA)
G-protein_coupled_receptor_
GO:0007186 34.905 61 0.00001 0.00393
signaling_pathway
GO:0045095 0.956 9 0.00001 0.00393 keratin_filament
GO:0016021 312.414 387 0.00001 0.00393 integral_to_membrane
GO:0004984 2.239 26 0.00001 0.00393 olfactory_receptor_activity
GO:0004930 24.507 55 0.00001 0.00393 G-protein_coupled_receptor_activity
GO:0005887 89.171 122 0.00005 0.01537 integral_to_plasma_membrane
GO:0007608 4.891 14 0.00005 0.01537 sensory_perception_of_smell
GO:0042605 0.531 5 0.00013 0.0347 peptide_antigen_binding
YRI
GO:0042612 0.139 4 0.00001 0.00114 MHC_class_I_protein_complex
GO:0042613 0.311 11 0.00001 0.00114 MHC_class_II_protein_complex
GO:0045095 1.034 9 0.00001 0.00114 keratin_filament
GO:0042605 0.605 7 0.00001 0.00114 peptide_antigen_binding
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.743 12 0.00001 0.00114
reticulum_membrane
GO:0032395 0.132 7 0.00001 0.00114 MHC_class_II_receptor_activity
GO:0016021 334.989 401 0.00001 0.00114 integral_to_membrane
antigen_processing_and_presentation_of_
GO:0002504 0.231 7 0.00001 0.00114 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0006955 15.313 36 0.00001 0.00114 immune_response

124
Capítulo 1

interferon-gamma-mediated_
GO:0060333 3.529 13 0.00001 0.00114
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.24 4 0.00001 0.00114 exogenous_peptide_antigen_via_
MHC_class_I,_TAP-independent
GO:0019882 1.401 16 0.00001 0.00114 antigen_processing_and_presentation
GO:0030658 2.646 10 0.00001 0.00114 transport_vesicle_membrane
GO:0030669 1.043 8 0.00001 0.00114 clathrin-coated_endocytic_vesicle_membrane
GO:0004984 2.416 21 0.00001 0.00114 olfactory_receptor_activity
GO:0012507 1.424 12 0.00001 0.00114 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 26.334 51 0.00001 0.00114 G-protein_coupled_receptor_activity
G-protein_coupled_receptor_
GO:0007186 37.515 62 0.00003 0.00336
signaling_pathway
GO:0032588 3.336 11 0.00003 0.00336 trans-Golgi_network_membrane
detection_of_chemical_stimulus_involved_
GO:0050911 0.244 4 0.00009 0.00993
in_sensory_perception_of_smell
GO:0005576 92.15 123 0.00023 0.02592 extracellular_region
positive_regulation_of_T_cell_
GO:0001916 0.707 5 0.00033 0.03707
mediated_cytotoxicity
YRI (without HLA)
G-protein_coupled_receptor_
GO:0007186 37.187 62 0.00001 0.00492
signaling_pathway
GO:0045095 1.023 9 0.00001 0.00492 keratin_filament
GO:0004984 2.397 21 0.00001 0.00492 olfactory_receptor_activity
GO:0004930 26.103 51 0.00001 0.00492 G-protein_coupled_receptor_activity
GO:0016021 332.421 394 0.00002 0.00802 integral_to_membrane
GO:0005576 91.502 123 0.0004 0.03721 extracellular_region
detection_of_chemical_stimulus_involved_
GO:0050911 0.242 4 0.00013 0.03885
in_sensory_perception_of_smell
GBR
GO:0042612 0.128 4 0.00001 0.00129 MHC_class_I_protein_complex
GO:0042613 0.285 9 0.00001 0.00129 MHC_class_II_protein_complex
GO:0042605 0.562 6 0.00001 0.00129 peptide_antigen_binding
integral_to_lumenal_side_of_
GO:0071556 0.683 12 0.00001 0.00129
endoplasmic_reticulum_membrane
GO:0032395 0.12 6 0.00001 0.00129 MHC_class_II_receptor_activity
GO:0016021 312.004 411 0.00001 0.00129 integral_to_membrane
GO:0005576 85.721 124 0.00001 0.00129 extracellular_region

125
Capítulo 1

antigen_processing_and_presentation_of_
GO:0002504 0.211 6 0.00001 0.00129 peptide_or_polysaccharide_antigen_via_
MHC_class_II
interferon-gamma-mediated_
GO:0060333 3.266 16 0.00001 0.00129
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.221 4 0.00001 0.00129 exogenous_peptide_antigen_via_MHC_
class_I,_TAP-independent
GO:0019882 1.294 15 0.00001 0.00129 antigen_processing_and_presentation
GO:0030669 0.96 9 0.00001 0.00129 clathrin-coated_endocytic_vesicle_membrane
GO:0004984 2.241 21 0.00001 0.00129 olfactory_receptor_activity
GO:0012507 1.314 12 0.00001 0.00129 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 24.487 50 0.00001 0.00129 G-protein_coupled_receptor_activity
GO:0007186 34.875 58 0.00002 0.00235 G-protein_coupled_receptor_signaling_pathway
GO:0006955 14.165 31 0.00002 0.00235 immune_response
GO:0030658 2.469 10 0.00003 0.00346 transport_vesicle_membrane
GO:0045095 0.956 7 0.00005 0.00569 keratin_filament
GO:0060337 1.679 8 0.00009 0.01008 type_I_interferon_signaling_pathway
GO:0032588 3.114 10 0.00014 0.01412 trans-Golgi_network_membrane
GO:0007608 4.894 13 0.00019 0.02054 sensory_perception_of_smell
positive_regulation_of_T_cell_
GO:0001916 0.654 5 0.0002 0.02087
mediated_cytotoxicity
GBR (without HLA)
G-protein_coupled_receptor_
GO:0007186 34.566 58 0.00001 0.0039
signaling_pathway
GO:0016021 309.57 404 0.00001 0.0039 integral_to_membrane
GO:0005576 85.005 124 0.00001 0.0039 extracellular_region
GO:0004984 2.217 21 0.00001 0.0039 olfactory_receptor_activity
GO:0004984 24.276 50 0.00001 0.0039 G-protein_coupled_receptor_activity
GO:0045095 0.948 7 0.00003 0.01045 keratin_filament
TSI
GO:0042613 0.304 8 0.00001 0.00163 MHC_class_II_protein_complex
GO:0042605 0.588 6 0.00001 0.00163 peptide_antigen_binding
integral_to_lumenal_side_of_
GO:0071556 0.718 10 0.00001 0.00163
endoplasmic_reticulum_membrane
GO:0032395 0.129 5 0.00001 0.00163 MHC_class_II_receptor_activity
GO:0016021 325.611 414 0.00001 1.63251-3 integral_to_membrane
GO:0005576 89.537 140 0.00001 0.00163 extracellular_region

126
Capítulo 1

antigen_processing_and_presentation_of_
GO:0002504 0.225 6 0.00001 0.00163 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0060333 3.413 13 0.00001 0.00163 interferon-gamma-mediated_signaling_pathway
GO:0019882 1.357 13 0.00001 0.00163 antigen_processing_and_presentation
GO:0004984 2.347 23 0.00001 0.00163 olfactory_receptor_activity
GO:0012507 1.371 10 0.00001 0.00163 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 25.566 48 0.00001 0.00163 G-protein_coupled_receptor_activity
GO:0007608 5.077 15 0.00002 0.00313 sensory_perception_of_smell
GO:0006955 14.835 31 0.00005 0.00762 immune_response
GO:0030669 1.007 7 0.00006 0.00861 clathrin-coated_endocytic_vesicle_membrane
GO:0060402 2.707 8 0.00017 0.02497 calcium_ion_transport_into_cytosol
GO:0030658 2.578 9 0.00018 0.02503 transport_vesicle_membrane
GO:0032588 3.244 10 0.00022 0.02925 trans-Golgi_network_membrane
GO:0042612 0.135 3 0.00023 0.02925 MHC_class_I_protein_complex
TSI (without HLA)
GO:0016021 323.377 407 0.00001 0.003894 integral_to_membrane
GO:0005576 88.921 140 0.00001 0.003894 extracellular_region
GO:0007608 5.05 15 0.00001 0.003894 sensory_perception_of_smell
GO:0004984 2.326 23 0.00001 0.003894 olfactory_receptor_activity
GO:0004930 25.39 48 0.00001 0.003894 G-protein_coupled_receptor_activity

S3 Table. Gene ontology enrichment analyses for outlier windows. The union
of significant windows for at least one of the t f values is used. t f , target fre-
quency used in NCD equation. FDR, false discovery rate. genes (sims), ex-
pected number of genes in this category (see Methods). genes (data), actual
number of genes in the category in the analyzed set. Because no categories ra-
mained significant after removal of HLA genes, these sets are not reported.

GO term # genes (sims) # genes (data) p-value FDR Category description

YRI
GO:0019882 0.157 5 0.00005 0.00402 antigen_processing_and_presentation
GO:0030669 0.119 4 0.00005 0.00402 clathrin-coated_endocytic_vesicle_membrane
GO:0032395 0.015 3 0.00005 0.00402 MHC_class_II_receptor_activity
GO:0042613 0.034 5 0.00005 0.00402 MHC_class_II_protein_complex
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.081 4 0.00005 0.00402
reticulum_membrane
GO:0030658 0.354 5 0.00002 0.00674 transport_vesicle_membrane
GO:0012507 0.157 4 0.00004 0.01191 ER_to_Golgi_transport_vesicle_membrane

127
Capítulo 1

antigen_processing_and_presentation_of_
GO:0002504 0.025 3 0.00005 0.01318 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0031295 0.444 5 0.00011 0.02661 T_cell_costimulation
LWK
antigen_processing_and_presentation_of_
GO:0002504 0.029 5 0.00005 0.00129 peptide_or_polysaccharide_antigen_via
_MHC_class_II
GO:0019882 0.177 10 0.00005 0.00129 antigen_processing_and_presentation
GO:0019221 1.556 10 0.00005 0.00129 cytokine-mediated_signaling_pathway
GO:0030658 0.395 7 0.00005 0.00129 transport_vesicle_membrane
GO:0030669 0.133 6 0.00005 0.00129 clathrin-coated_endocytic_vesicle_membrane
GO:0032588 0.491 6 0.00005 0.00129 trans-Golgi_network_membrane
GO:0031295 0.495 7 0.00005 0.00129 T_cell_costimulation
GO:0012507 0.177 9 0.00005 0.00129 ER_to_Golgi_transport_vesicle_membrane
GO:0032395 0.016 5 0.00005 0.00129 MHC_class_II_receptor_activity
GO:0042612 0.017 3 0.00005 0.00129 MHC_class_I_protein_complex
GO:0042613 0.039 7 0.00005 0.00129 MHC_class_II_protein_complex
GO:0006955 1.986 16 0.00005 0.00129 immune_response
GO:0042605 0.074 4 0.00005 0.00129 peptide_antigen_binding
interferon-gamma-mediated_
GO:0060333 0.455 9 0.00005 0.00129
signaling_pathway
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.092 9 0.00005 0.00129
reticulum_membrane
antigen_processing_and_presentation_of_
GO:0002480 0.029 3 0.00005 0.00129 exogenous_peptide_antigen_via_MHC_class_I,_
TAP-independent
positive_regulation_of_T_
GO:0001916 0.089 3 0.00008 0.01018
cell_mediated_cytotoxicity
antigen_processing_and_presentation_of_
GO:0019886 0.775 6 0.0001 0.01099
exogenous_peptide_antigen_via_MHC_class_II
GO:0030666 0.752 6 0.0001 0.01099 endocytic_vesicle_membrane
GO:0005765 2.206 10 0.0001 0.01099 lysosomal_membrane
negative_regulation_of_
GO:0032689 0.099 3 0.00012 0.01257
interferon-gamma_production
GO:0030670 0.305 4 0.00035 0.03549 phagocytic_vesicle_membrane
TSI

128
Capítulo 1

antigen_processing_and_presentation_of_
GO:0002504 0.027 5 0.00005 0.00131 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0019882 0.17 9 0.00005 0.00131 antigen_processing_and_presentation
GO:0019221 1.477 9 0.00005 0.00131 cytokine-mediated_signaling_pathway
GO:0030658 0.375 6 0.00005 0.00131 transport_vesicle_membrane
GO:0030669 0.126 5 0.00005 0.00131 clathrin-coated_endocytic_vesicle_membrane
GO:0031295 0.467 6 0.00005 0.00131 T_cell_costimulation
ER_to_Golgi_transport_
GO:0012507 0.168 8 0.00005 0.00131
vesicle_membrane
GO:0032395 0.016 4 0.00005 0.00131 MHC_class_II_receptor_activity
GO:0042612 0.016 3 0.00005 0.00131 MHC_class_I_protein_complex
GO:0042613 0.036 6 0.00005 0.00131 MHC_class_II_protein_complex
GO:0006955 1.898 15 0.00005 0.00131 immune_response
GO:0042605 0.072 4 0.00005 0.00131 peptide_antigen_binding
GO:0060333 0.434 9 0.00005 0.00131 interferon-gamma-mediated_signaling_pathway
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.088 8 0.00005 0.00131
reticulum_membrane
antigen_processing_and_presentation_
of_exogenous_peptide_antigen_
GO:0002480 0.029 3 0.00005 0.00131
via_MHC_
class_I,_TAP-independent
positive_regulation_of_T_cell_
GO:0001916 0.085 3 0.00005 0.00628
mediated_cytotoxicity
GO:0060337 0.214 4 0.00005 0.00628 type_I_interferon_signaling_pathway
GO:0032588 0.466 5 0.00006 0.0072 trans-Golgi_network_membrane
GO:0030670 0.291 4 0.00028 0.03158 phagocytic_vesicle_membrane
GBR
antigen_processing_and_presentation_of_peptide_
GO:0002504 0.024 6 0.00005 0.00112
or_polysaccharide_antigen_via_MHC_class_II
GO:0019882 0.151 11 0.00005 0.00112 antigen_processing_and_presentation
GO:0019221 1.332 10 0.00005 0.00112 cytokine-mediated_signaling_pathway
GO:0030658 0.339 7 0.00005 0.00112 transport_vesicle_membrane
GO:0030669 0.112 6 0.00005 0.00112 clathrin-coated_endocytic_vesicle_membrane
GO:0032588 0.421 6 0.00005 0.00112 trans-Golgi_network_membrane
GO:0031295 0.421 6 0.00005 0.00112 T_cell_costimulation
GO:0012507 0.151 9 0.00005 0.00112 ER_to_Golgi_transport_vesicle_membrane
GO:0032395 0.014 5 0.00005 0.00112 MHC_class_II_receptor_activity
GO:0042612 0.014 3 0.00005 0.00112 MHC_class_I_protein_complex

129
Capítulo 1

GO:0042613 0.033 7 0.00005 0.00112 MHC_class_II_protein_complex


GO:0006955 1.7 16 0.00005 0.00112 immune_response
GO:0042605 0.064 4 0.00005 0.00112 peptide_antigen_binding
interferon-gamma-mediated_
GO:0060333 0.391 10 0.00005 0.00112
signaling_pathway
GO:0005765 1.871 10 0.00005 0.00112 lysosomal_membrane
integral_to_lumenal_side_of_
GO:0071556 0.079 9 0.00005 0.00112
endoplasmic_reticulum_membrane
antigen_processing_and_presentation_of_
GO:0002480 0.025 3 0.00005 0.00112 exogenous_peptide_antigen_via_MHC_
class_I,_TAP-independent
GO:0030666 0.644 6 0.00002 0.00222 endocytic_vesicle_membrane
positive_regulation_of_T_cell_
GO:0001916 0.074 3 0.00003 0.00311
mediated_cytotoxicity
GO:0060337 0.195 4 0.00003 0.00311 type_I_interferon_signaling_pathway
antigen_processing_and_presentation_of_
GO:0019886 0.662 6 0.00005 0.00497
exogenous_peptide_antigen_via_MHC_class_II
GO:0050852 0.963 6 0.00017 0.01696 T_cell_receptor_signaling_pathway
GO:0030670 0.26 4 0.0002 0.01922 phagocytic_vesicle_membrane
antigen_processing_and_presentation_of_
GO:0002479 0.298 4 0.00029 0.02645 exogenous_peptide_antigen_via_MHC_class_I,_
TAP-dependent
antigen_processing_and_presentation_of_
GO:0042590 0.32 4 0.00032 0.02834
exogenous_peptide_antigen_via_MHC_class_I
GO:0050776 0.362 4 0.00053 0.04594 regulation_of_immune_response

S4 Table. Phenotype ontology (PO) enrichment analysis FDR, false discov-


ery rate (see Methods). Gene sets without any significant enrichment are not
shown. genes (sims), expected number of genes in this category (see Methods).
genes (data), actual number of genes in the category in the analyzed set. tf, the
set of significant windows for this t f was significant.
YRI
tf PO term # genes (sims) # genes (data) p-value FDR Description
0.5 HP:0000591 0.023 3 0.00007 0.04093 Abnormality_of_the_sclera
YRI (without HLA)
0.5 HP:0000591 0.023 3 0.00006 0.04216 Abnormality_of_the_sclera

130
Capítulo 1

S5 Table. Tissue-specific (TG) expression enrichment analysis The union of


significant windows for at least one t f is used. FDR, false discovery rate (see
Methods). Gene sets without any significant enrichment are not shown. genes
(sims), expected number of genes in this category (see Methods). genes (data),
actual number of genes in the category in the analyzed set.
TSI
tf TG term # genes (sims) # genes (data) p-value FDR Description
Union TG:02 5.417 12 0.00261 0.03046 adrenal
TSI (without HLA)
Union TG:02 5.406 12 0.00266 0.02877 adrenal
GBR
Union TG:10 14.868 25 0.00419 0.04943 lung

S6 Table. Assigned t f values Reported values are numbers of significant (top)


and outlier (bottom) windows (see Methods). The union of windows that are
significant or outlier for at least one of the t f values was used and showed in
the last column. Percentages refer to the proportion of windows with a given
assigned t f value (the one that minimizes NCD2, see Methods). "|" denotes
"or", i.e, when a window is assingned to more than one t f .
Significant Target Frequency
POP 0.3 0.4 0.5 0.4|0.3 0.5|0.4 0.3|0.4|0.5 Union
LWK 4049(52%) 1002(13%) 2705(35%) 2 10 2 7770
YRI 4481(53%) 1083(13%) 2863(34%) 3 4 2 8436
GBR 4217(49%) 1062(13%) 3238(38%) 3 6 0 8526
TSI 4080(49%) 1172(14%) 31339(37%) 4 6 0 8395
Outlier
POP 0.3 0.4 0.5 0.4|0.3 0.5|0.4 0.3|0.4|0.5 Union
LWK 565(50%) 142 424(37%) 1 5 2 1139
YRI 587(51%) 144 404(36%) 2 3 0 1142
GBR 584(52%) 129 417(37%) 0 1 0 1131
TSI 571(563%) 148 440(38%) 3 1 0 1163

S7 Table. List of outlier genes This is the same list reported in Table 3, but
included additional information. In purple, "African" genes; in orange, "Euro-
pean" genes and in green , "African and European" (see main text). P, p-value of
the most exteme window overlapping the gene; tf, assigned target frequency of
the window with lowest p-value. When a gene is "African" or "European" but
one of the populations from the other continents also has an extreme window
for the gene, it is highlighted with the same color code.

YRI LWK GBR TSI

131
Capítulo 1

Chr Gene Acronym tf P tf P tf P tf P


9 ABO 0.3 0.000422957 0.3 0.000131178 0.3 0.000674892 0.3 0.000782164
4 ADAM29 0.5 0.000491611 0.5 0.000233546 0.3 0.001198991 0.3 0.001785001
6 AIM1 0.5 0.000268486 0.5 0.000469543 0.3 0.002997477 0.4 0.000785842
2 ALK 0.3 0.0000883 0.3 0.000119531 0.3 0.00053881 0.3 0.000380661
22 ARHGAP8 0.3 0.000361046 0.4 0.000334075 0.3 0.001790517 0.3 0.000108498
14 ATXN3 0.5 0.000470769 0.5 0.000245805 0.3 0.0006847 0.3 0.000549231
1 BCAR3 0.3 0.000410697 0.5 0.000390469 0.3 0.009124835 0.3 0.004408559
12 C12orf54 0.3 0.00038863 0.4 0.000418666 0.3 0.000673666 0.4 0.000638726
1 C1orf101 0.4 0.00000368 0.3 0.0000153 0.3 0.014656988 0.3 0.000584171
22 C22orf34 0.5 0.0000791 0.5 0.000095 0.3 0.000426635 0.3 0.001254159
13 COG6 0.5 0.000468317 0.5 0.000399051 0.5 0.000622789 0.5 0.000578654
4 COL25A1 0.5 0.000366563 0.5 0.000457284 0.4 0.000483029 0.3 0.000744159
10 CUBN 0.4 0.000409471 0.4 0.000257452 0.3 0.000359207 0.3 0.000757644
2 DIRC3 0.3 0.000182055 0.3 0.000141599 0.3 0.004005218 0.3 0.005316384
18 DTNA 0.3 0.000357981 0.3 0.000205349 0.3 0.007393164 0.3 0.001966443
3 EPHA6 0.3 0.000223738 0.5 0.0000895 0.3 0.006470627 0.3 0.000641791
3 FRMD4B 0.5 0.000164892 0.3 0.000186959 0.3 0.000654051 0.3 0.003204665
16 GPR114 0.5 0.000487933 0.5 0.000225577 0.3 0.017186761 0.3 0.025924191
7 GTF2IRD1 0.3 0.000126887 0.3 0.000216382 0.3 0.082588766 0.5 0.084745846
6 HLA-DQA2 0.3 0.000209639 0.3 0.00029607 0.3 0.002912273 0.3 0.002816648
4 IGFBP7 0.3 0.0000938 0.3 0.000172861 0.3 0.000399664 0.3 0.000703702
1 LGALS8 0.5 0.000361659 0.5 0.000196767 0.3 0.000831815 0.5 0.000490998
4 LGI2 0.5 0.000375144 0.5 0.000463414 0.5 0.000827524 0.5 0.000653438
11 LUZP2 0.3 0.0000276 0.3 0.0000227 0.3 0.002680566 0.3 0.002470926
8 MYOM2 0.5 0.000084 0.5 0.0000821 0.3 0.000918858 0.3 0.000866755
18 NFATC1 0.5 0.000205349 0.5 0.000217608 0.3 0.003760638 0.3 0.002065746
11 OR52A1 0.4 0.0000398 0.3 0.0000184 0.3 0.000447476 0.3 0.001662404
6 PACRG 0.5 0.000130565 0.5 0.000123209 0.5 0.000765 0.5 0.000776034
1 PADI2 0.5 0.000345721 0.4 0.000433378 0.5 0.000478738 0.4 0.000561491
3 PARP15 0.4 0.00021393 0.5 0.000123822 0.5 0.000364724 0.4 0.000834267
6 PDE10A 0.5 0.00023232 0.5 0.000338365 0.3 0.002663402 0.3 0.002801936
PRR5-
22 0.3 0.000361046 0.4 0.000334075 0.3 0.001790517 0.3 0.000108498
ARHGAP8
12 PTPRB 0.5 0.000307103 0.5 0.000177764 0.3 0.002825229 0.3 0.00122351
11 PTS 0.5 0.0000828 0.5 0.000101755 0.3 0.000357368 0.3 0.001069039
6 RNF39 0.4 0.000328558 0.3 0.0000944 0.3 0.003772898 0.3 0.000408858

132
Capítulo 1

RP11-
15 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743
96O20.4
10 SFTPD 0.3 0.000194315 0.3 0.0000147 0.3 0.023205621 0.3 0.004575903
8 SGCZ 0.5 0.000256226 0.5 0.00048119 0.3 0.000722705 0.4 0.000495289
6 SLC17A5 0.5 0.000250709 0.5 0.00033101 0.5 0.008776049 0.3 0.004632297
11 SLC35F2 0.5 0.000266034 0.5 0.00031875 0.5 0.009484655 0.5 0.011214487
1 SPRR3 0.5 0.000357368 0.5 0.000496515 0.5 0.000538197 0.5 0.000263582
20 SPTLC3 0.5 0.000460962 0.4 0.000395373 0.4 0.000560878 0.5 0.000449315
15 SQRDL 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743
5 STK32A 0.5 0.000274615 0.5 0.000300361 0.5 0.000370241 0.3 0.000521034
14 STXBP6 0.5 0.000163053 0.5 0.000146502 0.3 0.001623787 0.3 0.00021393
20 TGM6 0.3 0.0000638 0.3 0.00013363 0.3 0.000148341 0.3 0.000667536
13 TMCO3 0.5 0.000337753 0.5 0.00046464 0.3 0.001407404 0.3 0.001530613
16 WWOX 0.3 0.000407019 0.5 0.000427248 0.3 0.000239676 0.3 0.00133875
19 ZNF331 0.5 0.000378209 0.5 0.000473834 0.3 0.00091089 0.3 0.000663245
3 ALDH1L1 0.3 0.000354303 0.3 0.000328558 0.4 0.001094171 0.4 0.001080072
22 CELSR1 0.3 0.000129952 0.3 0.000253774 0.3 0.007483272 0.3 0.010092732
5 COMMD10 0.3 0.000278906 0.3 0.00023232 0.3 0.000710445 0.3 0.001828522
2 MLPH 0.3 0.000350625 0.4 0.000460962 0.3 0.00040518 0.3 0.002850975
18 NEDD4L 0.5 0.000389856 0.3 0.000369015 0.3 0.000840397 0.5 0.001739027
14 OR6J1 0.3 0.000134856 0.3 0.000158762 0.3 0.000568233 0.3 0.000383726
6 SLC22A16 0.3 0.000498354 0.3 0.000435829 0.3 0.010716746 0.3 0.00173351
3 SUMF1 0.3 0.000151406 0.3 0.000326719 0.3 0.00174577 0.3 0.000546166
17 ZZEF1 0.3 0.000253161 0.3 0.000253161 0.3 0.000133017 0.3 0.000502644
15 C15orf48 0.5 0.000165505 0.3 0.00036595 0.3 0.517718828 0.3 0.471580976
6 CCHCR1 0.3 0.000426022 0.3 0.000300361 0.3 0.000782164 0.3 0.000416827
3 CLDN16 0.3 0.000385565 0.3 0.000255613 0.3 0.014823106 0.3 0.010374703
8 EXTL3 0.3 0.000269712 0.3 0.000457897 0.5 0.064383231 0.3 0.00184446
2 IL37 0.3 0.0000368 0.3 0.000430313 0.3 0.007752983 0.3 0.001856106
5 NR3C1 0.3 0.000401503 0.3 0.000410084 0.3 0.000630757 0.3 0.000290553
1 PGLYRP4 0.4 0.0000975 0.3 0.000274615 0.3 0.005533992 0.3 0.006012117
5 SLC27A6 0.3 0.000466479 0.3 0.000497741 0.5 0.00057988 0.4 0.000313233
8 STAU2 0.3 0.000329171 0.3 0.000444411 0.3 0.00096238 0.3 0.000500805
12 TMEM132D 0.5 0.000429087 0.3 0.000454832 0.3 0.00082875 0.5 0.001126659
11 TMEM135 0.3 0.000266647 0.3 0.000291166 0.3 0.000403954 0.3 0.000597043
17 WSCD1 0.5 0.000492837 0.3 0.000438894 0.3 0.001060457 0.3 0.000837945
1 ZNF670 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717
1 ZNF695 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717
12 AC121757.1 0.5 0.000594592 0.5 0.000751515 0.5 0.000419279 0.5 0.000359207

133
Capítulo 1

10 ADAM12 0.3 0.000508161 0.3 0.000504483 0.5 0.000144664 0.3 0.000123822


20 ADRA1D 0.3 0.000798714 0.3 0.000883919 0.3 0.000235998 0.3 0.000304652
6 AL590867.1 0.3 0.000575589 0.3 0.001535517 0.5 0.000351238 0.5 0.00036595
17 B3GNTL1 0.3 0.002595974 0.4 0.001552068 0.3 0.000171635 0.3 0.00041744
10 BICC1 0.3 0.001851816 0.3 0.002502801 0.5 0.000407632 0.5 0.000411923
1 C1orf222 0.3 0.00446434 0.3 0.000803618 0.3 0.00000797 0.3 0.000116466
13 CCDC169 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918
CCDC169-
13 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918
SOHLH2
17 CEP112 0.3 0.000579267 0.3 0.000451154 0.5 0.000258678 0.4 0.000275841
1 CNR2 0.3 0.000635661 0.5 0.000484868 0.5 0.000120144 0.5 0.000181442
7 CNTNAP2 0.3 0.000609303 0.3 0.000894339 0.4 0.000355529 0.3 0.000498967
3 CPNE4 0.5 0.000777873 0.3 0.000324267 0.5 0.000391082 0.5 0.000399051
8 CSMD1 0.3 0.000632596 0.3 0.000731899 0.3 0.0000374 0.3 2.57452E-05
4 FRAS1 0.5 0.000617885 0.4 0.000409471 0.5 0.000406406 0.3 0.000336527
9 FXN 0.5 0.000689604 0.5 0.000559652 0.5 0.000375144 0.5 0.000418053
9 GABBR2 0.3 0.000416827 0.3 0.000905986 0.3 0.000141599 0.3 0.000375144
22 GRAMD4 0.5 0.006669846 0.5 0.0068188 0.5 0.000457284 0.5 0.000486707
10 GRID1 0.3 0.013559139 0.3 0.00317708 0.5 0.0000215 0.3 4.78125E-05
12 GRIP1 0.4 0.002874268 0.3 0.000784003 0.3 0.000318137 0.4 0.000397212
7 HUS1 0.5 0.000526551 0.5 0.000349399 0.5 0.000489159 0.5 0.000335914
8 IDO2 0.5 0.000663245 0.5 0.000345721 0.4 0.000384339 0.5 0.000200445
2 IL18R1 0.3 0.001869592 0.3 0.000394147 0.5 0.000265421 0.4 0.000315685
2 IL1RL1 0.3 0.001869592 0.3 0.000394147 0.5 0.000265421 0.4 0.000315685
13 KL 0.5 0.000530842 0.5 0.000665084 0.3 0.000302813 0.3 0.000478738
12 KRT83 0.5 0.000594592 0.5 0.000751515 0.5 0.000419279 0.5 0.000359207
1 LAMC2 0.3 0.000373305 0.5 0.001037164 0.5 0.000218221 0.5 0.000234159
18 LDLRAD4 0.4 0.001146274 0.5 0.001184279 0.5 0.000386791 0.5 0.000328558
11 MYO7A 0.5 0.000855108 0.5 0.00084714 0.5 0.000494063 0.5 0.000434604
14 NRXN3 0.3 0.001166503 0.3 0.001046972 0.3 0.000264808 0.5 0.000215156
1 NSUN4 0.3 0.000539423 0.3 0.001890433 0.3 0.000136082 0.3 0.000272164
12 NTN4 0.3 0.000913955 0.5 0.002137465 0.5 0.000389856 0.4 0.000362272
OVCH1-
12 0.3 0.000649147 0.3 0.000624628 0.5 0.000339591 0.5 0.000262356
AS1
6 PKHD1 0.5 0.001134628 0.5 0.001141371 0.5 0.000257452 0.5 0.000239676
1 PPAP2B 0.3 0.000578041 0.3 0.000499579 0.5 0.000363498 0.5 0.000346334
14 PSMC1 0.5 0.000457897 0.5 0.000562717 0.5 0.00048119 0.5 0.000406406
16 RBFOX1 0.5 0.000643017 0.3 0.000335914 0.5 0.000492837 0.4 0.00045851
1 REG4 0.3 0.000530842 0.3 0.000849592 0.5 0.000409471 0.5 0.000314459

134
Capítulo 1

6 RUNX2 0.5 0.000683474 0.5 0.000625854 0.5 0.00011524 0.3 0.000229255


4 SEPT11 0.3 0.000912116 0.4 0.000619724 0.5 0.000256226 0.3 0.000395986
13 SGCG 0.3 0.00030649 0.3 0.00086369 0.5 0.0000944 0.4 0.000102981
19 SLC1A6 0.5 0.000507548 0.5 0.000115853 0.5 0.000169183 0.5 0.000129952
4 SLC2A9 0.3 0.000534519 0.3 0.000558426 0.5 0.00034143 0.5 0.000240289
13 SOHLH2 0.3 0.000228642 0.3 0.000780325 0.5 0.00015631 0.5 0.000118918
9 SVEP1 0.3 0.000692669 0.3 0.000551683 0.3 0.000147115 0.3 0.000450541
11 TCP11L1 0.5 0.000770517 0.4 0.000732512 0.5 0.000476899 0.4 0.000497128
4 TEC 0.3 0.000610529 0.3 0.000298522 0.3 0.000319363 0.5 0.00022006
1 TNN 0.3 0.000291166 0.3 0.000523486 0.5 0.000209027 0.5 0.000171635
2 TNS1 0.4 0.000453606 0.5 0.000617885 0.5 0.000299135 0.5 0.000346947
11 TRIM5 0.3 0.000370853 0.5 0.000619111 0.5 0.000204123 0.5 0.000217608
9 TRPM3 0.5 0.000587236 0.5 0.000676731 0.5 0.000410697 0.4 0.000348786
3 VPS8 0.5 0.000796875 0.5 0.000364111 0.5 0.0003825 0.5 0.000335301
8 WDYHV1 0.3 0.000932344 0.3 0.001286034 0.3 0.000329784 0.3 0.000297909
16 CDH5 0.4 0.001592525 0.3 0.000656503 0.3 0.000305878 0.3 0.000150793
5 ITGA1 0.3 0.002080457 0.3 0.001984832 0.3 0.000424183 0.3 0.000488546
9 KDM4C 0.5 0.004607165 0.3 0.002454989 0.3 0.000401503 0.3 0.000391082
11 ALG8 0.4 0.001277452 0.5 0.00042357 0.3 0.000361659 0.3 0.000312007
4 ATP10D 0.3 0.003962309 0.3 0.001068426 0.3 0.000413762 0.3 0.000486707
2 COL4A3 0.3 0.000563942 0.3 0.001126659 0.3 0.000434604 0.3 0.00045238
17 CRHR1 0.5 0.017642818 0.5 0.019007314 0.3 0.000390469 0.5 0.000369628
18 DOK6 0.3 0.001796647 0.3 0.001343654 0.3 0.00045851 0.3 0.000374531
19 GNA15 0.3 0.000422344 0.3 0.000527777 0.3 0.000274615 0.3 0.000152632
6 HLA-DRA 0.3 0.001896563 0.4 0.000440733 0.3 0.000456671 0.3 0.000398438
3 KALRN 0.5 0.001564327 0.3 0.000631983 0.3 0.000498354 0.5 0.000235385
17 KANSL1 0.3 0.0643912 0.3 0.095343674 0.3 0.000445637 0.4 0.000327945
12 OAS1 0.3 0.122187337 0.3 0.122196532 0.3 0.000478738 0.5 0.000413149
7 ORC5 0.3 0.000741707 0.3 0.000407019 0.3 0.000367176 0.3 0.000295457
1 PGBD5 0.3 0.000567007 0.3 0.000665084 0.3 0.000422957 0.3 0.000425409
9 POLR1E 0.3 0.000843462 0.3 0.000389856 0.3 0.000304039 0.3 0.000355529
4 RASSF6 0.3 0.000520421 0.3 0.000393534 0.3 0.000259904 0.5 0.000121983
7 SKAP2 0.3 0.001242512 0.5 0.002187729 0.3 0.000438894 0.3 0.000449315
19 ZNF254 0.3 0.015283455 0.3 0.010679968 0.3 0.000269099 0.3 0.000334688
10 APBB1IP 0.5 0.00024274 0.5 0.000327945 0.5 0.000223125 0.5 0.000201671
17 B4GALNT2 0.5 0.0000343 0.5 0.0000343 0.5 0.000118305 0.5 0.000148341
12 BICD1 0.3 0.0000778 0.3 0.000112788 0.3 0.000377596 0.3 0.000362885
1 C1orf68 0.5 0.000326719 0.5 0.000231707 0.5 0.000437055 0.5 0.000378209
10 CAMK1D 0.5 0.0000533 0.5 0.000038 0.5 0.00000306 0.5 4.90385E-06

135
Capítulo 1

4 CPE 0.3 0.0000227 0.3 0.0000288 0.3 0.0000386 0.3 3.00361E-05


10 DMBT1 0.5 0.000106046 0.5 0.000145889 0.5 0.000224964 0.3 0.000498354
1 EDARADD 0.5 0.000117079 0.5 0.000102368 0.3 0.000348173 0.3 0.000222512
14 EGLN3 0.3 0.0000944 0.3 0.0000729 0.5 0.000136082 0.3 0.000152019
1 ERO1LB 0.5 0.00031262 0.5 0.000270938 0.5 0.000335914 0.5 0.000264808
22 FAM19A5 0.3 0.000288714 0.3 0.000169796 0.3 0.000432152 0.3 0.000275228
19 FCER2 0.5 0.000155697 0.5 0.000147115 0.5 0.000274002 0.3 0.000250709
1 GPR137B 0.5 0.00031262 0.5 0.000270938 0.5 0.000335914 0.5 0.000264808
4 GPRIN3 0.5 0.000144664 0.5 0.000175313 0.5 0.000263582 0.5 0.000310168
11 HBE1 0.5 0.000262969 0.5 0.00022619 0.5 0.000220673 0.5 0.000459123
11 HBG2 0.5 0.000262969 0.5 0.00022619 0.5 0.000220673 0.5 0.000459123
7 HIP1 0.5 0.000225577 0.5 0.0000313 0.5 0.00003 0.5 3.67789E-06
6 HLA-B 0.3 0.00012137 0.3 0.0000846 0.3 0.000106659 0.3 0.000117079
6 HLA-C 0.3 0.0000405 0.3 0.0000417 0.3 0.0001275 0.3 0.000138534
6 HLA-DPA1 0.4 0.0000797 0.5 0.0000251 0.3 0.000215156 0.3 0.000221899
6 HLA-DPB1 0.5 0.0000932 0.5 0.0000509 0.4 0.000142825 0.3 8.82693E-05
6 HLA-DQA1 0.3 0.00000245 0.5 0.00000184 0.3 0.0000552 0.3 1.53245E-05
6 HLA-DQB1 0.3 0.000135469 0.3 0.00018512 0.3 0.000108498 0.3 0.00012137
6 HLA-DQB2 0.3 0.0000717 0.4 0.0000564 0.5 0.000365337 0.3 0.000346334
6 HLA-DRB1 0.3 0.000109111 0.3 0.000191863 0.3 0.000087 0.3 0.000137921
6 HLA-DRB5 0.3 0.0000251 0.5 0.0000264 0.4 0.0000233 0.3 3.49399E-05
6 HLA-G 0.4 0.000186959 0.5 0.00023845 0.5 0.00041744 0.5 0.000258678
2 LRP1B 0.5 0.000197993 0.5 0.000200445 0.5 0.000291779 0.3 0.000266034
4 MANBA 0.3 0.000459123 0.5 0.000216382 0.5 0.00033101 0.5 0.000383113
11 MMP26 0.3 0.0000901 0.3 0.0000828 0.3 0.00000368 0.3 0.000161214
2 MROH2A 0.4 0.000180829 0.4 0.000182668 0.5 0.000412536 0.5 0.000192476
10 MYO3A 0.3 0.0000343 0.3 0.0000362 0.5 0.0000153 0.5 1.40986E-05
3 MYRIP 0.5 0.000134856 0.4 0.000175926 0.3 0.000321815 0.3 0.000300974
21 NCAM2 0.5 0.000151406 0.5 0.000110337 0.4 0.0000221 0.3 6.74279E-05
2 NDUFA10 0.5 0.0000282 0.3 0.0000558 0.3 0.000226803 0.3 0.000253774
10 OLAH 0.3 0.000451154 0.3 0.000200445 0.3 0.0000797 0.5 4.41346E-05
6 PARK2 0.5 0.000416214 0.5 0.00036595 0.5 0.000272164 0.5 0.000375144
10 PCDH15 0.3 0.0000184 0.4 0.0000196 0.5 0.000129952 0.3 0.000128113
10 PDSS1 0.3 0.000353077 0.4 0.000221286 0.4 0.000465253 0.5 0.000452993
6 PHACTR2 0.5 0.000231707 0.5 0.000369015 0.3 0.000451767 0.4 0.000219447
20 PLCB4 0.5 0.000153858 0.5 0.000408858 0.5 0.000183894 0.5 0.000188798
21 PRDM15 0.3 0.000149567 0.5 0.000145276 0.3 0.0000429 0.4 7.66226E-05
8 PREX2 0.4 0.000124435 0.5 0.000196154 0.5 0.000308329 0.5 0.000190637
20 PROKR2 0.5 0.00000184 0.4 0.00000306 0.5 0.000000613 0.5 1.22596E-06

136
Capítulo 1

6 RP11-257K9.8 0.5 0.0000172 0.5 0.0000319 0.5 0.000152019 0.5 8.58173E-05


2 SH3RF3 0.3 0.000102368 0.5 0.000076 0.3 0.000270938 0.3 0.000134856
20 SIRPA 0.5 0.000207188 0.5 0.000202897 0.3 0.00046893 0.3 0.000419892
8 SLA 0.3 0.000304652 0.3 0.000122596 0.3 0.000258065 0.3 8.03005E-05
11 SNX19 0.5 0.000120757 0.5 0.000125661 0.5 0.000164892 0.5 0.000166731
2 SPAG16 0.3 0.000342043 0.3 0.000318137 0.5 0.000189411 0.3 0.000199832
3 SUCLG2 0.3 0.000270325 0.4 0.000270938 0.5 0.000122596 0.5 0.000326106
5 SV2C 0.5 0.000125048 0.5 0.0000932 0.4 0.0000276 0.3 1.47115E-05
8 TG 0.3 0.000304652 0.3 0.000122596 0.3 0.000258065 0.3 8.03005E-05
21 TMPRSS2 0.4 0.000100529 0.5 0.000136695 0.5 0.0000484 0.5 2.51322E-05
12 TMTC2 0.5 0.000132404 0.5 0.0000423 0.3 0.000374531 0.3 0.000255
4 UGT2B4 0.3 0.000212091 0.5 0.000314459 0.3 0.000131178 0.3 0.000142212
3 WNT7A 0.3 0.00022619 0.3 0.000118918 0.5 0.000241515 0.5 0.000165505
6 ZC3H12D 0.5 0.000185733 0.5 0.000275841 0.5 0.0000552 0.3 9.8077E-05
3 ZNF385D 0.3 0.000438281 0.3 0.000299748 0.5 0.000304039 0.5 0.00026726
19 ZNF83 0.5 0.0000766 0.5 0.000174087 0.5 0.000135469 0.5 0.000117692
6 CDSN 0.3 0.000153858 0.3 0.000184507 0.4 0.000143438 0.4 0.00011524
12 CHST11 0.5 0.000250096 0.3 0.000182055 0.3 0.000292392 0.3 0.000245805
1 FMN2 0.3 0.00020351 0.3 0.000103594 0.3 0.000228029 0.3 2.69712E-05
6 PSORS1C1 0.3 0.000153858 0.3 0.000184507 0.4 0.000143438 0.4 0.00011524
10 CTNNA3 0.5 0.000488546 0.3 0.000342043 0.4 0.0000405 0.3 4.96515E-05
12 FAM101A 0.4 0.000371466 0.3 0.000301587 0.5 0.000249483 0.5 0.000126887
1 RIMKLA 0.3 0.000084 0.3 0.000321815 0.3 0.000114014 0.3 0.000163666
3 SPATA16 0.5 0.000394147 0.3 0.000436442 0.3 0.0000778 0.3 0.000181442
2 THSD7B 0.3 0.000376983 0.3 0.000467704 0.5 0.000251935 0.5 0.000281358

S8 Table. Significant genes Significant genes pass the significance criteria in at


least two populations from the same continent. See main text and Supplemen-
tary Text 2.

ABAT CCDC38 ETFB KCNH5 PRR5L SORCS2 WDR27


ABCA12 CCDC50 ETV1 KCNIP1 PRRC1 SORCS3 WDR64
ABCA13 CCDC57 EVC2 KCNIP4 PRSS38 SP100 WDR72
ABCA4 CCDC85C EXO1 KCNJ6 PRSS45 SP110 WDR75
ABCB11 CCDC91 EXOC2 KCNK2 PRSS50 SP140L WDR93
ABCC2 CCHCR1 EXOC3L2 KCNMA1 PSD3 SPACA3 WDYHV1
ABCC4 CCNG2 EXOC4 KCNMB2 PSMC1 SPAG16 WFDC3
ABCC8 CCSER1 EXOC7 KCNQ1 PSMG4 SPARCL1 WFDC8

137
Capítulo 1

ABCD4 CD70 EXTL3 KCNQ3 PSORS1C1 SPATA13 WFIKKN2


ABCG1 CD84 EYA4 KCNQ5 PSTPIP2 SPATA16 WNT7A
ABI1 CD96 EYS KCNS3 PTCHD3 SPATA22 WNT9B
ABTB2 CDA F13A1 KCTD8 PTCHD4 SPATA3 WSCD1
AC004466.1 CDC42BPA F5 KDM4C PTGFRN SPATC1L WSCD2
AC004824.2 CDH12 FABP2 KHNYN PTH2R SPEF2 WWC1
AC023469.1 CDH13 FAHD1 KIAA0196 PTP4A3 SPHKAP WWOX
AC073528.1 CDH18 FAM101A KIAA0319 PTPRB SPINK2 WWTR1
AC087645.1 CDH22 FAM114A1 KIAA1024 PTPRD SPINK5 XKR3
AC091801.1 CDH23 FAM129B KIAA1199 PTPRK SPINT4 XRCC4
AC092687.4 CDH4 FAM135B KIAA1211 PTPRM SPNS3 XXYLT1
AC121757.1 CDH5 FAM13A KIAA1217 PTPRN2 SPRR1B YIPF1
ACBD5 CDH7 FAM154A KIAA1324 PTPRT SPRR3 YIPF7
ACPP CDH9 FAM155A KIAA1324L PTS SPTLC2 ZBTB16
ACTA2 CDKAL1 FAM173B KIF13A PUS7 SPTLC3 ZC3H12C
ACTL8 CDSN FAM179A KIF16B PXDN SQRDL ZC3H12D
ADAM12 CDYL2 FAM184B KIF26A PXDNL SRD5A1 ZDHHC14
ADAM29 CEACAM7 FAM189A1 KIF26B PYROXD1 SRL ZFAT
ADAMTS1 CELA3B FAM19A2 KIRREL3 PYROXD2 SSR1 ZFP57
ADAMTS12 CELF5 FAM19A5 KL RAB36 ST6GAL1 ZFYVE28
ADAMTS16 CELSR1 FAM43A KLF12 RAB3C ST6GALNAC3 ZNF114
ADAMTS17 CEP112 FAM47E KLHDC8A RAB8A ST8SIA1 ZNF155
FAM47E-
ADAMTS18 CEP128 KLHL1 RABEP1 ST8SIA6 ZNF254
STBD1
ADAMTS2 CERS6 FAM65B KLHL14 RAMP3 STAP1 ZNF28
ADAMTS3 CFHR1 FAM81B KLHL23 RASGEF1B STAU2 ZNF280A
ADAMTS5 CFHR2 FANK1 KLHL24 RASSF2 STIM1 ZNF283
ADAMTSL1 CHD5 FARS2 KLHL5 RASSF6 STK32A ZNF331
ADAMTSL2 CHKB FAS KLHL7 RBFOX1 STK32B ZNF345
ADAP1 CHL1 FAT2 KLK13 RBFOX3 STK32C ZNF354A
ADARB2 CHN2 FBXL17 KLRB1 RBM11 STK39 ZNF354B
STON1-
ADAT2 CHRM2 FCER2 KNG1 RBMS3 ZNF365
GTF2A1L
ADCY2 CHST11 FCGBP KRT14 RBP4 STON2 ZNF366
ADCY3 CLCA2 FER1L6 KRT40 RCAN1 STPG1 ZNF385D
ADCY5 CLCNKB FFAR4 KRT6A RCAN2 STPG2 ZNF391
ADD2 CLDN10 FGF12 KRT8 RCBTB1 STT3A ZNF423
ADH4 CLDN11 FGF14 KRT83 RECK STX2 ZNF44
ADHFE1 CLDN16 FHAD1 KRT84 REG4 STX8 ZNF441

138
Capítulo 1

ADRA1A CLEC1A FHIT KRTAP10-7 RELN STXBP5L ZNF443


ADRA1D CLEC1B FIP1L1 KRTAP12-2 RERGL STXBP6 ZNF468
AEBP2 CLEC3A FKRP KRTAP3-2 RFTN1 SUCLG2 ZNF568
AGAP1 CLEC4C FMN1 KRTAP5-5 RFX2 SULT2B1 ZNF577
AGBL1 CLEC6A FMN2 KSR2 RFX8 SUMF1 ZNF670
AGMAT CLIC5 FMO2 KY RGL1 SUN3 ZNF677
AGPAT9 CLMP FNDC1 L3MBTL2 RGS1 SV2C ZNF695
AHRR CLOCK FNDC3B L3MBTL4 RGS6 SVEP1 ZNF697
AIM1 CLSTN2 FOXK2 LAMA1 RGSL1 SVIL ZNF738
AIPL1 CMBL FRAS1 LAMA2 RHBDL3 SWAP70 ZNF74
AKNAD1 CMC2 FRMD4A LAMA4 RIMBP2 SYCE3 ZNF773
AL160286.1 CMKLR1 FRMD4B LAMB4 RIMKLA SYCP2L ZNF804B
AL355531.2 CNBD1 FSIP1 LAMC1 RIMS1 SYK ZNF83
AL590867.1 CNN2 FTO LAMC2 RIN2 SYN3 ZNF85
ALDH18A1 CNR2 FUT9 LAMC3 RNASE11 SYNE1 ZNF879
ALDH1L1 CNTLN FXN LAPTM4B RNF144B SYNE3 ZNF98
ALDH4A1 CNTN4 FYB LBH RNF150 SYNJ2 ZPLD1
ALG8 CNTN5 GABBR2 LDB3 RNF175 SYNPR ZSCAN18
ALK CNTNAP2 GABRG3 LDLRAD3 RNF19A SYT16 ZSWIM2
ALPL CNTNAP4 GABRR1 LDLRAD4 RNF212 SYT9 ZZEF1
AMTN CNTNAP5 GADL1 LGALS8 RNF39 TAB2
ANK1 COBLL1 GALC LGI2 RNPEP TACC3
ANK2 COG6 GALNT10 LGR5 ROBO2 TANC1
ANK3 COL21A1 GALNT13 LHFPL2 ROR1 TAP2
ANKH COL24A1 GALNT14 LHPP ROR2 TAS2R14
ANKRD24 COL25A1 GALNT18 LIMCH1 RORA TAS2R20
RP1-
ANKS1B COL28A1 GALNT8 LINC00908 TAS2R42
139D8.6
RP11-
ANO2 COL4A2 GALNT9 LINC00923 TBC1D22A
156P1.2
RP11-
ANO3 COL4A3 GALNTL6 LINGO2 TBC1D7
192H23.4
RP11-
ANXA4 COL4A4 GANC LIPC TBX20
210M15.2
RP11-
AOAH COMMD10 GAS2 LITAF TCP11L1
215A19.2
RP11-
AOC1 CORIN GATM LMX1B TCTN2
257K9.8

139
Capítulo 1

RP11-
AP3B1 COX19 GCNT3 LNX1 TDRD10
295P9.3
RP11-
APBB1IP CPA3 GFRA2 LOXL2 TEAD2
297M9.1
RP11-
APBB2 CPA5 GIPC2 LPHN2 TEC
302M6.4
RP11-
APIP CPB2 GLDN LPHN3 TEK
307N16.6
RP11-
APPBP2 CPE GLIPR1L2 LPIN1 TEKT1
321F6.1
RP11-
ARHGAP22 CPLX1 GLIS1 LPIN2 TENM2
383H13.1
RP11-
ARHGAP24 CPNE4 GMNC LPPR1 TENM3
389E17.1
RP11-
ARHGAP28 CPNE8 GNA15 LRCH1 TENM4
433C9.2
RP11-
ARHGAP44 CPXM2 GNG2 LRP1B TES
45H22.3
RP11-
ARHGAP8 CREB5 GNLY LRRC16A TESC
463C8.4
RP11-
ARHGEF10L CRELD2 GOLM1 LRRC7 TESPA1
697E2.6
RP11-
ARHGEF18 CRHR1 GOPC LRRFIP1 TEX2
96O20.4
RP13-
ARHGEF37 CRTAC1 GOSR2 LRRK2 TFB2M
279N23.2
RP5-
ARL14EPL CRTC3 GPC5 LRRTM4 TG
1052I5.2
RP5-
ARL15 CRX GPC6 LSAMP TGM6
966M1.6
ARSB CRYL1 GPD1L LTBP1 RPA3-AS1 THBS2
ARSJ CSGALNACT1 GPLD1 LUZP2 RPAIN THBS4
ART1 CSMD1 GPR111 LYAR RPGRIP1 THSD4
ART3 CSMD2 GPR114 LYPD6B RPS6KA2 THSD7A
ASAH2 CSMD3 GPR115 MACC1 RPSA THSD7B
ASAP1 CSN3 GPR133 MACROD2 RPTOR TIAM1
ASB18 CSRP1 GPR137B MAGI1 RRM1 TIAM2
CTB-
ASGR2 GPR158 MAGI2 RRP12 TIFA
129P6.11

140
Capítulo 1

CTD-
ASIC2 GPR78 MAMDC2 RUNX1 TJP2
2207O23.3
CTD-
ASPA GPRIN3 MAML3 RUNX2 TLDC1
2260A17.2
CTD-
ASTN2 GRAMD3 MANBA RXFP1 TLK1
2287O16.3
CTD-
ATF7IP2 GRAMD4 MAP3K13 RYR1 TLR10
2616J11.11
CTD-
ATP10A GRB10 MAPT RYR2 TMCC3
3088G3.8
CTD-
ATP10D GREB1 "MARCH1" RYR3 TMCO3
3105H18.16
CTD-
ATP2C2 GRHL2 MARCH4 SAMD12 TMED3
3105H18.18
ATP6V0A4 CTIF GRID1 MARCH7 SAMD3 TMEM104
ATP6V0E2 CTNNA2 GRID2 MARK4 SAMD5 TMEM106B
ATP6V1E1 CTNNA3 GRIK1 MAST4 SBF2 TMEM117
ATP8A1 CTNND2 GRIK2 MATN1 SCARB2 TMEM128
ATP8A2 CUBN GRIK4 MB21D2 SCD5 TMEM129
ATP9A CUX1 GRIN2A MCF2L SCLY TMEM132B
ATRNL1 CWF19L2 GRIN3A MCF2L2 SCML4 TMEM132C
ATXN3 CXCL11 GRIN3B MCM9 SCN1A TMEM132D
AVEN CYB5A GRIP1 MDGA2 SCN3A TMEM135
AXDND1 CYB5R2 GRM4 MECOM SCNN1G TMEM156
B3GNTL1 CYBRD1 GRM7 MEGF11 SCP2 TMEM179
B4GALNT2 CYP24A1 GRM8 MEIOB SCUBE1 TMEM220
BAI3 CYP4F12 GSTO1 MEOX2 SDC2 TMEM229B
BARD1 CYP4F3 GTF2H4 MFAP3 SDK2 TMEM232
BBS9 DAAM1 GTF2IRD1 MFSD6L SDR39U1 TMEM244
BCAR3 DAAM2 GTF3C6 MGAT5 SEMA3A TMEM259
BCAS1 DAB1 GUCA1A MGAT5B SEMA3E TMEM44
BCAS3 DAB2 HAAO MGMT SEMA6D TMEM51
BCKDHB DAD1 HABP2 MGST2 SEPT11 TMEM63C
BCL2L14 DAPK1 HAGH MGST3 SEPT9 TMEM71
BCR DCBLD1 HBE1 MICAL3 SERINC5 TMEM88B
BDH1 DCC HBG2 MICALCL SERPINA5 TMPRSS11E
BEST3 DCDC2C HDAC4 MICB SERPINB5 TMPRSS2
BFSP2 DCHS2 HDAC7 MIS12 SFTPD TMTC1
BICC1 DCTD HEATR1 MITF SGCG TMTC2

141
Capítulo 1

BICD1 DEFB128 HECW1 MLF1IP SGCZ TMTC4


BIN2 DEPDC7 HECW2 MLK4 SH2D4B TNFAIP8
BIRC5 DEPTOR HEG1 MLPH SH3RF2 TNFSF10
BLNK DGKH HHAT MMP2 SH3RF3 TNIK
BLOC1S5 DHRS4 HHLA1 MMP20 SHISA6 TNN
BLOC1S5-
DHX37 HHLA2 MMP26 SHROOM3 TNS1
TXNDC5
BMPR1B DIEXF HIP1 MOB3B SIRPA TNS3
BNC2 DIP2A HIST1H2AA MOBP SIRT3 TONSL
BNIP2 DIRC3 HIST1H2BA MORF4L1 SKA1 TOP1MT
BRE DKKL1 HIVEP3 MOV10L1 SKAP2 TPCN2
BSPRY DLC1 HJURP MPHOSPH6 SLA TPD52
BTNL2 DLG2 HLA-B MPND SLC12A3 TPK1
C10orf112 DLGAP1 HLA-C MPP7 SLC15A2 TPO
C12orf36 DMBT1 HLA-DPA1 MROH2A SLC16A7 TPRX2P
C12orf54 DMGDH HLA-DPB1 MROH2B SLC17A5 TRAT1
C12orf55 DMRT1 HLA-DQA1 MRPS22 SLC1A2 TRDN
C13orf45 DNAAF1 HLA-DQA2 MRS2 SLC1A6 TRIB3
C15orf41 DNAH11 HLA-DQB1 MS4A12 SLC22A16 TRIM22
C15orf48 DNAH8 HLA-DQB2 MSH3 SLC22A9 TRIM5
C16orf95 DNAJC16 HLA-DRA MSMO1 SLC24A4 TRIM9
C1orf101 DNER HLA-DRB1 MSR1 SLC25A21 TRPA1
C1orf177 DNHD1 HLA-DRB5 MSRA SLC25A24 TRPC6
C1orf198 DNM1L HLA-F MTHFD1L SLC25A37 TRPM3
C1orf222 DOCK1 HLA-G MTSS1 SLC26A3 TRPM5
C1orf68 DOCK5 HLCS MTUS1 SLC27A6 TRPS1
C1orf94 DOK6 HMCN1 MTUS2 SLC2A8 TSHR
C20orf166 DOK7 HMGCLL1 MUC16 SLC2A9 TSHZ2
C20orf196 DPF3 HNF4A MUC22 SLC30A10 TSNARE1
TSNAX-
C22orf34 DPP10 HPCAL4 MUC4 SLC35F2
DISC1
C2orf54 DPP6 HPSE MYLK4 SLC35F3 TSPAN15
C2orf83 DPY19L1 HPSE2 MYO15B SLC35F4 TSPAN18
C4orf19 DPY19L4 HS3ST4 MYO16 SLC37A1 TSPAN5
C4orf50 DRAM1 HS6ST3 MYO1B SLC38A8 TSPAN8
C5orf17 DSC1 HSPA12A MYO1D SLC38A9 TSPAN9
C6orf10 DSCAM HSPB3 MYO1H SLC39A11 TSPEAR
C6orf15 DSG1 HTRA4 MYO3A SLC39A14 TSSC1
C6orf165 DTNA HUS1 MYO3B SLC48A1 TTC37

142
Capítulo 1

C6ORF165 DYNC2H1 IDO2 MYO5B SLC5A12 TTC6


C6orf58 DYTN IGFBP7 MYO7A SLC6A1 TTC9
C7orf31 EDARADD IGSF5 MYOF SLC6A5 TTLL11
EEF1E1-
C8orf34 IL18R1 MYOM1 SLC7A11 TTLL13
BLOC1S5
C8orf46 EFCAB11 IL1RL1 MYOM2 SLC7A5 TULP3
C9orf91 EGFR IL36RN MYOZ3 SLC8A3 TXN2
CABP5 EGLN1 IL37 MYPN SLC9A4 TXNDC5
CACNA1A EGLN3 IL7R MYRFL SLC9C1 TYW1
CACNA1C EIF2B5 IMPA2 MYRIP SLCO2A1 UBASH3B
CACNA2D3 EIF4E2 INPP1 MYSM1 SLCO2B1 UBE2F-SCLY
CACNG2 ELAC2 INPP4B NAAA SLCO6A1 UGT2A1
CADM2 ELFN2 INPP5D NAALADL2 SMC6 UGT2A2
CADPS ELSPBP1 IP6K3 NADSYN1 SMCO2 UGT2B4
CALD1 EMCN IQGAP2 NANOG SMG7 ULK4
CALN1 EMILIN2 IQSEC1 NAT1 SMIM12 UNC13C
CAMK1D EMR1 IRF1 NAT2 SMOC2 UNC5B
CAND2 ENPP1 ITGA1 NAV2 SMOX UPP2
CAPN14 ENTPD4 ITGA2 NAV3 SMR3B USH2A
CAPN9 EPHA10 ITGAE NBAS SMYD3 USP20
CASQ2 EPHA6 ITIH4 NBEA SNTB1 UTS2B
CCBE1 EPHA7 ITPR3 NCALD SNTG1 VARS2
CCDC102B EPHB1 IZUMO1 NCAM2 SNTG2 VAV2
CCDC113 EPRS JAKMIP3 NCF2 SNX18 VAV3
CCDC129 ERAP1 KALRN NCF4 SNX19 VEGFC
CCDC146 ERAP2 KANK1 NCK2 SNX29 VNN1
CCDC149 ERBB4 KANK4 NCKAP5 SNX31 VPS8
CCDC158 ERC1 KANSL1 NCMAP SNX7 VRK3
CCDC169 ERG KAT2B NDFIP1 SOAT1 VSTM5
CCDC169-
ERICH1 KCNA6 NDST4 SOHLH2 VWA5B1
SOHLH2
CCDC171 ERO1LB KCNAB1 NDUFA10 SORBS1 VWF
ESRRG ESPNL KCNB2 NDUFAF6 SORBS2 WBSCR17
ESYT2 ESR1 KCNC2 NEBL SORCS1 WDR17

143
Capítulo 1

Supplementary Figures

S1 Fig. NCD analytical properties as a function of number of SNPs


x-axis, number of SNPs in the window for which NCD20.5 is being calculated;
y-axis, NCD20.5 values. Each color corresponds to one (non-variable) number
of FDs per window (20, 40, 100). Top, x-axis reaches 3,000 SNPs. After ∼ 1,500
SNPs (for any of the number of FDs), NCD20.5 stabilizes and asymptotically
approaches 0. Bottom, a zoom-in of the upper plot, with x-axis reaching only
100 SNPs. In this representation, all SNPs have a frequency of 0.5.

144
Capítulo 1

S2 Fig. Analytical properties of NCD2 as a function of number of FDs


x-axis, number of FDs in the window for which NCD20.5 is being calculated;
y-axis, NCD20.5 values. Each color corresponds to one the frequency of the 20
SNPs in the window (0.5, 0.4, 0.3). Top, x-axis reaches 3,000 FDs. After ∼ 500
FDs, NCD20.5 stabilizes and asymptotically approaches 0.5; Bottom, a zoom-in
of the upper plot, with x-axis reaching only 100 FDs. In this representation, all
20 SNPs have the same frequency (0.5 in blue, 0.4 in red, 0.3 in gray). Note
that the minimum NCD20.5 value is different for the different colors, since they
represent different SNP frequencies.

145
Capítulo 1

146
S3 Fig.Effect of sequence length on NCD20.5 power (Africa).
ROC curves for sequence lengths (L) of (A) 3 Kb, (B) 6 Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the African demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
147
S4 Fig.Effect of sequence length on NCD20.5 power (Africa).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the African demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1

148
S5 Fig.Effect of sequence length on NCD20.5 power (Europe).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the European demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
149
S6 Fig.Effect of sequence length on NCD20.5 power (Europe).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the European demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1

150
S7 Fig.Effect of sequence length on NCD20.5 power (Asia).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the Asian demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
151
S8 Fig.Effect of sequence length on NCD20.5 power (Asia).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the Asian demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1

Fig S9. Correlations for NCD2tf calculated with different t f values.


In each plot, NCD2 values calculated with two different target frequencies are
plotted against each other. NCD2 was calculated for 1,000 neutral simulations
following demographic parameters for the African continent. L = 3 Kb.

152
153
Fig S10. ROC curves for comparison between NCD20.5 and other tests (Europe).
Power to detect LTBS for simulations where the balanced polymorphism was modeled to achieve frequency equilib-
rium ( f eq ) of (left) 0.3, (center) 0.4, and (right) 0.5. Plotted values are for European demography, Tbs = 5 myr, L = 3
kb). Target frequency for NCD1 and NCD2 matches the simulated f eq .
Capítulo 1
Capítulo 1

Fig S11. Relationship between NDC2tf and the number of informative sites.
NCD2tf was calculated for neutral simulations (10,000 for each bin of IS) for
African demographic scenario and the 0.01 quantile value for each bin is plot-
ted. Blue (t f = 0.5), orange (t f = 0.4), pink (t f = 0.3), green (t f = 0.2).

154
Capítulo 1

Fig S12. Proportion of windows per chromosome. Sets of significant and out-
lier windows are derived from the union of three target frequencies (0.3, 0.4,
0.5). Grey, all genomic windows; significant (green) and (blue) outlier windows.

Fig S13. Proportion of positions in the genome retained after each filter. Pro-
portion of the hg19 human reference genome (total base-pairs = 2,684,573,005)
retained for each individual filtering criterium described in the Methods,
and for all filters jointly applied together. Proportion of sequences retained:
Map50=0.843; TRF=0.976; SD=0.961; pantro2=0.961; all=0.819. Map50: mappa-
bility 50-mer (see Methods); TRF: tandem repeats; SD: segmental duplications;
pantro2: reference chimp genome.

155
Capítulo 1

Fig S14. Distribution of proportion of high coverage (pHC) positions per


bin of empirical NCD2 p-value. pHC in percentage (y-axis) binned by the
NCD2 Ztf empirical p-values represented in –log10 scale on the x-axis. pHC
is the proportion of the sequence of a given window in this study having high
coverage values in at least two samples of modern human shotgun data (see
Methods).

156
Capítulo 1

Fig S15. Proportion of sequences pertaining to each functional category. y-


axis, the proportion of sequences over the total that belong to each category.
x-axis, sets of significant windows for YRI, LWK, GBR and TSI (see Methods).
all, all queried windows. darkblue=exon, lightblue=intron, lightgreen=3’UTR,
darkgreen=5’UTR.

157
Capítulo 1

158
Fig S16. Number of paralog genes per gene on the same chromosome. For each protein-coding gene from human
autosomes (19,349), the number of paralogs present in the same chromosome is plotted (left, gray). All autosomes
come from Ensembl for hg19, regardless if they were queried or not for NCD2. Significant genes (middle, blue) come
from the union of significant genes for YRI considering all t f values (see Table 2 in main paper). Significant genes
without olfactory receptor genes (21 ORs in total) are shown on the right (green). y-axis, relative frequency of the
genes that contain a given number of paralogs on the same chromosome. Note that the distributions are very similar
for the background and significant genes.
Capítulo 1

Fig S17. Proportion of Neanderthal SNPs in the candidate windows. Left,


significant windows; right, outlier windows. In gray, distribution obtained
from 1,000 samplings from the background. In orange, % of Neanderthal SNPs
within all significant (or outlier) windows. TSI, Toscani; GBR, Great Britain.

159
Capítulo 1

160
Fig S18. NCD20.5 empirical values for each bin of informative sites, IS. A) NCD20.5 for windows with IS between 1
and 100 for YRI (>99% of all windows); B) NCD20.5 for all windows with IS between 1 and 100 for GBR (>99% of all
windows). In blue, median value for all the windows within a given bin. Note that the medians stabilize around 20
IS for YRI and around 15 IS for GBR.
Capítulo 1

Fig S19. Number of paralog genes per gene on the same chromosome. For
each OR gene contained within any of the significant sets of windows (all pop-
ulations and t f values, 53 ORs in total), the number of paralogs present in the
same chromosome is plotted. y-axis, relative frequency of the genes that contain
a given number of paralogs on the same chromosome. Compare with distribu-
tions in S16 Fig.

161
Capítulo 1

Fig S20. Venn diagrams of candidate windows for four populations. A, left,
significant windows; B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR,
Great Britain; TSI, Toscani. The set of significant windows for each population
comes from the union of significant and outlier windows for tf=0.3, 0.4, 0.5 (see
Results and Methods). African populations are shown in tones of purple, and
European in tones of green.

162
163
Fig S21. Venn diagrams of significant windows for four populations, for each t f value. A, left, significant windows;
B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR, Great Britain; TSI, Toscani. The set of significant windows
for each population comes from from those detected with each t f value. African populations are shown in tones of
purple, and European in tones of green.
Capítulo 1
Capítulo 1

164
Fig S22. Venn diagrams of outlier windows for four populations, for each t f value. A, left, significant windows;
B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR, Great Britain; TSI, Toscani. The set of significant windows
for each population comes from from those detected with each ft value. African populations are shown in tones of
purple, and European in tones of green.
Capítulo 2

Acúmulo de mutações deletérias em


genes que foram alvos de seleção ba-
lanceadora de longo prazo em huma-
nos

Considerações Iniciais

Na última década – e particularmente nos últimos 5 anos – diversos trabalhos


vêm documentando a existência de uma parcela relativamente elevada de mu-
tações deletérias em populações humanas. Alguns estudos encontraram uma
diferença entre a carga genética de populações africanas e não-africanas (Hodg-
kinson et al., 2013; Lohmueller et al., 2008; Lohmueller, 2014; Henn et al., 2016),
ao passo que outros trabalhos vêm contestando tais achados (Do et al., 2015;
Simons et al., 2014). Paralelamente, uma série de estudos avaliaram a influên-

165
Capítulo 2

cia que as “varreduras seletivas” têm sobre variantes neutras próximas – redu-
zindo a diversidade – e, menos frequentemente, sobre variantes não-neutras –
limitando a eficácia da seleção nos sítios ligados (Betancourt e Presgraves, 2002;
Chun e Fay, 2011). Até o momento, nenhum estudo buscou avaliar o impacto
que a seleção para a manutenção de um polimorfismo balanceado tem sobre
variantes deletérias ligadas (exceto para HLA, Lenz et al., 2016). Assim, busca-
mos testar a hipótese de que a seleção balanceadora sobre um sítios aumenta a
abundância de alelos deletérios ligados que, na ausência de seleção balancea-
dora, poderiam ter sido eliminados por seleção purificadora.
A fim de abordar essa questão, valemo-nos dos genes com assinaturas de se-
leção balanceadora identificados no Capítulo 11 . Neste trabalho tive a colabora-
ção de Débora Y.C. Brandt (doutoranda, Universidade da Califórnia, Berkeley)
e Jônatas E. César (pós-doutorando, Universidade de São Paulo, IB), além de
Diogo Meyer, que orientou o trabalho. D.Y.C.B. organizou os dados do Projeto
1000 Genomas para nossas análises, calculou as frequências alélicas por popu-
lação e contribuiu com anotações funcionais para os SNPs. J.E.C. desenvolveu
scripts eficientes para as abordagens de re-amostragem descritas no manuscrito,
além de ter feito o pré-processamento dos dados para nossas análises. Eu parti-
cipei de todas as etapas descritas, fiz o planejamento das análises a serem feitas
(juntamente com D.M. e J.E.C.) e redigi o manuscrito, juntamente com D.M.,
com colaboração e aprovação dos outros co-autores. Todos contribuíram para a
discussão dos resultados. Pretendemos submetê-lo para o periódico Genetics.

1 Referimo-nos neste Capítulo ao manuscrito (não publicado) do Capítulo 1 como Bitarello et

al. (n.d).

166
Capítulo 2

Balancing selection drives the accumulation


of linked deleterious variation in humans
Bárbara Domingues Bitarello1 , Jônatas Eduardo César1 , Débora Yoshihara
Caldeira Brandt2 , Diogo Meyer1

1, Departamento de Genética e Biologia Evolutiva, Universidade de São Paulo, São Paulo,


Brazil
2, University of California, Berkeley, USA

Introduction
the dynamics and factors that interfere with the ef-

U
NDERSTANDING

ficacy of natural selection is crucial to understanding phenotypes,


complex diseases and genome structure in humans (Brandvain and
Wright, 2016). Mutations per se can be neutral, advantageous, slightly deleteri-
ous or strongly deleterious and genomic studies have shown that human popu-
lations harbour a large number of deleterious mutations (e.g. Eyre-Walker and
Keightley, 1999; Kiezun et al., 2013, among several others).
Using comparative methods, Eyre-Walker and Keightley (1999) estimated
that about 1.6 new deleterious mutations arise per human individual per gen-
eration. Estimates of load carried per individual vary between three to five
(Morton et al., 1956) and as much as 100 lethal equivalents (Kondrashov, 1995)
– i.e, an allele or combination of alleles that if made homozygous would be
lethal (Lohmueller et al., 2008). More recent estimates vary from 300 to 1,200
deleterious mutations per diploid (human) genome (Fay, 2011; Lohmueller et
al., 2008; Sunyaev et al., 2001). All of these estimates mostly reflect the assumed

167
Capítulo 2

mutation rate, but also rely on effective population size, dominance, and the
assumption that populations are in equilibrium (Brandvain and Wright, 2016).
Moreover, the methods used in the determination of what makes a mutation
“deleterious” vary considerably (reviewed in Henn et al., 2015). Therefore, it is
possible that many analyses on the load of deleterious mutations carry inaccu-
racies (reviewed in Brandvain and Wright, 2016).

Three processes have a central role in accounting for the abundance and dis-
tribution of deleterious mutations in the genome: mutation, drift, and selection
(Brandvain and Wright, 2016). Firstly, the balance between influx via muta-
tion and removal via purifying selection results in a dynamic process, where a
large number of weakly selected variants can be maintained at low frequencies.
Exome and genome-wide studies reporting an enrichment of recent deleterious
mutations are strong evidence for this process (Casals et al., 2013; Fu et al., 2012;
Tennessen et al., 2012; Kiezun et al., 2013).

Secondly, features of a population’s demographic history can influence the


load of deleterious mutations that it carries. The last decade has seen an explo-
sion of studies comparing the genetic load between human populations (Tishkoff
and Williams, 2002; Lohmueller et al., 2008; Lohmueller, 2014; Simons et al.,
2014; Henn et al., 2015; Henn et al., 2016). Lohmueller et al. (2008) quantified the
number of deleterious mutations per diploid genome in African American (AA)
and European Americans (EA) individuals, finding that EA individuals have
lower levels of nucleotide heterozygosity for all functional categories analysed,
and a higher number of homozygous genotypes for derived alleles in synony-
mous and nonsynonymous sites and for "possibly damaging" (Adzhubei et al.,
2010) SNPs.

Although the former result is compatible with a rich body of literature doc-

168
Capítulo 2

umenting decreasing levels of heterozygosity with increasing distance from


Africa (e.g.Tishkoff and Williams, 2002; Henn et al., 2015), the second obser-
vation is not, a priori, expected. Moreover, among the SNPs segregating in only
one of the two populations, the proportion of nonsynonymous SNPs was found
to be significantly higher in EA (Lohmueller et al., 2008).

This excess of deleterious mutations in European populations was inter-


preted as a consequence of a recent out-of-Africa bottleneck (∼ 50,000 years
ago) followed by explosive population growth until the present (Lohmueller
et al., 2008). Although the African population also experienced growth, it hap-
pened further in the past and thus this population would have had enough time
to move closer to equilibrium conditions (Lohmueller et al., 2008). Later find-
ings supported this hypothesis (Alkan et al., 2009; Subramanian, 2012; Subra-
manian, 2016; Hodgkinson et al., 2013; Peischl et al., 2013; Peischl and Excoffier,
2015), while others disputed it (Do et al., 2015; Simons et al., 2014).

A third factor that can account for the load in our genomes is pleiotropy,
which is widespread in the human genome. For example, several studies show
that disease alleles are often also positively selected, indicating that a deleteri-
ous variant has been pushed to a high frequency due to some other contribution
to fitness it displays (e.g. Corona et al., 2010).

Here, we explore an additional process that can play a role in shaping the
load of mutations: the effect of selection on closely linked loci. It is plausible
that at least part of the mutational load in humans is due not to demographic
factors, but to indirect consequences of selection in adjacent loci (Figure 1).

In this study we take up the task of understanding how balancing selection


in humans has shaped the level of load in adjacent loci. It is well understood
how selection – directional and balancing – has shaped neutral variation in re-

169
Capítulo 2

Figure 1: Effect of balanced polymorphism on neighboring sites. When an


advantageous variant appears in a given haplotype, but the site itself is un-
der long-term balancing selection, the advantageous variant increases the fre-
quency of neutral and deleterious variants in linkage. Because two or more hap-
lotypes are maintained, the linked variants are also kept polymorphic. Adapted
from Charlesworth (2006).

gions that lie close to sites under either directional or balancing selection (e.g.
Charlesworth, 2006; Charlesworth, 2012; Cutter and Payseur, 2013; Nielsen,
2005, to name a few). Recombination rates and neutral genetic diversity are
correlated in several organisms (reviewed in Charlesworth, 2012; Cutter and
Payseur, 2013) and, when selective sweeps occur, a depletion of neutral diver-
sity is verified around the selected site (Charlesworth, 2012; Cutter and Payseur,
2013; Nielsen, 2005). The extent of this effect is a consequence of both recom-
bination rates and the intensity of selection (Roux et al., 2013; Schierup et al.,
2000; Charlesworth et al., 1997).

170
Capítulo 2

The effects of directional selection on the accumulation of deleterious mu-


tations are less understood, but have also been addressed (e.g. Betancourt and
Presgraves, 2002; Chun and Fay, 2011). Interestingly, studies in Drosophila have
highlighted that strong purifying (background) selection hampers the effective-
ness of natural selection targeting neighboring sites: near strongly selected sites,
there is increased accumulation of deleterious mutations and the effectiveness
of selection targeting optimal codon usage is lower (Betancourt and Presgraves,
2002). There is also evidence that directional selection limits the efficacy of pu-
rifying selection in neighboring sites in humans (Chun and Fay, 2011).

All these findings suggest that there is a complex interaction of different se-
lective forces targeting linked sites and possibly that linkage limits the efficiency
of purifying selection in purging deleterious mutations from the genome. In this
context, we propose to examine the following question: does balancing selec-
tion targeting certain sites in the human genome interfere with the effectiveness
of natural selection in nearby sites, as has been observed for strong directional
selection?

From a theoretical point of view, a first expectation is that balancing selec-


tion would increase the rate at which deleterious mutations are purged from the
genome. This expectation arises because balancing selection increases the effec-
tive population size (Ne ) of the genomic region under selection (Charlesworth
et al., 1997; Roux et al., 2013; Schierup et al., 2000), and increased Ne leads to
an increase in the efficacy of natural selection. The key to understanding this
apparent paradox is to consider that balancing selection often involves the oc-
currence of partial selective sweeps (i.e., there is an increase in frequency of the
favored allele, but not to the point of fixation, followed by other such events,
favoring other variants) (Connallon and Clark, 2013; Albrechtsen et al., 2010).

171
Capítulo 2

Such a process enhances diversity, but variation is structured among haplo-


types. This process is analogous to the increase in Ne for a structured popu-
lation, where each deme has a small Ne , but the overall meta-population has
a large Ne (Charlesworth et al., 1997; Roux et al., 2013; Schierup et al., 2000).
Therefore, balancing selection should increase the genetic load in the vicinity of
the balanced polymorphism.

To our knowledge, an increase in genetic load in the vicinity of targets of bal-


ancing selection has only very recently been reported for HLA genes (Mendes,
2013; Lenz et al., 2016) and has also been suggested for the “S” loci in Arabidop-
sis and Solanum. In “S” loci it was interpreted in the context of “sheltered load”,
which relies on the assumption that deleterious variants are recessive and are
less “seen” by purifying selection in regions of high heterozygosity (Stone, 2004;
Roux et al., 2013).

Therefore, investigating whether an increased proportion of deleterious vari-


ants occurs for targets of balancing selection throughout the entire genome has
not yet been examined except in the context of HLA genes. We address this
question by investigating the levels of accumulation of deleterious mutations in
regions surrounding sites previously detected as targets of balancing selection
in a powerful genome-wide approach (Bitarello et al., n.d.). Given the broad
range of methods that can be used to identify deleterious variants, and the fact
that they are frequently not in agreement (reviewed in Henn et al., 2015), we
opt to use three complementary approaches. Moreover, most deleteriousness
measures are negatively correlated with allele frequencies, thus we explicitly
consider the effects of allele frequencies in our analyses.

Our expectation, given the theoretical background outlined above, was that
regions with evidence for balancing selection would show an enrichment of

172
Capítulo 2

deleterious variants. In accord with this expectation we found strong evidence


for an increased proportion of nonsynonymous variants within genes with sig-
natures of long-term balancing selection (LTBS), as well as evidence for an in-
creased proportion of deleterious variants.

Methods

Population datasets

In order to test the hypothesis that sites within genes with evidence for long-
term balancing selection (LTBS) show an excess of deleterious variants, we con-
sidered all protein-coding SNP positions (nonsynonymous, N, and synonymous,
S) from the 1000 Genomes Phase 3 data (Auton et al., 2015). We selected SNPs
that fall within the coordinates of genes with signatures of LTBS (“balanced
genes”, see below) or within the target windows per se (“balanced windows”)
(Tables 1 and 2).

We used the integrated call sets in VCF format for each chromosome, and
calculated reference and alternative allele frequencies per population using VCFtools
(Danecek et al., 2011). We only considered populations from Africa and Eu-
rope, and excluded the admixed ones, thus resulting in 10 populations: [Africa:
Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Mende in
Sierra Leone (MSL), Gambian in Western Division, The Gambia (GWD), Esan
in Nigeria (ESN)]; [Europe: Toscani in Italy (TSI), British in England and Scot-
land (GBR), Iberian populations in Spain (IBS), Finnish in Finland (FIN), Utah
residents with Northern and Western European ancestry (CEU)]. We did not in-
clude Asian populations because the targets of balancing selection defined by

173
Capítulo 2

Bitarello et al. (n.d.) were only documented for African and European popula-
tions.

Targets of balancing selection

The "balanced genes" are those reported by Bitarello et al. (n.d.) (see their Table
3) as having the strongest statistically significant signatures of LTBS in humans
(213 genes in total). The list of balanced genes was generated by intersecting 3
Kb windows with strong signatures of LTBS with the protein-coding gene anno-
tation from Encode/Ensembl (Bitarello et al., n.d.). Here we test the hypothesis
of an enrichment in the proportion of deleterious variants in the balanced genes
per se (Table 1).

A more specific definition of the regions under balancing selection would in-
volve the analysis of "balanced windows" (i.e., the queried sub-region of a gene
with evidence for balancing selection, according to the method of Bitarello et al.
(n.d.). However, this approach generates a dataset which is too restrictive (Table
2), with a number of SNPs that is too small to provide reliable contrasts among
regions under balancing selection with respect to the rest of the genome (with
on average 209 protein-coding SNPs per population, after the HLA genes are
removed). We therefore chose to restrict our analyses to the balanced genes, for
which we have a larger number of SNPs documenting the influence of selection
on nearby sites.

Our approach consists in a comparison of the number of deleterious SNPs


within the genes under balancing selection and the remainder of the genome
(which we refer to as providing a set of "control SNPs"). We define control
SNPs as those protein-coding SNPs outside all balanced genes. We did not
consider the SNPs contained in the sex chromosomes nor in the mitochondrial

174
Capítulo 2

DNA since there is no information regarding balancing selection signatures for


those genes (Bitarello et al., n.d.). More details about the controls are provided
below in the "Re-sampling control SNPs" section.

Annotation

One of the summary statistics used to quantify the genetic load was the CADD
(Combined Annotation Dependent Depletion, or simply “C Score”) described
in Kircher et al. (2014). Thus, we used the annotation provided in the study of
Kircher et al. (2014) (available at: http://cadd.gs.washington.edu/download,
accessed in March 2016).

From this annotation file we retrieved the following information: "CHR"


(chromosome where the SNP is situated), "POS" (position of SNP in chromo-
some), "Consequence" (nonsynonymous, synonymous, 3-prime-UTR, 5-prime-
UTR, intronic, non-coding change, canonical splice,stop-gained, stop-lost), "Gene-
Name" (associated gene name to the SNP position), "AnnoType" (NonCoding-
Transcript, Transcript, CodingTranscript), "PolyPhenCat" (benign, possibly dam-
aging, probably damaging), and the scaled and raw C scores (Kircher et al.,
2014).

Because two of our measurements of genetic load (see below) are only ap-
plicable to protein-coding sites, we restricted our analyses to these categories.
Thus, all quantification of deleterious load was restricted to sites which are
protein-coding (i.e., N or S) (Tables 1 and 2). For the step where we iden-
tified the specific SNPs with highest heterozygosity for each gene (and which
is/are the putative target(s) of balancing selection, see below) we used the com-
plete set of sites within the gene, since it is plausible that sites which are not
protein-coding are under balancing selection.

175
Capítulo 2

Given that the effect of balancing selection on the load of nearby regions is
also expected to affect sites which are functional but not protein-coding, our ap-
proach could theoretically be extended to this class of sites (for example, using
a measure of deleteriousness such as the C score, which is applicable to these
sites as well). However, in the present study we opted to restrict our analyses to
variants that affect protein-coding sequences. We justify this based on the fact
that assignment of deleteriousness at these sites can be performed by estimating
the ratio of nonsynonymous to synonymous polymorphisms, a measure which
is based exclusively on the nature of the variants, with no direct influence of
allele frequencies and phylogenetic conservation.

Quantifying genetic load

All protein-coding SNPs from the set of balanced SNPs (Table 1) were jointly
considered when calculating the statistics below, i.e, a single estimate of load
was made for the entire set of genes with evidence for balancing selection. This
avoids the difficulty in obtaining reliable estimates when computing load for
individual genes, since these often have a small number of SNPs. For controls,
the same approach was adopted for each re-sampled set of SNPs (details below).

Ratio of nonsynononymous to synonymous polymorphisms

We calculated the ratio of the number of nonsynonymous (PN ) to synonymous


polymorphisms (PS ) for each set of SNPs (from balanced genes and controls):

PN
PN /PS = (1)
PS

This ratio provides a measure of of the proportion of deleterious mutations,

176
Balanced genes
POP # sites N S Pdel1 Pdel2 Benign Cscore
YRI 3,423(2,961) 1,871(1,564) 1,552(1,397) 197(182) 262(238) 1,300(1,034) 10.72(11.36)
LWK 3,587(3,126) 1,959(1,654) 1,628(1,472) 213(194) 281(260) 1,357(1,093) 10.85(11.52)
MSL 3,387(2,937) 1,837(1,543) 1,550(1,394) 198(184) 241(222) 1,273(1,013) 10.63(11.26)
GWD 3,550(3,098) 1,967(1,668) 1,583(1,430) 198(181) 292(272) 1,353(1,091) 10.75(11.35)
ESN 3,261(2,821) 1,804(1,517) 1,457(1,304) 202(187) 265(248) 1,238(984) 10.91(11.63)
TSI 2,596(2,146) 1,479(1,181) 1,117(965) 175(161) 230(211) 997(733) 10.88(11.82)
GBR 2,299(1,866) 1,328(1,043) 971(823) 155(140) 188(173) 919(664) 10.69(11.70)

177
FIN 2,149(1,715) 1,230(948) 910(767) 138(124) 173(159) 864(610) 10.33(11.36)
CEU 2,353(1,925) 1,334(1,054) 1,019(871) 145(132) 198(183) 924(672) 10.66(11.66)
IBS 2,612(2,153) 1,507(1,200) 1,105(953) 171(155) 237(216) 1,034(764) 10.82(11.71)

Table 1: Statistics for protein-coding SNPs within the balanced genes. Genes under balancing selection were defined
by Bitarello et al. (n.d.). Numbers in parentheses refer to the datasets after removal of HLA genes (see Methods).
Cscore, average scaled C score for all SNPs in the set (see Methods). PN and PS , numbers of nonsynonymous and
synonymous sites, respectively. Pdel1 , Pdel2 , and Benign, numbers of possibly damaging, probably damaging and
benign variants (Adzhubei et al., 2010), respectively.
Capítulo 2
Capítulo 2

Balanced windows
POP # sites N S Pdel1 Pdel2 Benign Cscore
YRI 635(225) 401(124) 234(102) 23(9) 30(9) 333(93) 7.50(9.26)
LWK 651(236) 400(122) 251(114) 29(12) 28(8) 332(9) 7.34(9.07)
MSL 636(234) 391(124) 245(110) 20(7) 29(10) 327(93) 7.52(9.29)
GWD 652(248) 406(136) 246(112) 29(13) 35(16) 328(93) 7.88(9.91)
ESN 630(235) 387(125) 243(110) 28(13) 24(8) 321(91) 7.46(9.43)
TSI 612(205) 388(113) 224(92) 27(13) 33(15) 317(75) 7.62(9.93)
GBR 561(171) 361(100) 200(71) 24(9) 28(14) 302(70) 7.25(9.31)
FIN 553(166) 351(91) 202(75) 22(8) 24(10) 297(65) 7.02(8.80)
CEU 559(172) 353(96) 206(76) 19(6) 28(14) 297(67) 7.17(9.29)
IBS 607(196) 384(107) 223(89) 26(11) 33(15) 314(70) 7.46(9.58)

Table 2: Statistics for protein-coding SNPs within balanced windows. Win-


dows are defined in Bitarello et al. (n.d.). Cscore, average scaled C score for
all SNPs in the set (see Methods). PN and PS , numbers of nonsynonymous and
synonymous sites, respectively. Pdel1 , Pdel2 , and Benign, numbers of possibly
damaging, propably damaging and benign variants, respectively (Adzhubei et
al., 2010).

178
Capítulo 2

assuming that a large proportion (Eyre-Walker and Keightley, 1999; Subrama-


nian, 2012) of nonsynonymous mutations are either strongly or mildly deleteri-
ous. However, because nonsynonymous variants include those that are adap-
tive or neutral, we also considered alternative statistics which quantify genetic
load.

Ratio of damaging to synonymous polymorphisms

To estimate the number of damaging alleles in the “balanced” and control sets
of SNPs, we used the PolyPhen-2 (Adzhubei et al., 2010) annotation provided
in Kircher et al. (2014). PolyPhen-2 classifies nonsynonymous variants as ei-
ther benign, possibly damaging (Pdel1 ) and probably damaging (Pdel2 ). We thus
defined the ratio of damaging to synonymous SNPs (Lohmueller et al., 2008) as:

Pdel1 + Pdel2
Pdel /PS = (2)
PS

This estimate quantifies the proportion of SNPs most likely to be deleteri-


ous. PolyPhen-2 is a protein-level metric which is, by definition, restricted to
nonsynoymous sites. Moreover, several nonsynonymous SNPs (∼ 20% in the
1000 Genomes dataset) lack PolyPhen-2 annotation, so the Pdel /PS statistic was
calculated based on a smaller set of SNPs than the PN /PS (Tables 1 and 2) and
has higher variance.

CADD (C score)

The C score was provided by the CADD tool (Kircher et al., 2014). The C score is
a composite measure using information from more than 60 different such meth-
ods to quantify the effects of a mutation and has been shown to differentiate lev-

179
Capítulo 2

els of deleteriousness among groups of SNPs (Kircher et al., 2014). As argued by


Kircher et al. (2014), protein-level metrics such as PolyPhen-2 (Adzhubei et al.,
2010) are the best performing individual annotations (Kircher et al., 2014), but
are restricted to nonsynonymous variants, whereas conservation scores such as
GERP++ (Davydov et al., 2010) cannot distinguish between nonsynonymous
and stop-loss variants at a given position.

We used both the scaled and the raw C scores provided for the 1000G phase
3 SNPs. The scaled C scores range from 1-99 (higher values indicating higher
deleteriousness potential). Although counting the number of SNPs above a cer-
tain threshold could be used as a strategy, we used the approach of comparing
distributions of C scores between groups in order to increase power (Kircher
et al., 2014). Throughout the discussion, when C scores are presented they re-
fer to the average C score of all N + S SNPs contained in a given set of SNPs
(balanced or control). We restricted the analyses to these sites so as to make the
results comparable to those of the two other metrics for measuring deleterious-
ness.

n n
∑ (CNi ) + ∑ (CSi )
i =1 i =1
Cscore = (3)
N+S

, where n is the total number of SNPs in the set of SNPs, CNi and CSi are the C
scores for N and S SNPs contained in the set of SNPs. The overall C score used
in our analyses is thus an average of the C scores of all N + S SNPs contained
within the “balanced” and control sets of SNPs.

Scaled C scores are very useful for identifying a top ranked SNP and easier
to interpret, but raw C scores offer superior resolution for comparison of dis-
tributions of scores between groups of variants (Kircher et al., 2014). Thus, we

180
Capítulo 2

also compared the distribution of raw C scores for SNPs within balanced genes
to those of the re-sampled sets of controls, and performed a one-tailed Mann-
Whitney U-test (the alternative hypothesis being that SNPs from balanced genes
have higher raw C scores) to compare the balanced SNPs’ distribution to each
control replicate (significance threshold 5%).

Re-sampling control SNPs

To test the hypothesis that balanced genes are enriched for deleterious SNPs, we
compared the three statistics that measure deleteriousness between the SNPs
contained within the genes under balancing selection and a random sample
with the same number of SNPs, but chosen from genes with no evidence for
balancing selection (controls) (Table 1). We use the distributions of control SNPs
to obtain an empirical p-value for the SNPs from balanced genes, defined by the
fraction of re-sampled distributions with deleteriousness statistics which are
more extreme (i.e, higher) than those of the SNPs from the balanced genes.

Previous studies have shown, and we confirm here (see Results) that there
is a strong correlation between allele frequency and the probability of a variant
being annotated as deleterious (see Results). Because genes/regions under bal-
ancing selection are enriched for SNPs at intermediate frequencies (i.e. higher
heterozygosities), this effect will itself result in a marked difference between
measures of load for balanced genes and the genomic background. In order
to control for this effect and guarantee that differences in load are attributable
specifically to the effects of linked selection, we compared the proportion of
deleterious variants in balanced genes/windows to those of the control sets of
SNPs after matching the control SNPs to the frequencies of those in the “bal-
anced” set. Next we describe the procedure used to re-sample a set of SNPs

181
Capítulo 2

while controlling for frequency.

Once the protein-coding SNPs from balanced genes had been selected, we
followed a similar approach as the one adopted by Subramanian (2016): (1) we
took the MAF (minor allele frequency) of each protein-coding SNP; (2) we cal-
culated the log (base 10) of the MAF (logMAF), because in humans the MAF
follows an exponential distribution, i.e, a huge proportion of alleles have very
low MAF (e.g. Abecasis et al., 2012; Subramanian, 2016); (3) we divided the
SNPs into bins according to the logMAF (in our case, we used 9 bins rang-
ing from logMAF=-0.24, 0.1 with a 0.25 interval, encompassing a MAF range of
0.00398-0.5). Given that we did not expect balancing selection to favour derived
or ancestral variants preferentially (Bitarello et al., n.d.), using the MAF is ap-
propriate and does not require further filtering of data in order to infer ancestral
and derived states. Importantly, once the set of SNPs from balanced genes were
divided into bins of logMAF, we were able to quantify the relative contribution
of SNPs to each bin, thus allowing the re-sampled sets of SNPs to match the
site-frequency spectrum (SFS) of the SNPs observed in balanced genes. We re-
sampled from the control SNPs a set following the proportions of each logMAF
bin and the total number of SNPs within the set of target genes (Table 1).

This re-sampling schema was designed to account for the fact that all of the
genetic load measurements adopted here correlate negatively with allelic fre-
quency (see Results) and that there is an enrichment of intermediate-frequency
alleles among the balanced genes (Bitarello et al., n.d.). Each SNP was sampled
independently of its location, provided that it was protein-coding, autosomal
and matched the logMAF proportions calculated based on the balanced genes.
This means that each control set had the same number of protein-coding SNPs
as the balanced genes’ set and a similar SFS (Table 1), but those SNPs were not

182
Capítulo 2

necessarily attributed to the same number of genes as those for the balanced
genes.

Excluding adaptively maintained SNPs from load estimates

Our goal is to test the hypothesis that SNPs within genes under balancing selec-
tion have a higher proportion of deleterious variants than expected for a set of
control SNPs. However, an excess of deleterious or functional variation could
be an outcome of the direct effects of balancing selection. For example, the un-
usually high proportion of nonsynonymous polymorphism in HLA genes is a
consequence of balancing selection directly on functional sites, and not of dele-
terious variants accumulating as a byproduct of selection on a specific site (e.g.
Hughes and Nei, 1988; Bitarello et al., 2015).

In order to separate the direct effects of balancing selection from those due to
hitch-hiking, we also calculated the genetic load measurements after excluding
the sites which are the strongest candidates for balancing selection (thus justi-
fying the assumption that the remainder of the highly polymorphic variants are
present due to linkage with this selected variant). This approach relies on the
assumption that one or at most one or a few sites are the targets of balancing
selection within each gene.

For each balanced gene, the putatively selected SNPs were identified by lo-
cating within the outlier window with evidence for balancing selection (as re-
ported in Bitarello et al., n.d.) the site with the highest heterozygosity. This SNP
was then excluded from the set of SNPs for the balanced genes. When a bal-
anced gene had more than one balanced window, we chose the one with the
most extreme signature of LTBS (Bitarello et al., n.d.).

183
Capítulo 2

Heterozygosity was calculated as follows:

Hi∗ = 2 · [ MAF · (1 − MAF )] (4)

, where i is each SNP position and MAF is the minor allele frequency for that
position. For this exclusion step, all coding SNPs were considered, not only N
and S, and most of the excluded SNPs were intronic (∼ 90% per population,
Table 3). When the most extreme heterozygosity in a gene was shared by mul-
tiple SNPs, all were removed. The average number of SNPs removed per gene,
across all populations, was 3.8 SNPs. Also, few genes had one or more N or
S SNPs removed by this filter (average 14 genes out of 213, per population).
Overall 71 unique (29 N and 42 S) SNPs were removed across all populations
(average 11 N and 34 S per population, Table 3).

POP All Intron N S 3’UTR 5’UTR Splice


YRI 762 681 18 19 39 4 1
LWK 663 599 8 9 41 5 1
MSL 664 599 15 13 33 4 0
GWD 693 633 12 14 28 5 1
ESN 692 621 16 19 32 3 1
TSI 763 704 8 16 33 2 0
GBR 841 761 13 15 46 4 2
FIN 740 669 8 14 43 6 0
CEU 754 700 5 15 33 1 0
IBS 715 670 12 12 14 7 0

Table 3: Classes of SNPs with the highest heterozygosity(ies) per gene For
each gene, the SNP(s) with the highest heterozygosity(ies) were removed (All).
Only SNPs contained within the outlier windows (Bitarello et al., n.d.) of those
genes were considered. Splice, splice-site position, 3’ and 5’ UTR, 3 and 5 prime
UTR regions, N, S, Intron, nonsynonymou, synonymous and intronic sites.

Given that the assumption that a balanced gene/window has one or a few
sites which is/are the actual target(s) of selection is not reasonable for the HLA

184
Capítulo 2

genes – where several sites are targets of balancing selection (Hughes and Nei,
1988; Bitarello et al., 2015) – we performed our analyses under two scenarios: ei-
ther keeping the HLA genes or removing them. For these analyses we removed
the following HLA genes, which have prior strong evidence for long-term bal-
ancing selection and are included among the outlier genes in Bitarello et al.
(n.d.): HLA-A,HLA-B, HLA-C, HLA-DRB1, HLA-DRB5, HLA-DPA1, HLA-DPA2,
HLA-DPB1, HLA-DPB2, HLA-DQB1, HLA-DQB2, HLA-DQA1, HLA-DQA2. Their
removal changes the proportion of target SNPs that fall into each bin of logMAF,
with the SFS becoming less enriched for intermediate frequency variants.

All analyses and figures were generated in R (Development Core Team,


2009) and scripts are available: calculation of allele frequencies per population
for 1000 Genomes Phase 3 data (https://github.com/deboraycb/1000Gstats_
inR/); load analyses, re-sampling and all figures (https://github.com/bbitarello/
deleterious_mutations, access can be provided upon request).

Results

The site frequency spectrum of balanced genes

Here, we consider as "balanced genes" the set described as having the strongest
signatures of LTBS in Bitarello et al. (n.d.). Balancing selection shifts the SFS
towards intermediate frequencies (Andrés et al., 2009; Bitarello et al., n.d.). Al-
though the selected sites may only comprise a subset of the entire locus, bal-
ancing selection changes levels of polymorphism at adjacent sites (neutral and
non-neutral), thus generating a signature that allows selected genes to be de-
tected (Bitarello et al., n.d.). It is entirely plausible, and likely, that only portions

185
Capítulo 2

of those genes are the targets of balancing selection, and this provided us with
an appropriate dataset that has the putative site(s) that were selected and their
immediate vicinities, which show signatures of LTBS (as seen in Figure 5 of
Bitarello et al., n.d.).

The SFS of the balanced genes is shifted towards intermediate frequencies


when compared to the genomic background distribution (Figure 2). Specifi-
cally, balanced genes have about 10% less variants in the lower bins of frequency
(MAF ≤ 0.0025). In other words, the balanced genes have a different SFS from
that of the background, and an appropriate re-sampling of control SNPs needs
to account for this property of the balanced genes.

The SNPs in the balanced genes were binned according to their MAFs (Ta-
ble 4), and their distribution into the bins was used for the re-sampling proce-
dure for the controls. Because signatures of LTBS are expected to be restricted
to narrow windows (Andrés et al., 2009; Andrés, 2011; Bitarello et al., n.d.;
Charlesworth, 2006) and here we consider the entire gene, this shift towards
intermediate frequencies is modest.

Measures of deleteriousness correlate negatively with allelic fre-

quency

Previously, Lohmueller et al. (2008) reported that SNPs classified as "damaging"


according to PolyPhen had significantly lower mean derived allele frequencies
(DAF) than "benign" SNPs, with the "probably damaging" category having the
lowest mean DAF.

More generally, nonsynonymous variants are expected to have lower fre-


quencies (Brandvain and Wright, 2016), because purifying selection will have

186
Bins
Pop 1 2 3 4 5 6 7 8 9
n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF
YRI 905 0.5-0.5 277 0.9-0.9 317 1.4-1.8 348 2.3-3.7 292 4.2-6.9 320 7.4-12.5 361 13-22.2 416 22.7- 39.3 187 39.8-50
LWK 977 0.5-0.5 386 1-1 355 1.5-2 283 2.5-3.5 335 4-7 282 7.6-12.1 390 12.6-22.2 383 22.7-39.4 196 40-50
MSL 894 0.6-0.6 304 1.12-1.2 208 1.8-1.8 333 2.3-3.5 364 4.1-7 309 7.6-12.3 390 12.9-22.3 390 22.9-39.4 195 40-50
GWD 974 0.4-0.4 304 0.9-0.9 395 1.3-2.2 269 2.6-3.5 322 4-6.6 311 7.1-12.4 368 12.8-22.1 404 22.6-39.4 203 39.8-50
ESN 790 0.5-0.5 303 1-1 306 1.5-2 268 2.5-3.5 334 4-7.1 300 7.6-12.1 372 12.6-22.2 388 22.7-39.4 200 39.9-50
TSI 820 0.5-0.5 173 0.9-0.9 167 1.4-1.9 175 2.3-3.7 171 4.2-7 177 7.5-12.1 329 12.6-22 390 22.4-39.7 194 40.2-50
GBR 586 0.55-0.55 162 1.1-1.1 150 1.6-2.2 145 2.7-3.8 157 4.4-6.6 194 7.1-12.1 313 12.6-22 363 22.5-39.6 229 40.1-50
FIN 409 0.5-0.5 143 1-1 171 1.5-2 153 2.5-3.5 196 4-7 203 7.6-12.1 304 12.6-22.2 370 22.7-39.4 304 39.9-50

187
CEU 644 0.5-0.5 147 1-1 138 1.5-2 160 2.5-3.5 194 4-7.1 165 7.6-12.1 325 12.6-22.2 376 22.7-39.4 204 39.9-50
IBS 825 0.5-0.5 173 0.9-0.9 176 1.4-1.8 158 2.3-3.7 191 4.2-7 176 7.5-12.1 311 12.6-22 392 22.4-39.7 210 40.2-50

Table 4: Binning of SNPs according to their minor allelic frequencies


SNPs in balanced genes were binned according to their logMAF values (see Methods). Values correspond to the set of
balanced genes with HLA SNPs included. When HLA SNPs were excluded, the bin proportions changed (not shown).
n, number of SNPs (N + S) in the bin; MAF, minimum and maximum MAF values observed within the bin. MAFs are
given in %.
Capítulo 2
Capítulo 2

188
Figure 2: Site frequency spectrum for protein-coding (N + S) SNPs SNPs come from balanced (blue) and control
genes (gray). Rectangles identify the sections that are zoomed in in (B) and (C).
Capítulo 2

had enough time to purge deleterious variants from the population (assuming
most nonsynonymous variants are deleterious). This is in fact the pattern seen
for human populations, where there is a vast excess of low frequency nonsyn-
onymous variants (Casals et al., 2013; Fu et al., 2012; Tennessen et al., 2012).
Moreover, C scores also tend to be higher for lower frequency variants (Kircher
et al., 2014), although it has been shown that C score distributions have power
to differentiate lead-SNPs and tag-SNPs from GWAS, which by definition have
similar frequencies (Kircher et al., 2014).
We confirmed these patterns with the 1000 Genomes data we analyzed (Fig-
ure 3). When dividing all protein-coding SNPs (whether they fall into balanced
genes or not) into bins of minor allele frequencies (logMAF, see Methods), a
clear negative correlation is observed between MAF and the three statistics:
PN /PS , Pdel /PS and Cscore (Figure 3).
All of the aforementioned observations indicate the importance of control-
ling for allele frequencies when analyzing the load of deleterious mutations
among balanced genes. Lack of a control would cause higher load among con-
trol SNPs than for the SNPs from balanced genes, as a consequence of an en-
richment in intermediate frequency variants in balanced genes (Bitarello et al.,
n.d.) and the fact that deleterious variants are more abundant in the lower bins
of MAF (Adzhubei et al., 2010; Kircher et al., 2014; Lohmueller et al., 2008; Sub-
ramanian, 2016).

189
Capítulo 2

Pop PN /PS -Cscore Pdel /PS -Cscore PN /PS -Pdel /PS


YRI 0.23 0.55 0.61
LWK 0.27 0.61 0.62
MSL 0.28 0.59 0.65
GWD 0.28 0.61 0.65
ESN 0.26 0.58 0.63
TSI 0.28 0.60 0.67
GBR 0.27 0.60 0.65
FIN 0.27 0.56 0.65
CEU 0.28 0.59 0.68
IBS 0.36 0.67 0.70

Table 5: Pearson’s correlations between load statistics. Each value corresponds


to 1,000 re-samplings (controls) for the balanced SNPs (gene-based). All corre-
lations are highly significant (p − value < 2.6e−13 ).

Although our three measures of deleteriousness differ in the criteria used to


define/quantify how damaging variants are, we find an overall high correlation
among these measures (Figure 4). PN /PS and Pdel /PS are highly correlated in
all populations (cor> 0.61 and p < 2.6e−13 , Table 5), and Cscore and Pdel /PS as
well (cor> 0.55, Table 5). PN /PS and Cscore have a weaker correlation, albeit
also highly significant (Figure 4A and Table 5).

These results indicate the importance of using a re-sampling approach that


controls for differences in frequencies of SNPs in balanced genes and genomewide
(Table 4). Re-sampling sets of genes which are not under balancing selection,
without controlling for the SFS, would lead the control to be relatively enriched
for low frequency variants, a factor which would obscure the identification of
possible differences between the balanced and control SNPs.

190
191
Figure 3: Boxplot of load statistics by each bin of MAF. All autosomic N + S SNPs were included here. Bins were
defined based of the log(base 10) of the MAF of each variant (see Methods). y-axis, (A) PN /PS , (B) Pdel /PS , (C) Cscore.
Capítulo 2
Capítulo 2

192
Figure 4: Correlations between load summary statistics. (A) C score and PN /PS (cor=0.80, for all populations com-
bined, Pearson, p-value < 2.2e−16 ), (B) Pdel /PS and Cscore (cor=0.91, p < 2.2e−16 ) and (c) Pdel /PS and PN /PS (cor=0.91,
p < 2.2e−16 ). Each color corresponds to one population and each point is refers to the metric estimated for a re-
sampled set of SNPs. Lines represent linear regressions for each population. For correlation values per population,
see Table 5.
Capítulo 2

Extreme values for HLA SNPs

In the scan for balancing selection of Bitarello et al. (n.d.), HLA genes were
over-represented among the category of selected genes, and showed extremely
strong evidence for selection, with 12 classical HLA genes present among the
213 genes with strongest signatures of balancing selection. This observation,
associated to the fact that HLA genes are likely to carry several sites under the
direct effects of balancing selection, led us to single them out for an exploratory
analysis.

We initially compared the load statistics for the set of SNPs from balanced
genes to a group comprising all control SNPs in the genome. Note that here
there was no re-sampling involved; we simply compared the statistics between
the different groups. We evaluated the influence of SNPs from HLA genes on
the load summary statistics by excluding all HLA SNPs from the set of SNPs
contained in balanced genes.

The set of SNPs from the balanced genes have different values for the three
statistics when compared to the control set of SNPs: PN /PS values are higher
(Figure 5), while Pdel /PS and C score values are lower for the SNPs from all
balanced genes (Figure 5). Interestingly, the HLA set of SNPs follows the same
pattern as the SNPs from the set of all balanced genes (which include the HLA
SNPs), although in a much stronger way. When we examine HLA genes alone,
we find that the average PN /PS for these loci is almost 2-fold greater than that
of control SNPs (Figure 5). Similarly, the reduction in Pdel /PS and C score in
HLA compared to control SNPs is also about two-fold (Figure 5). Moreover,
when HLA SNPs are removed from the set of SNPs from balanced genes, the
remaining set tends to have values closer to controls, albeit still different (Figure

193
Capítulo 2

5).

The extreme patterns of HLA SNPs for these three statistics could be driving
the patterns seen in the SNPs from balanced genes, of which they are part of.
The PN /PS result is conservative in the sense that, although one could expect
lower values for HLA genes (less low frequency and thus less nonsynonymous
variants), it is actually almost two-fold higher (Figure 5). This observation
likely results from the high number of sites that are actively maintained by bal-
ancing selection in these genes. It is a well-known fact that balancing selection
has targeted several sites in HLA genes (e.g. Hughes and Nei, 1988; Yang and
Swanson, 2002; Bitarello et al., 2015), which could at least partially explain the
patterns observed for PN /PS . The mechanisms driving diversity in the other
balanced genes, however, remain largely unknown, and it is reasonable to as-
sume that only one (or a few) site(s) has been targeted by balancing selection in
a given gene (e.g. in Leffler et al., 2013 i.e, that the HLA represents the excep-
tion, rather than the rule.

However, there is no obvious biological explanation as to why Pdel /PS and


C scores should be reduced in HLA compared to control SNPs. This suggests
that the reason might be related to allelic frequencies. Moreover, the HLA genes
are enriched not only in intermediate frequency alleles (which are less likely to
be deleterious) but also in number of polymorphic sites (Robinson et al., 2013).
Thus, although HLA genes represent only 12 out of 213 balanced genes con-
sidered here, given the high SNP density of the MHC region as a whole, they
account for a considerable proportion of the SNPs from balanced genes (Table
1). Thus, in the remaining analyses we estimated load for the set of SNPs from
balanced genes with and without the inclusion of HLA SNPs. Because the HLA
SNPs change the shape of the SFS, we also re-sampled the control SNPs accord-

194
Capítulo 2

ingly.

Increased nonsynonymous to synonymous SNPs in balanced genes

Firstly, we note that PN /PS values are on average higher for European than
for African populations (Figure 6). This confirms the finding that European
populations have a higher proportion of nonsynonymous variants than African
populations (Lohmueller et al., 2008). Since our re-sampling was done by popu-
lation, we intrinsically take this into account as seen in the control distributions
(Figure 6).

The PN /PS values of SNPs from balanced genes are significantly higher than
controls (p < 0.01; Figure 6A). These results are not explained by the HLA
genes: although their removal reduces the balanced PN /PS (as expected), while
slightly increasing the control values (because the target SFS changes, thus re-
sulting in less intermediate frequencies in the controls), the increase in of bal-
anced genes with respect to the controls remains significant, albeit less extreme
(Figure 6B, P < 0.01 for all African populations and GBR and FIN). One Eu-
ropean population has marginally significant values (P = 0.06, TSI) and for
two others (CEU and IBS) PN /PS falls within the control distribution after the
removal of HLA SNPs (P > 0.24, Figure 6B).

The increased PN /PS for balanced genes is also not likely driven by the puta-
tive target(s) of balancing selection in these genes: when PN /PS for the balanced
genes was estimated after removing the SNP(s) with the highest heterozygos-
ity(ies) for each gene (see Methods), results remain qualitatively similar (Figure
6). In fact, most of the SNPs excluded from the balanced genes were intronic
(∼ 90% for all populations, Table 3), and on average 11.5 N and 14.6 S SNPs
were removed from each population (Table 3) which makes the PN /PS esti-

195
Capítulo 2

196
Figure 5: Boxplot of load statistics sets of SNPs. Balanced, SNPs contained in the balanced genes; balanced.no.HLA,
same as the previous category, but excluding 12 HLA genes (see Methods); control, all SNPs not contained in balanced
genes; HLA, a set of 12 HLA genes (see Methods). Each boxplot is composed of 10 data points, each one corresponding
to a population (see Methods). y-axis, (A) PN /PS , (B) Pdel /Ps , (C) Cscore.
Capítulo 2

mates increase slightly in most cases (Figure 6).

These results suggest there is an excess of nonsynonymous variants within


the set of balanced genes, and that this excess, at least for African populations,
is not entirely explained by the presence of HLA genes (Figure 6B) nor by the
presence of a one or a few SNPs per gene that have very high heterozigosities
(and are presumably the actual targets of selection). For the European popula-
tions the removal of HLA SNPs decreased the estimates in relation to controls
in a more pronounced way.

Because nonsynonymous variants are not necessarily deleterious (they might


also be neutral), we also investigated two other measures of load that directly
quantify deleteriousness.

Increased proportion of damaging to synonymous SNPs in bal-

anced genes

Again, we note that European populations have higher balanced and control
values than African populations, as seen previously (Lohmueller et al., 2008).
When comparing Pdel /PS estimates for SNPs from balanced genes and control
sets of SNPs, a similar pattern emerges, although less extreme than the one
seen for PN /PS : balanced genes tend to have higher load compared to controls
(p < 0.05) for all populations, except CEU and IBS (p > 0.14; Figure 7A).

The removal of HLA SNPs only slightly changes the Pdel /PS , and the quali-
tative relationship between them does not change, with all populations except
CEU and IBS having p < 0.05 (Figure 7). This differs from what was observed
for PN/PS, where the removal of HLA SNPs made the estimates of load for
balanced genes less different from controls, although still highly significant.

197
Capítulo 2

Moreover, the Pdel /PS estimates with and without the removal of SNPs with
the highest heterozygosity per gene only slightly increase the estimates, com-
patible with the observation that few of the removed SNPs with this filter are
nonsynonymous, and always less than the number of synonymous SNPs (Table
3).

The results for Pdel /PS are in agreement with what was observed for PN /PS ,
suggesting that the patterns observed for PN /PS are driven by deleterious, and
not adaptive or neutral nonsynonymous variants.

Increased C-scores in balanced genes

Average scaled C scores yield qualitatively different results with respect to anal-
yses based on PN /PS and Pdel /PS . Firstly, for African populations and for TSI
the load estimates for balanced genes are very elevated (p < 0.01) compared
to controls, similarly to what was seen for PN /PS (Figure 8A). However, this
pattern is not observed for the other European populations, with p-values ap-
proaching one for CEU and IBS (Figure 8A). Interestingly, in this case, the re-
moval of HLA SNPs enhances the signal: African values become even more
extreme and all the European populations acquire extreme values when com-
pared to controls as well (p < 0.01 for all populations, Figure 8B).

Given that the most appropriate set of SNPs for testing our hypothesis of
load is the set without HLA genes and without the SNPs with the highest het-
erozigosities per gene (Figure 8B), pink triangle), it is plausible that the reduc-
tion or loss of significance of the load in the set of all balanced genes (in Africa
and Europe, respectively) is due to the excess of adaptive variants (from HLA
or other genes) present in the complete set, which tend to have lower C scores.
Note that the control distributions in Figures 8A and 8B are very similar, and

198
199
Figure 6: PN /PS for balanced genes A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, estimate for all
protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozygosity in each
gene (see Methods. **, p − value < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
Capítulo 2
Capítulo 2

200
Figure 7: Pdel /PS for balanced genes A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, estimate for all
protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozygosity in each
gene (see Methods). **, p < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
201
Figure 8: Average scaled Cscore for balanced genes. A) Including HLA SNPs; B) removing HLA SNPs. Blue circle,
estimate for all protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozy-
gosity in each gene (see Methods). **, p < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
Capítulo 2
Capítulo 2

what changes dramatically is the load estimate for the balanced genes. Also,
the Scaled C scores are PHRED-scaled, ranging from 1 to 99, with the top 10%
most deleterious variants having scores above 10, and the top 1% above 20, and
so on (Kircher et al., 2014). Thus this difference is likely to be even greater than
what is conveyed by this analysis.

We also looked at the raw C scores which provide more power in tests com-
paring sets of SNPs (Kircher et al., 2014). For African populations, balanced
genes have raw C score distributions with significantly higher values than the
controls (Mann-Whitney U test, one-tailed, P<=0.05) for more than 70% of the
control re-samplings (Table 6), except for GWD, for which only 251 out of 1,000
controls have significantly lower C scores than the balanced genes. For Europe,
only TSI has 13% of the controls with lower C scores than the balanced genes,
and all other populations have less than 5 such cases (Table 6).

Pop HLA included HLA excluded


P < 0.05 P < 0.05
YRI 959 1,000
LWK 805 1,000
MSL 727 1,000
GWD 251 1,000
ESN 995 1,000
TSI 130 999
GBR 5 995
FIN 1 999
CEU 0 844
IBS 0 1,000

Table 6: Raw C score comparison between balanced genes and controls


For each comparison, the alternative hypothesis was that balanced genes had
higher raw C score values than the control distribution (Mann-Whitney U test,
one-tailed). Values refer to the number of comparisons (out of 1,000 control dis-
tributions) for which the null hypothesis (distributions are not different) was
rejected (P < 0.05).

202
Capítulo 2

However, when we perform the same analyses for the balanced genes after
the removal of HLA SNPs, balanced genes have higher raw C scores for all
comparisons in African populations, and for more than 995 comparisons for all
European populations, except CEU, for which 844 comparisons are significant
(Table 6).

Discussion and Conclusions

Increased genetic load in balanced genes

The study of slightly deleterious mutations is one of the pillars of population


genetics (Kondrashov, 1995). The fate of mutations is highly dependent on the
effective population size (Ne ) and its relationship to the selection coefficients. As
a consequence, weakly deleterious mutations might reach moderate frequencies
in small, but not in large populations, where selection is more effective. More-
over, linkage to selected variants is also a major determinant of the fate of a
deleterious mutation (Hill and Robertson, 1966; Cutter and Payseur, 2013).

The fates of strongly deleterious mutations are mostly deterministic in terms


of mutation rates and selection coefficients – i.e, when s >> 1/2Ne (where s is
the selection coefficient). The fate of very slightly deleterious mutations – i.e,
almost neutral – is, however, mostly stochastic, driven by genetic drift (Kon-
drashov, 1995). But what happens when considerably strong selection (positive,
negative, balancing) on a site impacts the sites in its vicinity? Here, we exam-
ined how balancing selection shapes the accumulation of deleterious mutations
in the vicinity of its targets.

We showed that genes with strong signatures of long-term balancing selec-

203
Capítulo 2

tion have increased levels of nonsynonymous to synonymous polymorphisms,


damaging to synonymous polymorphisms, and also elevated deleteriousness
scores (Kircher et al., 2014), when compared to controls. We took special care
in controlling for the fact that balanced genes have a site-frequency spectrum
which is different from the genomic background, with proportionally more in-
termediate frequency variants, and we also accounted for the fact that within
the balanced genes there are sites directly under balancing selection, which
could be incorrectly assigned to deleterious variants according to some clas-
sification methods.

Because HLA genes are known as an example of multi-locus balancing se-


lection – i.e, several positions within the HLA genes have been targets of selec-
tion (Hughes and Nei, 1988; Yang and Swanson, 2002; Bitarello et al., 2015) –
it seemed plausible that their considerable contribution to the set of balanced
genes could be responsible for the overall patterns we observed. Therefore,
in all analyses we compared results for balanced genes including and exclud-
ing HLA genes. This approach is conservative, given that not all SNPs in HLA
genes are expected to be direct targets of selection. A less drastic solution would
be to single out the exons of HLA genes which harbour most – if not all – of the
balanced polymorphisms in those genes and exclude only the SNPs contained
in those exons (Klein, 1986).

Additionally, we also removed from each balanced gene the SNP(s) likely to
be the targets of balancing selection, in order to filter our datasets from the po-
tentially conflicting patterns generated by advantageous and deleterious vari-
ants within the balanced genes. Importantly, in this approach we only excluded
the SNP(s) with the highest heterozigosiy among those contained in a window
with a very strong signature of LTBS as reported in Bitarello et al. (n.d.), thus

204
Capítulo 2

increasing the chance that the actual selected site was filtered out.

The challenges of quantifying genetic load

Establishing the damaging potential of a variant is a formidable task in itself


(Grimm et al., 2015). Quantifying the genetic load and comparing it between
groups (populations, SNPs, genes, etc) is also challenging, as demonstrated by
the great number of published contrasting results regarding genetic load in hu-
mans (reviewed in Henn et al., 2015). Therefore, it is important to justify the
methodology used here.

We chose to use statistics based on the counts of deleterious variants (PN /PS
and Pdel /PS ) and deleteriousness scores (C score). With PN /PS and Pdel /PS we
quantified the proportions of nonsynonymous and potentially damaging vari-
ants, respectively, for balanced genes and control groups. PN simply documents
whether the polymorphism changes the coded aminoacid, and thus is unbiased
with respect to knowledge of the frequency at which the polymorphism is seg-
regating. Nevertheless, PN counts are composed of neutral, deleterious and
advantageous variants and thus are not straight-forward to interpret. Pdel is
more accurate than PN as a measure of deleteriousness, but it is restricted to
nonsynonymous sites and is only available for a subset of the nonsynonymous
variants (∼ 80% for the genome, but ∼ 70% for balanced genes), thus reducing
its power, particularly in small sets of SNPs. Moreover, PolyPhen-2 has been
shown to overfit its training data and not to generalize well for other datasets
(Grimm et al., 2015). Neither of these approaches incorporate the frequency of
the deleterious mutations when classifying SNPs.

One possible frequency-based measure would be the ratio of heterozygosi-


ties at nonsynonymous and synonymous sites ( π N /πS ), but this ratio is par-

205
Capítulo 2

ticularly sensitive to recent bottlenecks (reviewed in Brandvain and Wright,


2016). This is because, after a bottleneck, nonsynonymous variation recovers
more quickly than synonymous variation (because there are more nonsynony-
mous sites), and so an elevated π N /πS following a bottleneck could be inferred
(wrongly) as relaxed selection. Do et al. (2015) and Simons et al. (2014) use
and recommend a direct estimate of the number of deleterious (or nonsynony-
mous) mutations (e.g. PN/PS and Pdel/PS), which is robust to violations of
demographic equilibrium (reviewed in Brandvain and Wright, 2016; Henn et
al., 2015).

Our approach here is thus conservative. Previous studies have shown that
for comparisons between African and Out-of-Africa, small or no difference is
verified when the number of putative deleterious mutations is counted (Henn
et al., 2016; Tennessen et al., 2012; Do et al., 2015; Lohmueller, 2014), whereas on
average the out-of-Africa populations are more homozygous for the putative
deleterious mutations (Lohmueller et al., 2008) – a difference not detectable by
these two methods.

In addition to this “SNP counting” approach, we also compared the distri-


bution of deleteriousness among all SNPs within the balanced genes and con-
trols via the C score (Kircher et al., 2014) analyses. The C score is defined for
all N + S sites and combines desirable features from other annotations, but is
also negatively correlated with allelic frequency (as are the other two statistics
used here). The C score does not suffer from poor generalization properties like
PolyPhen-2, because the vector machine was trained on an independent dataset
(Grimm et al., 2015; Kircher et al., 2014). However, CADD (C score) was trained
on high frequency variants, and although the C scores are available for all 1000
Genomes Phase 3 variants (Kircher et al., 2014) its accuracy in differentiating

206
Capítulo 2

deleteriousness of for low MAF variants is likely to be smaller.

Sheltered load and hitch-hiking

Our observations of increased load in genes with signatures of long-term bal-


ancing selection can be explained by two possible mechanisms: 1) as a mani-
festation of a "sheltered load" (Oosterhout, 2009) and 2) as an effect of linkage
of deleterious variants to the balanced polymorphisms, i.e, a hitch-hiking effect
(Mendes, 2013; Lenz et al., 2016).

According to the sheltered load model, regions with an excess of heterozy-


gosity would "protect" rare recessive variants from being "seen" by purifying
selection, thus contributing to their permanence in the population and at higher
frequencies than expected if they were not linked to balanced polymorphisms.
This model has been invoked to explain the dynamics of deleterious mutations
near the S loci of Arabidopsis and Solanum (Stone, 2004; Roux et al., 2013) and
the excess of disease associations in the MHC region (e.g. Oosterhout, 2009).

On the other hand, recent work (Lenz et al., 2016) showed through simula-
tions that deleterious mutations are expected to accumulate in the vicinity of a
locus under balancing selection. The simulation framework assumed that sev-
eral sites were under balancing selection in an HLA-like gene – as is the case
for classic HLA genes (e.g. Hughes and Nei, 1988; Yang and Swanson, 2002;
Bitarello et al., 2015).

Moreover, the simulations assumed symmetrical overdominance and used


realistic parameters from the actual HLA genes and/or human demography,
such as effective population size, mutation and recombination rates, and even
average selection coefficients for these loci. Finally, loci around the selected
HLA-like locus were modelled to be either evolving neutrally or under purify-

207
Capítulo 2

ing selection. With these simulations, the authors demonstrated that such a sce-
nario leads to an overall reduction of diversity around the HLA-like locus, but
the variants that "survive" tend to segregate at higher frequencies, demonstrat-
ing the potential for balancing selection in HLA genes to increase the frequency
of deleterious variants around the HLA loci. Lenz et al. (2016) confirm this
prediction with empirical data an excess of damaging (Adzhubei et al., 2010)
variants in non-HLA loci of the MHC region.
Importantly, the simulations of Lenz et al. (2016) assume an additive model,
not a recessive one. Thus, their observations suggest that some other mecha-
nism other than the "sheltered load" is responsible for the increased load in the
vicinity of HLA genes, and this is likely to be the hitch-hiking effect mentioned
above.

208
References

Abecasis, G. R., A. Auton, L. D. Brooks, M. a. DePristo, R. M. Durbin, R. E. Handsaker, H. M.


Kang, G. T. Marth, and G. A. McVean (2012). “An integrated map of genetic variation from
1,092 human genomes.” In: Nature 491 (7422), pp. 56–65.
Adzhubei, I. A., S. Schmidt, L. Peshkin, V. E. Ramensky, A. Gerasimova, P. Bork, A. S. Kon-
drashov, and S. R. Sunyaev (2010). “A method and server for predicting damaging missense
mutations.” In: Nature methods 7 (4), pp. 248–9.
Albrechtsen, A., I. Moltke, and R. Nielsen (2010). “Natural selection and the distribution of
identity-by-descent in the human genome.” In: Genetics 186 (1), pp. 295–308.
Alkan, C. et al. (2009). “Personalized copy number and segmental duplication maps using next-
generation sequencing”. In: Nature Genetics 41 (10), pp. 1061–1067.
Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. In: eLS, pp. 1–8.
Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” In: Molecular
Biology and Evolution 26 (12), pp. 2755–64.
Auton, A. et al. (2015). “A global reference for human genetic variation”. In: Nature 526 (7571),
pp. 68–74.
Betancourt, A. J. and D. C. Presgraves (2002). “Linkage limits the power of natural selection in
Drosophila.” In: Proceedings of the National Academy of Sciences of the United States of America
99 (21), pp. 13616–20.
Bitarello, B. D., C. de Filippo, J. C. Teixeira, D. Meyer, and A. M. Andrés. “Uncovering targets
of balancing selection in the human genome”. In: in prep.
Bitarello, B. D., R. D. S. Francisco, and D. Meyer (2015). “Heterogeneity of dN/dS Ratios at the
Classical HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”. In:
Journal of Molecular Evolution 82 (1), pp. 38–50.

209
Capítulo 2

Brandvain, Y. and S. I. Wright (2016). “The Limits of Natural Selection in a Nonequilibrium


World”. In: Trends in Genetics 32 (4), pp. 1–10.
Casals, F. et al. (2013). “Whole-Exome Sequencing Reveals a Rapid Change in the Frequency
of Rare Functional Variants in a Founding Population of Humans”. In: PLoS Genetics 9 (9).
Ed. by S. M. Williams, e1003815.
Charlesworth, B. (2012). “The effects of deleterious mutations on evolution at linked sites”. In:
Genetics 190 (1), pp. 5–22.
Charlesworth, B., M. Nordborg, and D. Charlesworth (1997). “The effects of local selection, bal-
anced polymorphism and background selection on equilibrium patterns of genetic diversity
in subdivided population”. In: Genetical Research 70, pp. 155–174.
Charlesworth, D. (2006). “Balancing selection and its effects on sequences in nearby genome
regions.” In: PLoS Genetics 2 (4), pp. 379–384.
Chun, S. and J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the
human genome.” In: PLoS genetics 7 (8), e1002240.
Connallon, T. and A. G. Clark (2013). “Antagonistic versus nonantagonistic models of balancing
selection: characterizing the relative timescales and hitchhiking effects of partial selective
sweeps.” In: Evolution; international journal of organic evolution 67 (3), pp. 908–17.
Corona, E., J. T. Dudley, and A. J. Butte (2010). “Extreme evolutionary disparities seen in positive
selection across seven complex diseases”. In: PLoS ONE 5 (8), pp. 1–10.
Cutter, A. D. and B. A. Payseur (2013). “Genomic signatures of selection at linked sites: unifying
the disparity among species.” In: Nature reviews. Genetics 14 (4), pp. 262–74.
Danecek, P. et al. (2011). “The variant call format and VCFtools”. In: Bioinformatics 27 (15),
pp. 2156–2158.
Davydov, E. V., D. L. Goode, M. Sirota, G. M. Cooper, A. Sidow, and S. Batzoglou (2010).
“Identifying a High Fraction of the Human Genome to be under Selective Constraint Using
GERP++”. In: PLoS Computational Biology 6 (12). Ed. by W. W. Wasserman, e1001025.
Development Core Team, R. (2009). R: A language and environment for statistical computing. Vi-
enna, Austria: R Foundation for Statistical Computing. ISBN: 3-900051-07-0. URL: http :
//www.r-project.org.

210
Capítulo 2

Do, R., D. Balick, H. Li, I. Adzhubei, S. Sunyaev, and D. Reich (2015). “No evidence that selection
has been less effective at removing deleterious mutations in Europeans than in Africans”.
In: Nature Genetics 47 (2), pp. 126–131.
Eyre-Walker, A. and P. D. Keightley (1999). “High genomic deleterious mutation rates in ho-
minids”. In: Nature 397 (6717), pp. 344–347.
Fay, J. C. (2011). “Weighing the evidence for adaptation at the molecular level.” In: Trends in
genetics : TIG 27 (9), pp. 343–9.
Fu, W. et al. (2012). “Analysis of 6,515 exomes reveals the recent origin of most human protein-
coding variants”. In: Nature 493 (7431), pp. 216–220.
Grimm, D. G. et al. (2015). “The evaluation of tools used to predict the impact of missense
variants is hindered by two types of circularity”. In: Human Mutation 36 (5), pp. 513–523.
Henn, B. M., L. R. Botigué, C. D. Bustamante, A. G. Clark, and S. Gravel (2015). “Estimating the
mutation load in human genomes”. In: Nature Reviews Genetics 16 (6), pp. 333–343.
Henn, B. M. et al. (2016). “Distance from sub-Saharan Africa predicts mutational load in diverse
human genomes”. In: Proceedings of the National Academy of Sciences 113 (4), E440–E449.
Hill, W. G. and A. Robertson (1966). “The effect of linkage on limits to artificial selection”. In:
Genetics research 8 (2), pp. 269–294.
Hodgkinson, A., F. Casals, Y. Idaghdour, J.-C. Grenier, R. D. Hernandez, and P. Awadalla (2013).
“Selective constraint, background selection, and mutation accumulation variability within
and between human populations.” In: BMC genomics 14, p. 495.
Hughes, A. L. and M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility
class I loci reveals overdominant selection”. In: Letters to Nature 335 (8), pp. 167–170.
Kiezun, A. et al. (2013). “Deleterious Alleles in the Human Genome Are on Average Younger
Than Neutral Alleles of the Same Frequency”. In: PLoS Genetics 9 (2), pp. 1–12.
Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, and J. Shendure (2014). “A general
framework for estimating the relative pathogenicity of human genetic variants”. In: Nature
genetics 46 (3), pp. 310–315.
Klein, J. (1986). Natural History of the Major Histocompatibility Complex. New York: John Wiley &
Sons, Ltd.
Kondrashov, A. S. (1995). “Contamination of the genome by very slightly deleterious mutations:
why have we not died 100 times over?” In: Journal of Theoretical Biology 175 (4), pp. 583–594.

211
Capítulo 2

Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582.
Lenz, T. L., V. Spirin, D. M. Jordan, and S. R. Sunyaev (2016). “Excess of Deleterious Mutations
around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. In: bioRxiv, pp. 1–30.
Lohmueller, K. E. (2014). “The distribution of deleterious genetic variation in human popula-
tions.” In: Current opinion in genetics & development 29C, pp. 139–146.
Lohmueller, K. E. et al. (2008). “Proportionally more deleterious genetic variation in European
than in African populations.” In: Nature 451 (7181), pp. 994–997.
Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Tech.
rep. Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/
41/41131/tde-02082013-161104/pt-br.php.
Morton, N. E., J. F. Crow, and H. J. Muller (1956). “An Estimate of the Mutational Damage in
Man From Data on Consanguineous Marriages”. In: Proceedings of the National Academy of
Sciences of the United States of America 42 (11), pp. 855–863.
Nielsen, R. (2005). “Molecular Signatures of Natural Selection”. In: Annual Review of Genetics
39 (1), pp. 197–218.
Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune
genes.” In: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657),
pp. 657–65.
Peischl, S., I. Dupanloup, M. Kirkpatrick, and L. Excoffier (2013). “On the accumulation of dele-
terious mutations during range expansions.” In: Molecular ecology 22 (24), pp. 5972–82.
Peischl, S. and L. Excoffier (2015). “Expansion load: recessive mutations and the role of standing
genetic variation”. In: Molecular Ecology 24 (9), pp. 2084–2094.
Robinson, J., J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and S. G. E. Marsh (2013). “The
IMGT/HLA database.” In: Nucleic Acids Research 41, pp. 1222–7.
Roux, C., M. Pauwels, M. V. Ruggiero, D. Charlesworth, V. Castric, and X. Vekemans (2013). “Re-
cent and ancient signature of balancing selection around the S-Locus in arabidopsis halleri
and A. lyrata”. In: Molecular Biology and Evolution 30 (2), pp. 435–447.
Schierup, M. H., D. Charlesworth, and X. Vekemans (2000). “The effect of hitch-hiking on genes
linked to a balanced polymorphism in a subdivided population”. In: Genetical research 76 (01),
pp. 63–73.

212
Capítulo 2

Simons, Y. B., M. C. Turchin, J. K. Pritchard, and G. Sella (2014). “The deleterious mutation load
is insensitive to recent population history”. In: Nature Genetics 46 (3), pp. 220–224.
Stone, J. L. (2004). “Sheltered load associated with S-alleles in Solanum carolinense.” In: Heredity
92 (4), pp. 335–42.
Subramanian, S. (2012). “The abundance of deleterious polymorphisms in humans.” In: Genetics
190 (4), pp. 1579–83.
— (2016). “Europeans have a higher proportion of high-frequency deleterious variants than
Africans”. In: Human Genetics 135 (1), pp. 1–7.
Sunyaev, S., V. Ramensky, I. Koch, W. Lathe 3rd, A. S. Kondrashov, and P. Bork (2001). “Predic-
tion of deleterious human alleles”. In: Hum Mol Genet 10 (6), pp. 591–597.
Tennessen, J. A. et al. (2012). “Evolution and Functional Impact of Rare Coding Variation from
Deep Sequencing of Human Exomes”. In: Science 337 (6090), pp. 64–69.
Tishkoff, S. A. and S. M. Williams (2002). “Genetic analysis of African populations: human evo-
lution and complex disease.” In: Nature Reviews Genetics 3 (8), pp. 611–621.
Yang, Z. and W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution
that Account for Heterogeneous Selective Pressures Among Site Classes”. In: Molecular Bi-
ology and Evolution 19 (1), pp. 49–57.

213
Considerações Finais e Perspectivas

qui, eu recapitulo as questões que propus abordar na Introdução

A (página 49), resumo as conclusões a que chegamos com as investi-


gações dos Capítulos 1 e 2, e discuto perspectivas decorrentes des-
tes trabalhos.

Seleção balanceadora no genoma humano

Desenvolvimento e avaliação de um novo método para a detec-

ção de assinatura de seleção balanceadora

No Capítulo 1, descrevemos um novo método para detecção de assinaturas de


seleção balanceadora de longo prazo (SBLP) em humanos: Non-Central Devi-
ation (NCD). Esse método apresenta duas estatísticas: NCD1 utiliza apenas o
espectro de frequências alélicas, ao passo que NCD2 usa também informação
contida na divergência entre humanos e chimpanzés. A combinação de duas
assinaturas de SBLP em NCD2 confere maior poder em relação a NCD1. Apesar
disso, a performance de NCD1 é comparável à de outros métodos comumente
usados para detectar genes ou regiões sob seleção natural e recomendamos que
seja utilizada em espécies para as quais dados de divergência com espécies pró-

214
Considerações Finais e Perspectivas

ximas não estejam disponíveis.

Através de simulações neutras e com seleção, baseadas num modelo deta-


lhado de demografia humana, demonstramos que o poder das duas estatísticas
é alto para detectar assinaturas de SBLP em populações africanas e europeias,
para sequências não muito longas (<= 6.000 pares de base). Avaliamos qual a
combinação de possíveis implementações que maximiza o poder das estatísti-
cas e vimos que, para seleção que surgiu há pelo menos 3 milhões de anos, o
método NCD2 tem maior poder para sequências de 3.000 pares de base. Para se-
leção mais recente (isto é, que teve início há menos de 1 milhão de anos), NCD1
tem poder maior que NCD2, mas como os valores em geral são baixos, na or-
dem de 30-40% (para taxa de falso positivo de 5%), enfatizamos que ambas as
estatísticas são indicadas para a detecção de eventos de seleção balanceadora
que perduram há pelo menos 3 milhões de anos (em humanos). Além disso,
mostramos que a performance de NCD2 supera a de outros métodos já existen-
tes: D de Tajima (Tajima, 1989), teste HKA (Hudson et al., 1987), testes T1 e T2
(DeGiorgio et al., 2014), e uma combinação de NCD1+HKA.

Um diferencial das nossas estatísticas em relação às já existentes é que nela


pode-se definir uma “frequência-alvo” (target frequency) a partir da qual o des-
vio de frequências alélicas é calculado. Isto é, a estatística pode ser calculada
assumindo-se que os polimorfismos balanceados estejam segregando em frequên-
cias diferentes de 0.5. Na avaliação de poder, consideramos as frequências 0.3,
0.4 e 0.5. Com as simulações, vimos que o ganho em poder de NCD1 e NCD2 em
relação às outras estatísticas é maior quando frequências de equilíbrio diferen-
tes de 0.5 são simuladas. Assim, mostramos que NCD2 tem poder maior do que
as outras estatísticas usadas para detectar assinaturas de seleção balanceadora.

É importante ressaltar que o poder foi avaliado no contexto de um modelo

215
Considerações Finais e Perspectivas

demográfico para populações humanas que é bastante complexo e realista (Gra-


vel et al., 2011). Isso sugere que a observação de que nosso método supera os
outros não é restrita a um cenário não-realista, mas baseada nos padrões de
polimorfismo previstos em humanos.

Prevalência de SBLP no genoma humano

Há ainda considerável controvérsia sobre a importância da seleção balancea-


dora como processo microevolutivo que molda a diversidade genética humana
(Andrés et al., 2009; Bubb et al., 2006; Leffler et al., 2013). Ele é raro, envolvendo
poucas regiões genômicas? É mais comum atuar em regiões codificadoras de
proteína ou regulatórias? Quais funções exercem os genes sob seleção balance-
adora? O regime é partilhado entre populações distintas? O método que desen-
volvemos tem como motivação contribuir para a resolução dessas questões.

As análises do Capítulo 1 mostram que, para humanos, a performance de


NCD2 é melhor do que a de NCD1. Assim, calculamos NCD2 para janelas de
3.000 pares de bases ao longo de todo o genoma. Usamos dados genômicos de
4 populações (duas africanas, duas europeias) do Projeto 1000 Genomas. Como
na prática não se sabe em qual frequência os polimorfismos balanceados estão
segregando, calculamos NCD2 considerando três frequências-alvo (0.3, 0.4, 0.5)
e combinamos os resultados. Tomamos cuidado especial em filtrar os dados a
priori, removendo regiões que poderiam ter assinaturas semelhantes às espe-
radas sob seleção balanceadora, mas por outras causas. Essas incluem regiões
com motivos em tandem, grandes duplicações cromossômicas, regiões que não
têm ortologia com chimpanzés e que não são únicas.

A fim de determinar os prováveis alvos de seleção balanceadora, combina-


mos duas estratégias: (1) um critério de significância baseado em simulação,

216
Considerações Finais e Perspectivas

em que uma janela é considerada significativa se seu valor de NCD2 para uma
dada frequência-alvo é menor do que aquele de 10.000 simulações neutras com
número igual de sítios informativos (resultando em cerca de 0,50% das janelas
por população, considerando a união de todas as frequências-alvo) e; (2) um
critério de ranking na distribuição genômica, após a aplicação de uma correção
que leva em conta o número de sítios informativos da janela. Com o segundo
critério, definimos como outliers as janelas na cauda da distribuição empírica
(0,05%), que é basicamente um subconjunto das janelas obtidas com o primeiro
critério.

Finalmente, reportamos como genes outlier aqueles que têm pelo menos uma
janela outlier (independente da frequência-alvo) em pelo duas populações do
mesmo continente. Com isso, esperamos reduzir os falsos positivos que pode-
riam ter surgido devido a alguma propriedade dos dados de uma certa popu-
lação, dado que, na escala de tempo que investigamos, esperamos que popula-
ções de um mesmo continente tenham compartilhado pressões seletivas, bem
como história demográfica. Nossos resultados mostraram que pelo menos 1%
dos genes do genoma têm assinaturas extremas de seleção balanceadora (ou-
tlier), mas talvez mais, podendo chegar até 8% (Tabela S8, Capítulo 1). Mesmo a
estimativa mais conservadora de 1% é bem mais alta do que o que já tinha sido
observado até hoje. Por exemplo, apenas 0.4% dos 13.500 genes analisados por
Andrés et al. (2009) apresentaram fortes assinaturas de seleção balanceadora. O
fato de nossa estimativa ser mais alta é provavelmente decorrente de múltiplos
fatores: o alto poder de NCD2, os dados genômicos utilizados, o fato de mesmo
com todos os nossos filtros termos retido mais de 18.000 genes autossômicos nas
análises e o fato de as janelas analisadas serem pequenas, o que aumenta a pro-
babilidade de detectar uma assinatura de SBLP (Andrés, 2011; Charlesworth,

217
Considerações Finais e Perspectivas

2009).

Nosso estudo pôde identificar genes com assinaturas extremas e reportou


o quão prevalente a seleção balanceadora pode ter sido na história evolutiva
humana. Dentre os 213 genes com assinaturas extremas de SBLP, 30% já foram
detectados em algum scan prévio e outros (pelo menos quatro) estudos de ge-
nes candidatos. Ou seja, cerca de 70% dos genes que apresentamos são novos
na literatura de seleção balanceadora. Adicionalmente, a nossa lista mais inclu-
siva (i.e, menos conservadora) de 1.470 genes com assinaturas menos extremas
indica que talvez a SBLP tenha sido ainda mais comum.

Partilhamento entre continentes

Boa parte dos das janelas candidatas é compartilhada entre ao menos duas das
populações analisadas (87%), particularmente entre populações do mesmo con-
tinente (78%). Mesmo nos casos em que um gene não passa o critério de perten-
cer aos dois continentes, a grande maioria tem assinaturas em ambos os conti-
nentes (ou seja, em pelo menos 3 das quatro populações analisadas), com raras
exceções. Finalmente, cerca de 32% dos genes outlier (69 genes, Tabela 3, Capí-
tulo 1) são partilhados entre as quatro populações.

Nossos achados confirmam que o grau de compartilhamento entre popula-


ções de um mesmo continente é maior do que entre populações de continentes
distintos. Tal observação pode ser interpretada como um compartilhamento de
pressões seletivas históricas, bem como de fatores demográficos em comum,
que influenciam a variabilidade genética que fica disponível para a atuação da
seleção balanceadora.

O fato de muitos dos alvos – genes e janelas – serem compartilhados entre


continentes é compatível com a escala de tempo do regime seletivo que investi-

218
Considerações Finais e Perspectivas

gamos (>= 3 milhões de anos). Mesmo que, na história humana recente, África
e Europa tenham divergido em diversos aspectos – em termos de história de-
mográfica e de pressões seletivas – é plausível que muitos alvos de seleção ba-
lanceadora de longo prazo tenham sido mantidos em ambas, e/ou que tenham
cessado de ser selecionados em um dos continentes apenas recentemente, pre-
servando assim as assinaturas de SBLP até o presente.

Características das regiões candidatas

Resposta imune

Observamos um enriquecimento para certas categorias funcionais entre os ge-


nes significativos e outliers. Cerca de metade das categorias enriquecidas são
relacionados à resposta imune, de forma ampla, e dessas, cerca de metade é
diretamente ligada à apresentação de antígenos por moléculas HLA.

Evidências de seleção balanceadora em diversos genes HLA clássicos de


classe I e II são abundantes na literatura. De fato, eles estão contidos nas ja-
nelas significativas e também nas outlier, que têm as assinaturas mais extremas.
Portanto, investigamos se os genes HLA estariam causando os enriquecimentos
de categorias relacionadas ao sistema imune. A remoção de tais genes levou à
observação de que nenhuma categoria permaneceu enriquecida para os genes
outlier, o que demonstra, em primeiro lugar, a grande influência dos genes HLA
no conjunto mais restrito de genes candidatos e, em segundo lugar, que o con-
junto de dados restante é pequeno (em média 177 genes por população), o que
pode acarretar perda de poder pra testes que visam detectar enriquecimento de
alguma classe funcional entre os genes selecionados (mesmo categorias que não
são compostas exclusivamente por genes HLA deixam de ser significativas com

219
Considerações Finais e Perspectivas

a remoção dos mesmos).

Por outro lado, é interessante observar que mesmo após a remoção dos ge-
nes HLA clássicos, algumas categorias funcionais permaneceram enriquecidas
para os genes significativos, algumas delas relacionadas ao sistema imune, mas
envolvendo outros genes, incluindo genes HLA não-clássicos. De fato, 1/3 dos
genes significativos são relacionados a funções imunes, mesmo que não compo-
nham categorias enriquecidas. Entre as outras categorias, temos por exemplo
“região extra-celular”, que confirma a observação de que tende a haver um ex-
cesso de genes relacionados à matriz extracelular entre os alvos de SBLP em
humanos (revisado em Key et al., 2014).

Corroboramos, assim, que a resposta imune é uma importante pressão sele-


tiva responsável por instâncias de seleção balanceadora, e detectamos fortes as-
sinaturas em alguns genes candidatos relacionados à reprodução. Cinco genes
significativos são relacionados à espermatogênese, embora não haja enriqueci-
mento para a categoria, e um dos 10 genes mais extremos (C1orf101) é altamente
expresso em testículo e, embora tenha função ainda desconhecida, há indícios
de que poderia estar relacionado ao complexo CATSPER de canais de Ca+2 , que
são cruciais para a sinalização na superfície celular que leva à fertilização. Em
suma, embora estas duas pressões seletivas de inegável importância (defesa do
organismo e reprodução) não aparentam estar por trás da maioria dos alvos de
SBLP, elas estão envolvidas em mais de 1/3 dos genes com assinaturas mais
fortes de SBLP.

Confiabilidade acerca dos alvos de SBLP

Outra categoria de genes enriquecida entre os candidatos são os receptores ol-


fatórios. Trata-se de uma família gênica complicada de se analisar pois são o

220
Considerações Finais e Perspectivas

resultado de diversas duplicações. Nossas análises não permitem excluir as hi-


póteses de que: a) as assinaturas de SBLP nesses genes sejam causadas por con-
versão gênica entre parálogos situados próximos uns aos outros (trata-se de um
fenômeno biológico, porém diferente de seleção balanceadora, capaz de gerar
assinatura semelhante); b) que o excesso de SNPs com frequências intermediá-
rias nesses genes seja decorrente de reads de genes distintos porém com alta
identidade terem sido mapeados a uma só gene no genoma referência, assim
inflando artificialmente a frequência de alelos em frequência intermediárias.

Seria plausível supor que ambos os artefatos – um deles causado por um


fenômeno biológico, e o outro por problemas de bioinformática de dados de
sequenciamento – poderiam estar ocorrendo de forma mais generalizada nos
nossos genes candidatos. A fim de verificar a credibilidade das regiões candi-
datas quanto à questão de conversão gênica entre parálogos situados próximos
um ao outro, comparamos a distribuição de número de casos em que parálogos
por gene candidato estão situados no mesmo cromossomo (possibilitando, as-
sim, a conversão não-homóloga), e comparamos com a distribuição para todos
os outros genes. Vimos que as distribuições são essencialmente idênticas, e que
portanto nossos genes candidatos não tendem a ter mais parálogos situados no
mesmo cromossomo, de forma geral. Como a conversão gênica não-homóloga
ocorre entre genes homólogos, a proximidade física é necessária.

A respeito de duplicações não detectadas, tomamos quatro dos 10 genes com


assinaturas mais extremas (dentre os que nunca apareceram em outros estudos
de seleção balanceadora) e verificamos que poucos SNPs contidos nesses genes
podem ser artefatos gerados por duplicações não detectadas, e que mesmo ex-
cluindo tais SNPs, os genes em questão continuam tendo assinaturas extremas
de SBLP. Finalmente, dos 213 genes com assinaturas mais extremas, apenas dois

221
Considerações Finais e Perspectivas

são receptores olfatórios (Tabela 3, Capítulo 1), o que implica que: (1) é plausível
que não sejam falsos positivos, dados todos os cuidados que tomamos, mas não
podemos descartar essa possibilidade; (2) nossas verificações nos deixam confi-
antes de que vieses desse tipo não são uma característica dos genes candidatos
de forma geral.

Regiões regulatórias versus regiões codificadoras de proteínas

Em um scan para polimorfismos balanceados partilhados entre humanos e chim-


panzés, Leffler et al. (2013) reportaram que, de 125 haplótipos compartilhados
entre humanos e chimpanzés – interpretado como uma assinatura de SBLP –
123 ocorrem em regiões genômicas não-gênicas. Combinando-se essa obser-
vação com o fato de há poucos casos descritos de genes-alvo de SBLP, seria
plausível supor que a maior parte dos sítios-alvo de seleção balanceadora fos-
sem regulatórios. No nosso estudo, vimos que embora as janelas significativas
representem apenas cerca de 0,5% das janelas analisadas, elas correspondem a
cerca de 8% dos genes codificadores de proteínas. Por outro lado, não detecta-
mos proporcionalmente mais janelas que incluem genes entre as significativas
quando comparadas às não-significativas.

A fim de explorar se a SBLP tende a ocorrer sobre sítios regulatórios, inves-


tigamos se havia um excesso de SNPs com função regulatória nas janelas signi-
ficativas. A princípio vimos que esse excesso – altamente significativo (Figura
7, Capítulo 1) – existe para SNPs que possuem diversas funções regulatórias,
inclusive a de eQTL. Entretanto, SNPs sem anotação de eQTL mas com outras
funções regulatórias não apresentam enriquecimento. Por fim, pudemos deter-
minar que, considerando apenas SNPs com frequência intermediária, não existe
enriquecimento para eQTLS, mostrando que uma anotação positiva para eQ-

222
Considerações Finais e Perspectivas

TLs é correlacionada positivamente à frequências dos mesmos. Nosso achado


mostra que, como há um excesso de variantes segregando em frequência in-
termediária em regiões sob seleção balanceadora, o enriquecimento de traços
genômicos para os quais a detecção é sensível à frequência alélica (como é o
caso de eQTLs) será enviesado. Finalmente, detectamos um excesso de SNPs
sem qualquer anotação de função regulatória nas janelas candidatas.

Apesar dessa ausência de evidência de excesso de enriquecimento para SNPs


com funções regulatórias entre as janelas mais extremas, detectamos um sutil,
porém significativo, enriquecimento para expressão mono-alélica (MAE) entre
os 213 genes com assinaturas mais extremas de SBLP (Savova et al., 2016). Um
estudo recente (Savova et al., 2016) reportou que uma proporção considerável
dos genes humanos ( 25%) apresentam expressão mono-alélica (MAE)2 . Eles
reportam, ainda, que dentre os genes que têm assinaturas de SBLP, existe um
enriquecimento de genes com assinatura MAE. Nós confirmamos essa relação
com o nosso achado de excesso de genes MAE entre os genes mais extremos.

Trata-se de um achado que, conforme argumentado por Savova et al. (2016),


pode indicar uma possível ligação evolutiva entre MAE e vantagem do hete-
rozigoto: muitos dos genes MAE codificam proteínas expressas na superfície
celular, e modulam interações entre a célula e o ambiente ao redor, incluindo
outras células. Heterozigose em um sítio MAE poderia levar a diferentes alelos
inativados em células de um mesmo tecido, diminuindo a possibilidade de uma
“monocultura” e assim reduzindo a susceptibilidade do tecido como um todo
a agentes infecciosos (Savova et al., 2016). Por outro lado, Savova et al. (2016)

2 Para a maioria dos genes, em organismos diploides, acredita-se que a expressão gênica
ocorre simultaneamente para os dois alelos. Para outros, apenas um dos alelos, o materno
ou o paterno, é expresso, ao passo que o outro é inativado. Esse padrão é alcançado através
de modificações epigenéticas, assim levando a uma expressão mono-alélica que é mantida ao
longo das divisões mitóticas.

223
Considerações Finais e Perspectivas

discutem que é inteiramente possível que expressão mono-alélica e manuten-


ção de diversidade através de seleção sejam fenômenos independentes que têm
como alvo os mesmos componentes moleculares.

Finalmente, para alguns alvos de SBLP já foram reportados casos em que


uma variante causa uma mudança de tecido em que o gene é expresso. Como
exemplo temos o gene B4galnt2: em camundongos, uma variante causa a mu-
dança de expressão do local habitual (epitélio intestinal) para outro (endotélio
vascular). O ortólogo desse gene em humanos (B4GALNT2) é um dos nossos ge-
nes candidatos, discutidos no Capítulo 1. Outro exemplo é o HLA-G, também
entre os nossos candidatos e com uma ampla literatura descrevendo padrões
complexos de expressão (p.ex. Tan et al., 2005). Assim, testamos se tais pa-
drões são recorrentes entre nossos genes e detectamos um excesso significativo
de genes com expressão em apenas um tecido humano: 12 com expressão na
glândula adrenal e 25 com expressão no pulmão.

Em suma, muitos dos alvos de SBLP são genes codificadores de proteínas –


a maioria nunca foi reportada antes em estudo de seleção balanceadora – e não
encontramos evidência de excesso de funções regulatórias entre as janelas que
não incluem genes. Por outro lado, encontramos enriquecimento para genes
com MAE e com expressão tecido-específica, apontando que talvez haja, sim,
um excesso de alvos de SBLP com funções regulatórias.

224
Considerações Finais e Perspectivas

Variação deletéria em regiões e genes com


assinaturas de SBLP

Além da seleção balanceadora e da seleção positiva, seleção contra mutações


deletérias constitui um processo evolutivo fundamental, capaz de influenciar a
variação quantitativa para caráteres de importância ecológica e médica. Com
o influxo constante de novas mutações deletérias que surgem nas populações,
algumas irão segregar transitoriamente dentro das populações, resultando num
balanço entre mutação e seleção que é influenciado pela taxa de mutação, pelo
tamanho populacional efetivo e pela intensidade de seleção sobre a mutação.
Entretanto, a contribuição de tais variantes deletérias sobre caráteres moldados
por variação genética quantitativa permanece pouco compreendido (Mitchell-
Olds et al., 2007). Diversos estudos de associação em humanos têm identificado
polimorfismos segregando em frequências intermediárias que influenciam va-
riação de traços complexos (Mitchell-Olds et al., 2007)3 .

No Capítulo 2, mostramos que genes com assinaturas extremas de seleção


balanceadora têm maior carga genética do que regiões evoluindo presumivel-
mente de forma neutra. Os controles levaram em conta o fato de que o espectro
de frequências alélicas dos genes balanceados tem proporcionalmente menos
variantes raras do que o controle genômico. Usamos três métricas diferentes
para quantificar este excesso: duas delas contam diretamente o número de va-
riantes potencialmente deletérias dividido pelo número de variantes neutras,
e a outra atribui uma medida para cada variante, que quantifica o quão dele-

3 Aqui, refiro-me a traços que, acredita-se, resultam de variação genética em múltiplos genes
e suas interações com fatores ambientais e comportamentais (Mitchell-Olds et al., 2007).

225
Considerações Finais e Perspectivas

téria ela é. Assim, as distribuições dessas medidas para genes balanceados e


controles pôde ser comparada.

As três estimativas são mais elevadas para os genes balanceados do que para
os controles, com poucas exceções. Mais ainda, quando removemos os genes
HLA – que têm muitos sítios mantidos de forma adaptativa e poderiam con-
fundir a interpretação das estimativas – os resultados foram qualitativamente
semelhantes. Avaliamos, por fim, o impacto que os sítios potencialmente se-
lecionados nos genes balanceados têm sobre essas estimativas, e vimos que as
observações se mantêm mesmo quando eles são removidos.

Em suma, há evidência, através de três diferentes métricas, de um excesso


de carga genética na vizinhança de regiões com assinaturas de seleção balan-
ceadora. Esse resultado pode ser interpretado de duas formas: (1) como uma
evidência de sheltered load4 ; ou (2) como evidência de efeito carona das variantes
deletérias com os polimorfismos balanceados, conforme explicado na Figura 1
do Capítulo 2.

Nossos resultados não permitem escolher entre uma ou outra explicação.


Entretanto, Lenz et al., 2013 mostrou, através de simulações de genes HLA com
múltiplos sítios selecionados e suas regiões adjacentes, que mesmo em um mo-
delo aditivo (não-recessivo), espera-se um aumento da carga genética em re-
giões adjacentes aos genes HLA, e tal efeito diminui quanto maior é a distância
em relação aos genes. Se extrapolarmos essas observações para outros genes
sob seleção balanceadora, é plausível supor que o mesmo ocorre na vizinhança
de outros alvos de seleção balanceadora.

A fim de discernir entre esses dois possíveis cenários, uma opção seria : (1)
4 A ideia de que variantes deletérias recessivas raramente estarão em homozigose quando es-

tão nos genes HLA, pois a região tem alta heterozigose. Assim, tais variantes deletérias estariam
protegidas da seleção purificadora (Oosterhout, 2009).

226
Considerações Finais e Perspectivas

verificar com simulações se sob modelo de seleção balanceadora não com múl-
tiplos, mas apenas um, sítio selecionado, os mesmo padrões são observados e;
(2) se existe um excesso de associações a doenças nas regiões genômicas dos
genes sob seleção balanceadora; (3) se o excesso de carga genética é menor (mas
ainda significativo) para genes vizinhos aos genes balanceados e/ou fixando-se
janelas genômicas em torno dos genes e verificando se a carga genética diminui
com a distância em relação ao gene-alvo.
Ainda que permaneçam algumas questões em aberto, nosso trabalho é uma
contribuição para dois campos estimulantes da biologia evolutiva: o estudo do
acúmulo de mutações deletérias no genoma humano e o estudo da importância
evolutiva da seleção balanceadora para a evolução humana.

227
Considerações Finais e Perspectivas

Perspectivas

Conciliando assinaturas de seleção e fenótipos

“(...) genome-wide scans are a hatchet, whereas what we need now is a scal-
pel. In-depth follow-up studies of individual outlier loci can be one such
scalpel, more precisely defining important population genetic parameters
such as the timing and magnitude of selection, the geographic distribu-
tion of selected variation, the interaction of population demograhic history,
recombination, and selection in shaping patterns of variation, and the func-
tional form of selection acting on individual outlier loci” (Akey, 2009)

A rigor, evidências de evolução adaptativa não demonstram que uma dada


substituição ou polimorfismo é adaptativo ao nível fenotípico, mas indicam a
região onde ele provavelmente poderá ser encontrado. Estudos baseados em
genética de populações são capazes de identificar genes alvo de seleção, i.e., que
evoluíram de forma não-neutra ao longo da história evolutiva humana (Capí-
tulo 1), mas não são capazes de fornecer, por si só, informações acerca dos traços
fenotípicos que representam os verdadeiros alvos de seleção (Mitchell-Olds et
al., 2007).
Até o momento, em muito poucos casos conseguiu-se traçar a relação cau-
sal entre um polimorfismo e um fenótipo de interesse, pois, tanto na pesquisa
quanto na prática clínica, a capacidade de detectar variantes genéticas suplanta,
em muito, a habilidade de sistematicamente avaliar os potenciais efeitos de tais
variantes (Kircher et al., 2014). Mesmo havendo essa enorme defasagem, com
a publicação de novos catálogos de genes/regiões genômicas candidatas à ação
da seleção balanceadora, ensaios funcionais têm se tornado mais comuns.

228
Considerações Finais e Perspectivas

Por exemplo, em um estudo elegante, Chakraborty e Fry (2015) mostra-


ram como um polimorfismo em um gene pleiotrópico – codificador da enzima
aldeído-desidrogenase – é ativamente mantido devido a diferenças no nível
de concentração alcóolica em frutas em ambientes diversos ocupados por Dro-
sophila. A enzima tem duas funções: metabolismo de etanol e de outros aldeídos
decorrentes da fosforilação oxidativa, sendo esta a provável função ancestral e
aquela a função derivada. As duas variantes têm aptidões diferentes em dife-
rentes hábitats, dependendo do regime alimentar da mosca. Os autores con-
seguiram identificar uma substituição de aminoácido responsável pelas duas
variantes da enzima, e verificaram a eficácia das duas variantes sobre diferen-
tes substratos, assim revelando a aptidão de cada variante em dois tipos de
ambientes.

Um outro exemplo é o do gene ERAP2, que codifica uma proteína envolvida


na via de apresentação de antígenos pelas moléculas de MHC classe I. Esse gene
apresenta assinaturas de SBLP de acordo com nosso estudo (Tabela S8, Capítulo
1) e já tinha sido revelado como candidato por Andrés et al. (2009). Em um es-
tudo posterior (Andrés et al., 2010) foi demonstrado que a seleção balanceadora
mantém dois haplótipos, A e B, segregando em frequências intermediárias, e
que um deles resulta em uma proteína truncada. O estudo mostra, ainda, que
homozigotos para esse haplótipo resultam em expressão reduzida de moléculas
de MHC de classe I na superfície de linfócitos T. Apesar de a pressão seletiva
para a manutenção dessa variante ser ainda desconhecida, o estudo mostrou
evidências bioinformáticas, moleculares, celulares e imunológicas que mostram
que o gene pode ter sofrido seleção balanceadora, o impacto do provável sítio
selecionado sobre a proteína, e uma consequência downstream dessa variação
para a apresentação de antígenos.

229
Considerações Finais e Perspectivas

Ainda que elucidar a relação causal entre genótipo e fenótipo como nos
exemplos acima esteja além do escopo do presente trabalho, demos importan-
tes passos nessa direção ao explorar propriedades das regiões candidatas. No
Capítulo 1, dentro dessas limitações, buscamos explorar a base biológica dos al-
vos de seleção balanceadora, ao olharmos para as categorias funcionais às quais
eles pertencem, para a proporção de sítios codificadores, e dentre esses, os sítios
não-sinônimos. No Capítulo 2, analisamos em maior detalhe as propriedades
dos sítios contidos nas regiões-alvo de seleção balanceadora. Assim, pudemos
testar hipóteses acerca do acúmulo de mutações deletérias em regiões sob se-
leção balanceadora e aprofundamos nossa compreensão acerca dos potenciais
alvos de seleção balanceadora no genoma humano.

Acreditamos que com o ressurgimento de interesse por alvos de seleção ba-


lanceadora em humanos na literatura, muitos dos genes candidatos levanta-
dos no nosso trabalho serão alvo de investigação mais detalhada tanto acerca
de padrões genômicos como acerca de possíveis efeitos fenotípicos e mutações
causais em estudos funcionais.

Potencial das estatísticas NCD em futuros estudos

No Capítulo 1, mostrei que as duas novas estatísticas que propusemos – NCD1


e NCD2 – têm poder elevado em relação a outras estatísticas comumente usadas
para a detecção de assinaturas de seleção balanceadora.

Uma limitação no que tange a extrapolação de nossas observações sobre o


poder das estatísticas NCD para outras espécies é que as análises de poder re-
querem simulações – neutras e com seleção – cujos parâmetros podem variar
muito entre espécies. Por outro lado, o trabalho do Capítulo 1 deixa em aberto
a possibilidade de que NCD1 e NCD2 sejam utilizados em outras espécies, dada

230
Considerações Finais e Perspectivas

a sua extrema facilidade de implementação e interpretação de seus resultados.


As simulações para outras espécies são necessárias no sentido de determinar
o poder da estatística para o cenário em questão, e também para definir filtros
adequados, como os que propusemos na extensa parte de métodos do trabalho.
Como exemplo, Teixeira e colaboradores (in prep)5 têm trabalhado em um
estudo que discute as potenciais implicações biológicas de alvos de seleção ba-
lanceadora nos “grandes símios”6 . Tal estudo tem utilizado as estatísticas NCD,
valendo-se de modelos demográficos específicos e detalhados para as espécies
em questão para avaliar o poder nesses cenários, bem como os filtros apropria-
dos. Essa aplicação pra outras espécies mostra o potencial das nossas estatísti-
cas de serem utilizas por geneticistas evolutivos interessados em assinaturas de
seleção que afetem o espectro de frequências alélicas.

5 Sou co-autora deste trabalho.


6 Great Apes, incluindo chimpanzé, bonobo, gorila e orangotango.

231
Bibliografia

Akey, J. M. (2009). “Constructing genomic maps of positive selection in humans: where do we


go from here?” Em: Genome Research 19 (5), pp. 711–722.
Andrés, A. M. (2011). “Balancing Selection in the Human Genome”. Em: eLS, pp. 1–8.
Andrés, A. M. et al. (2009). “Targets of balancing selection in the human genome.” Em: Molecular
Biology and Evolution 26 (12), pp. 2755–64.
Andrés, A. M. et al. (2010). “Balancing Selection Maintains a Form of ERAP2 that Undergoes
Nonsense-Mediated Decay and Affects Antigen Presentation”. Em: PLoS Genetics 6 (10),
e1001157.
Bubb, K. L. et al. (2006). “Scan of human genome reveals no new Loci under ancient balancing
selection.” Em: Genetics 173 (4), pp. 2165–77.
Chakraborty, M. e J. D. Fry (2015). “Evidence that Environmental Heterogeneity Maintains a De-
toxifying Enzyme Polymorphism in Drosophila melanogaster”. Em: Current Biology 26 (2),
pp. 1–5.
Charlesworth, B. (2009). “Effective population size and patterns of molecular evolution and
variation.” Em: Nature reviews. Genetics 10 (3), pp. 195–205.
DeGiorgio, M., K. E. Lohmueller e R. Nielsen (2014). “A model-based approach for identifying
signatures of ancient balancing selection in genetic data.” Em: PLoS genetics 10 (8), e1004561.
Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs
e C. D. Bustamante (2011). “Demographic history and rare allele sharing among human
populations.” Em: Proceedings of the National Academy of Sciences of the United States of America
108 (29), pp. 11983–8.
Hudson, R. R., M. Kreitman e M. Aguade (1987). “A Test of Neutral Molecular Evolution Based
on Nucleotide Data”. Em: Genetics 116 (1), pp. 153–159.

232
Considerações Finais e Perspectivas

Key, F. M., J. C. Teixeira, C. de Filippo e A. M. Andrés (2014). “Advantageous diversity maintai-


ned by balancing selection in humans”. Em: Current Opinion in Genetics & Development 29,
pp. 45–51.
Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper e J. Shendure (2014). “A general
framework for estimating the relative pathogenicity of human genetic variants”. Em: Nature
genetics 46 (3), pp. 310–315.
Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. Em: Science 339 (6127), pp. 1578–1582.
Lenz, T. L., B. Mueller, F. Trillmich e J. B. W. Wolf (2013). “Divergent allele advantage at MHC-
DRB through direct and maternal genotypic effects and its consequences for allele pool
composition and mating”. Em: Proceedings of the Royal Society B: Biological Sciences 280 (1762),
p. 20130714.
Mitchell-Olds, T., J. H. Willis e D. B. Goldstein (2007). “Which evolutionary processes influence
natural genetic variation for phenotypic traits?” Em: Nature reviews. Genetics 8 (11), pp. 845–
856.
Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune
genes.” Em: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657),
pp. 657–65.
Savova, V., S. Chun, M. Sohail, R. B. McCole, R. Witwicki, L. Gai, T. L. Lenz, C.-t. Wu, S. R.
Sunyaev e A. A. Gimelbrant (2016). “Genes with monoallelic expression contribute dispro-
portionately to genetic diversity in humans”. Em: Nature Genetics 48 (3), pp. 231–237.
Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly-
morphism.” Em: Genetics 123 (3), pp. 585–595.
Tan, Z., A. M. Shon e C. Ober (2005). “Evidence of balancing selection at the HLA-G promoter
region”. Em: Human Molecular Genetics 14 (23), pp. 3619–3628.

233
Apêndices

234
Apêndice A.1.
Cópia do artigo “Mapping bias overestimates reference allele frequencies at the
HLA genes in the 1000 Genomes Project phase I data”: G3: Genes|Genomes|Genetics
(2015), 5(3): 931-941.
Neste artigo, eu contribuí com o planejamento das análises e na compreen-
são da organização dos dados do Projeto 1000 Genomas. Além disso, realizei
alguns dos testes estatísticos e propus a utilização de medidas de desvio de
frequência. Finalmente, contribuí com comentários acerca da redação do texto.

235
INVESTIGATION

Mapping Bias Overestimates Reference Allele


Frequencies at the HLA Genes in the 1000 Genomes
Project Phase I Data
Débora Y. C. Brandt,* Vitor R. C. Aguiar,* Bárbara D. Bitarello,* Kelly Nunes,* Jérôme Goudet,†
and Diogo Meyer*,1
*Department of Genetics and Evolutionary Biology, University of São Paulo, 05508-090 São Paulo, SP, Brazil, and
†Department of Ecology and Evolution, Biophore, University of Lausanne, CH-1015 Lausanne, Switzerland

ORCID IDs: 0000-0001-7676-9367 (B.D.B.); 0000-0002-7155-5674 (D.M.)

ABSTRACT Next-generation sequencing (NGS) technologies have become the standard for data generation KEYWORDS
in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are NGS
known to be problematic when applied to highly polymorphic genomic regions, such as the human leukocyte mapping bias
antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to pop- 1000 Genomes
ulation genomics analyses, it is important to assess the reliability of NGS data. Here, we evaluate the reliability HLA
of genotype calls and allele frequency estimates of the single-nucleotide polymorphisms (SNPs) reported by
1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, and -DQB1). We take advantage of the availability of
HLA Sanger sequencing of 930 of the 1092 1000G samples and use this as a gold standard to benchmark the
1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele
frequencies are estimated with an error greater than 60.1 at approximately 25% of the SNPs in HLA genes.
We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping
bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have
poor allele frequency estimates and discuss the outcomes of including those sites in different kinds of
analyses. Because the HLA region is the most polymorphic in the human genome, our results provide insights
into the challenges of using of NGS data at other genomic regions of high diversity.

Whole-genome resequencing data for large numbers of human indi- variable position, which constitute the data for downstream analyses
viduals, as generated by the 1000 Genomes Project (www.1000genomes. and hypothesis testing.
org), provide unprecedented amounts of information about micro- The calling of single-nucleotide polymorphisms (SNPs) and
evolutionary processes and demographic histories. Such inferences genotypes and the estimation of allele frequencies from next-
rely on either genotypic or allelic frequency information for each generation sequencing (NGS) has undergone rapid development,
along with likelihood-based and Bayesian methods created to deal with
challenges associated to heterogeneity in read quality and coverage
(Nielsen et al. 2011). In Phase I of the 1000 Genomes Project, geno-
Copyright © 2015 Brandt et al.
doi: 10.1534/g3.114.015784 types were called using a combination of different approaches: first,
Manuscript received December 22, 2014; accepted for publication March 13, 2015; primary call sets were independently generated by different centers with
published Early Online March 17, 2015. different sequencing platforms, alignment, and variant calling methods;
This is an open-access article distributed under the terms of the Creative
Commons Attribution Unported License (http://creativecommons.org/licenses/
then, a consensus SNP call set was generated and made publicly avail-
by/3.0/), which permits unrestricted use, distribution, and reproduction in any able (The 1000 Genomes Project Consortium 2012).
medium, provided the original work is properly cited. The data generated by the 1000 Genomes Project frequently have
Supporting information is available online at http://www.g3journal.org/lookup/ been used to make inferences about evolutionary processes affecting
suppl/doi:10.1534/g3.114.015784/-/DC1
Data available in public repositories: https://github.com/deboraycb/reliability_hla_1000g
our species, including the detection of targets of natural selection
1
Corresponding author: Departamento de Genética e Biologia Evolutiva, Rua do (Hernandez et al. 2011; Ward and Kellis 2012; Andersen et al. 2012)
Matão, 277, São Paulo, SP 05508-090, Brazil. E-mail: diogo@ib.usp.br and understanding the genetic basis of complex phenotypes

Volume 5 | May 2015 | 931


236
(Lappalainen et al. 2013). In addition, the detailed catalog of testing for selection, many studies have found strong evidence asso-
genetic variation it provides across multiple human populations ciated to the HLA region, using the 1000 Genomes as a source of
has been used to understand the processes affecting specific polymorphism data (e.g., Leffler et al. 2013).
genes. All the aforementioned applications of the 1000 Genomes Project
Among the well-documented targets of selection is the major his- SNP data in HLA genes are dependent on the reliability of genotype
tocompatibility complex region of the human genome, which harbors calls at each SNP. However, no study to date has provided a detailed
the highly polymorphic classical human leukocyte antigen (HLA) class survey of the reliability of individual genotype calls and allele fre-
I and II loci. The interest in these loci stems from their strong asso- quency estimates at the SNPs in HLA genes, despite their frequent
ciation to various autoimmune disorders (Sollid et al. 2014), suscep- usage. We address this issue, discuss likely causes for cases of incorrect
tibility and resistance to infection (Chapman and Hill 2012), and genotype calls, and provide a list of reliable sites for the HLA loci in
striking signatures of genetic variation indicating strong balancing the 1000 Genomes data. As in previous studies (Erlich et al. 2011;
selection (Meyer and Thomson 2001). Such types of investigations Major et al. 2013), we used a dataset in which individuals had their
can naturally be extended to the analysis of the 1000 Genomes data, HLA genes genotyped using Sanger sequencing as a gold standard to
which provide a rich resource of population genetic variation within benchmark the genotypes called at the 1000 Genomes Project. How-
and around HLA genes. ever, differently from these other studies, which were interested in
Despite this interest, the use of NGS data for HLA loci is hampered reconstructing the HLA haplotypes using NGS, here we have decon-
by a major technical hurdle, which is the mapping of short sequence structed the haplotypes determined from Sanger sequencing data into
reads to genes that are both highly polymorphic and which constitute SNPs, and compared genotypes at the SNP level to the 1000 Genomes
a multigene family. The high polymorphism may decrease the prob- data. We took advantage of the recent availability of a dataset of
ability that short reads will be successfully mapped to the reference Sanger sequencing based HLA genotyping of HLA-A, -B, -C, -DQB1,
genome, in the event that the sequenced individual carries a variant and -DRB1 for 930 of the samples from the 1000 Genomes Project
that is highly diverged from that used in the index (Nielsen et al. (Gourraud et al. 2014). Our results have implications for other studies
2011). In addition, the fact that many HLA genes have close that use SNP data from the 1000 Genomes in order to estimate allele
paralogues increases the chance that a read will map to two or more frequencies. Because HLA loci are the most polymorphic in the human
genomic regions, leading it to be discarded from most sequencing genome, they most likely represent the worst case scenario for map-
analyses pipelines, and thus decreasing the amount of usable infor- ping bias and, consequently, allele frequency estimation error.
mation for genotype calling (Treangen and Salzberg 2012).
In previous studies authors explored the applicability of NGS to METHODS
genotype the HLA alleles of an individual, where an allele typically is In this study, we compare NGS genotype calls and allele frequency
defined as the haplotype determined by a combination of SNPs within estimates reported by the 1000 Genomes Project with those obtained
a given HLA gene [e.g., Erlich et al. (2011); Major et al. (2013)]. To in a study which used Sanger sequencing to genotype HLA genes. For
this end, Erlich et al. (2011) proposed NGS methodologies in which the purpose of our analysis we assembled a dataset comprising the
different steps—from sample preparation to haplotype level allele intersection of the 1000 Genomes and Sanger sequencing samples,
calling—were adapted to deal with the issues of high polymorphism resulting in 930 individuals from 12 populations. Supporting Infor-
and paralogy of HLA genes. In this way, they were able to successfully mation, Figure S1 summarizes the preprocessing of both datasets,
validate their methodology in a study of 270 samples that had been which preceded genotype and allele frequency comparisons.
typed previously by sequence-specific oligonucleotide hybridization,
which they treated as a gold standard dataset. The same gold standard 1000 Genomes dataset (1000G)
dataset was used by Major et al. (2013), who also examined the re- SNP genotypes were acquired from the chromosome 6 integrated
liability of calling HLA alleles using NGS, but using the 1000 Genomes Variant Call Format (VCF) file from version 3 of the 1000 Genomes
alignment data, and showed that this publicly available dataset can Project Phase I data, which is available at ftp://ftp.1000genomes.ebi.ac.
be used for this purpose, after appropriate filters (e.g., coverage) are uk/vol1/ftp/release/20110521/ (The 1000 Genomes Project Consor-
applied. tium 2012). We selected only SNPs in exons encoding the antigen
Both Erlich et al. (2011) and Major et al. (2013) were interested in recognition sites (ARS), which are exons 2 and 3 for HLA-A, -B, and
using NGS data to determine HLA alleles. Information regarding HLA -C (Bjorkman et al. 1987) and exon 2 for HLA-DQB1 and -DRB1
alleles is of biomedical relevance because HLA genotypes often are an (Brown et al. 1993). Sites were selected based on the most inclusive
important covariate to account for in association studies, and HLA coordinates of the RefSeq database in July 22, 2014 (see File S1). Both
typing is critical to hematopoietic transplantation. In this study, how- SNP and sample selection were carried out using VCFtools v0.1.12b
ever, we evaluate the quality of SNP level genotype calls from the 1000 (Danecek et al. 2011).
Genomes at the HLA genes.
The analysis of genotype and allele frequencies for SNPs contained HLA reference panel by Gourraud et al.
within HLA genes has proven of great value in biomedical and evo- (2014) (PAG2014)
lutionary studies, and the 1000 Genomes dataset is a resource used Gourraud et al. (2014) typed class I HLA-A, -B and -C, and class II
recurrently in this context. Examples of the use of HLA SNP data from HLA-DRB1 and -DQB1 genes of 1266 individuals from 14 different
the 1000 Genomes Project include: (1) In genome-wide association populations in Africa, Europe, Asia, and America. The HLA sequence-
studies (GWAS), SNPs in HLA genes often are associated with phe- based typing was performed with specific polymerase chain reaction
notypes of interest, and it is useful to understand the prevalence of amplification of ARS exons followed by Sanger sequencing. Data are
these variants in additional populations; (2) GWAS studies benefit available at the dbMHC Web site (http://www.ncbi.nlm.nih.gov/gv/
from knowledge of the haplotype structure surrounding HLA genes, mhc/xslcgi.fcgi?cmd=cellsearch; Helmberg et al. 2014).
which can be inferred from the dense SNP data of the 1000 Genomes Data from Gourraud et al. (2014) are available in the form of HLA
for multiple populations (e.g., Hill-Burns et al. 2011); and (3) when allele names per individual. Allele naming for HLA genes follows

932 | D. Y. C. Brandt et al.


237
specific rules (Marsh et al. 2010). To summarize, allele names are sequences for ARS encoding exons. Sequences were acquired from
composed of a letter indicating the locus, followed by 224 numeric the IMGT (i.e., international ImMunoGeneTics information system)
fields separated by colons. Each numeric field indicates specific forms database (Robinson et al. 2013), which keeps a well-curated repository
of variation: the 1st field distinguishes groups of alleles by serological of all known HLA allele sequences.
type, and the following fields distinguish nonsynonymous polymor- Our analysis was restricted to ARS exons because the HLA typing
phisms, synonymous polymorphisms, and noncoding differences, re- method used by Gourraud et al. (2014) only probed genetic variation
spectively. To obtain SNP genotypes and frequencies from the Sanger in these specific exons. As a consequence, multiple HLA alleles are
sequencing data, we converted all allele names to their associated compatible with the sequencing results, because the sites that

Figure 1 Genotype mismatches between the 1000G and PAG2014 datasets. Results per polymorphic site (“Position”) and per individual (930 in
total). Individuals are ordered by number of mismatches (individuals with less mismatches on top). Sites are numbered according to their position
in ARS exons coding sequence. Dark squares indicate mismatches between genotypes in the two datasets. ARS, antigen recognition sites; HLA,
human leukocyte antigen.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 933


238
934 | D. Y. C. Brandt et al.
239
Figure 3 (A) Distribution of coverage (x-axis) at matched and mismatched genotypes; y-axis is the square root of the relative frequency (Mann-
Whitney U one-tailed test, P , 10216); (B) Relationship between mean coverage (x-axis) and absolute frequency difference (jFEj, y-axis) between
1000G and PAG2014 (r = 20.11, P = 0.09). All polymorphic sites from HLA-A, -B, -C, -DRB1, and -DQB1 genes are included in both a and b. HLA,
human leukocyte antigen.

differentiate them are in other exons. This results in what we refer to this article, sites are numbered according to their position in the ARS
as an “ambiguous allele call” for an HLA allele (e.g., the allele is iden- exons coding sequences (12546 at the class I loci and 12270 at the
tified as B35:03, but we cannot establish whether it is B35:03:01 or class II loci).
B35:03:02, or a group of alleles is attributed to an individual, such as
B35:02/B35:03/B35:04). Ambiguous allele calls also may happen
Allele frequency comparisons
when sequencing has low quality at bases that differentiate two alleles.
After correcting all possible ambiguities in PAG2014 (as described
In addition, there are also genotypic ambiguities, which occur when
previously), we calculated allele frequencies for SNPs in both datasets.
different pairs of alleles are compatible with the sequencing results. For
By comparing the frequency of the reference allele in 1000G to its
individuals that bear ambiguous alleles, we created a consensus se-
value in PAG2014, we assessed the accuracy of allele frequency
quence in which ambiguous sites were reported with both possible
estimation. The reference allele was defined as the allele present in the
alleles (e.g., A/T, see Figure S1). In this way, we incorporate the un-
hg19 build of the reference sequence of the human genome. RefSeq
certainty associated to the sequence-based typing into downstream
IDs of the reference sequences used for each HLA gene are reported
analyses.
on File S1.
Although we cannot rule out technical errors in the Sanger
We computed the error in 1000G frequency estimates per site
sequencing that generated the PAG2014 data (Gourraud et al. 2014),
i (FEi) as follows:
we assume that this method provides the most reliable estimate of
HLA alleles (and hence SNP genotypes), and will serve as a standard FEi ¼ fi;1000G 2 fi;PAG2014
to estimate the reliability of genotype calls and allele frequencies for
the 1000 Genomes data (De Santis et al. 2013). where fi;1000G and fi;PAG2014 are the frequency of the reference allele at
site i in 1000G and PAG2014, respectively. We also computed the
Genotype comparisons mean absolute error in frequency estimates per gene as a mean of
We initially quantified how well the 1000G and PAG2014 data agreed absolute FEi for all sites within a gene (MAE):
with respect to genotype calls. Genotypes at each site in each individual
n 
were compared between the 1000G data and the PAG2014 data, here 1X 
MAE ¼  fi;1000G 2 fi;PAG2014 
 
considered as a gold standard. In the case of sites with ambiguity (e.g., n i¼1
T/A) in the PAG2014 data, if one of the two possible alleles matched
an allele present in the 1000G, we considered this an allele match and where n is the number of SNPs in the gene.
PAG2014 was corrected, by attributing the allele present in the 1000G
data to the ambiguous site. After correcting the ambiguous sites in Coverage in 1000G
PAG2014, we only considered genotypes to be a match if both alleles Sequencing coverage per individual per site was calculated from the
in 1000G were present in the PAG2014 data, at that site. Throughout 1000 Genomes Project phase I BAM files for the low coverage

Figure 2 REF allele frequency per site in each HLA gene in the 1000 Genomes (1000G) and Sanger sequencing (PAG2014) datasets. Continuous
line indicates the expected relationship (i.e., no difference) between 1000G and PAG2014. Dashed lines indicate a 60.1 deviation from the
expected frequency (as estimated from PAG2014 dataset). MAE (mean absolute error) defined in the section Materials and Methods. Numbers
indicate site position in ARS exons sequence. REF, reference; ARS, antigen recognition sites; HLA, human leukocyte antigen.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 935


240
experiments using the genomeCoverageBed program from BED- Testing for mapping bias
Tools (Quinlan and Hall 2010). BAM files are available on ftp:// After demonstrating that there is an overestimation of reference allele
ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/[sampleID]/alignment/. frequency in the 1000G SNPs (see the section Results), we hypothe-
Only low-coverage BAM files were used to estimate coverage be- sized that mapping bias was the underlying cause. To test this hy-
cause genotype likelihoods for the data we analyzed (1000 Genomes pothesis, we examined whether reads carrying the alternative allele at
Project Phase I integrated VCF files) were estimated from this source. a SNP are less likely to map to the reference genome than reads
Genotype likelihoods were estimated from high coverage exome BAM carrying the reference allele. First, for each HLA allele present in the
files only for a minority of sites that were exclusively discovered on the PAG2014 dataset, we defined windows of 51 base pairs that were
exome experiments, and were not used in the coverage analysis (See centered on each SNP (25 base pairs upstream and 25 base pairs
Table S1). downstream of the SNP, including non-polymorphic sites). The set

Figure 4 Difference in reference allele frequency between 1000G and PAG2014, measured by FE (see the section Materials and Methods), at
each polymorphic site, in each population. Shades of red indicate overestimation of reference allele frequency and shades of blue indicate
underestimation of reference allele frequency in 1000G. Full population names are given in Table S3.

936 | D. Y. C. Brandt et al.


241
of windows centered on a specific SNP was then separated in two allele frequency estimates are as reliable as the ones from the 1000
groups: (i) those that carry the reference allele at the central site and Genomes NGS data, at the same SNPs (see Figure S9).
(ii) those that carry the alternative allele at the central site. Next, all
windows were compared with the reference genome (hg19) sequence Relationship between sequencing coverage and
(the same sequence that was used as an index in the 1000 Genomes genotypic mismatches
Project), and the number of mismatches was counted, excluding the To investigate whether low sequencing coverage could explain
mismatch at the central SNP. If mapping bias was influencing allele genotype mismatches and deviations from expected allele frequencies,
frequency estimates, we expected that, for SNP positions with over- we compared sequencing coverage between mismatched and matched
estimation of the reference allele frequency in the 1000G, the alterna- genotypes (Figure 3A) and assessed the relationship between coverage
tive alleles would be flanked by additional alternative alleles (and thus and frequency deviation (Figure 3B).
have a greater mismatch count against the reference sequence). Sites with mismatched genotypes have on average lower sequencing
coverage than sites with matched genotypes (Figure 3A; Mann-
RESULTS Whitney U one-tailed test P , 10216). This is the expected relation-
ship if low sequencing coverage explains genotype mismatches
Genotypic mismatch frequency between datasets. However, the difference in sequencing coverage
We found that, on average, 18.6% of genotypes were mismatched between sites with matched and mismatched genotypes is small (mean
between 1000G and PAG2014 when individual genotypes for each coverage in matching genotypes is 1.95, and 1.75 in nonmatching
site in the five classical HLA genes were compared, and exons with genotypes, a difference of 6.2%) and has likely achieved very high
greater nucleotide diversity tend to have a greater proportion of significance only due to the large number of observations. Similarly,
genotype mismatches (Figure S2). We also observed that mismatches correlation between allele frequency deviation and sequencing cover-
are specially concentrated on a few sites (Figure 1), with 18.7% of age is weak and not significant (Figure 3B; r = 20.11, P = 0.09),
sites concentrating 50% of the mismatches over the five loci we although the direction of correlation is in agreement with what would
analyzed. be expected if lower coverage explained larger deviations in frequency
estimation. We have also investigated the possible effect of the posi-
Reference allele frequency accuracy tion of the SNPs relative to exon edges on the allele frequency devia-
Accuracy of estimation of allele frequencies in 1000G was assessed tions and found no correlation between those factors (Figure S10). We
comparing the observed frequency of the reference allele in the 1000G therefore investigated other factors that may account for errors in
data with that of PAG2014, for both the global dataset (consisting of genotype calling.
a pooled set of all individuals) and for each population separately (see
Figure S3, Figure S4, Figure S5, Figure S6, and Figure S7). We chose Direction of frequency deviation
a difference of 0.1 between the frequencies on both datasets as We found that most of the genotype mismatches are caused by
a threshold that determines a “large frequency difference.” miscalling an alternative allele as a reference allele (Table S2). Fur-
For the global dataset (Figure 2) we found that for HLA-A and -C thermore, most deviations in allele frequency estimates are in the
most SNPs have similar frequency estimates for 1000G and PAG2014, direction of an overestimation of reference allele frequencies in the
with few large deviations (only 9/66 and 8/44 SNPs with absolute 1000 Genomes data (Figure 2). This information is summarized in
difference in frequencies (jFEj) larger than 0.1, respectively). The Figure 4, which shows the location and magnitude of frequency devi-
HLA-DQB1 locus shows an intermediate proportion of SNPs with ations between the 1000G and PAG2014 data.
large deviations (10/42 SNPs with jFEj . 0:1), and HLA-B and The overall shift in the direction of overestimating reference
HLA-DRB1 show the greatest proportion of sites with large frequency alleles is summarized in Table 1, which shows the number of SNPs
differences between 1000G and PAG2014 (23/64 and 15/35 sites with with more than 0.1 frequency difference in at least two popula-
jFEj . 0:1). Overall, the mean absolute difference in frequency be- tions, for each locus. For HLA-A, -B, and -DQB1 most sites with
tween SNPs in the 1000G and PAG2014 data are 0.08, and it is greater large frequency differences between 1000G and PAG2014 are
at the HLA genes with the greatest levels of nucleotide diversity (HLA- skewed in the direction of overestimating the reference allele
B, -DQB1 and -DRB1 all deviate by 60:1). [P = 0.057 for HLA-A and P , 1024 for HLA-B and -DQB1, bi-
The proportion of genotype mismatches and allele frequency nomial test for null hypothesis of equal numbers of deviations in
deviations per site are highly correlated (Pearson correlation = 0.86, direction of reference (REF) or alternative (ALT)], whereas HLA-C
P , 10216; Figure S8). However, some SNPs with a high proportion and HLA-DRB1 show no evidence for an excess of large deviations
of genotype mismatches have well-estimated allele frequencies. One in the direction of reference alleles.
example is site 465 at HLA-B, in which 44% of genotypes are mis-
matched, but jFEj is only 0.007. Overall, 15 sites have more than 25%
mismatched genotypes while showing jFEj , 0:1 (see Figure S8). This
is possible when the frequency of genotype errors in which the refer-
ence allele is overrepresented is similar to the frequency of errors in
which the alternative allele is overrepresented. n Table 1 Number of sites with overestimation of REF or ALT
allele frequency in each HLA locus (jFEj . 0.1 in 2 or more
populations)
Allele frequency at the axiom exome genotyping array – Affymetrix:
Because genotyping arrays constitute an additional frequently used A B C DQB1 DRB1
resource to genotype SNPs within HLA genes, playing an important REF 11 30 6 22 11
role in GWAS studies, we also have investigated the accuracy of allele ALT 3 2 3 2 11
frequency estimation from this genotyping technology. We estimated Genomic coordinates of those sites are given in Table S4. HLA, human leukocyte
allele frequencies from Axiom Exome data, and we found that those antigen; REF, reference; ALT, alternative.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 937


242
Figure 5 Number of differences to the reference genome at 1860 51-bp windows centered at sites HLA-B 132 and HLA-DQB1 244 with reference
(REF) or alternative (ALT) allele at those sites. Windows were defined from all HLA alleles present in the 930 samples from the PAG2014 dataset.
HLA, human leukocyte antigen.

Testing for mapping bias alternative alleles in those sites are flanked by additional alternative
We hypothesized that the observed reference allele bias was caused by alleles.
a lower efficiency in the mapping of reads containing the alternative To test this hypothesis, we aligned sequences of all alleles present
allele. This is expected under the assumption that the reads carrying in PAG2014 to the HLA sequences present in the hg19 build of the
the alternative allele on average have more differences with respect to reference human genome (the same sequences used for the alignment
the reference genome (used by the 1000 Genomes Consortium as the of reads in the 1000 Genomes Project) and defined windows of 51
index to align NGS reads) than reads carrying the reference allele. In base pairs around each SNP. We then quantified the number of differ-
this scenario, some sites would have a stronger bias than others if the ences with respect to the reference genome for windows surrounding

Figure 6 Number of differences to the reference genome at 51-bp


windows centered at each SNP in the HLA-A, -B, and -DQB1 genes.
Windows around each SNP were defined from the set of 1860 alleles
present in the 930 samples from the PAG2014 dataset. Next, the set of
windows was divided in three groups: those centered on SNPs with
overestimated, well estimated and underestimated reference allele fre-
quencies (red, yellow and blue boxplots, respectively). Then, each
group was divided in two: windows in which the central site contains
the reference allele (REF, dark boxplots) and windows centered on an
alternative allele (ALT, light colored boxplots). Upper and lower hinges
correspond to the 25th and 75th percentiles, horizontal lines represent
the median, whiskers are 1.5 times the interquartile range, and outliers
are represented by dots. HLA, human leukocyte antigen; SNP, single-
nucleotide polymorphism.

938 | D. Y. C. Brandt et al.


243
REF allele frequency in the 1000 Genomes data with respect to
PAG2014). In both cases, ALT windows bear more differences to
the reference sequence than REF windows.
These results support the hypothesis that these sites with poorly
estimated allele frequencies have their ALT alleles residing in haplotypes
with substantially more differences with respect to the reference genome
than haplotypes centered on the REF allele, thus accounting for the
observed bias.
To gain a broader perspective of this issue, we classified SNPs from
the HLA loci with REF allele bias (HLA-A, -B, and -DQB1) into three
categories: (i) sites at which the REF allele frequency was overesti-
mated, i.e., FE . 0:1 (“overestimated”); (ii) sites where the REF allele
frequency was underestimated, i.e., FE , 2 0:1 (“underestimated”);
and (iii) sites at which allele frequencies were well estimated
(jFEj , 0:01, here referred to as “well estimated”). We compared these
three categories of sites with respect to the number of differences
relative to the reference genome in REF and ALT windows (Figure
6). We found that the overestimated group has significant excess of
differences at alternative allele bearing haplotypes. In this group of
SNPs, ALT windows have on average 4.4 other differences relative to
the reference genome, whereas those centered on the REF allele have
1.9 differences (excess of differences on windows centered on the ALT
allele was tested with a one tailed Mann-Whitney U test; P , 10216).
Figure 7 Heterozygosity of SNPs at HLA genes estimated from the Sites with well estimated or underestimated REF allele frequency, on
PAG2014 dataset. Orange bars show distribution of heterozygosity at the other hand, do not show a similar excess of differences in the
sites with a high error rate in frequency estimation (jFEj . 0:1 in two or haplotypes bearing the ALT allele, although the difference between
more populations). Blue bars show the distribution of heterozygosity REF and ALT windows is statistically significant because of the large
after exclusion of SNPs with high error rate. SNP, single-nucleotide
sample size (well estimated: ALT mean = 1.7; REF mean = 1.8; one
polymorphism; HLA, human leukocyte antigen.
tailed Mann-Whitney U test P , 10216; underestimated: ALT mean =
1.9 ; REF mean = 1.2; one tailed Mann-Whitney U, P , 10216).
(i) REF and (ii) ALT alleles. If REF allele mapping bias is driving
errors in frequency estimation, it is expected that sites with an over- Impact of biases in frequency estimation to population
estimation of REF allele frequency would present the following pat- genetic statistics
tern: windows carrying the REF with fewer differences to the reference Our analysis was able to identify a subset of SNPs in the HLA genes
genome than sequences centered on the ALT alleles. For sites with for which genotype calls and allele frequency estimates from the
well-estimated frequencies, on the other hand, we did not expect such 1000G showed a high error rate with respect to the PAG2014 dataset.
a difference between REF and ALT windows. To evaluate the impact of the errors introduced by including these
To illustrate this effect, Figure 5 shows the results for the two most sites in population genetic analyses, we compared the distribution of
extreme cases of frequency deviation shown in Figure 4: site 244 of sample heterozygosity between the sites with low and high error rates.
HLA-DQB1 and site 132 of HLA-B (0.56 and 0.52 absolute increase in Heterozygosity is defined as H ¼ 2pð1 2 pÞ for biallelic loci, as is the

Figure 8 Relationship between SNP heterozygosity (H) and (A) absolute value of deviation (jFEj; Pearson’s correlation = 0.32; P = 1.938 · 1027) or
(B) magnitude and direction of deviation (FE; Pearson’s correlation = 0.59; P , 10216). SNP, single-nucleotide polymorphism.

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 939


244
case for the 1000 Genomes Phase I SNPs, because tri- or quad-allelic et al. (2012) or Dilthey et al. (2014)], it would be possible to improve
SNPs were not reported on Phase I. genotype calling and allele frequency estimates.
The removal of sites with poor frequency estimates (jFEj . 0:1 in In our study, HLA-A, -B, and DQB1 show evidence of REF allele
at least two populations) results in a marked change in the distribution mapping bias. The HLA-DRB1 locus, on the other hand, did not
of H, with a significant drop in the frequency of sites with large H and present REF allele frequency overestimation, a finding that can be
a shift in the distribution toward lower values (Figure 7). Note that the explained by the existence of multiple copies of this gene (both pseu-
H values in Figure 7 are estimated from the PAG2014 data, implying dogenes and functional copies), which may result in biases/errors that
that the high values of H among “excluded” sites are not due to the make REF allele bias comparatively less visible (Degner et al. 2009).
deviations in allele frequencies generated by NGS errors, but are the The HLA-C locus also shows a weaker REF allele bias, a pattern that
true heterozygosities at those sites. These results therefore document may be explained by its lower degree of polymorphism which leads to
that because sites with high heterozygosity tend to have greater devia- a decrease in the number of mismatches of reads with respect to the
tions from the “true” frequency (i.e., based on the PAG2014 dataset), reference genome, thus decreasing the mapping bias.
the removal of poorly estimated sites results in a reduction in H We provide a list of unreliable SNPs within the HLA genes, defined
values. by us as those with an absolute difference in frequency larger than 0.1
(jFEj . 0:1) in two or more populations (Table S4). We show that
The effect of heterozygosity on allele frequency these unreliable SNPs on average have greater heterozygosities in our
estimation bias gold standard dataset. As a consequence, although filtering out those
We found an overall positive correlation between SNP heterozygosity unreliable sites improves the overall accuracy in allele frequency esti-
and the magnitude of error in allele frequency estimates (Figure 8A; mation, it leads to an underestimation of the mean heterozygosity of
Pearson’s correlation = 0.32; P = 1.938 · 1027). This result provides SNPs in HLA genes, a bias that should be taken into account in down-
further evidence that sites with greater heterozygosity tend to have stream analyses. Analyses that require genotype calls at the individual
poorer estimates for allele frequencies in the 1000G. Also, heterozy- level, including haplotype-based analyses, should be performed with
gosity is even more strongly correlated to the deviation in frequency, caution when using the data from the 1000 Genomes at HLA genes.
considering the direction of the deviation (Figure 8B; Pearson’s cor- Our results have implications to studies that use SNP data from the
relation = 0.59; P , 10216). Together, these results show that HLA 1000 Genomes in other genomic regions with high variability, such as
SNPs with greater heterozygosities not only have more errors in fre- KIR and olfactory receptors. Because HLA loci are the most polymor-
quency estimation but also a stronger bias toward overestimation of phic in the human genome, they represent a worst case scenario for
REF allele frequency. mapping bias and subsequent allele frequency estimation errors. We
found a significant correlation between SNP heterozygosity and the
DISCUSSION absolute difference in frequency between 1000 Genomes data and our
The 1000 Genomes Project data were generated by various sequencing gold standard. This suggests that in genome-wide studies, SNPs with
centers, which relied on different sequencing platforms, read lengths, high heterozygosities, and contained within regions with additional
aligners and variant and genotype calling algorithms (The 1000 SNPs, have an increased chance of presenting poor frequency estimates.
Genomes Project Consortium 2012), creating challenges to an overall
assessment of data reliability. In this study, we specifically examine the ACKNOWLEDGMENTS
performance of NGS-based genotype calls and allele frequency esti- This research was financially supported by grants from São Paulo
mates for the highly polymorphic and intensely studied classical HLA Research Foundation (FAPESP) and The Brazilian National Council
genes. We took advantage of the possibility of comparing downstream for Scientific and Technological Development (CNPq). D.Y.C.B. was
genotype calls from the 1000 Genomes and HLA typing based on funded by FAPESP scholarships #2012/22796-9 and #2013/12162-5;
Sanger sequencing for the same set of samples to assess data quality V.R.C.A. has a FAPESP grant #2014/12123-2, B.D.B. was funded by
and test hypothesis about possible biases. #2011/12500-2 (FAPESP) and #152676/2011-2 (CNPq); K.N. has
We show that the 1000 Genomes SNPs called in the HLA genes a FAPESP grant #2012/09950-9; and D.M. has a FAPESP research
have many differences at the genotype level, when compared to results grant #12/18010-0 and a CNPq productivity grant #308167/2012-0.
obtained using Sanger sequencing. However, considerably high geno-
type mismatching is possible with only modest deviations in allele
LITERATURE CITED
frequencies, and we conclude that for the 1000 Genomes data allele
Andersen, K. G., I. Shylakhter, S. Tabrizi, S. R. Grossman, C. T. Happi et al.,
frequency estimates for SNPs at HLA genes are considerably more 2012 Genome-wide scans provide evidence for positive selection of
reliable than the individual genotype calls. genes implicated in Lassa fever. Philos. Trans. R. Soc. Lond. B Biol. Sci.
Low coverage did not explain the errors in genotypes and allele 367: 868–877.
frequencies in the 1000 Genomes dataset. Instead, we found evidence Bjorkman, P. J., M. A. Saper, B. Samraoui, W. S. Bennett, J. L. Strominger
that read mapping bias was responsible for those errors. Mapping bias et al., 1987 Structure of the human class I histocompatibility antigen,
is well known for NGS, and highly polymorphic regions such as HLA HLA-A2. Nature 329: 506–512.
genes are especially susceptible to its effects (Nielsen et al. 2011), Boegel, S., M. Löwer, M. Schäfer, T. Bukur, J. de Graaf et al., 2012 HLA
particularly when a single reference genome is used as an index for typing from RNA-Seq sequence reads. Genome Med. 4: 102.
the alignment of NGS reads. In this situation, many true variants fail Brown, J. H., T. S. Jardetzky, J. C. Gorga, L. J. Stern, R. G. Urban et al.,
1993 Three-dimensional structure of the human class II histocompat-
to be identified because they are present in haplotypes that differ from
ibility antigen HLA-DR1. Nature 364: 33–39.
the genome used as index, and thus reads generated from these Chapman, S. J., and A. V. S. Hill, 2012 Human genetic susceptibility to
regions are not aligned and are lost. Together, these results suggest infectious disease. Nat. Rev. Genet. 13: 175–188.
that increasing coverage would not improve allele frequency estimates Danecek, P., A. Auton, G. R. Abecasis, C. a. Albers, E. Banks et al.,
at those sites if a single reference sequence is still used as index. By 2011 The variant call format and VCFtools. Bioinformatics 27: 2156–
mapping to multiple genomes [e.g., using strategies similar to Boegel 2158.

940 | D. Y. C. Brandt et al.


245
De Santis, D., D. Dinauer, J. Duke, H. A. Erlich, C. L. Holcomb et al., 2013 16 Major, E., K. Rigo, T. Hague, A. Bérces, and S. Juhos, 2013 HLA typing
(th) IHIW: review of HLA typing by NGS. Int. J. Immunogenet. 40: 72–76. from 1000 genomes whole genome and whole exome Illumina data. PLoS
Degner, J. F., J. C. Marioni, A. A. Pai, J. K. Pickrell, E. Nkadori et al., One 8: e78410.
2009 Effect of read-mapping biases on detecting allele-specific expres- Marsh, S. G. E., E. D. Albert, W. F. Bodmer, R. E. Bontrop, B. Dupont et al.,
sion from RNA-sequencing data. Bioinformatics 25: 3207–3212. 2010 Nomenclature for factors of the HLA system, 2010. Tissue Anti-
Dilthey, A., C. Cox, Z. Iqbal, M. R. Nelson, and G. McVean, 2014 Improved gens 75: 291–455.
genome inference in the MHC using a population reference graph. bio- Meyer, D., and G. Thomson, 2001 How selection shapes variation of the
Rxiv. Available from: http://biorxiv.org/content/early/2014/07/08/006973. human major histocompatibility complex: a review. Ann. Hum. Genet.
Accessed March 20, 2015. 65: 1–26.
Erlich, R. L., X. Jia, S. Anderson, E. Banks, X. Gao et al., 2011 Next-generation Nielsen, R., J. S. Paul, A. Albrechtsen, and Y. S. Song, 2011 Genotype and
sequencing for HLA typing of class I loci. BMC Genomics 12: 42. SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12:
Gourraud, P.-A., P. Khankhanian, N. Cereb, S. Y. Yang, M. Feolo et al., 443–451.
2014 HLA Diversity in the 1000 Genomes Dataset. PLoS One 9: e97282. Quinlan, A. R., and I. M. Hall, 2010 BEDTools: a flexible suite of utilities for
Helmberg, W., M. Feolo, R. Dunivin, and D. Hoffman, 2014 dbMHC. comparing genomic features. Bioinformatics 26: 841–842.
Hernandez, R. D., J. L. Kelley, E. Elyashiv, S. C. Melton, A. Auton et al., Robinson, J., J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham et al.,
2011 Classic selective sweeps were rare in recent human evolution. 2013 The IMGT/HLA database. Nucleic Acids Res. 41: D1222–
Science 331: 920–924. D1227.
Hill-Burns, E. M., S. A. Factor, C. P. Zabetian, G. Thomson, and H. Payami, Sollid, L. M., W. Pos, and K. W. Wucherpfennig, 2014 Molecular mecha-
2011 Evidence for more than one Parkinson’s disease-associated variant nisms for contribution of MHC molecules to autoimmune diseases. Curr.
within the HLA region. PLoS One 6: e27109. Opin. Immunol. 31C: 24–30.
Kitts, A., M. Feolo, and W. Helmberg, 2003 The major histocompatibility The 1000 Genomes Project Consortium, 2012 An integrated map of genetic
complex database, dbMHC. In: National Center for Biotechnology In- variation from 1,092 human genomes. Nature 491: 56–65.
formation NIH, ed. The NCBI Handbook. Bethesda: National Center for Treangen, T. J., and S. L. Salzberg, 2012 Repetitive DNA and next-generation
Biotechnology Information NIH, p.1–29. sequencing: computational challenges and solutions. Nat. Rev. Genet. 13:
Lappalainen, T., M. Sammeth, M. R. Friedländer, P. C. ’t Hoen, J. Monlong 36–46.
et al., 2013 Transcriptome and genome sequencing uncovers functional Ward, L. D., and M. Kellis, 2012 Evidence of abundant purifying selection
variation in humans. Nature 501: 506–511. in humans for recently acquired regulatory functions. Science 337: 1675–
Leffler, E. M., Z. Gao, S. Pfeifer, L. Ségurel, A. Auton et al., 2013 Multiple 1678.
instances of ancient balancing selection shared between humans and
chimpanzees. Science 339: 1578–1582. Communicating editor: C. R. Marshall

Volume 5 May 2015 | Mapping Bias at HLA in 1000 Genomes | 941


246
Apêndice A.2.
Cópia do artigo “HLA supertype variation across populations: new insights
into the role of natural selection in the evolution of HLA-A and HLA-B poly-
morphisms”: Immunogenetics (2015), 67(11):651-663.
Neste trabalho, contribuí com scripts em Perl para a realização das permuta-
ções descritas no artigo. Além disso, este trabalho tem fortes pontos em comum
com o manuscrito apresentado no Apêndice A.4: em ambos, investigamos as
unidades de seleção nos genes HLA, ainda que com abordagens bastante dis-
tintas. Aqui, trabalhamos com os genes HLA-A e HLA-B em um contexto po-
pulacional, e investigamos o papel dos supertipos como unidades de seleção,
ao passo que no outro (A.4) usamos abordagens filogenético-comparativas para
investigar o papel das linhagens alélicas de HLA como unidades de seleção nos
genes HLA de classe I.

247
Immunogenetics
DOI 10.1007/s00251-015-0875-9

ORIGINAL PAPER

HLA supertype variation across populations: new insights


into the role of natural selection in the evolution of HLA-A
and HLA-B polymorphisms
Rodrigo dos Santos Francisco 1,2,3 & Stéphane Buhler 2,4 & José Manuel Nunes 2,5 &
Bárbara Domingues Bitarello 1 & Gustavo Starvaggi França 6,7 & Diogo Meyer 1 &
Alicia Sanchez-Mazas 2,5

Received: 28 June 2015 / Accepted: 29 September 2015


# The Author(s) 2015. This article is published with open access at Springerlink.com

Abstract Supertypes are groups of human leukocyte antigen randomized groups of alleles. At HLA-A, low levels of vari-
(HLA) alleles which bind overlapping sets of peptides due to ation are observed at B and F pockets and randomized He and
sharing specific residues at the anchor positions—the B and F GST do not differ from the observed data. By contrast, HLA-B
pockets—of the peptide-binding region (PBR). HLA alleles concentrates most of the differences between supertypes, the
within the same supertype are expected to be functionally B pocket showing a particularly high level of variation.
similar, while those from different supertypes are expected Moreover, at HLA-B, the reassignment of alleles into random
to be functionally distinct, presenting different sets of pep- groups does not reproduce the patterns of population differen-
tides. In this study, we applied the supertype classification to tiation observed with supertypes. We thus conclude that dif-
the HLA-A and HLA-B data of 55 worldwide populations in ferently from HLA-A, for which supertype and allelic varia-
order to investigate the effect of natural selection on supertype tion show similar patterns of nucleotide diversity within and
rather than allelic variation at these loci. We compared the between populations, HLA-B has likely evolved through spe-
nucleotide diversity of the B and F pockets with that of the cific adaptations of its B pocket to local pathogens.
other PBR regions through a resampling procedure and com-
pared the patterns of within-population heterozygosity (He)
and between-population differentiation (GST) observed when Keywords HLA . Supertypes . Human populations . Natural
using the supertype definition to those estimated when using selection . Pathogens . Adaptation

Diogo Meyer and Alicia Sanchez-Mazas co-supervised the study.


Electronic supplementary material The online version of this article
(doi:10.1007/s00251-015-0875-9) contains supplementary material,
which is available to authorized users.

* Rodrigo dos Santos Francisco 3


Hospital Israelita Albert Einstein, São Paulo, Brazil
biorodrigo2001@yahoo.com.br 4
Transplantation Immunology Unit and National Reference
Diogo Meyer Laboratory for Histocompatibility, Department of Genetic and
diogo@ib.usp.br Laboratory Medicine, Geneva University Hospital,
Geneva, Switzerland
Alicia Sanchez-Mazas
5
alicia.sanchez-mazas@unige.ch Institute of Genetics and Genomics in Geneva (IGE3),
1 Geneva, Switzerland
Department of Genetics and Evolutionary Biology, University of São
6
Paulo, São Paulo, Brazil Department of Biochemistry, Chemistry Institute, University of São
2 Paulo, São Paulo, Brazil
Laboratory of Anthropology, Genetics and Peopling History,
7
Department of Genetics and Evolution–Anthropology Unit, Molecular Oncology Center, Sírio-Libanês Hospital, São
University of Geneva, Geneva, Switzerland Paulo, Brazil

248
Immunogenetics

Introduction classified HLA alleles into supertypes, defined as groups of


alleles sharing chemical properties at the B and F pockets. The
The three classical human leukocyte antigen (HLA) class I logic behind the classification is that alleles within supertypes
genes, HLA-A, HLA-B, and HLA-C, are extremely polymor- are expected to exhibit widely overlapping peptide repertoires,
phic and exhibit thousands of alleles, most of them coding for whereas alleles from different supertypes would more fre-
different proteins (2112, 2789, and 1799 HLA-A, HLA-B, quently bind non-overlapping sets of peptides. Supertypes
and HLA-C proteins currently defined, respectively) were originally defined by sequencing endogenously bound
(Robinson et al. 2015). These molecules play a central role ligands and searching for motifs shared by alleles that bind
in the immune response by presenting processed peptides de- similar peptides and by analyzing the three-dimensional struc-
rived from proteins of the intracellular environment (including ture of the HLA molecules (Sette and Sidney 1999; Sidney
foreign ones derived from intracellular parasites such as virus- et al. 1996, 2008). As a result, four supertypes were described
es and some bacteria) to cytotoxic T lymphocytes and also for HLA-A (A1, A2, A3, and A24) and five for HLA-B (B7,
functioning as ligands for the killer immunoglobulin-like re- B27, B44, B58, and B62), and they were originally assigned,
ceptor (KIR) of natural killer cells (Parham 2005). respectively, to 31 HLA-A and 57 HLA-B alleles whose
Almost all of the HLA class I polymorphisms are clustered peptide-binding specificities were experimentally defined.
in exons 2 and 3, which code for the α1 and α2 extracellular These alleles were used to construct a reference panel for the
domains of the HLA molecule. These domains form a groove- B and F amino acid sequences. A set of 945 HLA-A and
like structure known as the peptide-binding region (PBR) HLA-B alleles with unknown binding specificities were then
which engages the peptides (Saper et al. 1991). At the DNA checked for matches to the sequences of this panel (Sidney
level, the PBR codons exhibit striking features regarding their et al. 2008). Among these 945 previously unclassified alleles,
diversity, including a high heterozygosity (Parham et al. 1989; 57 % presented a full match in both B and F pockets to alleles
Lawlor et al. 1990; Hedrick et al. 1991) and high rates of non- with known supertype status. Another 23.8 % presented par-
synonymous substitutions (Hughes and Nei 1988; Takahata tial matches with residues found in these pockets.
et al. 1992). These characteristics contrast with neutral expec- In line with the expectation that supertypes constitute a
tations and support the hypothesis that balancing selection has functionally relevant definition of HLA variation, several
maintained variation at these codons. The high levels of var- researchers have found that grouping alleles into supertypes
iation observed at the sites involved in peptide binding support is useful in disease association studies involving HLA loci
a model of host-pathogen coevolution (Apanius et al. 1997), (Alencar et al. 2013; Chakraborty et al. 2013; Cordery
which states that the pathogenic microorganisms are the main et al. 2012; Gilchuk et al. 2013; Karlsson et al. 2012,
evolutionary force shaping HLA variation (Borghans et al. 2013; Kuniholm et al. 2013; Trachtenberg et al. 2003),
2004; Slade and McCallum 1992; Takahata and Nei 1990). allowing large numbers of rare alleles to be grouped ac-
Further supporting this hypothesis, several studies have dem- cording to a functional criterion, thus increasing the power
onstrated a positive correlation between the diversity level of of the studies. From an evolutionary point of view, natural
some HLA genes and the richness of environmental patho- selection is expected to leave a detectable signature on B
gens (Prugnolle et al. 2005; Qutob et al. 2011; Sanchez- and F pockets and, consequently, on the genotypes defined
Mazas et al. 2012). These results corroborate the idea that by examining HLA variation from the perspective of
the codons making up the PBR constitute the main targets of supertypes. For example, under the assumption that
balancing selection within HLA genes. However, the analyses pathogen-driven selection shapes supertype frequencies,
performed to date generally treat the PBR region as a homo- we expect genetic variation defined at the supertype level
geneous block, whereas it is in fact composed of six different to show patterns of polymorphism and differentiation indic-
pocket-like structures (A, B, C, D, E, and F). Each pocket ative of balancing selection to a greater degree than varia-
accommodates one of the nine amino acid residues of the tion that is not related to supertype definition. The predic-
bound peptide (the first, second, third, sixth, seventh, and tion that balancing selection on supertype variation would
ninth, respectively) (Saper et al. 1991). Moreover, the binding result in detectable genetic signatures was raised by Sette
affinity between a given HLA molecule and a specific peptide and Sidney (1999), who found that Bsupertype frequencies
depends on the chemical properties of each PBR pocket were high and fairly conserved among different ethnicities.^
(Saper et al. 1991). In addition, Naugler and Liwski (2008) argued that Bnatural
The strongest interaction between HLA molecules and the selection should favor maximization of the heterozygosity of
bound peptides is accounted by the B and F pockets, which allele supertypes instead of the heterozygosity of individual
accommodate the second and ninth amino acid residues of the alleles,^ making explicit the hypothesis that supertypes, as
peptide, respectively (Saper et al. 1991). As the amino acids defined by B and F pocket variations, constitute the level of
composing the B and F pockets play a central role in peptide variation that is the primary target of natural selection in HLA
recognition by the HLA molecules, Sidney et al. (1996) genes.

249
Immunogenetics

Both conservation of supertype frequencies between pop- from which we excluded populations presenting (a) an allelic
ulations and increased heterozygosity at the supertype level resolution lower than the first two sets of digits (now referred
are expected to generate a pattern of low-population differen- to as second field level of resolution), so as to only keep alleles
tiation when compared with those observed at the allelic level. differing at the protein level; (b) genotypic ambiguities; and
Balancing selection at the supertype level would also enhance (c) deviation from Hardy-Weinberg expectations. This filter-
genetic variation at the B and F pockets compared with other ing resulted in a dataset of 6435 and 6409 individuals typed
regions of the PBR, increasing the chances of antigen recog- for HLA-A and HLA-B, respectively, belonging to 55 differ-
nition by the immune system. However, testing these hypoth- ent populations: seven sub-Saharan African (SSA), two North
eses, i.e., comparing population differentiation and variability African (NAF), eight Southwest Asian (SWA), four European
defined at the levels of HLA alleles and supertypes, respec- (EUR), 22 Southeast Asian (SEA), four Pacific islanders
tively, represents a methodological challenge due to the diffi- (PAC), four Australian aborigine (AUS), two North Asian
culties in comparing measures of differentiation and heterozy- (NEA), and two Native American (AME) populations
gosity for genetic variants that are defined by different attri- (Supplementary Material Table 1-S). Almost half of these
butes (alleles being defined by all variation in the coding re- populations (24 out of 55) had demographic histories indicat-
gion, by contrast with supertypes which are defined by a sub- ing that they were likely to have experienced severe founder
set of codons). Indeed, because supertypes are sets of alleles, effects (these populations were from Oceania, Taiwan, and the
genetic variation defined at the allele level is nested within that Americas). Because such reductions in diversity due to demo-
defined at the supertype level. Therefore, heterozygosity at the graphic effects can potentially mask signals of balancing se-
supertype level is constrained to be lower or equal to that lection, we carried out all the analyses with both the complete
estimated at the allele level. Furthermore, because population set of 55 populations and a reduced set of 30 populations
genetic differentiation measured by statistics related to (obtained by excluding those from Oceania, Taiwan, and the
Wright’s FST is strongly determined by intrapopulation vari- Americas).
ability (Jost 2008), we expect higher levels of population dif-
ferentiation at the supertype level simply because of the de- Supertype definition
creased number of supertype variants in comparison to alleles.
In the present study, our aim is to investigate whether the We assigned all HLA-A and HLA-B alleles to their specific
use of supertype instead of allele definitions at HLA-A and supertype as defined by the classification given in figures 1
HLA-B loci reduces population differentiation and increases (http://www.biomedcentral.com/1471-2172/9/1/figure/F1)
heterozygosity, as expected under a model of balancing selec- and 2 (http://www.biomedcentral.com/1471-2172/9/1/figure/
tion acting on supertypes. For the reasons explained above, we F2) from Sidney et al. (2008). The alleles not assigned to
control our analyses for the inherent differences in polymor- any supertype were treated in our analyses of population
phism between these two kinds of classification. Our approach differentiation and molecular variation in two ways: (a) their
consists in producing null distributions for population differ- allele-level definition was used and (b) they were pooled into
entiation and heterozygosity by generating randomized sets of groups of Bnon-classified alleles^ (named NCA and NCB for
alleles (herein referred to as Brandom supertypes^) that match HLA-A and HLA-B, respectively). We included A*29:01,
true supertype sampling properties (i.e., number of supertypes A*29:02, A*29:03, A*30:01, A*30:08, and A*68:06 in the
and number of alleles per supertype) without any biological NCA group because of their ambiguous supertype allocation
criteria for pooling them together. We also analyze supertype (Sidney et al. 2008), and all B*08 alleles were assigned to the
variation at the nucleotide level by partitioning DNA se- NCB group because of their unique PBR structures, which
quences into segments corresponding to the different pockets make the peptide-binding profile unpredictable (Sidney et al.
within the PBR. Our hypothesis is that the B and F pockets, 2008).
which are the major determinants of the peptide-binding spec-
ificities and used to define supertypes, constitute the main Population genetic analyses
targets of balancing selection and thus retain higher levels of
diversity compared to other PBR pockets. We tested the population samples for deviation from Hardy-
Weinberg (HW) equilibrium using the Gene[rate] program
which tests the null hypothesis of equilibrium on the basis of
Materials and methods a log-likelihood ratio test on frequency estimates (both under
HW and under a generalized non-HW model) (Nunes et al.
Population data 2014; Nunes 2014).
We wrote R scripts to estimate supertype frequencies by
We used a database generated for the 13th International direct counting of alleles, generate summary statistics (number
Histocompatibility Workshop (IHWS) (Mack et al. 2006) of alleles (k) and expected sample heterozygosity (He)), and

250
Immunogenetics

estimate genetic differentiation between pairs of populations We estimated the nucleotide diversity (π) (Nei 1987) per
by using GST (Nei and Chesser 1983). Mantel tests (Mantel pocket (i.e., A, B, pooled CDE, and F) for each population
1967) for assessing Pearson’s correlations between genetic (referred to as πtotal). For these four pockets, we also computed
distances obtained either from supertype or from allelic data within- and between-supertype nucleotide diversity (referred
were carried out using the ade4 R package (Dray and Dufour to as πwithin and πst, respectively), and thus estimated a mea-
2007), and all graphs and other statistical tests (e.g., Wilcoxon sure of among-supertype variation for each pocket, obtained
rank sum test) were also generated using R version 3.0.2 using the following formula:
(Development Core Team 2011). In box plots, the boxes cor- πtotal −πwithin
respond to the interquartile range, the median is the thick line πst ¼ ð1Þ
πtotal
inside the box, and whiskers extend up to observations that are
outside the box for less than 1.5 times the interquartile range. Total, within- and between-supertype π values were
Dots are outliers to these limits. By using Arlequin 3.5 pro- calculated in two ways: (a) by excluding the non-
gram (Excoffier and Lischer 2010), we performed a hierarchi- classified alleles and (b) by including the non-classified
cal analysis of molecular variance (AMOVA) for each alleles as a single group. As the dataset is limited to al-
supertype taken individually by pooling all others into a leles defined at second field level of resolution, no infor-
unique group of Bnon-classified alleles^ for the calculations. mation about synonymous polymorphism is available. We
In this way, we estimated the diversity among populations addressed this problem by applying the same strategy as
(FST), among populations within geographic regions (FSC), described by Buhler and Sanchez-Mazas (2011), which
and among geographic regions (FCT) for each supertype. consisted in treating as missing data the nucleotide posi-
tions which were described as synonymous (Robinson
et al. 2015). We excluded sites having more than 5 %
Testing the molecular variation of the PBR pockets missing data.

We analyzed the molecular variation at each PBR pocket


using the coding sequences of the six pockets which Testing genetic differentiation between populations based
make up the HLA class I peptide-binding region (A to on supertypes
F). The definition of these codons (Table 1) was taken
from Saper et al. (1991). The residues retained for the To test whether the levels of genetic differentiation between
analysis of pocket B variability are the ones surrounding populations differed from those expected under the null hy-
the rim and constituting the inner wall of the pocket. As pothesis that supertypes are equivalent to random sets of al-
the main-chain atoms of pocket B residues 24, 25, and leles, we randomized the assignment of alleles into supertypes
34 are part of the protein backbone, and their side chains and calculated corresponding He and GST values. The ran-
are not turned to the pocket area, they are not expected domized assignment of alleles to supertypes was performed
to contribute to the chemical properties of the pocket, using two different approaches (for both the complete and the
and were not included in the analysis (Saper et al. reduced datasets):
1991; see also Table 1).
The B and F pockets were analyzed individually because of 1. By fixing the number of alleles per supertype to that ob-
their central role in engaging peptides and in defining served in the original dataset
supertypes. As the C, D, and E pockets jointly make up the 2. Without any constraint on the number of alleles associated
central region of the PBR and are shorter compared to other to a specific supertype
pockets, we pooled them for the present analysis. The A pock-
et was analyzed individually because of its position at one end The randomizations were repeated 10,000 times, and p-
of the PBR. values were estimated empirically by determining the number

Table 1 Codon composition of


the PBR pockets Pockets Codons Total size in
base pairs (bp)

A 5, 7, 59, 63, 66, 99, 159, 163, 167, and 171 30


B 7, 9, 24, 25, 34, 45, 63, 66, 67, 70, and 99 33
C, D, and E 9, 70, 73, 74, 97, 99, 114, 147, 152, 155, 156, 159, and 160 39
F 77, 80, 81, 84, 116, 123, 143, 146, and 147 27

From: Saper et al. (1991)

251
Immunogenetics

of randomized datasets with GST values lower or He values variation being found among populations of different geo-
higher than those observed for the true data. graphic regions (FCT >FSC; Table 2). The A1 supertype is
represented by a small number of alleles, with one or two
alleles in more than half of the populations (Fig. 1b) and only
Results and discussion one in 14 of them (Fig. 1c). The A2 and A3 supertypes exhibit
more even distributions, half of the populations having fre-
HLA-A and HLA-B supertype frequencies and their quencies ranging from 14 to 29 % for A2 and 14 to 32 % for
geographic distributions A3 (Figs. 1a and 2). As a consequence, among the HLA-A
supertypes, A2 and A3 present either the lowest or no geo-
In a previous study (the only one, to our knowledge, except graphic structure at all (F CT < F SC for A2 and FCT not
our own study on HLA-DRB1 (Gibert and Sanchez-Mazas significantly different from 0 for A3; Table 2). All populations
2003)) addressing population differentiation at the supertype present at least one allele of supertype A2 (eight of them
level, Sidney et al. (1996) used five population samples and showing just one), while the A3 supertype is represented by
reported that all supertypes were present in all world regions. a large number of alleles (Fig. 1b, c). The A24 supertype is
This current study with 55 populations greatly extends those observed in all populations (Fig. 1c), with frequencies ranging
original observations, allowing us to show that some from 13 to 40 % in half of them (Fig. 1a). Despite its broad
supertypes are not observed in all populations while reaching distribution, A24 is often represented by only two alleles,
a frequency of more than 50 % in others (Figs. 1a, c, 2, and 3). A*23:01 and A*24:02, with 26 and 10 populations showing
Among the HLA-A supertypes, A1 is the rarest, showing fre- just one or both of these alleles, respectively (Fig. 1b, c). This
quencies smaller than 9 % in more than half of the populations supertype is found at higher frequencies (40 % in average) in
(Fig. 1a) and being virtually absent in five of them (Fig. 1c). SEA, PAC, AUS, NEA, and AME (Fig. 2). Although A24
A1 alleles are found with high frequencies (22 % in average) exhibits the highest level of population differentiation among
in Africa, Southwest Asia, and Europe (Fig. 2), resulting in a the four HLA-A supertypes (FST =11 %, p<0.0001), most of
significant geographic structure, i.e., with most of the the variation is found within geographic regions (FCT <FSC).

Fig. 1 Supertype variation, a boxes represent the frequency distributions of populations showing only one allele for the referred supertype (referred
of the four HLA-A and the five HLA-B supertypes and the Bnon- to as Bmonomorphic populations^). The light gray section of the bars
classified alleles^ NCA and NCB, respectively; b each box represents represents the number of populations where the referred supertype was
the distribution of the number of distinct alleles of each supertype per not detected
population; and c the dark gray section of the bars represents the number

252
Immunogenetics

Fig. 2 HLA-A supertype


frequencies. Heat map
summarizing the frequencies of
the four HLA-A supertypes and
the non-classified alleles (NCAs).
Population names are shown on
the right

The frequencies of the HLA-A non-classified alleles (NCAs) ranging from 0 to 5.8 % and from 2.9 to 18 % in half of the
vary greatly between populations, ranging from 2 to 14 % in populations, respectively (Fig. 1a). Among the five HLA-B
half of them (Fig. 1a). The NCA group presents a strong supertypes, B62 presents the highest level of population differ-
geographic structure (FCT being twice as much as FSC) and entiation (FST =11.38 %, p<0.0001; Table 2), although with no
a very high FST value (almost 16 %) (Table 2). The highest clear geographic structure (FCT <FSC; Table 2). Such a geo-
NCA frequencies are found in African and Australian popu- graphic structure is only found for B58 (FCT of 5.6 %, almost
lations (averages of 16 and 43 %, respectively) (Fig. 2). twice as great as FSC; Table 2), which is observed in SSA
The HLA-B supertypes fall into two main categories re- populations at an average frequency of 33 % (from 23 to
garding their frequency distributions. On the one hand, B7 60 %; Fig. 3), against 4.2 % in the other regions (Fig. 3) and
and B44 exhibit a pattern resembling A2 and A3, with high no observation at all in many populations (18 out of 55; Fig. 3).
average frequencies (Figs. 1a and 3) and relatively low levels The B27 supertype presents an intermediate pattern between
of geographic structure (Table 2). Half of the populations pres- B7/B44 and B58/B62. It exhibits relatively lower frequencies
ent frequencies ranging from 18 to 31 % for B7 and from 21 to (from 7 to 19 % in half of the populations; Fig. 1a) and a higher
32 % for B44, respectively (Figs. 1a and 3). Both B7 and B44 level of population differentiation than B7 and B44 (FST =
are observed in all populations (except B7 in the Yami; Figs. 1c 7.5 %, p<0.0001; Table 2) but no geographic structure (FCT
and 3), with large numbers of alleles per population (Fig. 1b, very close to zero; Table 2). Contrasting with what is observed
c). By contrast, B58 and B62 exhibit very low frequencies, for the NCA, the non-classified alleles for HLA-B (NCB) are

253
Immunogenetics

Fig. 3 HLA-B supertype


frequencies. Heat map
summarizing the frequencies of
the five HLA-B supertypes and
the non-classified alleles (NCBs).
Population names are shown on
the right

quite frequent, with frequencies ranging from 10 to 17 % in to the analysis, they should not be ignored. They are a conse-
half of the populations (Fig. 1a). More than 75 % of popula- quence of the functional supertype classification, and they
tions present at least two different NCBs (Fig. 1b), and only were kept to understand exactly how they influence the vari-
two populations lack one of these alleles (Fig. 1c). The NCBs ations in HLA-A and HLA-B. As discussed above, the NCA
also exhibit a significant geographic structure, although not as consists of a small group of alleles, which reach high frequen-
strong as for NCA (Table 2). cies in island populations. On the other hand, NCB is a more
In summary, based on the observed data, supertypes can be heterogeneous group appearing in almost all populations.
allocated into two main categories: on the one hand, A2, A3,
B7, B27, and B44 fit the classical view that supertypes are Heterozygosity and interpopulation differentiation
evenly distributed (Figs. 1a, 2, and 3), poorly structured geo-
graphically (Table 2), and represented by a large number of Using both complete and reduced datasets (see BMaterials and
alleles (Fig. 1b, c). On the other hand, A1, A24, B58, and B62 methods^ section), the heterozygosity estimated for the data
present a greater frequency variation among populations treated at the allelic level is always larger than that estimated
(Figs. 2 and 3 and Table 2), and in some cases significant for the data treated at the supertype level (Table 3). This result
geographic structure (i.e., for A1 and B58, both being very is expected because alleles are nested within supertypes, and
common in Africa), and are represented by a smaller number the heterozygosity of the latter is thus constrained to be equal
of alleles. Although the unclassified alleles have brought noise to or smaller than that of the former.

254
Immunogenetics

Table 2 Supertype differentiation indexes among populations (FST), the higher correlations between alleles and supertypes when
among populations within geographic regions (F SC), and among
they are taken into account. The difference between alleles
geographic regions (FCT)
and supertypes is less pronounced for HLA-A which presents
Supertypes FST FSC FCTa a smaller number of alleles per supertype in all populations
(Fig. 1b, c).
A1 9.95 %*** 2.67 %*** 7.48 %***
A2 4.85 %*** 3.40 %*** 1.51 %*
Patterns of molecular variability for different PBR
A3 6.48 %*** 6.48 %*** 0.000b
pockets of HLA-A and HLA-B
A24 11.14 %*** 6.66 %*** 4.80 %***
NCA 15.90 %*** 4.90 %*** 11.56 %***
Our goal in this part of the study was to test the prediction that
B7 5.11 %*** 3.21 %*** 1.97 %*
the B and F pockets of the PBR exhibit the highest levels of
B27 7.54 %*** 7.10 %*** 0.47%b variation as a consequence of their crucial role in peptide
B44 3.21 %*** 1.72 %*** 1.51 %** binding, which is expected to result in a stronger effect of
B58 8.34 %*** 2.91 %*** 5.59 %** balancing selection.
B62 11.38 %*** 7.35 %*** 4.35 %* We first estimated the global levels of variation at the PBR
NCB 7.02 %*** 2.90 %*** 4.24 %*** and observed significantly higher levels of nucleotide diversi-
*p<0.01; **p<0.001; ***p<0.0001, where p values refer to the proba-
ty (πtotal) at HLA-B, compared to HLA-A (p<0.0000005;
bility of observing a statistic as extreme under the null hypothesis of no Wilcoxon rank sum test). Moreover, these two genes differ
structure in the way molecular variation is distributed among the A,
a
In italics: Values of FCT >FSC, an indication that most of the variation B, CDE, and F pockets within the PBR (Fig. 5). The rank
was found among populations of different geographic regions order of πtotal is pCDE≫pB≫pA>pF, at HLA-A, and pB≫
b
Not significant value pF>pCDE≫pA, at HLA-B (where p is an abbreviation for
Bpocket^ and ≫ and > indicate greater than and significant,
In order to define the degree to which genetic differentia- at the 0.00001 level, and greater than but non-significant dif-
tion, measured by GST between populations, was concordant ferences, respectively, according to a Wilcoxon rank sum test;
at the supertype and allelic levels, we estimated the correlation Fig. 5). Among the HLA-A pockets, most of the variation is
between these measures and tested their significance using found in the CDE pockets, which makes up the central region
Mantel tests. The results suggest that when using the complete of the PBR, and significantly less in pB (πtotal values ranging
population dataset, the patterns of population differentiation from 0.14 to 0.15 and from 0.11 to 012 in half of the popula-
observed at the supertype and allelic levels are very similar, tions, respectively; Fig. 5). The pA and pF pockets exhibit the
especially for HLA-A (r=0.956, p<0.0005; Fig. 4a) but also smallest levels of variation (πtotal values ranging from 0.07 to
for HLA-B (r=0.75, p<0.0005; Fig. 4b). The removal of the 0.09 in half of the populations; Fig. 5). Among the HLA-B
Pacific, Australian, Taiwanese, and Native American popula- pockets, pB exhibits by far the highest variation, with πtotal
tions provokes an overall drop of both the GST values and their values ranging from 0.18 to 0.21 in half of the populations,
correlations. Despite this decrease, a high-correlation coeffi- whereas the other pockets exhibit a relatively narrow πtotal
cient is still observed for HLA-A (r=0.62, p<0.0005; Fig. 4c), distribution (ranging from 0.10 to 012 in half of the popula-
whereas the value is much lower for HLA-B (r = 0.3, tions; Fig. 5).
p<0.0005; Fig. 4d). Because Pacific, Australian, Taiwanese, The hypothesis that the pockets B and F are the main tar-
and Native American populations contribute to large differen- gets of balancing selection is thus partially supported for
tiation values, lower-correlation coefficients were expected HLA-B, since pB presents by far the highest level of nucleo-
after removing them. Furthermore, these populations also ex- tide diversity. Interestingly, van Deutekom and Kesmir (2015)
hibit a reduced set of alleles per supertype, which may explain recently showed that changes involving several of the B
pocket’s amino acids had a profound impact on peptide-
Table 3 Expected heterozygosity (He) of alleles and supertypes
binding properties, which corroborates our interpretation. On
the other hand, pF, which is not significantly different from pA
Loci Dataseta Average allelic He Average supertype He at HLA-A, and from pCDE at HLA-B, does not present an
increased value of πtotal which would be an evidence against
HLA-A Complete 0.7761 0.6774
balancing selection. It is important to note that these results
HLA-A Reduced 0.8974 0.7504
were obtained independently from the classification of alleles
HLA-B Complete 0.8948 0.7577
into supertypes, since the determination of the pockets’ co-
HLA-B Reduced 0.9429 0.7766
dons was taken from the classical study of Saper et al. (1991).
a
Complete dataset, all populations; reduced dataset, excluding Pacific, We also analyzed how the nucleotide diversity was distrib-
Australian, Taiwanese, and Native American populations uted between supertypes. Since the supertype categorization is

255
Immunogenetics

Fig. 4 Plots of GST values


between populations based on
allele (Y axis) and supertype (X
axis) frequencies. The correlation
(Rxy) and significance were
obtained using a Mantel test.
Complete dataset, all populations
and reduced dataset, excluding
Pacific, Australian, Taiwanese,
and Native American populations

based on variations of pB and pF, these pockets were expected the assignment of alleles to supertypes was randomized
to present more differences between supertypes than the by permuting the supertype labels attributed to each
others. This prediction was confirmed for pF at HLA-A and allele motif, as described in the BMaterials and
pB at HLA-B (Fig. 6). methods^ section. As the same patterns were obtained
As pB presents the highest levels of variation at HLA-B using the two different simulation approaches (see
and also accounts for most of the differences between HLA-B BMaterials and methods^ section), we only present the
supertypes, we conclude that the variation between HLA-B results for the case without any constraint on the num-
supertypes accounts for most of the differences observed be- ber of alleles associated to a specific supertype.
tween HLA-B alleles. In other words, alleles classified within For HLA-A, we do not observe any population with a
a same HLA-B supertype share more similarities than alleles significant difference in He in contrasts between the real and
assigned to different HLA-B supertypes. By contrast, most of random supertype assignments. For HLA-B, 6 out of 55 pop-
the differences between HLA-A supertypes lie within pF, the ulations exhibit significantly lower He (permutation-based
pocket presenting the lowest πtotal values for this gene. p<0.05) than those acquired via simulations. These six popu-
Therefore, at this locus, the supertypes do not account for most lations belong to the reduced dataset. Because the number of
of the variation between alleles (Fig. 6). In other words, HLA- populations with individually significant p values in either
A presents more variation within than between supertypes. direction (i.e., with significantly lower or greater He compared
to the simulated value) is small, we investigated whether the
Simulation approach to test selection on supertypes distribution of the p values itself was informative regarding
selective effects. To do this, we used an exact binomial test to
According to the definition of Sidney et al. (1996), al- assess whether the observed distribution of p values deviated
leles included within the same supertype have overlap- from one composed of equal numbers of values on either side
ping peptide-binding specificities. To test the effects of of 0.5 (the expected proportion of deviation in either direction
the supertype classification on expected heterozygosities under the null hypothesis; Fig. 7). For HLA-A, no significant
(He) and pairwise differentiation (GST), we generated deviation is found (p value>0.05 for both complete and re-
null distributions for these two statistics under the hy- duced datasets). For HLA-B, however, a significant skew to-
pothesis that alleles within supertypes are a random col- wards p values greater than 0.5 is observed, indicating an
lection, with no shared functional attributes. To this end, overall significant excess of populations with lower He than

Fig. 5 Total nucleotide diversity (πtotal) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the total nucleotide diversity per
pocket for the populations of the complete dataset

256
Immunogenetics

Fig. 6 Nucleotide diversity between supertypes (πst) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the nucleotide
diversity between supertypes per pocket for the populations of the complete dataset

those obtained through simulations (p value<0.05 and p value between than within HLA-A supertypes. This indicates that
<0.005 for complete and reduced datasets, respectively). HLA-A supertypes are composed of heterogeneous sets of
For both HLA-A and HLA-B, GST values were not signif- alleles with few sequence similarities at pF (Figs. 5 and 6),
icantly different from those of the randomized data, when which explains the similarity between the results based on the
using the complete dataset. This is also true when using the observed and randomized data. On the other hand, HLA-B
reduced dataset for HLA-A but not for HLA-B. Indeed, after supertypes appear to be composed of alleles sharing more
removing the Pacific, Australian, Taiwanese, and Native sequence similarities, as shown by the molecular analysis of
American populations, the observed GST is higher than 98 % the PBR pockets (Figs. 5 and 6).
of the simulations for HLA-B (Fig. 8). This finding differs In summary, HLA-B supertypes are sets of alleles with B
from the expectations of Sidney et al. (1996), who predicted pocket resemblances, and these similarities can be interpreted
an overall decrease of differentiation at the supertype level. directly in terms of peptide presentation profiles because
However, it is in agreement with our description of the ob- HLA-B supertypes exhibit major differences regarding the
served data. Indeed, in our simulations, alleles were randomly chemical properties of pB. Thus, our results showing an in-
assigned to supertypes, creating randomized supertypes with creased differentiation at the level of HLA-B supertypes are
similar contents of common and rare alleles. The common consistent with an effect of natural selection resulting in local
alleles are expected to be assigned to different randomized adaptation of populations to different pathogen environments.
supertypes in most of the simulations because they are less Through our simulations, the functional grouping of alleles
numerous than the rare alleles. Such a pattern is similar to that reflected by the HLA-B supertypes is disrupted, creating ran-
described for real HLA-A supertypes, which present a low domized groups in the same way as described for HLA-A.
number of common alleles per population (Fig. 1b, c). As The frequent allocation of common alleles into different ran-
discussed above, this pattern also explains the high correlation domized supertypes in the simulations thus provokes both an
found between G ST values measured at the allelic and increase of He and a decrease of population differentiations
supertype levels for this locus (Fig. 4). Finally, as also (GST), when compared with the observed data (Figs. 7 and 8).
discussed above for the PBR pockets, less variation is found In agreement with this interpretation, the inclusion of the

Fig. 7 P value distributions


obtained through simulations for
the expected heterozygosity (He).
The p value is defined as the
proportion of simulated datasets
with He larger than the observed
He. The results obtained with the
complete (top) and reduced
(bottom) dataset are shown

257
Immunogenetics

Fig. 8 Simulation results for GST.


The red line represents the
average observed GST. We
calculated the average GST value
for each simulated step and then
determined the significance as the
proportion of simulated values
smaller than the observed one.
The results with the complete
(top) and reduced (bottom)
datasets are shown

Pacific, Australian, Taiwanese, and Native American popula- balancing selection, our simulation results reveal that HLA-
tions reduces this effect because the patterns of variation at B supertype frequencies do not show a signature of balancing
HLA-B for these populations resemble those observed at selection (i.e., we find lower He compared to those of random-
HLA-A, with a relatively low number of alleles belonging to ly assigned groups of alleles), implying that each supertype is
different supertypes. not maintained at relatively high frequencies in all popula-
tions. This result is supported by the geographically heteroge-
neous distributions of B58 and B62 (and, to a lesser extent,
Conclusions B27) frequencies among populations. Moreover, populations
are more differentiated than expected for HLA-B supertypes
The supertype classification of HLA-A and HLA-B alleles has (higher observed GST values than those obtained from ran-
been widely used in medical research, with reports suggesting domly assigned groups of alleles). As most of the differences
that supertype-level variation explains susceptibility or resis- between HLA-B supertypes lie in the B pocket, this means
tance to a series of pathogenic diseases (Alencar et al. 2013; that the differences in HLA-B supertype composition among
Chakraborty et al. 2013; Cordery et al. 2012; Gilchuk et al. populations can be interpreted in terms of peptide recognition.
2013; Karlsson et al. 2012, 2013; Kuniholm et al. 2013; Thus, for HLA-B, our results support the idea that populations
Trachtenberg et al. 2003). This classification was proposed present more differences in peptide presentation profiles than
in the 1990s as an attempt to find, as described by Sette and expected, possibly due to local adaptations to pathogens.
Sidney (1999), Bthe common denominators and similarities By contrast, most of the differences between HLA-A al-
hidden within this very large degree of polymorphism.^ The leles are not related with differences at the supertype level.
same authors also stated that Bthe overall frequency of each of This is supported by our simulation results showing that the
these supertypes is remarkably high and fairly conserved randomly assigned groups of alleles often reproduce the ob-
among very different ethnicities. Thus, there might be some served patterns of variation and differentiation of HLA-A
advantage for human populations to present approximately supertypes. Moreover, HLA-A alleles are more conserved at
five to ten main binding specificities and that each one of these the sites involved in peptide binding, suggesting that they
is maintained at relatively high frequency.^ According to our present a more conserved profile of peptides across popula-
results, the variation among HLA-B supertypes does reflect tions, differing from what is observed for HLA-B. Of note,
the functional diversity at this locus and is thus in agreement one possible caveat of inferring peptide binding through the
with the above-mentioned hypothesis. Our results strongly supertype classification is that some peptides presented by
indicate that the B pocket is likely to be the main target of HLA class I molecules are known to assume a looping con-
natural selection at HLA-B, as it presents the highest levels of formation outside the peptide-binding groove. However, no
molecular variation and accounts for the main differences in matter how different conformations a peptide can adopt, the
the peptide presentation profiles for this gene. However, in anchor amino acids located at the peptide ends remain the
contrast with classical expectations for loci evolving under same, limited by the B and F pockets. In this way, this

258
Immunogenetics

conformational variability exhibited by the peptides is also a R Development Core Team (2011) R: a language and environment for
statistical computing. Vienna, Austria: the R Foundation for
consequence of the interaction between the peptide anchors
Statistical Computing. ISBN: 3-900051-07-0. Available online at
and the B and F pockets and thus is not expected to change the http://www.R-project.org/
results obtained here. Dray S, Dufour AB (2007) The ade4 package: implementing the duality
Our results suggest that the B pocket of the HLA-B mole- diagram for ecologists. J Stat Softw 22(4):1–20
cules is the main target of natural selection, whereas no such Excoffier L, Lischer HE (2010) Arlequin suite ver 3.5: a new series of
programs to perform population genetics analyses under Linux and
signals could be retrieved for the other HLA-B pockets nor for Windows. Mol Ecol Resour 10(3):564–567
the pockets of the HLA-A molecules in relation to the Gibert M, Sanchez-Mazas A (2003) Geographic patterns of functional
supertype classification. This conclusion matches the expec- categories of HLA-DRB1 alleles: a new approach to analyse asso-
tations that supertypes are the primary targets of selection for ciations between HLA-DRB1 and disease. Eur J Immunogenet
30(5):361–374
HLA-B but not for HLA-A. Following this idea, we could
Gilchuk P, Spencer CT, Conant SB, Hill T, Gray JJ, Niu X, Zheng M,
state that HLA-A supertypes are composed by alleles whose Erickson JJ, Boyd KL, McAfee KJ, Oseroff C, Hadrup SR, Bennink
resemblances are not the consequence of a shared phylogenet- JR, Hildebrand W, Edwards KM, Crowe JE, Williams JV, Buus S,
ic origin. A future extension of this work could be to explore Sette A, Schumacher TN, Link AJ, Joyce S (2013) Discovering
naturally processed antigenic determinants that confer protective T
whether the central pockets C, D, and E that have been shown
cell immunity. J Clin Invest 123(5):1976–1987
to contain most of the variation at HLA-A could be used as an Hedrick PW, Whittam TS, Parham P (1991) Heterozygosity at individual
alternate functional classification for these alleles. amino acid sites: extremely high levels for HLA-A and -B genes.
Proc Natl Acad Sci U S A 88(13):5897–5901
Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major
histocompatibility complex class I loci reveals overdominant selec-
Acknowledgments This work was supported by the Swiss National
tion. Nature 335(6186):167–170
Science Foundation (SNSF) grant no. 31003A_144180 to ASM and
Jost L (2008) G(ST) and its relatives do not measure differentiation. Mol
São Paulo Research Foundation (FAPESP) 12/18010-0 and a CNPq pro-
Ecol 17(18):4015–4026
ductivity grant no. 308167/2012-0 to DM. RSF was supported by CNPq
(grant no. 142130/2009-5) and CAPES (grant no. 12447/12-9). We also Karlsson I, Kløverpris H, Jensen KJ, Stryhn A, Buus S, Karlsson A,
thank two anonymous reviewers for their useful comments. Vinner L, Goulder P, Fomsgaard A (2012) Identification of con-
served subdominant HIV type 1 CD8(+) T cell epitopes restricted
within common HLA supertypes for therapeutic HIV type 1 vac-
cines. AIDS Res Hum Retroviruses 28(11):1434–1443
Karlsson I, Brandt L, Vinner L, Kromann I, Andreasen LV, Andersen P,
Open Access This article is distributed under the terms of the Creative Gerstoft J, Kronborg G, Fomsgaard A (2013) Adjuvanted HLA-
C o m m o n s A t t r i b u t i on 4 . 0 I n t e r n a t i on a l L i c e n s e ( h t t p : / / supertype restricted subdominant peptides induce new T-cell immu-
creativecommons.org/licenses/by/4.0/), which permits unrestricted use, nity during untreated HIV-1-infection. Clin Immunol 146(2):120–
distribution, and reproduction in any medium, provided you give 130
appropriate credit to the original author(s) and the source, provide a link Kuniholm MH, Anastos K, Kovacs A, Gao X, Marti D, Sette A,
to the Creative Commons license, and indicate if changes were made. Greenblatt RM, Peters M, Cohen MH, Minkoff H, Gange SJ, Thio
CL, Young MA, Xue X, Carrington M, Strickler HD (2013)
Relation of HLA class I and II supertypes with spontaneous clear-
ance of hepatitis C virus. Genes Immun 14(5):330–335
References
Lawlor DA, Zemmour J, Ennis PD, Parham P (1990) Evolution of class-I
MHC genes and proteins: from natural selection to thymic selection.
Alencar LXE, Braga-Neto UM, Nascimento EJM, Cordeiro MT, Silva Annu Rev Immunol 8:23–63
AM, Brito CAA, Silva PM, Gil LH, Montenegro SM, Marques Mack S, Sanchez-Mazas A, Meyer D, Single R, Tsai Y et al (2006) 13th
Júnior ET Jr (2013) HLA-B*44 is associated with dengue severity International Histocompatibility Workshop Anthropology/Human
caused by DENV-3 in a Brazilian population. J Trop Med 2013: Genetic Diversity Joint Report—Chapter 2: methods used in the
648475 generation and preparation of data for analysis in the 13th
Apanius V, Penn D, Slev PR, Ruff LR, Potts WK (1997) The nature of International Histocompatibility Workshop. In: Hansen J (ed)
selection on the major histocompatibility complex. Crit Rev Immunobiology of the human MHC: Proceedings of the 13th
Immunol 17(2):179–224 International Histocompatibility Workshop and Conference.
Borghans JA, Beltman JB, De Boer RJ (2004) MHC polymorphism IHWG Press, Seattle, pp 564–579
under host-pathogen coevolution. Immunogenetics 55(11):732–739 Mantel N (1967) The detection of disease clustering and a generalized
Buhler S, Sanchez-Mazas A (2011) HLA DNA sequence variation regression approach. Cancer Res 27(2):209–220
among human populations: molecular signatures of demographic Naugler C, Liwski R (2008) An evolutionary approach to major histo-
and selective events. PLoS One 6(2):e14643 compatibility diversity based on allele supertypes. Med Hypotheses
Chakraborty S, Rahman T, Chakravorty R, Kuchta A, Rabby A, 70(5):933–937
Sahiuzzaman M (2013) HLA supertypes contribute in HIV type 1 Nei M (1987) Molecular evolutionary genetics. Columbia University
cytotoxic T lymphocyte epitope clustering in Nef and Gag proteins. Press, New York
AIDS Res Hum Retroviruses 29(2):270–278 Nei M, Chesser RK (1983) Estimation of fixation indices and gene diver-
Cordery DV, Martin A, Amin J, Kelleher AD, Emery S, Cooper DA, sities. Ann Hum Genet 47(Pt 3):253–259
STEAL study group (2012) The influence of HLA supertype on Nunes JM (2014) Using Uniformat and Gene[rate] to analyse data with
thymidine analogue associated with low peripheral fat in HIV. ambiguities in population genetics. http://dx.doi.org/10.6084/m9.
AIDS 26(18):2337–2344 figshare.984299

259
Immunogenetics

Nunes JM, Buhler S, Roessli D, Sanchez-Mazas A, HLA-net 2013 col- Saper MA, Bjorkman PJ, Wiley DC (1991) Refined structure of the
laboration (2014) The HLA-net Gene[rate] pipeline for effective human histocompatibility antigen HLA-A2 at 2.6 A resolution. J
HLA data analysis and its application to 145 populations from Mol Biol 219(2):277–319
Europe and neighbouring areas. Tissue Antigens 83(5):307–323 Sette A, Sidney J (1999) Nine major HLA class I supertypes account for
Parham P (2005) MHC class I molecules and KIRs in human history, the vast preponderance of HLA-A and -B polymorphism.
health and survival. Nat Rev Immunol 5(3):201–214 Immunogenetics 50(3–4):201–212
Parham P, Benjamin RJ, Chen BP, Clayberger C, Ennis PD, Krensky AM, Sidney J, Grey HM, Kubo RT, Sette A (1996) Practical, biochemical and
Lawlor DA, Littman DR, Norment AM, Orr HT et al (1989) evolutionary implications of the discovery of HLA class I
Diversity of class I HLA molecules: functional and evolutionary supermotifs. Immunol Today 17(6):261–266
interactions with T cells. Cold Spring Harb Symp Quant Biol Sidney J, Peters B, Frahm N, Brander C, Sette A (2008) HLA class I
54(Pt 1):529–543 supertypes: a revised and updated classification. BMC Immunol 9:1
Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F Slade RW, McCallum HI (1992) Overdominant vs. frequency-dependent
(2005) Pathogen-driven selection and worldwide HLA class I diver- selection at MHC loci. Genetics 132(3):861–864
sity. Curr Biol 15(11):1022–1027 Takahata N, Nei M (1990) Allelic genealogy under overdominant and
Qutob N, Balloux F, Raj T, Liu H, Marion de Procé S, Trowsdale frequency-dependent selection and polymorphism of major histo-
J, Manica A (2011) Signatures of historical demography and compatibility complex loci. Genetics 124(4):967–978
pathogen richness on MHC class I genes. Immunogenetics Takahata N, Satta Y, Klein J (1992) Polymorphism and balancing selection
64(3):165–175 at major histocompatibility complex loci. Genetics 130(4):925–938
Trachtenberg E, Korber B, Sollars C, Kepler TB, Hraber PT, Hayes E,
Robinson J, Halliwell JA, Hayhurst JD, Flicek P, Parham P, Marsh SG
Funkhouser R, Fugate M, Theiler J, Hsu YS, Kunstman K, Wu S,
(2015) The IPD and IMGT/HLA database: allele variant databases.
Phair J, Erlich H, Wolinsky S (2003) Advantage of rare HLA
Nucleic Acids Res 43(Database issue):D423–D431
supertype in HIV disease progression. Nat Med 9(7):928–935
Sanchez-Mazas A, Lemaître JF, Currat M (2012) Distinct evolutionary van Deutekom HW, Kesmir C (2015) Zooming into the binding groove of
strategies of human leucocyte antigen loci in pathogen-rich environ- HLA molecules: which positions and which substitutions changes
ments. Philos Trans R Soc Lond B Biol Sci 367(1590):830–839 peptide binding most? Immunogenetics 67(8):425–436

260
Apêndice A.3.
Cópia do artigo “Kiwi genome provides insights into evolution of a nocturnal
lifestyle”: Genome Biology (2015), 16(1): 1-15.
Neste trabalho, eu realizei os testes de seleção baseados em dN/dS usando
o pacote PAML – e/ou supervisionei sua execução e interpretação – e fui res-
ponsável pela discussão dos resultados referentes a estas análises no artigo.
Também fiz parte das análises referentes às regiões ultra-conservadas (Ultra-
conserved non-coding elements) que apresentam mais variação do que o esperado
em kiwi, indicando possíveis vias de desenvolvimento alteradas nessa espécie.
Finalmente, contribuí com correções do manuscrito e com discussões relaciona-
das aos aspectos evolutivos do trabalho.

261
Le Duc et al. Genome Biology (2015) 16:147
DOI 10.1186/s13059-015-0711-4

RESEARCH Open Access

Kiwi genome provides insights into


evolution of a nocturnal lifestyle
Diana Le Duc1,2*, Gabriel Renaud2, Arunkumar Krishnan3, Markus Sällman Almén3, Leon Huynen4, Sonja J. Prohaska5,
Matthias Ongyerth2, Bárbara D. Bitarello6, Helgi B. Schiöth3, Michael Hofreiter7, Peter F. Stadler5, Kay Prüfer2,
David Lambert4, Janet Kelso2 and Torsten Schöneberg1*

Abstract
Background: Kiwi, comprising five species from the genus Apteryx, are endangered, ground-dwelling bird species
endemic to New Zealand. They are the smallest and only nocturnal representatives of the ratites. The timing of kiwi
adaptation to a nocturnal niche and the genomic innovations, which shaped sensory systems and morphology to
allow this adaptation, are not yet fully understood.
Results: We sequenced and assembled the brown kiwi genome to 150-fold coverage and annotated the genome
using kiwi transcript data and non-redundant protein information from multiple bird species. We identified
evolutionary sequence changes that underlie adaptation to nocturnality and estimated the onset time of these
adaptations. Several opsin genes involved in color vision are inactivated in the kiwi. We date this inactivation to
the Oligocene epoch, likely after the arrival of the ancestor of modern kiwi in New Zealand. Genome comparisons
between kiwi and representatives of ratites, Galloanserae, and Neoaves, including nocturnal and song birds, show
diversification of kiwi’s odorant receptors repertoire, which may reflect an increased reliance on olfaction rather than
sight during foraging. Further, there is an enrichment of genes influencing mitochondrial function and energy
expenditure among genes that are rapidly evolving specifically on the kiwi branch, which may also be linked to its
nocturnal lifestyle.
Conclusions: The genomic changes in kiwi vision and olfaction are consistent with changes that are hypothesized to
occur during adaptation to nocturnal lifestyle in mammals. The kiwi genome provides a valuable genomic resource for
future genome-wide comparative analyses to other extinct and extant diurnal ratites.

Background in New Guinea, and the rhea in South America, and, as


New Zealand’s geographic isolation, after the separation extinct members, the moa from New Zealand and the
from Gondwana around 80 million years ago, provides elephant birds from Madagascar. New Zealand is thus
an unequaled opportunity to study the results of evolu- the only landmass to have been inhabited by two ratite
tionary processes following geographic isolation. In New lineages. Strikingly, the two lineages are highly divergent
Zealand, the ecological niches typically occupied by in size with moa having a body size of up to 3 m [1]
mammals in most other parts of the world are domi- while kiwi, the smallest of the ratites, reaches only the
nated by birds. Kiwi (genus Apteryx), the national size of a chicken. Moreover, while moa occupied the di-
symbol of New Zealand, belong to a group of flightless urnal niche, kiwi are the only ratites, and one of only a
birds, the ratites. This group is geographically broadly few bird lineages (less than 3 % of the bird species [2]),
distributed including both extant members, which are that are nocturnal. Although the kiwi eye is unusually
the ostrich in Africa, the emu in Australia, the cassowary small for a nocturnal bird, it has a nocturnal-type retina
[3]. This may indicate that the nocturnal adaptation of
* Correspondence: diana_leduc@eva.mpg.de; schoberg@medizin.uni-leipzig.
kiwi is recent, or alternatively, that changes in eye size
de are not a prerequisite for nocturnality.
1
Institute of Biochemistry, Medical Faculty, University of Leipzig, Johannisallee We have sequenced and assembled the genome of Ap-
30, Leipzig 04103, Germany
Full list of author information is available at the end of the article
teryx mantelli, the North Island brown kiwi, to improve

© 2015 Le Duc et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

262
Le Duc et al. Genome Biology (2015) 16:147 Page 2 of 15

our understanding of how genomic features evolve during is syntenically alignable to 83.51 % of the chicken genome.
adaptation to nocturnality and the ground-dwelling niche. However, 91.96 % of the zebra finch sequences that are
We have also sequenced the transcriptome from embry- syntenic-chain-alignable to chicken showed conserved
onic tissue to provide support for the genome annotation. synteny in kiwi, suggesting that the kiwi genome assembly
We identified genomic changes in kiwi that affect physio- includes the majority of conserved regions between birds.
logical functions, including vision and olfaction, which We identified a set of 27,876 genes following de novo
have been predicted to characterize nocturnal adaptation gene prediction on the assembled genome (Additional
in the early history of mammals [4]. file 1: Note: De novo gene prediction and gene annota-
tion). To refine these gene annotations we used 47.5 Gb
Results of transcript sequence data from kiwi embryonic tissue
Genome sequencing, assembly, and annotation together with the de novo gene predictions and protein
We prepared 11 libraries with several insert sizes from evidence from three well-annotated bird species (G.
Apteryx mantelli genomic DNA and sequenced 83 billion gallus, T. guttata, M. gallopavo) as input to the MAKER
base pairs (Gb) from small insert-size libraries and 120 Gb genome annotation pipeline [10]. A validated set of
from large-insert mate-pair Illumina libraries (Additional 18,033 genes was selected based on their alignment to
file 1: Table S1). After read correction [5] we assembled orthologous genes in other birds and on supporting evi-
contigs and scaffolds using SOAPdenovo [6] (Additional dence provided by kiwi transcript sequences. In total,
file 1: Note: Filtering and read correction; Genome assem- the gene models spanned 306.62 Mb of the assembly,
bly) to generate a draft assembly, which spanned 1.595 Gb with exons accounting for 23.96 Mb (approximately 1.6
(Additional file 1: Tables S2 and S3). The N50s of contigs %) of the total kiwi genome.
and scaffolds were 16.48 kb and 3.95 Mb, respectively
(Additional file 1: Table S3). Since the size of the kiwi gen- Evolution of gene families
ome is unknown, we estimated average coverage using a Gene family expansion and/or contraction have been
19-mer frequency distribution (Additional file 1: Figure proposed as important mechanisms underlying adapta-
S1) which yielded a genome size estimate of 1.65 Gb, pla- tion [11]. We explored patterns of protein family expan-
cing the kiwi among the largest bird genomes sequenced sions and contractions in kiwi and used TreeFam [12] to
to date [7] (Table 1; Additional file 1: Table S4). The as- define gene families in the kiwi and all bird and reptile
sembled contigs and scaffolds cover approximately 96 % genomes in Ensembl 73, as well as two nocturnal birds
of the complete genome with an average sequence cover- (barn owl, chuck-will’s-widow), two other ratites (ostrich,
age of 35.85-fold after correction (Additional file 1: Note: tinamou) [7] (GigaDB [13]), two mammals (human,
Filtering and read correction). Assembly quality was mouse), and one fish (stickleback) (Ensembl 73 [14]). In
assessed by chaining the kiwi scaffolds to two Sanger- total we identified 10,096 gene families shared between
sequenced bird genomes: chicken [8] and zebra finch [9]. the inferred ancestral state and the 16 species consid-
A total of 50.09 % (0.8 Gb) of the kiwi genome is alignable ered, of which 623 represent single-gene families. For
in syntenic chains to 79.67 % of the much smaller chicken these single-gene families we constructed a maximum-
genome (1.07 Gb). A similar fraction, 57.61 % (0.9 Gb), of likelihood phylogeny [15] (Fig. 1) and tested for changes
the kiwi sequence was alignable to 76.92 % of the zebra in ortholog cluster sizes. In accordance with previous es-
finch genome (1.2 Gb) (Additional file 1: Table S5). For timates, our results indicate a net gene loss on the avian
comparison, 69.86 % (0.84 Gb) of the zebra finch genome branch [16].
Changes of gene-family sizes have been inferred for
Table 1 Kiwi genome assembly characteristics and genomic multiple de novo assembled genomes [17, 18]. However,
features compared with other avian genomes (see Additional
many of these genomes have rather fragmented assemblies
file 1: Table S4)
(Table 1); thus, results should be interpreted cautiously,
Species Size of N50 scaffolds Heterozygous SNP
assembly (Gb) (Mb) rate per kb only after manual inspection and ideally independent ex-
Apteryx mantelli 1.59 4 1.5
perimental confirmation.
We therefore manually examined the 130 gene families
Falco cherrug [17] 1.18 4.2 0.8
that had either significant expansion or contraction spe-
Falco peregrinus [17] 1.17 3.9 0.7 cifically to the kiwi branch. After excluding expansions
Taeniopygia guttata [9] 1.2 10.4 1.4 that were caused by fragmentation of the assembly [19]
Ficedula albicolis [90] 1.13 7.3 3.03 only 85 gene families remained significant (Additional
Anas platyrhynchos [18] 1.1 1.2 2.61 file 1: Table S6). Of these, 63 gene families are expanded
Gallus gallus [8] 1.07 15.5 4.5
in the kiwi. An analysis of gene family functions [20]
showing expansion in kiwi identified enrichment in cat-
Meleagris gallopavo [91] 0.93 1.5 ~1.36
egories including signal transduction, calcium homeostasis,

263
Le Duc et al. Genome Biology (2015) 16:147 Page 3 of 15

Fig. 1 Phylogenetic tree of 16 species built on 623 TreeFam [12] single-gene families. Branch lengths are scaled to estimate divergence
times. All branches are supported by 100 bootstraps. The song bird clade is depicted in blue, Galliformes jn purple, Anseriformes in green,
and nocturnal birds in red. Ratites (Struthio camelus and Apteryx mantelli) and Tinamus guttatus are highlighted in light green. The number
of genes gained (+ red) and lost (− blue) is given underneath each branch. The rate of gene gain and loss for the clades derived from
the most common recent ancestor was estimated [77] to 0.0007 per gene per million years

and motor activity (FDR <0.0001, Additional file 1: Figure Patterns of natural selection
S2A). Among the gene families that show contraction on To determine whether any branch-specific selection is
the kiwi branch we found an enrichment of development- present in kiwi we estimated branch ω-values (Ka/Ks sub-
related Gene Ontology (GO) categories (FDR <0.0001, stitution ratios) for 4,152 orthologous genes in eight bird
Additional file 1: Figure S2B). species: kiwi, ostrich, tinamou, chuck-will’s-widow, barn
Diversification of tetrapods and the colonization of ter- owl, chicken, zebra finch, and turkey using CODEML
restrial habitats are often accompanied by changes of [24]. Ortholog assignment was based on the orthology re-
physiological systems specifically in cellular signal trans- lation among chicken, zebra finch, and turkey defined in
duction [21]. Membrane proteins are involved in cellular Ensembl 73 (Additional file 1: Note: Orthologs and Ka/Ks
signaling, hence we aimed to determine more specifically calculation). The kiwi average ω across all the orthologs is
which classes of membrane-expressed proteins have comparable to that in ostrich, and higher than in tinamou
undergone changes in the number of coding genes. To and night birds (0.291, 0.313, 0.145, 0.202, and 0.200 for
this end we annotated the membrane proteome in kiwi, kiwi, ostrich, tinamou, chuck-will’s-widow, and barn owl,
human, all birds, and reptiles present in Ensembl 74, two respectively). This implies a relatively faster overall rate of
additional ratites (ostrich and tinamou) and two nocturnal functional evolution in kiwi and ostrich.
birds (chuck-will’s-widow and barn owl) (Additional file 1: In addition to gene-family expansions/contractions,
Note: Detection and classification of the membrane prote- we used evidence of branch-specific selection to iden-
ome; Additional file 1: Table S7). We manually inspected tify genes and functional pathways that may underlie
the classes which showed expansion in kiwi, to ensure that kiwi-specific adaptations. For the 4,152 orthologous
the higher number of predicted genes is not a result of as- genes in the eight bird species we used the branch models
sembly fragmentation. We found a significant expansion from CODEML to perform likelihood ratio tests [24],
in kiwi of genes coding for adhesion and immune-related comparing a simple model of one ω for all sites and
proteins (Additional file 1: Table S7). Additionally, we branches versus a model where kiwi is defined as the fore-
found a significant expansion of the Ephrin kinases class, ground branch and the other birds as background. We
which are functionally involved in the development of the first considered genes with a significantly higher ω on the
sensory-motor innervation of the limb [22] and later on in kiwi branch than that in all other birds (LRT >3.84, signifi-
tendons condensation and developing feather buds [23]. cance at 5 %, 1 degree of freedom). Functional enrichment

264
Le Duc et al. Genome Biology (2015) 16:147 Page 4 of 15

using GO [20] categories was tested using a hypergeo- file 1: Table S8B). Among slower evolving categories, the
metric test (Additional file 1: Note: Gene ontology and mitochondrial outer membrane was one of the kiwi-
rapidly evolving genes). The same test was performed on specific categories (Additional file 1: Table S9A), while
genes evolving significantly slower in kiwi. To assign func- anion channel activity was a shared category with chuck-
tional categories as either kiwi-specific, or shared with will’s-widow (Additional file 1: Table S9B). For the poten-
other ratites or nocturnal birds, a similar procedure was tially biological meaningful categories which could explain
performed for each species of Palaeognathae (ostrich, kiwi-specific physiology we extracted the genes clustering
tinamou) and night birds (chuck-will’s-widow, barn owl) in the node. GO categories have a high potential to deliver
by assigning each in turn as the foreground branch in false-positive enrichment, which could be considered bio-
CODEML. logically meaningful a posteriori [25]. Therefore, future
After multiple testing correction using family-wise error studies need to verify the adaptive functionality of genes
rate none of the categories remained significant. For fur- belonging to the respective category (Additional file 1:
ther analysis we considered only GO categories that had Tables S8C and S9C).
(1) a P value <0.05; (2) at least three significantly changed It has been proposed that, in a nocturnal environment,
genes; and (3) the number of significant genes was at least genes involved in circadian rhythm have been under se-
5 % of the total genes annotated in the GO category. GO lective pressure [4]. Our species-specific selection screens
categories that were over-represented (P value <0.05) on did not identify circadian rhythm-related categories to be
the kiwi branch, but not present in any of the other con- enriched for changed genes in either kiwi or the other
sidered species, were identified as potentially kiwi-specific nocturnal birds. However, since mutations in even a single
changes (Additional file 1: Note: Gene ontology and rap- gene may be relevant, we analyzed more closely bio-
idly evolving genes). Notably, faster-evolving categories rhythm regulators from the neuropsin gene family. Ence-
present in kiwi, but absent in any of the other species, are phalopsin (OPN3), melanopsin (OPN4-1), and neuropsin
related to mitochondrion, feeding behavior and energy re- (OPN5) showed a similar ω in kiwi and the other branches
serve metabolic process, visual perception, and eye photo- and no obvious alterations could be detected in the se-
receptor cell differentiation (Additional file 1: Table S8A). quence (Table 2). Similar to chicken [26], kiwi and the
Sensory perception of light stimulus is a faster evolving other tested birds have a duplication of the melanopsin
category shared, surprisingly, with the ostrich (Additional gene (OPN4-2), which displayed significant signals of

Table 2 Annotated opsins in the Apteryx mantelli genome


AptMant0 annotation ID External gene Description ω ω Apt. LRT
ID background mantelli
augustus_masked-scaffold541-abinit-gene-7.0- RHO No obvious alteration 0.044 0.14913 6.128*
mRNA-1
augustus_masked-scaffold1311-abinit-gene-0.1- OPN1LW Partial sequence TM7 0.15601 0.59702 1.503
mRNA-1
maker-scaffold728-augustus-gene-1.2-mRNA-1 OPN1MW Deleterious mutation Glu3.49Lys 0.02093 0.26785 44.951*
augustus_masked-scaffold1068-abinit-gene-0.2- OPN1SW† Partial sequence, deleterious mutation 0.03815 0.19244 5.162*
mRNA-1 Glu6.30Gly
augustus_masked-scaffold9587-abinit-gene-0.0- SWS2†† Partial sequence 0.02045 0.0001 0.514
mRNA-1
maker-scaffold19-augustus-gene-28.1-mRNA-1 OPN3 No obvious alteration 0.10965 0.54221 3.211
augustus_masked-scaffold39-abinit-gene-55.0- OPN4-1 No obvious alteration 0.14205 0.23127 2.733
mRNA-1
augustus_masked-scaffold122-abinit-gene-6.0- OPN4-2 No obvious alteration 0.18597 2.57434 8.194*
mRNA-1
maker-scaffold597-augustus-gene-1.2-mRNA-1 OPN5 No obvious alteration 0.07114 0.0001 1.733
augustus_masked-scaffold1987-abinit-gene-3.0- opsin-VA-like No obvious alteration 0.31735 0.26196 0.035
mRNA-1
LRT = likelihood ratio testing with one degree of freedom, between the null model (model = 0) and a model where the kiwi branch differs from other birds:
chicken, turkey, zebra finch, chuck-will’s-widow, barn owl, tinamou, and ostrich (model = 2), implemented in CODEML from the PAML package [24]. Extended
selection analysis in which nocturnal birds, ostrich, and tinamou are sequentially appointed as foreground branch are presented in Additional file 1: Table S10.
*P value <0.05
†Tested on orthologs in Tinamus guttatus, Antrostomus carolinensis, Taeniopygia guttata, Gallus gallus, and Apteryx mantelli (not present in Struthio camelus and
Tyto alba assemblies)
††Tested on orthologs in Chlamydera nuchalis, Chlamydera maculata, Sericulus chrysocephalus, Ptilonorhynchus violaceus, Scenopoeetes dentirostris, Ailuroedus
crassirostris, Falco cherrug, Columba livia, and Apteryx mantelli

265
Le Duc et al. Genome Biology (2015) 16:147 Page 5 of 15

positive selection in kiwi but not in the other nocturnal Besides these two functionally well-characterized posi-
birds. However, a branch-site selection analysis of this tions, we identified several other amino acids substitu-
gene did not show any significant positively selected sites tions in kiwi OPN1MW and OPN1SW. Further, tests for
(Additional file 1: Note: Vision analysis). branch and branch-site specific ω values for OPN1MW
and OPN1SW on the kiwi branch showed no evidence
for positively selected sites in kiwi (Additional file 1:
Kiwi sensory adaptations – vision Note: Vision analysis), suggesting that the greater ω
Nocturnality is accompanied by a number of specific values for kiwi are likely due to loss of constraint on
changes, including adaptations in visual processing [4]. these genes. Hence these genes are likely to be drifting
In contrast to most nocturnal animals, that have large and, considering the fact that only 8 % of all inactivating
eyes relative to their body size, kiwi have small eyes and mutations in GPCRs are stop codons while almost 65 %
reduced optic lobes in the brain [27]. However, the kiwi are missense mutations [35–37], the described loss-of-
retina has a higher proportion of rods than cones which function mutations in OPN1MW and OPN1SW render
is consistent with adaptation to nocturnality [3]. Besides color vision of kiwi, unlike for other sequenced ratites
black/white vision mediated via rhodopsin (RHO), most (Fig. 2), absent – at least for the green and blue spectral
birds have trichromatic or tetrachromatic vision, for which ranges.
various additional opsins are responsible: OPN1LW (red), We tentatively dated the opsin-loss-of-function event
OPN1MW (green, RH2), OPN1SW (blue, subtypes SWS1, as an indicator of the timing of adaptation to the noctur-
SWS2) [28]. We identified these genes in the kiwi assem- nal niche. Assuming that the loss of constraint happened
bly. The RHO gene in kiwi shows no interruption and no on the kiwi branch in a short period of time and chan-
obvious function-impairing amino acid changes compared ged the rate of selection, measured by the ω value, from
to other vertebrates. We were able to assemble only a par- the average over bird lineages (0.021 for OPN1MW and
tial sequence of the red opsin OPN1LW (transmembrane 0.014 for OPN1SW, Table 2) to the neutral ω value of 1,
(TM) helix 7) and found no previously described deleteri- the loss of function was dated to 30–38 million years
ous amino acid changes within this region [29]. ago (Additional file 1: Note: Vision analysis), which
In the green opsin, OPN1MW, we identified a Glu134 places the event shortly after the arrival of kiwi in New
to Lys substitution (relative position 3.49 in the Zealand [38].
Ballesteros and Weinstein nomenclature) in the highly
conserved D/ERY motif of this rhodopsin-like GPCR. Kiwi sensory adaptations – olfaction
We confirmed this mutation in a second Apteryx man- Kiwi are unique among birds in having nostrils
telli individual, as well as in other kiwi species (Fig. 2). present at the end of their prominent beaks and have
To determine whether the change is kiwi-specific we se- been reported to depend largely on tactile and olfac-
quenced this domain of OPN1MW in other ratites, in- tory senses for foraging [39]. To investigate whether
cluding the extinct moa. We found that Glu3.49 is 100 % the genome shows signs of olfactory adaptation in
conserved in all birds for which sequence was available kiwi we assessed the numbers of olfactory receptor
and also in over 250 other vertebrate orthologs. Previous (OR) genes [40] and the diversity in the OR sequence
experimental analysis showed that mutation of Glu3.49 to [41].
Arg – another basic amino acid – results in a non- The only previous approach to molecular characterization
functional receptor protein [30]. Furthermore, the Asp of the olfactory system in kiwi was based on PCR amplifi-
or Glu in the D/ERY motif is also highly conserved in cation of ORs with degenerate primers [42]. This allowed
most other rhodopsin-like GPCRs and the identical mu- only a rough estimation of the number of ORs of 478
tation of Glu3.49 to Lys in the thromboxane A2 receptor, genes (95 % confidence interval 156–1,708 genes). PCR
for example, prevents the receptor from being function- with degenerate primers only produces incomplete frag-
ally expressed on the plasma membrane [31]. ments of the genes and hence the accurate quantification
Similarly, at the N-terminal end of TM6 in OPN1SW of gene families with highly similar sequences, as in the
we identified a highly conserved Glu6.30 which is present case of ORs, is prone to over-estimation [43]. In contrast,
in all bird orthologs sequenced so far, except for kiwi de novo genome assembly facilitates a global assessment
OPN1SW where Glu6.30 is substituted by Gly. Previous of the gene repertoire [44] and can therefore be used to
functional characterization has shown that mutation of provide a more accurate estimate of the OR repertoire.
Glu6.30 destabilizes the H-bond network resulting in We thus annotated the OR genes in kiwi, as part of the
constitutively active opsins and other rhodopsin-like entire membrane proteome, on the basis of putative
GPCRs [32, 33]. A constitutively active opsin is function- functionality and seven transmembrane helices (7TM)
ally incapable of light signal transmission [34] and is (Additional file 1: Note: Olfactory receptor genes identifi-
therefore non-functional. cation and annotation). The number of non-OR receptor

266
Le Duc et al. Genome Biology (2015) 16:147 Page 6 of 15

Fig. 2 Protein sequence comparison revealed substitutions of Glu3.49 to Lys (E/DRY motif) and Glu6.30 to Gly in kiwi OPN1MW (RH2) and
kiwi OPN1SW, respectively. Both residues are 100 % conserved in all birds sequenced so far and over 100 publicly available sequences of
other vertebrate OPN1MW and OPN1SW orthologs. To assure the OPN1MW-change is kiwi-specific additional ratites were sequenced,
including different kiwi species and the extinct moa. Glu3.49 of the E/DRY motif and Glu6.30 at the N-terminal end of helix 6 are parts of
an ‘ionic lock’ interhelical hydrogen-bond network which is highly conserved in many rhodopsin-like GPCRs. Nb – North Island brown
kiwi, Ob – Okarito brown kiwi, Gs – Great spotted kiwi, Ec – Emeus crassus (Eastern moa), Pg – Pachyornis geranoides (Mappin’s moa),
Chuck-will – Chuck-will’s-widow

families was comparable to other avian species, suggesting up to 141 OR genes are present in the kiwi genome,
that the membrane proteome is well annotated in kiwi of which 86 encode for full-length receptors while the
(Additional file 1: Table S7). This analysis revealed an ini- rest are most likely pseudogenes due to frameshifts,
tial set of 82 OR genes in the kiwi genome. However, ORs premature stop codons, or truncations (Additional file
are highly duplicated across the genome and such regions 1: Note: Olfactory receptor genes identification and an-
could be prone to being overcollapsed during the notation). The estimated proportion of intact ORs
assembly process. We therefore estimated the copy num- among all OR genes in kiwi (61 %) is lower than previ-
ber of each annotated OR using a correction based on ously reported for Apteryx australis [42] (78.6 %), but
coverage. To obtain the correction factor for each OR, much higher than in zebra finch (38 %) [45].
read-coverage in the OR region was divided by the Comparative analysis of the OR repertoire shows that
genome-wide average coverage corresponding to its the kiwi genome has both the α and the γ subgroups of
GC bin. Following this correction we estimated that type 1 OR genes, as reported for other bird genomes

267
Le Duc et al. Genome Biology (2015) 16:147 Page 7 of 15

sequenced so far [45]. Unlike the majority of other birds Phylogenetic comparison of OR repertoires suggest
analyzed so far, kiwi has a higher number of γ subgroup that γ ORs within bird and reptile genomes exhibit con-
ORs. Gene family size estimates are highly dependent on trasting evolutionary rates. Tree topology suggests that γ
genome quality [46] and continuous curation is ongoing ORs in a few birds and reptiles show species-specific
even for well-annotated genomes: for example, in the clustering pattern (Fig. 3). This pattern was previously
chicken olfactory repertoire the number of annotated described in birds and it was suggested that these recep-
ORs changed by a factor of eight in two consecutive tors have undergone adaptive evolution with respect to
Ensembl releases (release 73 – 251 ORs and release 74 – the occupied environmental niche [45]. However, a few
30 ORs). Further improvement of genome qualities, in- γ ORs belonging to kiwi cluster with their reptilian
cluding kiwi, are therefore required for the identification counterparts, while some cluster basal to the clade con-
of a complete set of ORs. Thus, a correlation between taining most bird γ ORs (Fig. 3).
olfactory acuity and the number of ORs in different Phenotypic diversity in olfaction is, in part, attributable
birds could be subject to error. to genetic variation with a wider range of odors thought

Fig. 3 Maximum likelihood (ML) tree constructed using full-length intact α and γ group olfactory receptors from 10 birds (chicken, zebra
finch, flycatcher, duck, turkey, chuck-will’s-widow, barn owl, ostrich, tinamou, and kiwi) and two reptile genomes (anole lizard and Chinese soft-shell
turtle). The ML topology shown above was cross-verified using the neighbor joining (NJ) method. Three Class A (Rhodopsin) family GPCRs from
chicken genome, dopamine receptor D1 (DRD1), dopamine receptor D2 (DRD2), and histamine receptor H1 (HRH1) were used as the out-group
(shown as non-olfactory receptors). The red dot indicates confidence estimates (% bootstrap from 500 resamplings, >90 % bootstrap support from
both ML and NJ methods) for the nodes that distinguish α and γ ORs. The scale bar represents the number of amino-acid substitutions per site. The
topology supports lineage specific expansions of γ group olfactory genes in the bird and the reptile species. Note, a few of the γ group ORs
in kiwi cluster with reptilian ORs (highlighted by orange arrowhead), while some cluster basal to the clade containing bird ORs (highlighted
by green arrowhead). The topology supports contrasting evolutionary rates within the analyzed γ ORs, as indicated by short (blue arc with
arrowheads) and long branch lengths (pale orange arc with arrowheads). The inset shows the number of intact olfactory receptors in each
species that are analyzed using the ML tree topology

268
Le Duc et al. Genome Biology (2015) 16:147 Page 8 of 15

to be detectable given more genetic variation [41]. Since were then manually inspected. No insertions, deletions,
the absolute number of ORs might be a poor predictor and/or stop codons that would clearly disrupt the open
of olfactory abilities, we investigated the variation in the reading frame could be identified in the inspected genes.
γ ORs sequence as a measure of the range of possible Additionally, we found all 39 HOX genes expected for
detectable odors. The average protein sequence entropy the Sauropsid ancestor [54] and investigation of regula-
was calculated to check for variation within the γ-c clade tory sequences within the HOX clusters by phylogenetic
in each species (Additional file 1: Note: γ-c clade OR footprinting showed no preferential loss of conserved
within-species protein sequence entropy). DNA elements in Apteryx mantelli compared to Galli-
Previous studies have shown that Shannon entropy formes (Additional file 1: Figure S4; Additional file 1:
(H) analysis is a sensitive tool for estimating the diversity Table S11).
of a system [47, 48]. For protein sequence, H ranges To detect signs of different evolution in kiwi wing
from 0 (only one residue is present at that position in and tail developmental genes we performed a selective
the multiple sequence alignment) to 4.322 (all 20 resi- constraint analysis using the CODEML branch test
dues are equally represented in that position). Typically (Additional file 1: Note: Selection analysis on limb de-
H ≤2 is attributed to high conservation [49]. H values in velopment genes; Additional file 1: Table S12). Of
birds were in the range of 0.34±0.05 (zebra finch) to these genes FIBIN was the only gene that showed sig-
1.11±0.12 (chicken). The average entropy in kiwi se- nals of positive selection on the avian tree including
quences was 1.23±0.15, significantly higher than all other chicken, turkey, and zebra finch (Additional file 1:
bird species investigated (P value = 0.003 Wilcoxon Figure S5). Three sites with signs of positive selection
Signed-Rank test, Additional file 1: Note: γ-c clade OR that were 100 % conserved in the other species show
within-species protein sequence entropy). We conclude a different amino acid in kiwi: exchanges of Ser136Ala,
that overall the γ-c clade of ORs are highly similar in se- Gln148Arg, and Phe162Cys (positions are relative to
quence, in accordance with previously published data the mouse Fibin coding sequence). The functional
[45]. However, since detection of a wider range of odors relevance of these substitutions is unclear and needs
is correlated to genetic variation of ORs [41], the signifi- to be studied when experimental tests of FIBIN func-
cantly higher H in kiwi ORs is suggestive for a broad tion become available.
odor acuity in this species in comparison to other birds. Since no obvious alterations could be found in the
coding sequences of genes involved in developmental
Kiwi morphology processes, which could explain the regressed-wing
The most prominent phenotype of kiwi, lack of wings, morphology of kiwi, we further analyzed ultra-conserved
has been linked to energy conservation [50] and to the non-coding elements (UCNEs) (Additional file 1: Note:
limited resources in New Zealand in late Oligocene [51]. Ultra-conserved non-coding elements analysis). UCNEs
Like most ratites, kiwi are flightless, but the phylogenetic are defined as DNA non-coding regions of ≥95 % se-
tree of Palaeognathae implies that this phenotype quence identity between human and chicken, longer
evolved several times independently in this order [38]. than 200 bp [55]. The majority of UCNEs cluster in gen-
Unlike ostriches and rheas, that possess prominent omic regions containing genes coding for transcription
wings, kiwi show only vestigial invisible wings, while factors and developmental regulators [56] and experi-
moa lack even vestiges [52]. mental studies in transgenic animals have shown that
To determine whether we can identify the genetic some of these sequences can act as tissue-specific en-
basis for the extremely regressed wings in kiwi we anno- hancers during developmental processes [57]. Of the
tated genes in the highly conserved signaling pathways 4,351 UCNEs annotated in UCNEbase [55], 19 showed
related to limb development (Additional file 1: Note: more than the expected 5 % sequence variation as de-
Kiwi morphology analysis; Additional file 1: Figure S3). fined in the database [55] (Additional file 1: Table S13).
These include genes belonging to the FGFs, TBX cluster, Among these, four were related to HOXA, TBX2, Sp8,
HOX cluster (Additional file 1: Figure S4; Additional file and TFAP2A genes which have been previously de-
1: Table S11), WNT, SALL, and FIBIN genes, known to scribed in limb development pathways [53, 58, 59], sug-
be responsible for limb and wing development [53] gesting that changes in non-coding elements could be
(Additional file 1: Table S12). Growth and transcription involved in kiwi’s loss of wings.
factors typically influence the development of both
upper and lower limbs, while FIBIN is currently the only Discussion
gene described to be exclusively involved in the develop- With their small body size, extremely large egg size, noc-
ment of the upper limb [53]. turnal life style, and prominent nostrils at the end of
For these clusters of genes, we aligned corresponding their beaks, among several other traits, kiwi represent
orthologs and translated multiple alignments, which probably the most unusual member of the ratites [60]. A

269
Le Duc et al. Genome Biology (2015) 16:147 Page 9 of 15

recent mitochondrial DNA phylogeny placed kiwi as the dominant diurnal taxon at this time [4]. According to
closest relatives of the extinct Madagascan elephant this hypothesis, several traits typical for mammals, in-
birds [38]. Whether dispersal or vicariance best describe cluding a well-developed sense of smell, limited color
ratite distribution has been debated for over a century vision, increased eye size, and an energetic metabol-
[61]. A phylogeny including 169 bird species, built on 32 ism optimized for sun radiation-independent body
kb from 19 independent loci, showed ostrich as basal in temperature regulation, have been shaped by the noc-
the Palaeognathae clade [62]. In contrast, our phylogeny, turnal environment [65, 66]. Nocturnally adapted
based on 623 1:1 orthologs in 16 species, totaling ap- Mesozoic mammals also tended to have a small body
proximately 700 kb, places the tinamou as basal to size, an insectivorous diet, and low energy metabolism
Palaeognathae with 100 % bootstrap confidence (Fig. 1; [67]. Interestingly, kiwi has the smallest body size
Additional file 1: Figure S6). However, when the phyl- among flightless ratites, the lowest metabolic rate
ogeny was constructed for 10 bird species using just among birds [68, 69], and an insectivorous diet, sug-
UCNEs (totaling >1 Mb) the topology of the tree gesting a pattern of evolution that is similar to the
matches that obtained from fewer loci from a larger evolution of mammals under nocturnality. Consistent
number of species which agrees with a previous publica- with this hypothesis, our genome-wide scans for pat-
tion [62] (Additional file 1: Figure S7). Including more terns of positive selection showed enrichment in GO
ratites and a larger number of (hand-curated) loci should categories like mitochondrion functions and energy
provide better resolution of the tree topology, and in- reserve metabolic process (Additional file 1: Table
deed the topology we obtain here is well-supported. S8A), both related to metabolic rate. Moreover, we
However, we note that the topology changes depending found strong evidence for a loss of color vision in
on the gene sets that are included (Additional file 1: Figs. kiwi and their retinal structure also clearly supports
S6 and S7) and that when using ultra-conserved se- adaptation to vision under low light levels [3]. Al-
quences the phylogeny differs from that obtained from a though the small eye size of kiwi [27] is unusual for
larger, more representative set of genes. Hence, future a nocturnal species, based on the retinal anatomy
availability of additional genomes and ortholog sets from Corfield et al. rejected a regressive evolution model
multiple ratites will allow a better understanding of their for kiwi vision and suggested that kiwi have an acuity
origin. in detecting low light levels similar to other nocturnal
Nevertheless, a previous study has estimated that kiwi species [3]. This suggests that molecular mutations
diverged from the Madagascan elephant birds about 50 and retinal structure changed faster than eye size. In
million years ago [38] (Additional file 1: Figure S8). This birds, eye size was described to scale to body mass
estimate post-dates the split of Madagascar and New with an exponent similar to brain mass and metabolic
Zealand from Gondwana, which took place around 100 rate [70]. Thus, the low metabolic rate of kiwi [68]
and 80 million years ago, respectively, and implies that could be the constraint for their relatively small eyes.
ratites must have dispersed by flight and also that kiwi Alternatively, kiwi might serve as an example that ad-
arrived on New Zealand less than 50 million years ago. aptations in the retinal structure could be sufficient,
This conclusion is supported by the fossil record in New and changes in eye size are not absolutely necessary.
Zealand, which includes a flighted kiwi ancestor [63]. At This conclusion may be supported by the absence of
the time kiwi arrived, moa already inhabited New variation in eye shape according to activity pattern
Zealand and it has been hypothesized that moa were observed in lizards and non-primate mammals [71].
monopolizing the diurnal ground niche, which forced It has long been hypothesized that unlike most bird
kiwi to adapt to an alternative nocturnal lifestyle [38]. species kiwi is more similar to mammals in their reliance
This would suggest that kiwi adapted to the nocturnal on olfactory and mechanical cues for foraging, perceived
niche soon after arriving on the island. The loss of func- by the nostrils and mechanoreceptors located at the end
tion that we observe in OPN1SW is indicative of adapta- of its bill, for foraging [72]. We found that the kiwi, un-
tion to nocturnality [64]. We dated the loss of function like other ratites, has an increased diversity in the bird-
in several color vision opsins to 30–38 million years ago, specific γ-c clade ORs. Since OR diversity is hypothe-
which is consistent with the arrival of the kiwi in New sized to correlate positively with olfactory acuity in ver-
Zealand less than 50 million years ago, and their subse- tebrates [42, 73], the significantly higher diversity in kiwi
quent adaptation to a nocturnal niche. ORs compared to other birds (Additional file 1: Figure
In contrast to birds, which almost certainly have a di- S9) suggests that kiwi may be able to distinguish a larger
urnal origin, the nocturnal bottleneck hypothesis sug- range of odors than other birds.
gests that mammals were nocturnal for about 160 Steiger et al. formulated two possible scenarios that
million years in their evolution as they were restricted to could explain γ ORs evolution in birds: the first hypoth-
nighttime activity to avoid dinosaurs which were the eses that species-specific γ ORs arose from independent

270
Le Duc et al. Genome Biology (2015) 16:147 Page 10 of 15

expansion events in each species, while the second as- sequencing; Additional file 1: Table S1). Paired-end
sumes that the ancient γ OR clade was more diverse and sequencing was performed on HiScanSQ and HiSeq
became homogenized by concerted evolution within spe- platforms with read lengths of 101 bp and 96 bp,
cies [45]. Some γ ORs of kiwi, ostrich, tinamou, and respectively.
nocturnal birds clustered with their reptilian counter- Sequencing errors were corrected using Quake [5]
parts, while others clustered basal to the clade contain- (Additional file 1: Note: Filtering and read correction;
ing most bird γ ORs (Fig. 3). This supports a two-fold Additional file 1: Figure S1). A total of 52.53 Gb of high-
conclusion: (1) γ ORs in kiwi are more diverse in se- quality sequence was used for de novo assembly with
quence than in other birds investigated, which was veri- SOAPdenovo [6]. The short-insert-size libraries (240 bp,
fied by the significantly higher sequence entropy; and (2) 420 bp, 800 bp) were used to build contigs. Based on
since kiwi is basal to the Neognathae (Fig. 1), the ances- paired-end information scaffolds were generated using
tral state of γ OR clade is probably diversified compared all libraries (2 kb, 3 kb, 4 kb, 7 kb, 9 kb, 11 kb, 13 kb).
to other modern birds. Remaining gaps in the scaffolds were closed using the
paired-end information (Additional file 1: Note: Genome
Conclusions assembly). This final assembly (AptMant0) was used for
Since its arrival in New Zealand sometime after 50 all subsequent analyses.
million years ago, the kiwi adapted to a nocturnal, Gene annotation was performed with the MAKER
ground-dwelling niche. The onset of adaptation to pipeline [10], using several sources of evidence: de
nocturnality appears to have been approximately 30– novo gene predictions, RNA-Seq data, and protein
38 million years ago, about one-fifth of the time pro- evidence from three species (G. gallus, T. guttata, and
posed for the evolution of mammals in a nocturnal M. gallopavo) (Ensembl version 72). Briefly, after re-
environment. The molecular changes present in the peat masking, gene models were predicted by Augus-
kiwi genome are in accordance with the adaptations tus version 2.7 [74] using the training dataset for
that are hypothesized to have occurred during early chicken. Apteryx mantelli RNA-Seq data were then
mammalian adaptation to nocturnality. This suggests aligned to AptMant0 using NCBI BLASTN version
similar patterns of adaptation to the nocturnal niche 2.2.27+ [75] and BLASTX was used to align protein
both in kiwi and mammals. Further comparative ana- sequences to identify regions of homology. Finally,
lyses, including other diurnal Palaeognathae, as well using both the ab initio and evidence-informed gene
as additional nocturnal bird groups and their diurnal predictions, Maker updated features such as 5’ and 3’
sister species, should shed further light on the gen- UTRs based on RNA-Seq evidence and a consensus
omic imprints of adaptation to a nocturnal life style. gene set was retrieved (Additional file 1: Note: De
novo gene prediction and gene annotation).
Methods and materials
Genome sequence assembly and annotation Comparative genome analysis
We sequenced Apteryx mantelli female individuals, which Triplet orthologs between chicken, zebra finch, and
originate from the far North (kiwi code 73) and central turkey were downloaded from Ensembl 73. Kiwi genes
part – Lake Waikaremoana (kiwi code AT5 and kiwi code were considered orthologs to a triplet if the ortholog
16–12) of North Island (Additional file 1: Figure S10). assignment from Maker agreed with the orthologous
They were sampled in 1986 (kiwi code 73) and 1997 (kiwi gene assigned in each of the three considered species.
code AT5 and 16–12) in ‘operation nest egg’ carried out The ostrich, tinamou, chuck-will’s-widow, and barn owl
by Rainbow and Fairy Springs, Rotorua. No animals were orthologs were assigned by orthology to the chicken
killed or captured as a result of this study and genome as- proteins. After assigning orthology in the eight avian
sembly was performed with iwi approval from the Te species, coding sequences were aligned and two different
Parawhau and Waikaremoana Māori Elders Trust. sets of alignments were compiled for further analysis:
We extracted genomic DNA from Apteryx mantelli Set 1: alignments of all eight species that do not con-
embryos. Libraries with insert sizes of 240 bp, 420 tain a single frameshift indel.
bp, 800 bp, 2 kb, 3 kb, and 4 kb were obtained from Set 2: the longest uninterrupted run of at least 200
individual kiwi code 73, and mate-paired-end libraries aligned bases in each multiple sequence alignment, for
7 kb, 9 kb, 11 kb, and 13 kb, from individual kiwi which we first ensured that gaps in the alignment were
code 16–12. DNA from individual AT5 was used to not introduced by unresolved bases in our assembly.
build a 350 bp insert-size library with the purpose of The CODEML program from the package PAML [24]
confirming kiwi-specific sequence polymorphisms and was run first on four avian lineages: G. gallus, T. gut-
was not included in the genome assembly (Additional tata, M. gallopavo, and A. mantelli to compare the kiwi
file 1: Note: Sampling, DNA library preparation and genome to high-quality annotated ones. Six pairwise

271
Le Duc et al. Genome Biology (2015) 16:147 Page 11 of 15

combinations were run to obtain estimates of non- option of 0.0007 (Additional file 1: Note: Gene fam-
synonymous (Ka) and synonymous (Ks) changes in the ilies evolution using CAFE). Pfam IDs corresponding
four avian lineages. Ka and Ks distributions were com- to the TreeFam families were assigned to GO categor-
pared pairwise between all four avian species on a set ies. We tested whether significant (P <0.05) contraction/
of 3,754 orthologous genes which presented no frame- expansion events cluster in different GO categories using
shifts or indels (Additional file 1: Figure S11). ClueGO with a hypergeometric test [78] (Additional file 1:
We next scanned for differently evolving genes with the Figure S2).
CODEML program under a branch model (model = 2,
two ωs for foreground and background branches, respect-
ively, vs. model = 0, one ω for all branches, compared via Membrane proteome annotation
likelihood ratio test) [24] using the set of orthologs as de- Complete protein sequence sets for the following bird
fined above in the eight bird species (Additional file 1: and reptile species were downloaded from Ensembl 74
Note: Orthologs and Ka/Ks calculation). [14]: Taeniopygia guttata, Meleagris gallopavo, Ficedula
Branch specific ω values were used to identify GO albicollis, Anas platyrhynchos, Pelodiscus sinensis, Gallus
categories that are evolving significantly different on gallus, and Anolis carolinensis. Homo sapiens from the
each of the following bird species: kiwi, ostrich, tina- same Ensembl version was used as outgroup. Protein se-
mou, barn owl, and chuck-will’s-widow. GO categories quences of ratites (Tinamus guttatus, Struthio camelus)
enrichment was tested using the FUNC [76] package. and nocturnal birds (Antrostomus carolinensis, Tyto
A hypergeometric test was run for each species sep- alba) were downloaded from GigaDB [13]; although
arately on genes having a significantly higher ω. Mul- these genomes are more fragmented than the ones from
tiple testing correction was done using family-wise Ensembl, annotation of the membrane proteome in birds
error rate. Categories with P value <0.05 were consid- adapted, like kiwi, to the nocturnal niche and the ones
ered for further analysis if at least three significantly belonging to the same clade as kiwi, allows to differenti-
changed genes were present in the GO category, and ate between events that are clade-specific or shaped by
the number of significant genes was greater or equal nocturnality. Only the longest protein sequence for each
to 5 % of the total genes annotated in the respective gene was considered for analysis. Membrane proteins
GO category. The same test was applied on genes and signal peptides were predicted for all species with
with a significantly smaller ω in each of the species. Phobius [79]. These proteins were classified based on a
Kiwi-specific categories were considered those which manually curated human membrane proteome dataset,
showed no enrichment in any of the other ratites or which describes family relationship and molecular func-
night birds (Additional file 1: Note: Gene Ontology tion. The predicted membrane proteins were aligned to
and rapidly evolving genes). the human membrane proteome dataset with the BLASTP
We used the TreeFam methodology to define gene program of the BLAST package using default settings
families [12] across 16 genomes: Gallus gallus, Anas (v. 2.2.27+) [75]. Each predicted membrane protein was
platyrhynchos, Ficedula albicollis, Meleagris gallopavo, classified according to its best human hit with an e-value
Taeniopygia guttata, Pelodiscus sinensis, Anolis caroli- <10−6. Predicted membrane proteins with no hit were
nensis, Homo sapiens, Mus musculus, Gasterosteus acu- deemed unclassified, along with those proteins that hit
leatus, Ornithorhynchus anatinus, downloaded from an unclassified human protein (Additional file 1: Note:
Ensembl 73 [14], Tinamus guttatus, Struthio camelus, Detection and classification of the membrane prote-
Antrostomus carolinensis, Tyto alba, downloaded from ome; Additional file 1: Table S7).
GigaDB [13], and Apteryx mantelli. The longest tran-
script was chosen for further analysis. For the single-
copy orthologous families, genes were aligned against Vision evolutionary analysis
each other. To build a consensus phylogenetic tree Opsins are G protein-coupled receptors known to play a
(Fig. 1) the resulting alignments were loaded in PAUP* role in light signal transduction and night-day cycle
[15] version 4.0d105 and trees were inferred using max- (Table 2). For these genes ω was estimated by appointing
imum likelihood, with default parameters. To measure sequentially kiwi, ostrich, tinamou, chuck-will’s-widow,
the confidence for certain subtrees, a series of 100 boot- and barn owl as the foreground branch under the
strap replicates were performed (Additional file 1: Note: CODEML branch model (model = 2) [24] as described for
Nuclear loci phylogeny). comparative genome analysis. Inactivating mutations were
We determined the branch-specific expansion and verified by checking that they were present in reads from
contraction of the orthologous protein families among both sequenced individuals and in other kiwi species, by
the 16 species using CAFE (computational analysis of Sanger sequencing (OPN1MW) (Fig. 2; Additional file 1:
gene family evolution) version 3.0 [77] with lambda Note: Vision analysis).

272
Le Duc et al. Genome Biology (2015) 16:147 Page 12 of 15

Olfaction evolutionary analysis corresponding coverage (that is, 35-fold). The final num-
Olfactory receptors (ORs) in kiwi were annotated using ber of estimated ORs was obtained by multiplying the
both the Augustus de novo gene prediction and the number of initially annotated genes with their correspond-
Maker information after scaffold positions were checked ing correction factors.
and redundant sequences were removed. Using the same annotation procedure, the OR gene
We then performed four steps (Additional file 1: repertoire was estimated in all bird and reptile genomes
Figure S12): from Ensembl 74, two nocturnal birds (chuck-will’s-
widow and barn owl) and two Palaeognathae (ostrich
i. Functional ORs from chicken [45] were downloaded and tinamou) for comparative phylogenetic analysis with
and aligned against the kiwi transcriptome using the kiwi OR dataset. All obtained OR genes were then
TblastN with default parameters. After collecting aligned using MAFFT [81] v7, with BLOSUM62 as the
overall hits for each query (every chicken OR served scoring matrix and default settings of option E-INS-I.
as query), identical (same) hits from each run were Phylogenetic analyses were run using both maximum
removed to obtain a non-redundant dataset. likelihood (ML) and neighbor joining (NJ) methods
ii. A Pfam search against the kiwi proteome with a (Additional file 1: Note: Comparative phylogenetic ana-
default e-value cutoff of 1.0 was used to identify lysis on ORs from kiwi and other bird and reptile ge-
sequences that contained 7tm_4 domain (olfactory nomes). The reliability of the phylogenetic trees was
domain). evaluated with 500 bootstrap replicates.
iii. The 7tm_4 domain was searched against the kiwi We calculated Shannon entropy (H) using within spe-
proteome by a CDD search (conserved domain cies multiple sequence alignments of γ ORs for all birds
database search). and reptiles genomes separately with a built-in function
iv. Separate HMM profiles were built from conserved from BioEdit [82] (Additional file 1: Note: γ-c clade OR
7tm regions of functional ORs of chicken, turkey, within-species protein sequence entropy).
and zebra finch obtained from previous studies
[45]. Using the three HMM profiles, HMM Kiwi morphology
searches were performed against the kiwi Previously characterized wing development genes [53]
proteome and non-redundant hits were retrieved were assigned orthologs in kiwi, chicken, zebra finch,
from combined results of all three searches. and turkey (Additional file 1: Figure S3; Additional file 1:
Table S12). We aligned the sequences and multiple align-
A CD-HIT (Cluster Database at High Identity with ments were translated and manually inspected for se-
Tolerance) was performed to remove identical sequences quence differences as well as insertions/deletions and
with a cutoff of 100 %. Preliminary phylogenetic analysis rearrangements. We examined selective pressures under
was performed using a maximum likelihood approach the branch models implemented in CODEML [24]. The
(Additional file 1: Note: Olfactory receptor genes identi- one-ratio model (model = 0, NSsites = 0) was used to esti-
fication and annotation). Non-ORs were removed if they mate the same ω ratio for all branches in the phylogeny.
clustered separately from ORs. We excluded pseudogene Then, the two-ratio model (model = 2, NSsites = 0), with
candidates if at least one premature stop codon and/or a background ω ratio and a different ω on the kiwi branch,
frameshifts could be identified in the kiwi sequence. was used to detect selective pressure acting specifically on
OR repertoire estimates were curated based on genomic the kiwi branch. These two models were compared via a
coverage calculated using samtools mpileup version 0.1.18 LRT (1 degree of freedom), as mentioned above [83].
[80] on the alignment of the 240 bp, 420 bp, 800 bp Scaffolds and isolated contigs harboring (putative) HOX
insert-size libraries to AptMant0 (Additional file 1: Note: genes were identified by BLAST and mapped to all 673
Olfactory receptor genes identification and annotation). sauropsid HOX protein sequences from GenBank. Trans-
The correction factor for each annotated OR was obtained lated HOX sequences of Apteryx were aligned to the HOX
by dividing the read coverage in that region to the GC- proteins extracted from Genbank and differences were
content corresponding average coverage over the entire identified by manual inspection. Potential regulatory se-
genome. For example, if an OR sequence had a GC quences in the HOX cluster region were identified by
content of 50 %, we calculated the average genome-wide phylogenetic footprinting using tracker2 [84] (Additional
coverage corresponding to the GC bin of 50 % to be 35- file 1: Figure S4).
fold (Additional file 1: Note: Genome coverage and To retrieve the entire coding region of the FIBIN gene
estimation of genome size; Additional file 1: Figure S13). in kiwi, we designed primers based on the chicken and
Given a coverage in the respective OR region of 105-fold, ostrich sequence (Additional file 1: Table S14). Using the
we obtained a correction factor of 3 after dividing the OR 276-bp fragment amplified by Sanger sequencing, we
sequence coverage (that is, 105-fold) by the GC-bin blasted transcriptome sequences from kiwi and iteratively

273
Le Duc et al. Genome Biology (2015) 16:147 Page 13 of 15

assembled the entire coding sequence. Since FIBIN Authors’ contributions


showed signs of positive selection in the preliminary DLD, LH, and TS performed the experiments. DLD, GR, KP, MO, AK, MSA, HBS,
SJP, PFS, and BDB analyzed the data. DLD, MH, JK, and TS designed the
analysis as described above, extended selection analysis study and wrote the paper with contributions from all authors. DL provided
was performed using 15 species: human, mouse, bat, biological samples. All authors read and approved the final manuscript.
whale, dolphin, turtle, lizard, python, flycatcher, chicken,
zebra finch, frog, zebrafish, and pufferfish (Additional file
Acknowledgments
1: Note: Fibin identification and selection analysis; This work was supported by grants of the Deutsche Forschungsgemein-
Additional file 1: Figure S5). The branch-site tests were schaft and intramural support (Medical Faculty, University of Leipzig), as well
as the Australian Research Council, the Swedish Research Council, NSERC
used to detect signals of selective pressure on each branch
(postgraduate fellowship to GR), and the Max Planck Society. BDB was
(NSsites = 2, model = 2, compared to the same model but funded by grant no. 2011/12500-2, São Paulo Research Foundation (FAPESP).
with omega fixed to 1, via LRT). Amino acid changes with This research was endorsed by Māori Elders from the Te Parawhau Trust and
from Waikaremoana iwi. We are very thankful for technical and methodical
signs of selection and specific for the kiwi were visualized
support provided by Knut Finstermeier, Anne Butthof, Knut Krohn, Michael
in both sequenced individuals. Dannemann, Udo Stenzel, Mathias Stiller, and Rigo Schulz. We thank Andreas
Chicken UCNEs annotations were downloaded from Reichenbach for helpful discussions on kiwi vision and Petra Korlević for the
drawings in Fig. 1 and Additional file 1: Figure S10.
the ultra-conserved non-coding element UCNEbase
[55]. Orthologous regions in Apteryx mantelli and Author details
1
Struthio camelus, Tinamus guttatus, Tyto alba, Antros- Institute of Biochemistry, Medical Faculty, University of Leipzig, Johannisallee
30, Leipzig 04103, Germany. 2Department of Evolutionary Genetics, Max
tomus carolinensis genomes, downloaded from GigaDB
Planck Institute for Evolutionary Anthropology, Leipzig 04103, Germany.
[13], and birds from Ensembl 74 [14] Ficedula albicollis, 3
Department of Neuroscience, Unit of Functional Pharmacology, Uppsala
Taeniopygia guttata, Anas platyrhynchos, and Meleagris University, Box 593Husargatan 3, Uppsala 751 24, Sweden. 4Griffith School of
Environment and School of Biomolecular and Physical Sciences, Griffith
gallopavo were established using Blast 2.2.25 [85] with
University, Nathan, Queensland 4111, Australia. 5Department of Computer
‘blastn’ and default parameters. Gallus gallus genome Science, and Interdisciplinary Center for Bioinformatics, University of Leipzig,
Ensembl 74 was used as control in the orthology assign- Leipzig 04103, Germany. 6Department of Genetics and Evolutionary Biology,
University of São Paulo, São Paulo, SP 05508-090, Brazil. 7Adaptive
ment. Orthologous regions from each of the species were
Evolutionary Genomics, Institute for Biochemistry and Biology, University
aligned [86] to the reference UCNE and the number of Potsdam, Potsdam 14469, Germany.
mismatches between the UCNE and the target genomes
Received: 13 February 2015 Accepted: 1 July 2015
were determined (Additional file 1: Note: Ultra-conserved
non-coding elements analysis).

Data availability References


1. Bunce M, Worthy TH, Phillips MJ, Holdaway RN, Willerslev E, Haile J, et al.
Assembly, raw DNA, and RNA sequencing reads have The evolutionary history of the extinct ratite moa and New Zealand
been deposited in the European Nucleotide Archive under Neogene paleogeography. Proc Natl Acad Sci U S A. 2009;106:20646–51.
the BioProject with accession number: PRJEB6383. 2. Iviartin GR. Sensory capacities and the nocturnal habit of owls (Strigiformes).
IBIS. 1986;128:266–77.
HOX Cluster annotation files were deposited on [87] 3. Corfield JR, Parsons S, Harimoto Y, Acosta ML. Retinal anatomy of the New
and [88]. Zealand kiwi: structural traits consistent with their nocturnal behavior. Anat
UCNEs multiple fasta files and analysis have been de- Rec (Hoboken). 2015;298:771–9.
4. Gerkema MP, Davies WI, Foster RG, Menaker M, Hut RA. The nocturnal
posited on [89]. bottleneck and the evolution of activity patterns in mammals. Proc Biol Sci.
The kiwi FIBIN sequence was deposited in GenBank 2013;280:20130508.
under BankIt 1821198 FIBIN KR364000. 5. Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and
correction of sequencing errors. Genome Biol. 2010;11:R116.
6. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an
Additional file empirically improved memory-efficient short-read de novo assembler.
GigaScience. 2012;1:18.
Additional file 1: Supplementary Material contains Supplementary 7. Zhang G, Li C, Li Q, Li B, Larkin DM, Lee C, et al. Comparative genomics
Figs. S1–S15, Supplementary Tables S1–S17, Supplementary Note, reveals insights into avian genome evolution and adaptation. Science.
and Supplementary References. 2014;346:1311–20.
8. International Chicken Genome Sequencing C. Sequence and comparative
analysis of the chicken genome provide unique perspectives on vertebrate
Abbreviations evolution. Nature. 2004;432:695–716.
bp: base pair; CDD: Conserved domain database; CD-HIT: Cluster database at 9. Warren WC, Clayton DF, Ellegren H, Arnold AP, Hillier LW, Kunstner A, et al.
high identity with tolerance; Gb: Giga base pairs; GO: Gene ontology; The genome of a songbird. Nature. 2010;464:757–62.
GPCR: G protein-coupled receptor; H: Shannon entropy; HMM: Hiden markov 10. Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, et al. MAKER: an
model; kb: kilo base pairs; LRT: Likelihood ratio test; Mb: Mega base pairs; easy-to-use annotation pipeline designed for emerging model organism
ML: Maximum likelihood; NJ: Neighbor joining; OR: Olfactory receptor; genomes. Genome Res. 2008;18:188–96.
PCR: Polymerase chain reaction; TM: Transmembrane; UCNE: Ultra-conserved 11. Kondrashov FA. Gene duplication as a mechanism of genomic adaptation
non-coding element. to a changing environment. Proc Biol Sci. 2012;279:5048–57.
12. Li H, Coghlan A, Ruan J, Coin LJ, Heriche JK, Osmotherly L, et al. TreeFam: a
Competing interests curated database of phylogenetic trees of animal gene families. Nucleic
The authors declare no competing financial interests. Acids Res. 2006;34:D572–80.

274
Le Duc et al. Genome Biology (2015) 16:147 Page 14 of 15

13. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI. GigaDB: 38. Mitchell KJ, Llamas B, Soubrier J, Rawlence NJ, Worthy TH, Wood J, et al.
promoting data dissemination and reproducibility. Database (Oxford). Ancient DNA reveals elephant birds and kiwi are sister taxa and clarifies
2014;2014:bau018. ratite bird evolution. Science. 2014;344:898–900.
14. Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. 39. Corfield JR, Eisthen HL, Iwaniuk AN, Parsons S. Anatomical specializations for
Nucleic Acids Res. 2013;41:D48–55. enhanced olfactory sensitivity in kiwi, Apteryx mantelli. Brain Behav Evol.
15. Wilgenbusch JC, Swofford D. Inferring evolutionary trees with PAUP*. Curr 2014;84:214–26.
Protoc Bioinformatics. 2003;Chapter 6:Unit 6 4. 40. Niimura Y, Nei M. Extensive gains and losses of olfactory receptor genes in
16. Hughes AL, Friedman R. Genome size reduction in the chicken has mammalian evolution. PLoS One. 2007;2, e708.
involved massive loss of ancestral protein-coding genes. Mol Biol Evol. 41. Hasin-Brumshtein Y, Lancet D, Olender T. Human olfaction: from genomic
2008;25:2681–8. variation to phenotypic diversity. Trends Genet. 2009;25:178–84.
17. Zhan X, Pan S, Wang J, Dixon A, He J, Muller MG, et al. Peregrine and saker 42. Steiger SS, Fidler AE, Kempenaers B. Evidence for increased olfactory receptor
falcon genome sequences provide insights into evolution of a predatory gene repertoire size in two nocturnal bird species with well-developed
lifestyle. Nat Genet. 2013;45:563–6. olfactory ability. BMC Evol Biol. 2009;9:117.
18. Huang Y, Li Y, Burt DW, Chen H, Zhang Y, Qian W, et al. The duck genome 43. Preston GM. Cloning gene family members using PCR with degenerate
and transcriptome provide insight into an avian influenza virus reservoir oligonucleotide primers. In: White BA (ed.) PCR cloning protocols: from
species. Nat Genet. 2013;45:776–83. molecular cloning to genetic engineering; In series: Methods in
19. Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW. molecular biology (Clifton, N.J.) 67; Humana Press: 1997 pg 433-49. ISBN
Extensive error in the number of genes inferred from draft genome 0896034436
assemblies. PLoS Comput Biol. 2014;10, e1003998. 44. Liu S, Wei W, Chu Y, Zhang L, Shen J, An C. De novo transcriptome
20. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene analysis of wing development-related signaling pathways in Locusta
ontology: tool for the unification of biology. The Gene Ontology migratoria manilensis and Ostrinia furnacalis (Guenee). PLoS One.
Consortium. Nat Genet. 2000;25:25–9. 2014;9, e106770.
21. Zakon HH, Jost MC, Lu Y. Expansion of voltage-dependent Na+ channel 45. Steiger SS, Kuryshev VY, Stensmyr MC, Kempenaers B, Mueller JC. A
gene family in early tetrapods coincided with the emergence of comparison of reptilian and avian olfactory receptor gene repertoires:
terrestriality and increased brain complexity. Mol Biol Evol. 2011;28:1415–24. species-specific expansion of group gamma genes in birds. BMC Genomics.
22. Luxey M, Jungas T, Laussu J, Audouard C, Garces A, Davy A. Eph:ephrin-B1 2009;10:446.
forward signaling controls fasciculation of sensory and motor axons. Dev 46. Morrison SS, Pyzh R, Jeon MS, Amaro C, Roig FJ, Baker-Austin C, et al. Impact of
Biol. 2013;383:264–74. analytic provenance in genome analysis. BMC Genomics. 2014;15:S1.
23. Patel K, Nittenberg R, D’Souza D, Irving C, Burt D, Wilkinson DG, et al. 47. Margulies DH, Natarajan K, Rossjohn J, McCluskey J. Fundamental
Expression and regulation of Cek-8, a cell to cell signalling receptor in Immunology. 7th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott
developing chick limb buds. Development. 1996;122:1147–55. Williams & Wilkins; 2012. p. 511.
24. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol 48. Shannon CE. The mathematical theory of communication. Bell System Tech
Evol. 2007;24:1586–91. J. 1948;27:379–243. 623–56.
25. Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of 49. Litwin S, Jores R. Shannon information as a measure of amino acid diversity.
storytelling: gene ontology categories and the importance of validating In: Perelson AS, Weisbuch G, editors. Theoretical and experimental insights
genomic scans. Mol Biol Evol. 2012;29:3237–48. into immunology, vol. 66. NATO ASI Series. Berlin: Springer Berlin
26. Torii M, Kojima D, Okano T, Nakamura A, Terakita A, Shichida Y, et al. Two Heidelberg; 1992. p. 279–87.
isoforms of chicken melanopsins show blue light sensitivity. FEBS Lett. 50. McNab BK. Resource use and the survival of land and freshwater vertebrates
2007;581:5327–31. on oceanic islands. American Naturalist. 1994;144:643–60.
27. Martin GR, Wilson KJ, Martin Wild J, Parsons S, Fabiana Kubke M, Corfield J. 51. Cooper A, Cooper RA. The Oligocene bottleneck and New Zealand
Kiwi forego vision in the guidance of their nocturnal activities. PLoS One. biota: genetic record of a past environmental crisis. Proc Biol Sci.
2007;2, e198. 1995;261:293–302.
28. Osorio D, Vorobyev M. A review of the evolution of animal colour vision 52. Grzimek B, Schlager N, Olendorf D, McDade MC. Grzimek’s animal life
and visual communication signals. Vision research. 2008;48:2042–51. encyclopedia. Gale: Gale, MI; 2004.
29. Beukers MW, Kristiansen I, IJzerman AP, Edvardsen I. TinyGRAP database: a 53. Tanaka M. Molecular and evolutionary basis of limb field specification and
bioinformatics tool to mine G-protein-coupled receptor mutant data. Trends limb initiation. Dev Growth Differ. 2013;55:149–63.
Pharmacol Sci. 1999;20:475–7. 54. Pascual-Anaya J, D’Aniello S, Kuratani S, Garcia-Fernandez J. Evolution of
30. Jansen JJ, Mulder WR, De Caluwe GL, Vlak JM, De Grip WJ. In vitro Hox gene clusters in deuterostomes. BMC Dev Biol. 2013;13:26.
expression of bovine opsin using recombinant baculovirus: the role of 55. Dimitrieva S, Bucher P. UCNEbase–a database of ultraconserved non-coding
glutamic acid (134) in opsin biosynthesis and glycosylation. Biochim elements and genomic regulatory blocks. Nucleic Acids Res. 2013;41:D101–9.
Biophys Acta. 1991;1089:68–76. 56. Woolfe A, Elgar G. Organization of conserved elements near key
31. Capra V, Veltri A, Foglia C, Crimaldi L, Habib A, Parenti M, et al. Mutational developmental regulators in vertebrate genomes. Adv Genet. 2008;61:307–38.
analysis of the highly conserved ERY motif of the thromboxane A2 receptor: 57. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M,
alternative role in G protein-coupled receptor signaling. Mol Pharmacol. et al. In vivo enhancer analysis of human conserved non-coding sequences.
2004;66:880–9. Nature. 2006;444:499–502.
32. Schulz A, Schoneberg T, Paschke R, Schultz G, Gudermann T. Role of the 58. Bell SM, Schreiner CM, Waclaw RR, Campbell K, Potter SS, Scott WJ. Sp8 is
third intracellular loop for the activation of gonadotropin receptors. Mol crucial for limb outgrowth and neuropore closure. Proc Natl Acad Sci U S A.
Endocrinol. 1999;13:181–90. 2003;100:12195–200.
33. Vogel R, Mahalingam M, Ludeke S, Huber T, Siebert F, Sakmar TP. Functional 59. Gestri G, Osborne RJ, Wyatt AW, Gerrelli D, Gribble S, Stewart H, et al.
role of the “ionic lock”–an interhelical hydrogen-bond network in family A Reduced TFAP2A function causes variable optic fissure closure and
heptahelical receptors. J Mol Biol. 2008;380:648–55. retinal defects and sensitizes eye development to mutations in other
34. Ebrey T, Koutalos Y. Vertebrate photoreceptors. Prog Retin Eye Res. morphogenetic regulators. Hum Genet. 2009;126:791–803.
2001;20:49–94. 60. Reid B, Williams GR. The kiwi. In: Kuschel G, editor. Biogeography and
35. Schoneberg T, Schulz A, Biebermann H, Hermsdorf T, Rompler H, Sangkuhl Ecology in New Zealand, vol. 27. The Hague: Springer Netherlands; 1975. p.
K. Mutant G-protein-coupled receptors as a cause of human diseases. 301–30.
Pharmacol Ther. 2004;104:173–206. 61. van Tuinen M, Sibley CG, Hedges SB. Phylogeny and biogeography of ratite
36. Tao YX. Inactivating mutations of G protein-coupled receptors and diseases: birds inferred from DNA sequences of the mitochondrial ribosomal genes.
structure-function insights and therapeutic implications. Pharmacol Ther. Mol Biol Evol. 1998;15:370–6.
2006;111:949–73. 62. Hackett SJ, Kimball RT, Reddy S, Bowie RC, Braun EL, Braun MJ, et al. A
37. Vassart G, Costagliola S. G protein-coupled receptors: mutations and phylogenomic study of birds reveals their evolutionary history. Science.
endocrine diseases. Nat Rev Endocrinol. 2011;7:362–72. 2008;320:1763–8.

275
Le Duc et al. Genome Biology (2015) 16:147 Page 15 of 15

63. Worthy TH, Worthy JP, Tennyson AJD, Salisbury SW, Hand SJ, Scofield 89. Kiwi Annotated UCNEs. Available at: https://bioinf.eva.mpg.de/KIWI-UCNEs/
RP. Miocene fossils show that kiwi (Apteryx, Apterygidae) are probably 90. Ellegren H, Smeds L, Burri R, Olason PI, Backstrom N, Kawakami T, et al. The
not phyletic dwarves. In: Göhlich UB, Kroh A, editors. Proceedings of genomic landscape of species divergence in Ficedula flycatchers. Nature.
the 8th International Meeting Society of Avian Paleontology and 2012;491:756–60.
Evolution. Vienna, 2012, Verlag des Naturhistorischen Museums in Wien, 91. Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Le Blomberg A, et al.
Vienna; 2013. p. 63–80. Multi-platform next-generation sequencing of the domestic turkey
64. Jacobs GH. Losses of functional opsin genes, short-wavelength cone (Meleagris gallopavo): genome assembly and analysis. PLoS Biol.
photopigments, and color vision–a significant trend in the evolution of 2010;8:1–21.
mammalian vision. Vis Neurosci. 2013;30:39–53.
65. Striedter GF. Principles of brain evolution. Sinauer Associates Inc.,U.S. ISBN:
978-0-87893-820-9. 2004/2005
66. Walls GL. The vertebrate eye and its adaptive radiation. Oxford: Cranbook
Institute of Science; 1942.
67. Crompton AW, Taylor CR, Jagger JA. Evolution of homeothermy in
mammals. Nature. 1978;272:333–6.
68. McNab BK. Metabolism and temperature regulation of kiwis (Apterygidae).
The Auk. 1996;113:687–92.
69. Sales J. The endangered kiwi: a review. Folia Zoologica Praha. 2005;54:1.
70. Brooke ML, Hanley S, Laughlin SB. The scaling of eye size with body mass in
birds. Proc Biol Sci. 1999;266:405–12.
71. Hall MI, Kamilar JM, Kirk EC. Eye shape and the nocturnal bottleneck of
mammals. Proc Biol Sci. 2012;279:4962–8.
72. Cunningham S, Castro I, Alley M. A new prey‐detection mechanism for kiwi
(Apteryx spp.) suggests convergent evolution between paleognathous and
neognathous birds. J Anat. 2007;211:493–502.
73. Gilad Y, Przeworski M, Lancet D. Loss of olfactory receptor genes
coincides with the acquisition of full trichromatic vision in primates.
PLoS Biol. 2004;2, E5.
74. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS:
ab initio prediction of alternative transcripts. Nucleic Acids Res.
2006;34:W435–9.
75. Gertz EM, Yu YK, Agarwala R, Schaffer AA, Altschul SF. Composition-based
statistics and translated nucleotide searches: improving the TBLASTN
module of BLAST. BMC Biol. 2006;4:41.
76. Prüfer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, et al. FUNC: a
package for detecting significant associations between gene sets and
ontological annotations. BMC Bioinform. 2007;8:41.
77. De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool
for the study of gene family evolution. Bioinformatics. 2006;22:1269–71.
78. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, et al.
ClueGO: a Cytoscape plug-in to decipher functionally grouped gene
ontology and pathway annotation networks. Bioinformatics. 2009;25:1091–3.
79. Kall L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence
feature prediction that includes homology information. Bioinformatics.
2005;21:i251–7.
80. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The
Sequence Alignment/Map format and SAMtools. Bioinformatics.
2009;25:2078–9.
81. Katoh K, Standley DM. MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Mol Biol Evol.
2013;30:772–80.
82. Hall TA. BioEdit: a user-friendly biological sequence alignment editor and
analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser.
1999;41:95–8.
83. Yang Z. Computational Molecular Evolution. Oxford: Oxford University Press;
2006.
84. Prohaska SJ, Fried C, Flamm C, Wagner GP, Stadler PF. Surveying
phylogenetic footprints in large gene clusters: applications to Hox cluster
duplications. Mol Phylogenet Evol. 2004;31:581–604.
85. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment
search tool. J Mol Biol. 1990;215:403–10.
86. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and
high throughput. Nucleic Acids Res. 2004;32:1792–7.
87. Kiwi Genome. Available at: http://www.bioinf.uni-leipzig.de/~studla/
KIWI-HOX/.
88. Kiwi Annotated HOX Cluster. Available at: https://bioinf.eva.mpg.de/
KIWI-HOX/

276
Apêndice A.4.
Cópia pessoal do manuscrito “Heterogeneity of dN/dS Ratios at the Classical
HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”:
Journal of Molecular Evolution (2015) 82(1): 38-50. O artigo em sua versão final
(pós-processamento editorial) não está liberado para ser re-distribuído a partir
deste documento, uma vez que o mesmo não é “Open Access”. Portanto, dis-
ponibilizo a versão aceita para publicação, porém sem a formatação da revista
– esta encontra-se disponível pelo DOI 10.1007/s00239-015-9713-9.
Esse artigo é o resultado do meu trabalho de mestrado, que foi aprimorado
ao longo do meu doutorado. Ele tem elementos em comum com o artigo apre-
sentado no apêndice A.2, pois em ambos investigamos as unidades de seleção
nos genes HLA: aqui, linhagens alélicas; no outro artigo (A.2), supertipos.
Neste trabalho, fui orientada por Diogo Meyer, que concebeu as ideias ori-
ginais do projeto. Ambos desenvolvemos as metodologias a serem adotadas ao
longo do projeto. Executei todas as análises e redigi o manuscrito juntamente
com DM. RSF e eu fizemos a detecção de sequências recombinantes e todos os
co-autores participaram na discussão e verificação dos resultados.

277
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

Noname manuscript No.


(will be inserted by the editor)

Heterogeneity of dN/dS ratios at the classical HLA class I genes over


divergence time and across the allelic phylogeny

Bárbara Domingues Bitarello, Rodrigo dos Santos

Francisco, Diogo Meyer

Abstract The classical class I HLA loci of humans show an excess of nonsynonymous with respect to

synonymous substitutions at codons of the antigen recognition site (ARS), a hallmark of adaptive evolution.

Additionally, high polymporphism, linkage disequilibrium and disease associations suggest that one or more

balancing selection regimes have acted upon these genes. However, several questions about these selective

regimes remain open. First, it is unclear if stronger evidence for selection on deep timescales is due to changes

in the intensity of selection over time or to a lack of power of most methods to detect selection on recent

timescales. Another question concerns the functional entities which define the selected phenotype. While most

analysis focus on selection acting on individual alleles, it is also plausible that phylogenetically defined groups

of alleles ("lineages") are targets of selection. To address these questions we analyzed how dN/dS (ω) varies

with respect to divergence times between alleles and phylogenetic placement (position of branches). We find

that ω for ARS codons of class I HLA genes increases with divergence time and is higher for inter-lineage

branches. Throughout our analyses, we used non-selected codons to control for possible effects of inflation of ω

associated to intra-specific analysis, and showed that our results are not artifactual. Our findings indicate the

importance of considering the timescale effect when analysing ω over a wide spectrum of divergences. Finally,

our results support the divergent allele advantage model, whereby heterozygotes with more divergent alleles

have higher fitness than those carrying similar alleles.

Keywords balancing selection, HLA, MHC, dN/dS, allelic lineages, antigen recognition site, divergent allele

advantage

Address: Departament of Genetics and Evolutionary Biology, University of São Paulo, Rua do Matão, 277, São Paulo. Tel.:
+55(11)3091-8092 E-mail: bdbitarello@gmail.com

278
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

1 Introduction

MHC class I and II classical molecules are cell-surface glycoproteins which mediate presentation of peptides

to T-cell receptors, and play a key role in triggering adaptive immune responses when the bound peptide is

recognized as foreign (Klein and Sato 2000). In humans, they are coded by HLA class I (HLA-A, -B, and -C )

and II (HLA-DR, -DQ, and -DP) classical genes. The class I and class II HLA classical genes are the most

polymorphic in the human genome (Meyer and Thomson 2001), and knowledge about their function in the

immune response supports a role for balancing selection in driving the diversity patterns at these loci.

A number of findings suggest MHC genes have experienced balancing selection: unusually high level of

heterozygosity with respect to neutral expectations (Hedrick and Thomson 1983); existence of trans-species

polymorphisms (Takahata and Nei 1990); high levels of linkage disequilibrium (Huttley et al 1999); site

frequency spectra with excess of common variants (Garrigan and Hedrick 2003); high levels of identity-by-descent

compared to genomic averages (Albrechtsen et al 2010); positive correlation between HLA polymorphism and

pathogen diversity (Prugnolle et al 2005), and significant associations of HLA alleles with the course of infectious

diseases (e.g. Apps et al 2013). Information on the crystal structure of MHC molecules (Bjorkman et al 1987)

allowed the identification of a specific set of amino acids that make up the antigen recognition site (ARS),

which determines the peptides that the molecule is able to bind (Bjorkman et al 1987; Chelvanayagam 1996).

The codons of the ARS were shown to have increased nonsynonymous substitution rates (Hughes and Nei

1988, 1989), consistent with the hypothesis that adaptive evolution at HLA loci is driven by peptide binding

properties.

Several models of selection are compatible with balancing selection at MHC genes. Heterozygote advantage

assumes that heterozygotes have higher fitness values because they are able to mount an immune response

to a greater array of pathogens, an idea originally proposed by Doherty and Zinkernagel (1975), who showed

that mice which were heterozygous for the MHC had increased immunological surveillance. Heterozygote

advantage has received support from experiments in semi-natural populations of mice (Penn et al 2002), which

show increased resistance of heterozygotes to multiple-strain infection, and through the finding that among

humans infected with HIV, those which are heterozygous for HLA genes have slower progression to AIDS

(reviewed in Dean et al 2002). Heterozygote advantage has also received support from substitution rate studies

(Hughes and Nei 1988, 1989) as well as simulation-based studies (e.g. Takahata and Nei 1990). A second

model for balancing selection at MHC genes is negative frequency dependent selection (or apostatic selection),

according to which rare variants have a selective advantage over common ones, because pathogens are more

likely to evade presentation by common molecules (Slade and McCallum 1992). Although both are biologically

compelling, decades of research have shown that most forms of summarizing genetic observation are incapable

of differentiating these two modes of selection (Hughes and Nei 1989; Meyer and Thomson 2001; Spurgin and

Richardson 2010), and the functional insights for the action of heterozygote advantage at least partially explain

why it is usually favored over negative frequency dependence (Richman 2000).

279
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

A third model involves selective pressures that are heterogeneous over space and/or time, favoring different

alleles in different temporal or geographic compartments, and thus resulting in an overall increase in diversity

at MHC loci. This model has been shown to be capable of accounting for features of HLA variation (Hedrick

2002). Many studies have investigated this model by comparing the degree of population differentiation at MHC

and putatively neutral loci, with the expectation being that selection that is geographically heterogeneous will

result in increased differentiation at HLA genes. As reviewed in Spurgin and Richardson (2010) , the results are

mixed, and interpretation is hampered due to differences in the mutational models underlying the evolution

of HLA genes and loci used as neutral controls. Although the specific form of selection acting on MHC genes

remains an open question, the fact that these genes have evolved in a non-neutral way and are under balancing

selection is an undisputed finding, which is robust to complications introduced by demographic history (Harris

and Meyer 2006; Hughes and Yeager 1998; Garrigan and Hedrick 2003).

While studies of MHC have documented convincingly a role of selection, certain questions remain unresolved

in the context of variation of the human MHC genes (termed HLA loci). The first of these concerns the

"timescale" of selection: while most tests for selection have provided strong evidence for selection at classical

HLA class I genes in in deep timescales, there is comparatively less support for selection at recent timescales

(Garrigan and Hedrick 2003). It has proved difficult to tease apart the possibility that selection differs across

timescales from reduced statistical power of tests for recent selection, and thus the question of the timescale

of selection on HLA genes remains open.

The second question concerns targets of selection, i.e, which biological entity is targeted by selection in

HLA class I genes: individual alleles or groups of similar alleles? Classical MHC genes have many alleles,

which can be hierarchically classified into groups of alleles which reflect the phylogenetic relatedness and

shared functional attributes of these alleles. Wakeland et al (1990) proposed a mechanism coined "divergent

allele advantage", which is a specific case of heterozygote advantage, according to which the fitness values

of heterozygotes are proportional to the degree of divergence between the alleles they carry. This model was

motivated by the observation that, in MHC class II murine genes, alleles from a given allelic lineage often differ

by only minor structural variations in the ARS, while alleles in different lineages have functionally different

ARS. The open question is whether individual alleles or allelic lineages are the main targets of selection for

HLA genes. Although nucleotide diversity intra-lineages exceeds genome-wide averages, inter-lineage diversity

is substantially higher than intra (Takahata and Satta 1998). This raises the question of whether intra-lineage

variation is under a different mode and intensity of selection with respect to differences between lineages.

We address these questions by analysing the temporal and phylogenetic dynamics of dN/dS (or ω) for ARS

codons at the class I classical loci (HLA-A, -B and -C ) loci, using both pairwise and phylogenetic approaches.

These loci are all highly polymorphic and there is an abundance of data available for most exons of their coding

sequence, which makes our analyses of non-ARS codons (as a control) possible. Our pairwise comparisons of

alleles show that more divergent pairs show higher ω for ARS codons than closely related pairs of alleles.

The phylogenetic analyses support the hypothesis that selection is stronger for inter-lineage branches (i.e,

those connecting two clades from the same lineage, as opposed to those who do not), and also which are

280
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

internal to the phylogeny (when compared to terminal branches), provided that a bias toward overestimating

ω for recent divergence is taken into account (Rocha et al 2006). Although evidence for balancing selection

on the intra-lineage scale is weaker than on the inter-lineage scale, our findings show that there is statistical

support for deviation from a regime of neutrality for intra-lineage branches of the allelic tree. We conclude

that intra-lineage divergence has also evolved under a regime of balancing selection, and that inter-lineage

divergence bears an even stronger signature of selection.

2 Materials and Methods

2.1 Data

Alignments for HLA-A, HLA-B and HLA-C were obtained from the IMGT/HLA Database (Robinson et al

2013). All dN/dS estimates and related analyses were implemented in CODEML (PAML package, Yang 2007).

First codon position was considered to be the first codon of exon 2, as indicated by annotation on IMGT

alignments. Our initial data sets were comprised of complete coding sequences, i.e, exons 2-7 (for HLA-A and

HLA-C ) and 2-6 (HLA-B). These data sets were used for the site models (SM) approach. For the pairwise and

branch model (BM) approaches, we used two datasets: one with 48 ARS codons (Chelvanayagam 1996) and

the other, referred to as "non-ARS", consisting of the remaining codons (Table 1).

In order to be able to use the methods available in CODEML we restricted our analysis to HLA alleles

which had complete coding sequences, no stop codons, were expressed in the cell surface and only differed with

respect to others by base changes (i.e. no insertions or deletions). Alleles with mutations putatively linked to

low or absent cell surface expression were also remove from analyses. The non-ARS data sets were used for

estimation of dS, used in the pairwise approach as a proxy for allelic divergence, as an internal control for ARS

analyses. For the branch models, further pruning of the phylogenetic trees was done, as described below.

2.2 Trees and intragenic recombination detection

Phylogenetic trees Complete alignments, described above, were used to generate NJ trees for each gene (Saitou

and Nei 1987). The program NEIGHBOR, from the PHYLIP package (Felsenstein 1989) was used with the

F84 method, k (transition/transversion ratio) = 2 and empirical base frequencies for the distance matrices

obtained in DNADIST (Felsenstein 1989).

Recombination detection. Intragenic recombinants were detected by applying RDP3 (Martin et al 2010) to

the complete alignments, followed by manual inspection. The RDP3 program combines several non-parametric

recombination detection methods in sequence data, and we used 6 independent tests for recombination detection:

RDP; Chimaera; Maxchi; GENECONV, BootScan and SiScan for recombination detection (see Martin et al.

2010 and references therein). Window size was adjusted to 100 for BootScan and SiScan, and to 15 for RDP.

The number of variable sites per window was adjusted to 35 and 30 for Maxchi and Chimaera, respectively.

281
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

These sizes were chosen based on a test alignment we provided to the software, in which parental and daughter

HLA-B sequences were known a priori. Based on this training set, we adjusted the parameters as described, and

for other parameters default values were used. Since these six tests are mostly independent, and have different

strengths, we considered a recombination event to be significant when p < 0.05 in at least 3 of the above

methods, which means we were somewhat conservative in the removal of recombinants from the datasets. "Trace

evidence" cases, i.e, those that bear a signal of recombination but are technically not statistically significant,

were kept in the data sets. Following this initial procedure, we visually inspected the filtered alignments for

the detection of additional recombinant sequences. This procedure generated tow data sets for each locus, one

with recombinants and one without ("recombinant" (R), and "non-recombinant" (NR), respectively, Table 1).

Clade Filter For the branch models, we used t (expected number of nucleotide substitutions per codon) matrices

obtained in pairwise analyses of the non-recombinant non-ARS data sets as input for NEIGHBOR. The trees

were visualized for manual pruning and labeling in Mesquite (v2.75, http://mesquiteproject.org/). We imposed

that alleles from a given HLA lineage (as defined by the standard HLA nomenclature, which identifies lineage

membership by the first field of an allele’s name) had to group together in a clade, and alleles which did not

group in such manner were manually pruned from trees in order to fulfill this "clade membership criterium".

The effect of this filtering on inclusion of alleles is presented in Figure S1 in the Online Resource 1. After

pruning of the trees, the corresponding pruned alleles were removed from the NR data sets and these reduced

data sets were used for the branch model analyses. Table 1 shows the number of alleles used for each analysis.

2.3 CODEML analyses

Branch models (BM) With the pruned data sets we compared branch models 0 (one ω for all branches) and

2 (two or more categories of branches with independent ω) from CODEML. We provided CODEML with a

topology based on the non-ARS pruned data set, using branch lengths as starting points for ML estimation

(fix_blength=1). For all CODEML analyses (BM, site models and pairwise), the Goldman and Yang (1994)

model was used for estimation of substitution rates. Other parameters defined in the control file were as

follows: option F3x4 for codon frequency estimation, κ = 2 and ω = 0.4 as initial values. Tables S14-S16

(Online Resource 1) show likelihood convergence for the branch models, assuming different initial parameter

values and codon frequency estimation methods. BM analyses were performed solely for the NR data sets (see

tables 2 and 3). Branch models 0 (one omega for all branches) and 2 (two or more omegas) were compared,

where branches were labeled either as "intra" or "inter" lineages (Figure 3), or as "terminal" or "internal". The

two models were compared via a likelihood ratio test (LRT) with one degree of freedom (see below). BM

analyses were performed only for the NR (and pruned) datasets. See Figure 3 for an schema of the labels

applied to the trees used in the BM analyses.

Site models (SM) For the SM approach, the clade filter was not applied, which resulted in minor differences

between this data set and the other two (pairwise and branch models approach, see Table 1). We used the

282
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

site models from CODEML to identify codons with ω > 1 and thus to test if ARS codons bear evidence for

adaptive evolution. M0 (one ratio) assumes the existence of only one ω ratio for all codons, while M1 (neutral)

assumes the existence of two categories of sites, one with ω1 = 1 (sites evolving in a neutral fashion) and

the other with ωo < 1 (sites evolving under purifying selection), while M2 (selection) adds an extra category

to M1, where ω2 > 1, corresponding to sites with evidence for adaptive evolution. M7 (beta) is a flexible

null model where the value is sampled from a beta distribution, where ω0 < 1, and 0 < ω < 1 , while M8

adds an extra category to M7, ω2 , which is estimated from the data (Yang 2006). Codons with posterior

probabilities P > 0.95 of ω > 1 in the Bayes Empirical Bayes (BEB) (Yang et al 2005) approach implemented

in CODEML were considered to have significant evidence for adaptive evolution, following criteria described

elsewhere (Yang and Swanson 2002; Yang et al 2005). The ARS codon classification proposed by Bjorkman

et al. (1987) is referred to as BJOR, while the "peptide binding environments", i.e, the amino acid residues in

a fixed neighborhood of the peptide binding residues known from crystal structure complexes (which provide

a less restrictive description of the antigen binding sites), are referred to as CHEV (Chelvanayagam 1996).

Finally, the list of codons in HLA genes with evidence of ω > 1 from Yang and Swanson (2002) is referred to as

YANG (Figure 1 and Online Resource 1, Table S9). M1 vs M2 and M7 vs M8 models were compared through

a LRT with two degrees of freedom. Tables S3-S8 (Online Resource 1) show likelihoods obtained when altering

initial CODEML conditions for the SM analyses. SM analyses were performed for R and NR data sets.

Codons with P > 0.95 for ω > 1 in M8 (34 in total) were combined for the three loci, and the R and NR

data sets, and compared to CHEV, BJOR and YANG. Of these 34 codons, only one was outside of the exons

2 and 3 range (codon 305), which is where all ARS codons are located. Figure 1 shows the overlap between

the codons defined as making up the ARS in the BJOR and CHEV classifications, as well as those idenfied as

under selection in the YANG set of codons and our analyses.

In order to evaluate if our site model analyses were robust to features of the estimation method, the analyses

were repeated with DATAMONKEY, from the HYPHY package (Pond et al, 2005). The substitution model

used for construction of the NJ tree was HKY85 (very closely related to F84, used for CODEML analyses).

Two criteria for detection significant dN/dS > 1 were considered: SLAC and FEL (both with significance level

of 0.1), with the former being the most conservative criterion available in the package. Tables S10-S12 report

the overlap of sites with evidence for dN/dS > 1 for BEB (CODEML), SLAC and FEL.

LRT When comparing two nested models the LRT test statistic is given by doubling the log likelihood differece

between the more parameter rich model and the less parameter rich model. The difference in parameter number

yields the degrees of freedom. It is expected that the use of a chi-square distribution for significance evaluation

of this test is a conservative approach (Yang 2006). Both site models and branch models comparisons were

performed through LRTs.

Breslow-Day Test In order to compare ARS and non-ARS codons with respect to the distribution of synonymous

and nonsynoymous changes within and between lineages (or for internal or terminal branches), we used a

contingency table approach similar to the one described in Templeton (1996). We estimated the synonymous

283
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

(S ) and non-nonsynonymous (N ) changes on each branch in CODEML, using the branch models. Next we

counted N (nonsynonyous changes) and S (synonymous changes) for intra/inter or terminal/internal branches

for each locus, and for ARS and non-ARS codons (Table 5).

We defined the odds ratio (OR) as:

Nintra · Sinter
Ninter · Sintra

, and used a Breslow-Day test for homogeneity of OR to test the hypothesis that contingency tables from

ARS and non-ARS codons have the same OR. We applied the same test to internal/terminal branches. Data

from the three loci were combined into the same analysis to increase power.

Pairwise approach We also performed analyses where statistics were estimated in comparisons between all pairs

of alleles (pairwise analyses, see Table 1) using runmode=-2 in CODEML. This approach does not require a

phylogenetic tree. Because IMGT/HLA nomenclature allows information about allelic lineages to be known

without a tree, pairs were also classified as intra or inter-lineage. Correlations between allelic divergence and

omega values were tested with a Mantel Test using Pearson’s correlation index (Online Resource 1, Table S13).

We obtained quantiles of the dS non−ARS distribution and divided pairwise values according to these quantiles

(Online Resource 1, Table S1 for non-ARS data set and Table 4 in main text for ARS data set). Differences

in mean ω values for "intra" and "inter" comparisons were tested for significance by a Wilcoxon rank sum test

(Figure 2).

2.4 Allele frequencies of HLA SNPs in the 1000 Genomes

The IMGT/HLA database contains all HLA alleles described to date, regardless of their population frequencies.

Therefore, it is possible that rare variants can contribute disproportionately to patterns identified in the dN/dS

analyses. To address this concern, we investigated patterns of variation at the HLA loci in a population (Yoruba,

YRI) from the 1000 Genomes Project (1000G), for which frequency of alleles at specific SNP positions is

available (N = 88 individuals).

To test for a possible enrichment of rare variants in the IMGT data we compared patterns of variation

seen in the IMGT and 1000G phase I data (The 1000 Genomes Project Consortium, 2012). To this end, we

defined a set of sites, for each locus, which were variable in our IMGT-derived data sets (referred to as the

"OVERALL" set of sites). Next, we classified these sites as variable only within a single lineage ("INTRA"), or

variable in more than one lineage ("INTER"). For each site, we converted the positions within the HLA locus

into a genomic coordinate for H. sapiens (hg19).

Next, we verified if these positions are polymorphic in the 1000G Phase I low-coverage dataset (ftp://ftp.1000genomes.ebi.ac.uk/v

and recorded the minor allele frequency in the YRI population.

284
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

3 Results

3.1 Evidence for selection and assessment of recombination

Before investigating how ω varies over time and phylogenetic context, we tested (a) whether selection is

detectable in our data set with pairwise comparisons and phylogenetic dN/dS approaches; (b) if the presence

of HLA alleles resulting from intragenic recombination influences our inferences; and (c) if there is agreement

between the ARS codons defined by crystal structure (Bjorkman et al 1987; Chelvanayagam 1996) and the

codons inferred to have ω > 1 in our data set. The results to these tests are pre-requisites for subsequent

analyses addressing the more specific hypotheses about heterogeneity in dN/dS estimated across the allelic

phylogeny and divergence time.

We quantified the mean pairwise dN/dS (ω), and found ω ARS > 1 for all loci (Table 4). We used the

non-ARS codons from the same sequences as an internal control, and found that ω ARS is 3.9 (HLA-A), 4.0

(HLA-B) and 3.2-fold (HLA-C ) greater than ω non−ARS (Table 4). This effect is not driven by a subset of the

pairwise comparisons, since dN > dS for the majority (between 67 and 84%) of ARS pairwise comparisons, in

contrast to the non-ARS comparisons, where fewer than 7% show dN > dS (Table 4). Importantly, we find

that the result ω ARS > ω non−ARS is due to increased dN (3.5 to 14-fold higher for ARS), and not to decreased dS

(0.5 to 2.8-fold higher for ARS, Table 4). Qualitatively similar results were obtained when we computed the

ratio of mean substitution rates, dN /dS (Table 4). These findings are robust to the presence of recombinants

(Online Resource 1, Table S1). Overall, our results document that pairwise comparison of alleles provides

strong support for adaptive evolution on ARS codons, as expected.

Evidence for adaptive evolution in ARS codons was also strongly supported by phylogenetic methods

from CODEML (see Methods), where models allowing for selection (M2 and M8) in a subset of codons were

significantly favored over the neutral models M1 and M7 (Online Resource 1, Table S2; p < 0.01, LRT). Results

were robust to starting conditions for HLA-A and HLA-B (Online Resource 1, Tables S3-S6), and less so for

HLA-C (Online Resource 1, Tables S7 and S8).

We next quantified the overlap between codons we inferred to be under selection (using site models

from CODEML, "SM") and those defined as ARS based on structural analyses of HLA (Chelvanayagam

1996; Bjorkman et al 1987). Within exons 2 and 3 (which contain all ARS codons) we identified 33 codons

with significant ω > 1 for the M8 site model (see Methods and Table S2, Online Resource 1) in at least

one locus, of which 27 (82%) are contained within the set that forms the ARS according to the crystal

structure-based classification (Bjorkman et al 1987), 25 (76%) are contained within the peptide binding

environments (Chelvanayagam 1996), and 25 (76%) overlap with Yang and Swanson’s (2002) site models

approach to detect codons with ω > 1 in the three classical class I HLA loci (Figure 1 and Online Resource 1,

Table S9). The association between ARS and selected sites for all loci is highly significant (p < 10−11 , chi-square

test). There is extensive overlap between the two ARS classifications (Bjorkman et al 1987; Chelvanayagam

285
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

1996) (Figure 1) and we also find a high overlap of selected sites between the R and NR data sets for each

locus (27 out of 33) (Online Resource 1, Tables S10-S12).

Overall, our results show that: (a) the pairwise and phylogenetic site models methods implemented in

CODEML strongly support adaptive evolution on the ARS codons of HLA loci - as also described by Yang

and Swanson (2002) through site models; (b) there is an enrichment of codons with ω > 1 in the CHEV set

of codons (see Online Resource 1, Table S9, for the names given to the sets of codons), supporting the use of

this classification for our study; (c) although the results were robust to the presence of recombinants, a finding

consistent with simulation studies (Anisimova et al, 2003), the estimated values for ω appear to be sensitive

to the inclusion of recombinants. Therefore, where appropriate, in subsequent pairwise analyses, we contrast

results of non-recombinant (NR) and recombinant (R) datasets, while for the branch models we use the NR

data set exclusively.

In addition, the results of HLA-C, although following the same trend observed for HLA-A and HLA-B,

show that absolute divergence values for ARS codons are on average 1/2 of those observed for the other two

loci, both for dN and dS (Table 4). This result might be a reflect of the fact that HLA-C not only has an

antigen presentation function, but has a huge role in interactions with NK receptors (KIR) and that, unlike

HLA-A and HLA-B, all HLA-C allotypes form ligands for KIR receptors (Hilton et al 2015; Single et al 2007).

Because the KIR loci have been shown to evolve quite rapidly across primate species, plausibly faster than

their MHC class I ligands (Single et al, 2007), it is possible that this important selective pressure is responsible

for the lower substitution rates seen for the ARS of HLA-C, as well as for the lack of consistency observed in

ML estimates

3.2 The time-dependence of ω at HLA class I loci

Having confirmed that selection at ARS sites is detectable with pairwise comparisons and phylogenetic approaches,

we investigated if recent evolutionary change (accounting for differences among recently diverged alleles) shows

different signatures of selection with respect to changes that occurred over greater timescales. Our first approach

consisted in examining the distribution of ωARS as a function of the time since divergence between allele pairs.

Our estimate of divergence time between allele pairs was based on the values of dS (estimated from non-ARS

codons) for each allele pair, thus avoiding statistical non-independence with ωARS . Because very recently diverged

alleles have low synonymous divergence (dSnon−ARS ), the corresponding ωARS values were often undefined or

extremely large. We therefore followed a strategy adopted by Wolf et al (2009) to filter out the allele pairs with

ωARS > 5 (resulting in the removal of 1.1%, 1.4% and 3.9% of ω values for pairwise comparisons at HLA-A, -B,

and -C, respectively).

Pairwise estimates show that ωARS increases as a function of divergence time (Table 4). Indeed, ωARS and

dS non−ARS are positively correlated (Online Resource 1, Table S13; rHLA−A = 0.17, p < 0.001; rHLA−B = 0.20,

p < 0.001; rHLA−C = 0.20, p < 0.001; Pearson, significance obtained by Mantel Test). Qualitatively similar

results were found for NR data sets and were robust to different correlation measures (Online Resource 1,

286
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

10

Table S13). We also compared the ω between allele pairs classified as intra and inter-lineage (Figure 2). For

all loci, the median value of ωARS is > 1 for the inter lineage contrasts, and < 1 for the intra-lineage contrasts,

and the distribution of ω is significantly higher for inter-lineage contrasts (p < 0.001, Wilcoxon rank sum test;

Figure 2) of the ARS codons.

The above pairwise comparison approach suffers from the limitation that allele pairs with ω > 5 were

treated as missing data, possibly underestimating ω for recently diverged alleles. This prompted us to use a

phylogenetic model to contrast alleles at different levels of differentiation, which is more robust to the effects

of low differentiation between specific allele pairs. We compared a branch model that estimates a single ω for

all branches to one that estimates two values of ω (inter versus intra-lineage; terminal versus internal; see

Figure 3). For all loci we found higher ωARS for inter-lineage branches than for intra-lineage branches, although

significance was not attained for these tests (Table 2). For the contrast between internal and terminal branches,

we found higher ωARS for internal branches at all loci and this result was statistically significant for HLA-C

(Table 2).

Our results show that both pairwise comparisons and branch models indicate a heterogeneity of ω throughout

the diversification of HLA alleles, with higher ω values associated to contrasts between more divergent alleles

(pairwise approach) or to branches connecting different lineages or that are internal to the phylogeny (BM

approach), although the difference was not significant for the "intra-inter" contrasts.

3.3 Significantly more nonsynonymous changes inter-lineages at ARS codons

In this study we estimate ω for allele pairs or branches sampled within a single species, and over varying

timescales. Both these features imply in possible biases to the estimation of ω, which we now discuss.

Kryazhimskiy and Plotkin (2008) used analytical and simulation approaches to show that under positive

selection the behavior of ω within a single population is not a monotonic function of the intensity of selection, so

that ω intra a population can be low, even under positive selection. This occurs because, when an advantageous

nonsynonymous variant is fixed in a population, nonsynonymous variation can be decreased due to the

homogeneity generated by the selective sweep. However, this scenario clearly does not apply to HLA genes,

where balancing selection maintains multiple nonsynonymous polymorphisms simultaneously segregating within

a population, contributing to ω > 1.

Another challenge to the interpretation of ω arises from that fact that many studies have shown that

genes under purifying selection show surprisingly high ω (often close to 1) when samples with short divergence

times are analyzed (e.g., those from a single population or species). For example, Rocha et al (2006) showed

that dN/dS between two samples is negatively correlated with their divergence times, and exemplified these

predictions with bacterial genomes. Likewise, a decrease of dN/dS with divergence time has been described in

Wolf et al (2009), but considering a much deeper timescale. Kryazhimskiy and Plotkin (2008) demonstrated

that this pattern is expected even under a regime of purifying selection that is constant over time. Thus, it

is plausible that the recent divergence times among alleles within HLA allelic lineages could result in inflated

287
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

11

intra-lineage ω values, explaining the modest differences between intra and inter-lineage ω values seen in the

phylogenetic analyses (Tables 2 and 3). To explore this issue further, we used non-ARS codons as an internal

control for this putative build-up of dN/dS are recent timescales, and to do so we compared their patterns

of variation to those of ARS codons. We found that non-ARS codons have larger intra-lineage ω values than

inter-lineage values, and also higher ω for terminal than internal branches (p < 0.05 for HLA-A in the intra

versus inter-lineage contrast, and for HLA-A and HLA-C in the tips versus internal contrast; LRT; Table

3). This distribution of ω values is in the exact opposite direction to that observed for the ARS (Table 2),

consistent with an effect of short divergence times inflating the estimates of ω (Kryazhimskiy and Plotkin

2008).

In order to formally test whether ARS and non-ARS codons have a different distribution of synonymous

and nonsynonymous changes intra and inter-lineages (or for internal and terminal branches) we employed a

contingency table approach similar to that of Templeton (1996). We used the inferred number of synonymous

(S ) and nonsynonymous (N ) changes on each branch of the allelic phylogeny from each locus to estimate the

total number of each type of change in a specific class of branches (see Figure 3 for a schematic representation

of the branch labeling).The odds ratio was defined as presented in the Methods. For all loci, we find that

OR > 1 for non-ARS codons (proportionally more nonsynonymous on the intra-lineage branches) and OR < 1

for ARS codons (proportionally more nonsynonymous changes on the inter-lineage branches), as shown in

Table 5. This finding is consistent with the maximum likelihood estimates of ω for branches (Tables 2 and 3),

and the increased pairwise ω inter-lineage, relative to intra-lineage (Figure 2). To test for differences between

ARS and non-ARS codons, we pooled the contingency tables of all loci (due to the fact that several cells

for individual loci had low counts) and rejected the null hypothesis that contingency tables from ARS and

non-ARS codons have the same OR (p − value = 0.0069; Breslow-Day test). Our analysis comparing internal

and terminal branches showed the same pattern, with proportionally more nonsynonymous changes in internal

branches for ARS codons (p − value = 0.00013; Breslow-Day test; Table 5).

In summary, although there is evidence for an excess of inter-lineage nonsynonymous changes (or for

terminal branches) for ARS codons, there is also an enrichment for intra-lineage nonsynonymous changes for

ARS codons, when compared to non-ARS codons (P < 0.001; Fisher’s exact test). Next, we discuss possible

biases in the data set which could lead to these results.

3.4 Comparing dN/dS results with 1000 genomes variation

Our analyses are based on allele sequences available in the IMGT/HLA data base, which is a curated resource

to which newly discovered alleles are contributed. This data set is likely to be biased with respect to population

frequencies, since very rare HLA alleles are likely to represent a disproportionately larger fraction than in true

population samples, since all new alleles which are discovered are encouraged to be submitted to IMGT. We

therefore investigated if this bias influenced our findings. Specifically, we were concerned that the enrichment

for rare variants could result in an inflation of weakly deleterious nonsynonymous variants for recent divergence,

288
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

12

a well documented population genetic signature (Henn et al 2015). This signature could create an artificially

inflated value of ω for intra-lineage variability.

We found that only a subset of variable positions present in our IMGT-derived datasets are present in the

1000 Genomes Phase I low coverage data (Tables 6, 7, and 8 (for HLA-A, HLA-B, and HLA-C, respectively).

This is in accordance with the greater degree of sampling of rare variants in the IMGT data set.

We next divided positions into two groups: those which are only variable within a single lineage (’INTRA’),

and those variable in more than one lineage (’INTER’). For comparison, a third group, which consists of all

variable sites (’OVERALL’), was also defined. We found that, considering all INTRA and INTER positions

present in the 1000G data, there is no significant difference in minor allele frequency (MAF) between the two

categories (Tables 6, 7, and 8). Furthermore, when we classify the 1000G HLA SNPs into low (MAF<=0.1)

and high frequency (MAF>0.1), we do not see an enrichment for low frequency variants within the "INTRA"

set of SNPs when compared to the "INTER" set (Wilcoxon test, not shown).

These results reassure us that the intra-lineage variation we observe is not biased in the direction of

extremely rare variants, and that our observation that there is evidence for stronger intra-lineage balancing

selection for ARS codons than for neutrally evolving regions (non-ARS) is not a spurious result driven by an

enrichment for low-frequency SNPs.

4 Discussion

Our study documents a positive correlation between dN/dS values and the degree of divergence between

allele pairs. This result is supported by phylogenetic analyses, which show higher ω values for branches

connecting different lineages, or branches which are internal to the phylogeny. A heterogeneous nonsynonymous

substitution rate (dN ) for HLA genes was also reported in a study which found that dN for ARS codons is

not linearly correlated with divergence time in classical HLA loci (Yasukochi and Satta 2014). By further

investigating the temporal dynamics in the DRB1 gene, these authors showed that this rate heterogeneity

is likely the consequence of a reduction in the substitution rates in specific allelic lineages, possibly as a

consequence of continuous selective pressure by a specific pathogen. In the present study our goal was to

explicitly test for heterogeneity in the ω ratios over a priori defined groups of alleles (the HLA allelic lineages)

and for timescales of divergence (low and high divergence). As was the case with the study of Yasukochi

and Satta (2014), we find heterogeneity in the intensity of selection, in our case with evidence of increased

selection at deeper timescales than at more recent ones, and for greater selection on inter-lineage branches of

the allelic phylogeny, with respect to intra-lineage branches. Our findings indicate that long-term balancing

selection has resulted in an enrichment for adaptive changes between allelic lineages for HLA class I genes,

with proportionally weaker signatures of molecular adaptation for recent (terminal and intra-lineage branches)

than for the inter-lineage and for the internal branches.

Although previous studies have shown that low divergence is often associated to inflated ω estimates (Rocha

et al, 2006), the phylogenetic analyses carried out in the present work relied on non-ARS codons as a control

289
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

13

to show that low divergence times of intra-lineage contrasts does not explain the ω > 1 values within lineages,

at ARS codons. Thus, while we show that inter-lineage selection is stronger than intra-lineage selection, our

results also demonstrate that intra-lineage variation bears a signature of balancing selection.

Recently several papers have drawn attention to the effects of divergence times on dN/dS estimation (e.g.

Wolf et al 2009; Stolestki and Eyre-Walker 2011), and the complexities of interpreting these values when data

is drawn from a single population (Rocha et al 2006; Kryazhimskiy and Plotkin 2008). Our finding of increased

ωARS among more divergent alleles (or for inter-lineage branches) is conservative in light of these findings, which

predict decreased ω for more divergent alleles. We accounted for this effect by using non-ARS codons, which

have a similar phylogenetic structure to that of ARS codons (after removal of recombinants) to control for

the background inflation of omega in recently diverged alleles, and found that ARS codons have very different

distribution of ω, with increased inter-lineage evidence for selection, exactly the opposite to what is seen for

non-ARS codons.

An important caveat to this interpretation is that the temporal dynamics of dN/dS appears to be sensitive

to the selective regime which is assumed to be operating. Thus, while several authors have shown that, under

purifying selection, increased dN/dS at low divergence is expected, positive selection can produce a positive

correlation with divergence times (Dos Reis and Yang 2013; Mugal et al 2014), which could account for part of

the results we describe in this study. However, the case of directional positive selection, involving the sequential

substitution of adaptive mutations, is markedly different from the dynamics of a balanced polymorphism, as

is the case for HLA genes.

Assuming that balancing selection has been the main selective regime shaping the molecular evolution

of HLA genes, and that heterozygote advantage is one (even if not exclusively) of the mechanisms through

which selection has acted upon this system, our finding that inter-lineage ωARS is greater than intra-lineage

is consistent with the divergent allele advantage model, according to which heterozygotes for more divergent

alleles have higher fitness than those carrying similar alleles (Wakeland et al 1990). Under this model, excess

of inter-lineage nonsynonymous changes in HLA genes would be expected, which is a result we have shown for

the ARS data set. This model has been shown to explain patterns of variation in the DRB locus in Galapagos

sea lions, where local allelic divergence at this locus positively influences fitness directly (Lenz et al 2013),

and not mere heterozygosity or number of alleles at the MHC locus. Most likely several selective regimes have

shaped the evolutionary history of MHC genes, as suggested by previous observations, and our contribution

suggests that these selective regimes could be operating alongside with divergent allele advantage.

Our results suggest that groups of functionally related alleles (in our analysis, the allelic lineages) should

be regarded as important targets of selection, rather than individual alleles. In line with our observations,

it has been proposed that HLA supertypes - groups of alleles sharing chemical properties at the B and F

pockets of the ARS region (Sidney et al 1996) - constitute the level of variation that is the primary target of

natural selection in HLA-B genes (Francisco et al 2015). Since there is a high overlap between allelic lineage

and supertype classifications(Sidney et al 1996), our results indicate that attempts to understand how natural

290
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

14

selection acts on HLA variation benefit by comparing the effects of selection on the allelic, allelic lineage or

supertype levels of variation.

Electronic Supplementary Material

Supporting tables are available as an additional file.

Competing Interests

The authors declare that they have no competing interests.

Author’s Contributions

BDB carried participated in the design of the study, performed analyses, discussed results and drafted the

manuscript. RDF performed analyses and discussions. DM conceived of the study, participated in its design

and discussion and in the drafting of the manuscript. All authors read and approved the final manuscript.

Acknowledgements The authors thank Kelly Nunes for thoughtful comments on the manuscript, Richard Single for
comments on the statistical aspects of this work, Aida M. Andrés for general comments and Débora Y.C.Brandt for help
with the 1000 Genomes data sets. This work was supported by the São Paulo Research Foundation (grants #2008/09127-8
and #2011/12500-2 to BDB; #08/56502-6 to DM) and Conselho Nacional de Desenvolvimento Científico e Tecnológico
(#152676/2011-2 to BDB, #142130/2009-5 to RSF and #308960/2009-2 to DM). The final publication is available at
Springer via http://dx.doi.org/DOI: 10.1007/s00239-015-9713-9

Data available in public repositories

https://github.com/bbitarello/dNdS-hla-allelic-lineages

References

Albrechtsen A, Moltke I, Nielsen R (2010) Natural selection and the distribution of identity-by-descent in the

human genome. Genetics 186(1):295–308

Anisimova M, Nielsen R, Yang Z (2003) Effect of recombination on the accuracy of the likelihood method for

detecting positive selection at amino acid sites. Genetics 164(3):1229–36

Apps R, Qi Y, Carlson JM, Chen H, Gao X, Thomas R, Yuki Y, Del Prete GQ, Goulder P, Brumme ZL,

Brumme CJ, John M, Mallal S, Nelson G, Bosch R, Heckerman D, Stein JL, Soderberg Ka, Moody MA,

Denny TN, Zeng X, Fang J, Moffett A, Lifson JD, Goedert JJ, Buchbinder S, Kirk GD, Fellay J, McLaren

P, Deeks SG, Pereyra F, Walker B, Michael NL, Weintrob A, Wolinsky S, Liao W, Carrington M (2013)

Influence of HLA-C expression level on HIV control. Science (80- ) 340(6128):87–91

291
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

15

Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987) Structure of the human

class I histocompatibility antigen, HLA-A2. Nature 329(6139):506–12

Chelvanayagam G (1996) A roadmap for HLA-A, HLA-B, and HLA-C peptide binding specificities.

Immunogenetics 45(1):15–26

Dean M, Carrington M, O’Brien SJ (2002) Balanced polymorphism selected by genetic versus infectious human

disease. Annu Rev Genomics Hum Genet 3:263–92

Doherty PC, Zinkernagel RM (1975) Enhanced immunological surveillance in mice heterozygous at the H-2

gene complex. Nature 256(5512):50–52

Dos Reis M, Yang Z (2013) Why do more divergent sequences produce smaller nonsynonymous/synonymous

rate ratios in pairwise sequence comparisons? Genetics 195(1):195–204

Felsenstein J (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5:164–166

Francisco RS, Buhler S, Nunes JM, Bitarello BD, França GS, Meyer D, Sanchez-Mazas A (2015) HLA supertype

variation in human populations: new insights about the role of natural selection on the evolution of HLA-A

and HLA-B polymorphisms. Immunogenetics, DOI 10.1007/s00251-015-0875-9.

Garrigan D, Hedrick PW (2003) Detecting adaptive molecular polymorphism : Lessons from the MHC.

Evolution (N Y) 57(8):1707–1722

Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences.

Mol Biol Evol 11(5):725–736

Harris E, Meyer F (2006) The Molecular Signature of Selection Underlying Human Adaptations. Yearb Phys

Anthropol 130:89-130

Hedrick PW (2002) Pathogen resistance and genetic variation at MHC loci. Evolution (N Y) 56(10):1902–1908

Hedrick PW, Thomson G (1983) Evidence for balancing selection at HLA. Genetics 104(3):449–56

Henn B, Botigué LR, Bustamante C, Clark AG, Gravel S (2015) Estimating the mutation load in human

genomes. Nat Rev Genetics 16:333—343

Hilton HG, Guethlein LA, Goyos A, Nemat-Gorgani N, Bushnell DA, Norman PJ, Parham P (2015)

Polymorphic HLA-C Receptors Balance the Functional Characteristics of KIR Haplotypes. J Immunol

195:3160-3170

Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci

reveals overdominant selection. Nature 335(6186):167–170

Hughes AL, Nei M (1989) Nucleotide substitution at major histocompatibility complex class II loci: evidence

for overdominant selection. Proc Natl Acad Sci U S A 86(3):958–962

Hughes AL, Yeager M (1998) Natural selection at major histocompatibility complex of vertebrates. Annu Rev

Genet pp 415–435

Huttley G, Smith MW, Carrington M, O’Brien S (1999) A scan for linkage disequilibrium accross the human

genome. Genetics 152(4):1711–1722

Klein J, Sato A (2000) The HLA system. First of two parts. Adv Immunol 343(10):702–709

Kryazhimskiy S, Plotkin JB (2008) The Population Genetics of dN/dS. PLoS Genet 4(12):10

292
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

16

Lenz T, Mueller B, Trillmich F, Wolf JBW (2013) Divergent allele advantage at MHC-DRB through direct

and maternal genotypic effects and its consequences for allele pool composition and mating. Proc R Soc B

280: 20130714

Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P (2010) RDP3: a flexible and fast computer

program for analyzing recombination. Bioinformatics 26(19):2462–3

Meyer D, Thomson G (2001) How selection shapes variation of the human major histocompatibility complex:

a review. Ann Hum Genet 65(1):1–26

Mugal CF, Wolf JBW, Kaj I (2014) Why time matters: codon evolution and the temporal dynamics of dN/dS.

Mol Biol Evol 31(1):212–31

Penn DJ, Damjanovich K, Potts WK (2002) MHC heterozygosity confers a selective advantage against

multiple-strain infections. Proc Natl Acad Sci U S A 99(17):11,260–4

Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics

21(5):676-679

Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F (2005) Pathogen-driven selection

and worldwide HLA class I diversity. Curr Biol 15(11):1022–7

Richman A (2000) Evolution of balanced genetic polymorphism. Mol Ecol 9(12):1953–63

Robinson J, Halliwell Ja, McWilliam H, Lopez R, Parham P, Marsh SGE (2013) The IMGT/HLA database.

Nucleic Acids Res 41(Database issue):D1222–7

Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ (2006) Comparisons of dN/dS

are time dependent for closely related bacterial genomes. J Theor Biol 239(2):226–235

Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees.

Mol Biol Evol 4:406–425

Sidney J, Grey HM, Kubo RT, Sette A. (1996) Practical, biochemical and evolutionary implications of the

discovery of HLA class I supermotifs. Immunol Today 17(6): 261–6

Single RM, Martin MP, Gao X, Meyer D, Yeager M, Kidd JR, Kidd K, Carrington M (2007 Global diversity

and evidence for coevolution of KIR and HLA. Nat Genetics 9:1114–1119

Slade R, McCallum H (1992) Overdominant vs. frequency-dependent selection at MHC loci. Genetics

132:861–864

Spurgin LG, Richardson DS (2010) How pathogens drive genetic diversity: MHC, mechanisms and

misunderstandings. Proc Biol Sci 277(1684):979–88

Stolestki N, Eyre-Walker A (2011) The positive correlation between dN/dS and dS in mammals is due to runs

of adjacent substitutions. Mol Biol Evol 28(4):1371–1380

Takahata N, Nei M (1990) Allelic Genealogy Under Overdominant and Frequency-Dependent Selection and

Polymorphism of Major Histocompatibility Complex Loci. Genetics 124(4):967–978

Takahata N, Satta Y (1998) Footprints of intragenic recombination at HLA loci. Immunogenetics 47(6):430–441

Templeton AR (1996) Contingency tests of neutrality using intra/interspecific gene trees: the rejection of

neutrality for the evolution of the mitochondrial Cytochrome Oxidase II gene in the hominoid primates.

293
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

17

Genetics 144(3):1263–1270

The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human

genomes. Nature 491: 56 —65

Wakeland EK, Boehme S, She JX, Lu Cc, Mclndoe RA, Cheng I, Ye Y, Potts WK (1990) Ancestral

Polymorphisms of MHC Class II Genes : Divergent Allele Advantage. Immunol Res 9:115–122

Wolf JBW, Künstner A, Nam K, Jakobsson M, Ellegren H (2009) Nonlinear dynamics of nonsynonymous (dN)

and synonymous (dS) substitution rates affects inference of selection. Genome Biol Evol 1:308–319

Yang Z (2006) Computational molecular evolution. Oxford University Press, Oxford

Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol 24(8):1586–1591

Yang Z, Swanson WJ (2002) Codon-Substitution Models to Detect Adaptive Evolution that Account for

Heterogeneous Selective Pressures Among Site Classes. Mol Biol Evol 19(1):49 –57

Yang Z, Wong WSW, Nielsen R (2005) Bayes empirical bayes inference of amino acid sites under positive

selection. Mol Biol Evol 22(4):1107–1118

Yasukochi Y, Satta Y (2014) Nonsynonymous Substitution Rate Heterogeneity in the Peptide-Binding Region

Among Different HLA-DRB1 Lineages in Humans. G3 (Bethesda)

294
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

18

Locus All allelesa SM (R/NR)b Pairwise (R/NR)c BM pruned data setd Codons
Total Non-ARS ARS
HLA-A 1193 144/107 138/104 93 340 292 48
HLA-B 1799 233/78 173/71 63 324 276 48
HLA-C 829 133/109 125/110 105 341 293 48
Table 1 Number of alleles and codons for different data sets. a, included all available alleles in release 3.1.0, 2010-07-15.,
including possible recombinants; b, SM, data set used for site models, i.e, after selection of alleles with complete coding
sequences; c, R/NR, with and without recombinants data sets; d, BM (branch models) pruned data set is the NR data set
after prunning for alleles which do not cluster intra their respective allelic lineages (see Methods)

Locus ωa ωinter b ωintra c 2∆ld ωinternal e ωterminal f 2∆l


HLA-A 1.84 2.03 1.68 0.06 2.35 1.39 0.49


HLA-B 0.99 1.16 0.73 0.71 1.2 0.69 0.97
HLA-C 1.89 4.14 1.19 2.61 4.91 0.95 7.36*
Table 2 Branch model dN/dS estimations and LRT results (ARS data sets). * significance at 5%; Data sets after removal
of recombinants (NR); a, ω estimate under model 0 (one for all branches); b, ω inter lineages; c, ω intra lineages d, negative
log-likelihood difference between two nested models; e, ω for internal branches; f, ω for terminal branches

Locus ωa ωinter b ωintra c 2∆ld ωint e ωter f 2∆l


HLA-A 0.53 0.40 0.77 2.8 0.39 0.95 4.57*


HLA-B 0.42 0.40 0.55 0.34 0.39 0.66 0.86
HLA-C 0.50 0.39 0.79 3.97* 0.38 0.92 5.27*
Table 3 Branch model dN/dS estimations and LRT results (non-ARS data set). * significance at 5%; Data sets after
removal of recombinants (NR); a, ω estimate under model 0 (one for all branches); b, ω inter lineages; c, ω intra lineages;
d, negative log-likelihood difference between two nested models; e, ω for internal branches; f, ω for terminal branches

non-ARS ARS
Locus Quantilea dN dS ω b dN /dS dN > dS d dN dS ω dN /dS dN > dS

HLA-A 0.02c 0.05 0.35 0.35 628(6.64%) 0.12 0.07 1.36 1.74 7364(77.90%)
1 0.00 0.01 0.35 0.42 628 0.05 0.04 1.08 1.41 2132
2 0.02 0.05 0.398 0.397 0 0.12 0.06 1.47 1.94 2347
3 0.02 0.06 0.37 0.37 0 0.14 0.09 1.34 1.55 2316
4 0.02 0.08 0.29 0.29 0 0.15 0.08 1.50 1.97 2339
HLA-B 0.01 0.04 0.33 0.30 470(3.16%) 0.14 0.11 1.33 1.26 9908(66.59%)
1 0.01 0.02 0.46 0.46 470 0.10 0.09 1.17 1.08 2405
2 0.01 0.03 0.35 0.35 0 0.15 0.12 1.25 1.21 2460
3 0.01 0.05 0.27 0.27 0 0.15 0.13 1.28 1.18 2229
4 0.02 0.06 0.25 0.25 0 0.17 0.11 1.58 1.59 2814
HLA-C 0.02 0.05 0.38 0.37 474(6.12%) 0.07 0.02 1.22 3.04 6514(84.05%)
1 0.00 0.01 0.44 0.46 474 0.04 0.02 0.99 1.71 1303
2 0.01 0.04 0.31 0.31 0 0.07 0.02 1.04 3.52 1791
3 0.02 0.06 0.41 0.41 0 0.08 0.02 1.63 3.95 1810
4 0.03 0.08 0.37 0.37 0 0.09 0.03 1.55 3.35 1682

Table 4 Pairwise estimations for substitution rates (data sets prior to the removal of recombinants). a, quantiles of
divergence (dS non-ARS ); b, average pairwise dN/dS; c, bold refers to the average pairwise values for each locus; d, percentages
correspond to the proportion of pairs for which dN > dS in relation to the total number of pairwise comparisons

295
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

19

Data set Substitution Branch category


intra inter terminal internal
N 118.8 148 89.3 158.5
non-ARS S 39.4 106 24.9 115.1
OR = 2.15 OR = 2.96

N 172.7 230.7 144.3 291.2


S 18.5 17.5 17.4 17.7
ARS
OR = 0.71 OR = 0.21

p = 6.9 × 10−3 ∗b p = 1.3 × 10−4 ∗


Table 5 Distribution of changes for ARS and non-ARS codons. Counts correspond to the total (combined) values for
HLA-A, -B and -C ; *significant at 1%; N , nonsynonymous change; S, synonymous change; intra, intra lineage; inter, inter
lineage; terminal, terminal branches; internal, internal branches

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 68 29 5 24 0.15
Inter 88 55 12 43 0.14
Overall 156 84 17 67
Table 6 HLA-A: MAFs for SNPs in the 1000 Genomes dataset. Overall, set of variable positions considering all sequences
in the site models dataset after removal of recombinants. Intra, subset of the ’Overall’ set which is variable only within one
allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable within more than one allelic lineage. Var.Pos,
set of all variable positions in the site models dataset. Var.Pos.1000g, subset of Var.Pos which is a SNP in the 1000G low
coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 44 24 6 18 0.30
Inter 59 38 8 30 0.39
Overall 103 62 14 48
Table 7 HLA-B: MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set
of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the
’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable
within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset
of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 78 27 8 19 0.26
Inter 68 55 19 36 0.24
Overall 146 82 27 55
Table 8 HLA-C : MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set
of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the
’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable
within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset
of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.

296
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

20

Fig. 1 Overlap between two ARS classifications and two site models studies. BJOR and CHEV are ARS classifications
(Bjorkman et al 1987; Chelvanayagam 1996); YANG is a list of codons with significant in HLA genes; BIT is the set of
codons with from our SM (site models) approach (see Materials and Methods for details)

297
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

21

Fig. 2 Pairwise estimates for intra-lineage and inter-lineage pairs of alleles. These results refer to ARS data sets prior to
the removal of recombinants, for pairwise analyses; Green, inter-lineage; purple, intra-lineage; gray, non-ARS ; * significant
difference between ω̄ (intra) and ω̄ (inter) (p < 0.001, Wilcoxon rank sum test)

298
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.

22

Fig. 3 Schematic representation of the allelic phylogenies used in the branch models approach. Left: terminal vs internal
branches; right: intra-lineage vs inter-lineage; For the branch models approach, we labeled branches of each tree (HLA-A,
-B and -C ) as “intra/inter” or “terminal/internal” and ran model 2 (CODEML), which allows for two independent ω values
to be estimated, according to these labels

299

Você também pode gostar