Escolar Documentos
Profissional Documentos
Cultura Documentos
Diogo Meyer
Orientador
São Paulo
2016
Bárbara Domingues Bitarello
São Paulo
2016
i
Ficha Catalográfica
Domingues Bitarello, Bárbara
"Seleção balanceadora no genoma humano: rele-
vância biológica e consequências deletérias".
299 páginas.
Tese (Doutorado) - Instituto de Biociências da
Universidade de São Paulo. Departamento de Gené-
tica e Biologia Evolutiva.
1. Evolução Molecular;
2. Evolução Humana;
3. Seleção Balanceadora;
4. Evolução Adaptativa;
5. Genética de Populações;
6. Genômica de Populações;
7. Carga Genética
Comissão Julgadora:
ii
À memória de Maria Gabriela Duarte Macêdo (Kikita).
iii
“We speak not only to tell other people what we think, but to tell ourselves
what we think. Speech is a part of thought.”
– Oliver Sacks
iv
Agradecimentos
Às minhas amigas e amigos atemporais, que eu quase não vejo, mas que
sempre torceram junto comigo a cada etapa na vida acadêmica até aqui: Laura
Prado, Marcela Combat, Poliana Cardoso, Denise Nogueira, Luciana Matta, Ra-
mon Vitral. Quero agradecer à Laura Prado por ter me ouvido e ter dado muitas
dicas úteis nos meses finais do doutorado.
Tenho o privilégio de ter trabalhado em um ambiente muito agradável (o
Porão). Agredeço ao Pato (Guilherme Garcia) por ter me dado várias dicas so-
bre formatação da tese, e por ter cedido seu template de LATEX(e à Débora Brandt
também): graças a vocês eu pude desenvolver exatamente o que eu queria, sem
gastar tempo desnecessário. Agradeço Daniela Rossoni, Bárbara Tafinha, Ana
Paula Assis e Anna Penna pela amizade. Ao meu colega Gustavo Franca por
ter me dado muitas dicas perto do fim do doutorado, fora o reforço positivo.
Agradeço ao Diog(R)o Melo por ter me ajudado com questões de Linux. Final-
mente, agradeço a oportunidade de trabalhar com todos dos grupos do profes-
sor Gabriel Marroig e da professora Tatiana Torres, bem como outros grupos
que participam dos encontros Evolução no Porão.
Aos meus colegas do grupo de Genética Evolutiva: sou muito grata a to-
dos. Foi um prazer trabalhar com um grupo colaborativo como o nosso e ver
o quanto pudemos crescer juntos. Ao Vitor Aguiar por ter me escutado muito
(mas muito mesmo) e por sempre ser tão gentil. Ao Jônatas, pela generosidade
com seus scripts e por me apresentar a Tia das Massas. Sou grata pelo quanto me
ajudou a programar melhor e por sua enorme contribuição com as análises do
Capítulo 2. Agradeço o incentivo do Limão aos meus hobbies musicais e por ter
ajudado a achar erros nos dados usados no Capítulo 2. À Maria Helena Maia,
que foi uma irmã durante meu mestrado e início do doutorado.
Agradeço especialmente à Kelly Nunes e à Débora Brandt. À Kelly por ser
minha sábia pós-doc de plantão, e sempre ter transmitido calma e entusiasmo
quando eu precisei. Obrigada especialmente por ter lido muitas partes da tese
com carinho e ter me ajudado muito a aprimorá-la. À Débora Brandt, quero
agradecer por sua amizade, generosidade e o quanto me ajudou com a quali-
dade dos dados que analisei, além de sua ajuda corrigindo diversos trechos da
tese. Algumas outras pessoas que gostaria de agradecer: Caroline Lima, Caro-
lina Malcher, Rodrigo dos Santos Francisco.
Aos amigos/colegas/colaboradores que fiz em Leipzig: Cesare de Filippo,
João Teixeira, Michael Dannemann, André Strauss, Sandra Oliveira, Diana Le-
Duc, Fabrizio Manfezzoni, Felix Key, Petra Korlevic. Não apenas pude apren-
der com vocês, mas vocês fizeram minha estadia em Leipzig ser melhor. Ao
Stéphane Peyregné, que me revelou que trabalhar ouvindo trilha sonora da Dis-
ney (agradeço vocês também, Disney) aumenta a produtividade. À Annalisa
Schmidt, por ser uma amiga muito presente durante meu ano em Leipzig. Teria
sido muito menos legal sem você lá.
v
Gostaria de agradecer especialmente à pesquisadora Aida Andrés. Passei
um ano com seu grupo no Max Planck Institute for Evolutionary Anthropo-
logy, onde aprendi mais do que eu poderia antever. Gostaria de agradecer es-
pecialmente pela confiança que teve em mim desde o início, e também por sua
calma. Nessa etapa do trabalho eu tive, efetivamente, dois orientadores. É um
privilégio que nem todos os alunos de doutorado têm, e sou muito grata.
Gostaria de agradecer a alguns professores e/ou membros da minha banca
de qualificação, que considero terem contribuído muito para minha formação
ao longo de toda a pós-graduação: Tatiana T. Torres, Walter A. Neves, Gabriel
Marroig, Paulo Otto. Sou grata ao professor Eduardo Tarazona, que me apre-
sentou ao Diogo Meyer.
Ao meu orientador, Diogo Meyer, agradeço por ter me ajudado a aprender
o máximo possível ao longo desses sete (!) anos de pós-graduação na USP e às
oportunidades que me proporcionou. Eu cheguei aqui sem saber muita coisa,
exceto que queria estudar genética de populações humanas, e você me propor-
cionou estudar exatamente o que eu queria. Concluir essa etapa da minha for-
mação é um "sonho"que nutro desde muito jovem, e você foi uma pessoa muito
importante ao longo desta trajetória.
À Ale Chris, com quem eu pude contar absolutamente sempre que precisei.
Agradeço eternamente por tê-la como amiga, e por sua enorme generosidade.
Agradeço também à Gisele Melo pela amizade.
À Klervia Jaouen, obrigada pela sua confiança inabalável em mim, pela pa-
ciência, compreensão e por sua disposição em me ajudar, sempre, seja ouvindo
ensaios de apresentação, seja lendo o que eu escrevi (e tudo isso em português,
sua terceira, e ainda incipiente, língua). Obrigada por me fazer feliz e querer ser
uma pessoa melhor, sempre.
À minha avó, Tê, por todo o suporte que sempre me deu. Aos meus pais,
Bia e Flávio, e aos meus segundos pais, Beth e Joe: obrigada por serem ótimos
exemplos pra mim, todos vocês, e por sempre terem me incentivado a seguir
essa carreira. Agradeço à minha mãe por ter lido e comentado a introdução (sei
que não foi fácil) e por ter sido compreensiva e sempre ter me ajudado e lidar
com a vida acadêmica. Agradeço finalmente à minha irmã, que foi muito com-
preensiva com a minha necessidade de reclusão nos últimos meses e sempre
menteve um reconfortante interesse pelas coisas científicas e nerds. A todas as
pessoas que eu porventura tenha esquecido de agradecer, obrigada.
Finalmente, agradeço à Fundação de Amparo à Pesquisa do Estado de São
Paulo (FAPESP) por ter me financiado no doutorado, incluindo o periodo que
passei em Leipzig.
vi
Resumo
vii
Quando aplicamos NCD2 a dados humanos, usando chimpanzé como grupo ex-
terno, encontramos mais de 200 genes codificadores de proteínas com forte assinatura
de seleção balanceadora, dos quais apenas 1/3 tinha evidência prévia de seleção ba-
lanceadora. Encontramos também um enriquecimento para diversas categorias de on-
tologia gênica, das quais cerca da metade é relacionada à imunidade. Verificamos que
dentre os genes com evidências de seleção balanceadora há um excesso de casos de ex-
pressão preferencial em tecidos tais como “adrenal” e “pulmão”, e também um excesso
de genes com expressão mono-alélica. No geral, vimos que as regiões selecionadas no
genoma humano incluem tanto sítios codificadores quanto regulatórios. Não encon-
tramos um excesso de assinaturas de seleção balanceadora em regiões regulatórias, ao
contrário do que reportaram outros estudos. Finalmente, encontramos um excesso de
polimorfismos não-sinônimos em relação aos sinônimos nos genes selecionados.
Tendo documentado a ocorrência de seleção balanceadora no genoma humano e
identificado genes que foram potencialmente alvos deste regime seletivo, nós investi-
gamos as consequências evolutivas desse processo. Nós partimos da hipótese que a
seleção balanceadora sobre um sítio reduz a eficiência com a qual a seleção purifica-
dora elimina variantes deletérias em sítios vizinhos. Esse processo é uma consequência
do quanto a seleção sobre um loco afeta, através de ligação genética, as frequências
de sítios não-neutros adjacentes. Testamos essa hipótese examinando se os genes sob
seleção balanceadora apresentam um excesso de variantes deletérias em relação a ex-
pectativas derivadas a partir do restante do genoma. Usando três diferentes métricas
para determinadas se e/ou o quão deletéria é uma dada variante, identificamos um ex-
cesso de variantes deletérias dentro dos genes sob seleção balanceadora, e mostramos
que tal padrão não pode ser atribuído a efeitos confundidores. Esse achado mostra que,
juntamente com os benefícios associados à variação adaptativa, a seleção balanceadora
aumenta o fardo de mutações deletérias no genoma humano.
De forma geral, nossos achados sugerem que a seleção balanceadora provavelmente
mantém variantes genéticas envolvidas em uma miríade de processos biológicos além
da imunidade e que ela foi mais comum no genoma humano do que se acreditava
anteriormente, afetando entre 1-8% dos genes codificadores de proteínas, bem como
diversas regiões não-codificadoras. Adicionalmente, a seleção balanceadora parece ser
importante para a evolução humana não apenas por seu efeito sobre a aptidão, mas
também por ter sido uma importante força capaz de moldar a diversidade genética
observada atualmente em humanos e a susceptibilidade a doenças.
viii
Abstract
ix
When applying NCD2 to human data, using chimpanzee as the outgroup, we found
more than 200 protein-coding regions with strong signatures of balancing selection,
only 1/3 of which had prior evidence for balancing selection. There was also an enrich-
ment for several gene ontology categories, approximately half of which are related to
immunity. We also found that among genes with evidence for balancing selection there
was an excess of cases of preferential expression in specific tissues, such as "adrenal"
and "lung", and an excess of genes with mono-allelic expression. Overall, we found
that selected regions of the genome include both coding and regulatory sites. We failed
to find a marked excess of balancing selection in regulatory regions, as reported in
previous studies. Finally, we found an excess of nonsynonymous versus synonymous
polymorphisms within the selected genes.
Having documented the occurrence of balancing selection in the human genome
and identified genes which were potential targets of this selective regime, we next in-
vestigated evolutionary consequences of this process. We hypothesized that balancing
selection acting on a site reduces the efficiency with which purifying selection purges
deleterious variants at nearby sites. This process is a consequence of how the dynam-
ics of selection at one locus, mediated by linkage, can interfere with the frequencies of
adjacent non-neutral sites. We tested this hypothesis by examining if the genes under
balancing selection show an excess of deleterious variants with respect to expectations
derived from the remainder of the genome. Using three different metrics to determine
deleteriousness , we identified a significant excess of deleterious variants within bal-
anced genes, and we show that this pattern cannot be attributed to confounding fac-
tors. This finding shows that together with the benefits associated with adaptive varia-
tion, balancing selection is increasing the burden of deleterious mutations in the human
genome.
Overall, our findings suggest that balancing selection likely maintains variation in
a myriad of biological processes other than immunity and that it has been more com-
mon in the human genome than previously thought, affecting between 1-8% of human
protein-coding genes, as well as a number of non-protein coding regions. Moreover,
balancing selection appears to be important to human evolution not only because of
its influence on fitness, but also because it has been an important force shaping current
human genetic diversity and susceptibility to disease.
x
Sumário
Prólogo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Introdução Geral 4
Seleção Balanceadora: conceito, mecanismos e importância . . . . . . . 4
Por que estudar os mecanismos de manutenção da variabilidade
genética? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Teoria neutra da evolução molecular . . . . . . . . . . . . . . . . . 8
Mecanismos de manutenção de diversidade adaptativa . . . . . . 12
Evolução adaptativa no genoma humano . . . . . . . . . . . . . . . . . 25
Assinaturas de seleção balanceadora . . . . . . . . . . . . . . . . . 26
Seleção balanceadora no genoma humano . . . . . . . . . . . . . . 33
Carga genética induzida por seleção balanceadora . . . . . . . . . . . . 37
Carga genética . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Relevância, Questões & Hipóteses . . . . . . . . . . . . . . . . . . . . . 46
Relevância . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Questões & Hipóteses . . . . . . . . . . . . . . . . . . . . . . . . . 49
Bibliografia 52
xi
SUMÁRIO SUMÁRIO
References 107
Supplementary Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
S1 Text: Additional analyses for significant and outlier windows
and genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
S2 Text: A set of significant genes . . . . . . . . . . . . . . . . . . . 116
S3 Text: Manual verification of reliability of SNPs contained in
four of the outlier genes . . . . . . . . . . . . . . . . . . . . 116
Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Supplementary Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
xii
SUMÁRIO SUMÁRIO
References 209
Bibliografia 232
Apêndices 234
Apêndice A.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
xiii
SUMÁRIO SUMÁRIO
xiv
Prólogo
Existem diversas fontes de evidência da ação da seleção natural no genoma
humano. A seleção natural pode ser direcional, aumentando ou diminuindo a
frequência de variantes vantajosas ou deletérias (seleção positiva ou negativa,
respectivamente) ou balanceadora. A seleção positiva vem sendo amplamente
investigada há pelo menos duas décadas, sob a forma de “scans genômicos”
e é vista como o mecanismo principal da evolução adaptativa. Estima-se que
entre 2-14% do genoma humano foram alvo desse regime seletivo, em diversas
escalas de tempo.
1
Em segundo lugar, buscamos testar a hipótese de que a seleção balancea-
dora, mantendo polimorfismos por longas escalas de tempo (milhões de anos),
teria um efeito deletério sobre sítios próximos ao(s) sítio(s) selecionados. Essa
hipótese é oriunda de uma ampla literatura acerca do acúmulo de mutações de-
letérias em regiões vizinhas a alvos de seleção positiva, de carga genética em
humanos, e também do fato de que muitos genes sob seleção balanceadora pa-
recem estar associados a doenças complexas. Além disso, há evidências de tal
acúmulo ocorre ao redor dos genes HLA que que foram alvos de seleção balan-
ceadora. A fim de explorar essas questões, o primeiro capítulo da tese trata da
detecção dos alvos da seleção balanceadora no genoma humano, e o segundo no
estudo dos efeitos da seleção balanceadora sobre regiões vizinhas do genoma.
Além disso, apresento uma introdução geral aos temas dos dois capítulos, e
uma discussão final sobre as implicações dos achados dos dois estudos. Assim,
a tese está dividida em:
2
• Segundo capítulo: em que apresento uma investigação acerca do acúmulo
de mutações deletérias em regiões que foram alvo de seleção balanceadora
em humanos, conforme detectadas no scan genômico apresentado no pri-
meiro capítulo.
3
Introdução Geral
E
NTENDER
genética?
4
5
Figura 1: Linha do tempo do estudo da seleção balancedora Adaptada a partir de Gloss e Whiteman (2016) resu-
mindo as principais contribuições teóricas e empíricas para a compreensão da importância da seleção balanceadora
para a manutenção de variação genética. SBLP, seleção balanceadora de longo prazo (ver texto).
Introdução Geral
Introdução Geral
6
Introdução Geral
tivo que, de forma adaptativa, mantinha variação genética nas populações. Por
exemplo, Dobzhansky (1937) observou polimorfismos na orientação de longos
trechos de DNA em cromossomos de Drosophila – as chamadas “inversões cro-
mossômicas” usando técnicas de coloração de cromossomos, muito antes de o
sequenciamento de DNA ser possível – e atribuiu à seleção balanceadora a ma-
nutenção de tais polimorfismos na natureza (Figura 1).
2 Esse termo, usado frequentemente como sinônimo para vantagem do heterozigoto, foi em-
pregado originalmente para explicar a heterose em plantas (Hedrick, 2012).
3 Essa é uma definição mais antiga de vantagem do heterozigoto (Crow, 1987). Outras serão
7
Introdução Geral
Com a sensibilidade cada vez maior dos métodos moleculares em detectar os ní-
veis de variabilidade (i.e, Lewontin e Hubby, 1966), constatou-se a abundância
de polimorfismos4 em populações naturais, o que levou a um grande desen-
4A presença de variantes fenotípicas discretas em uma população é chamada de polimor-
fismo. Os polimorfismos “visíveis” são, contudo, uma subestimativa da diversidade genética
subjacente (Charlesworth e Charlesworth, 2010). Ao longo do texto, polimorfismos são diferen-
8
Introdução Geral
mente usada em genética de populações, embora a teoria tenha sido proposta pela primeira vez
antes (Kimura, 1968).
9
Introdução Geral
10
Introdução Geral
parte dos alelos polimórficos são mantidos na espécie devido a um balanço en-
tre mutação e extinção aleatória de alelos (Kimura, 1983).
Ao longo dos anos 60 e 70, por influência das ideias de Kimura, os geneticis-
tas evolutivos ficaram cada vez mais convencidos de que muita – se não a maior
parte – da evolução molecular reflete a fixação de mutações neutras ou quase
neutras, e não benéficas. A teoria parecia ser capaz de explicar, através de pro-
cessos estocásticos, a maior parte da variação observada dentro de populações.
Nesse período, o estudo teórico dos mecanismos de evolução adaptativa – sele-
ção positiva e balanceadora – diminuiu consideravelmente (Orr, 2005) (Figuras
1 e 2).
Contudo, é interessante observar que a teoria neutra não se opõe à noção de
que a evolução de forma e função possam ser guiadas por seleção darwiniana,
mas destaca um outro aspecto do processo evolutivo ao enfatizar o papel crucial
que as pressões mutacionais e a deriva genética possuem no nível molecular.
Kimura (1983) define a teoria neutra como:
“(...)the theory that at the molecular level evolutionary changes and poly-
morphisms are mainly due to mutations that are nearly enough neutral
with respect to natural selection that their behavior and fate are mainly
determined by mutation and random drift.” (Kimura 1983; primeiro ca-
pítulo).
11
Introdução Geral
Níveis de variação genética, sabe-se hoje, são influenciados por fatores de-
mográficos, tais como flutuações no tamanho populacional, estruturação, mis-
6 Genetichitch-hiking, processo através do qual mutações neutras – ou, em alguns casos, dele-
térias – mudam de frequência em uma população devido ao efeito de ligação genética com uma
mutação selecionada (revisado em Cutter e Payseur, 2013). Esse tópico será abordado em maior
detalhe na seção “Carga genética induzida por seleção balanceadora” e também no Capítulo 2.
7 Embora a seleção positiva também atue sobre as mutações vantajosas ou adaptativas, ela
tende a fixar tais variantes vantajosas na população, e portanto reduz, em vez de manter, a
variabilidade genética.
12
Introdução Geral
Aptidões constantes
genético, em que uma fita de DNA é usada para modificar a sequência de outra (Tishkoff e
Williams, 2002).
10 Por exemplo, ver a seção de métodos do Capítulo 1.
13
Introdução Geral
alelos A1 e A2 para nos referirmos a uma média da aptidão de cada alelo con-
siderando todos os genótipos em que ele aparece (i.e, homo e heterozigoto).
No cenário que estamos tratando aqui, embora w11 , w12 e w22 não mudem ao
longo do tempo, as aptidões marginais continuam dependendo das frequên-
cias alélicas. Ou seja, é possível que as duas aptidões marginais se igualem, e
que as frequências alélicas parem de mudar, atingindo um equilíbrio estável de
frequências. Chamamos a frequência de cada alelo, no equilíbrio, de frequência de
equilíbrio11 . Interessantemente, os valores absolutos das aptidões dos genótipos
são irrelevantes: apenas os valores relativos dos genótipos entram nas equações
de seleção. Pode-se, portanto, definir um genótipo (e.g. o heterozigoto) como
sendo o "padrão", e expressar as aptidões dos outros genótipos em relação a
este, como apresentado a seguir (Charlesworth e Charlesworth, 2010).
A1 A1 A2 A2 A2 A2
1−t 1 1−s
11 No caso de locos bi-alélicos, a frequência de equilíbrio pode ser definida como a frequência
do alelo menos frequente.
12 A primeira demonstração e discussão de como um polimorfismo pode ser mantido por
seleção, de forma bastante semelhante à apresentada aqui, foi feita no trabalho entitulado "On
the dominance ratio", de Fisher (1922). Ver a Figura 1.
14
Introdução Geral
15
Introdução Geral
16
Introdução Geral
A. 1
Frequência alélica
0.75
Frequência de
0.5
equilíbrio: 0.5
0.25
0
Tempo
B.
1
Frequência alélica
0.75
0.5
Frequência de
equilíbrio: 0.3
0.25
0
Tempo
17
Introdução Geral
esse não é o caso: a aptidão média de uma população (i.e, considerando todos
os genótipos e suas frequências) com reprodução aleatória e vantagem do he-
terozigoto atinge o seu máximo no equilíbrio. Por isso, diz-se que a frequência de
equilíbrio em um cenário de vantagem do heterozigoto é aquela que maximiza
a aptidão média da população (Charlesworth e Charlesworth, 2010; Andrés, 2011).
Portanto, embora a presença de homozigotos com baixa aptidão no caso da ane-
mia falciforme, por exemplo, seja muito prejudicial para o indivíduo, a aptidão
da população como um todo é mais alta quando indivíduos resistentes à ma-
lária são mantidos na população (Charlesworth e Charlesworth, 2010; Wright,
1937).
18
Introdução Geral
19
Introdução Geral
20
Introdução Geral
Como nesse caso as aptidões dependem das frequências dos genótipos, o ar-
gumento acima sobre a frequência de equilíbrio ser aquela que maximiza a ap-
tidão média da população não procede neste caso: aqui, o equilíbrio estável não
precisa, por definição, coincidir com o a aptidão média máxima (Charlesworth
e Charlesworth, 2010).
21
Introdução Geral
É provável que parte da diversidade observada nos genes MHC seja man-
tida desta forma, uma vez que diferentes variantes dos genes que codificam
as moléculas apresentadoras de antígenos são capazes de apresentar um reper-
tório determinado de epítopos, e novas variantes que surgem no hospedeiro
podem ter uma vantagem dependente de frequência (Trachtenberg et al., 2003;
22
Introdução Geral
19 Cercade 10 gerações por verão, e duas por inverno (Bergland et al., 2014).
20 Entretanto,no estudo de Bergland et al. (2014) modelos mais realistas foram considerados,
que levam em conta a possibilidade de gerações que se sobrepõem, múltiplos loci ligados e uma
combinação de variações espaciais e temporais nas pressões seletivas (todos compatíveis com
D. melanogaster), uma discussão além dos objetivos desta introdução.
23
Introdução Geral
Com alguma abstração, podemos também incluir aqui casos em que exis-
tem aptidões diferentes para um dado genótipo considerando os dois sexos: se
alelos A1 e A2 têm efeitos opostos sobre a aptidão de machos e fêmeas (anta-
gonismo sexual), polimorfismos podem ser mantidos na ausência de vantagem
do heterozigoto21 . Em D. melanogaster, estima-se que 8% dos genes têm padrões
compatíveis com esse tipo de seleção (Innocenti e Morrow, 2010).
24
Introdução Geral
25
Introdução Geral
Seleção atual
26
Introdução Geral
25 Um modelo que postula que cada novo alelo que surge em uma população é “novo” ou
“único”, i.e, diferente de todos os que surgiram antes. Esse modelo foi proposto por Kimura e
Crow (1964) em uma tentativa de estimar a proporção de loci homozigotos em uma população
diploide finita.
27
Introdução Geral
Outros testes com poder para detectar assinaturas de eventos de seleção ba-
lanceadora compatíveis com essa escala de tempo olham: (a) a distribuição de
frequências alélicas observada, comparando com aquela esperada sob o modelo
neutro, (b) a variação genética e desequilíbrio de ligação26 em certas regiões
genômicas com as observadas em regiões evoluindo de forma neutra, (c) a di-
ferenciação geográfica observada em certos loci com aquelas encontradas para
marcadores neutros (Hedrick, 2006; Hedrick, 2012; Mitchell-Olds et al., 2007)
(Figura 5).
26 Uma medida que reflete se dois alelos em dois diferentes loci coexistem de forma não-neutra
28
Introdução Geral
cente.
29
Introdução Geral
Nesse sentido, a razão dN/dS > 1 (ou ω > 1) é uma assinatura genética de
seleção positiva (Gillespie, 1991; Nielsen, 2005), mas também de seleção balan-
ceadora (e.g. Bitarello et al., 201530 ) (Figura 5). Entretanto, o critério de ω > 1
para considerar que genes estejam sob evolução adaptativa é muito conserva-
dor. Partindo da premissa de que a maior parte das mutações não-sinônimas é
deletéria (Kimura e Crow, 1963; Kimura, 1968; Eyre-Walker e Keightley, 1999), o
critério muitas vezes não é atendido quando genes inteiros são analisados. Isso
ocorre porque geralmente apenas alguns códons estão sob seleção positiva ou
balanceadora, enquanto a maior parte das mutações não-sinônimas são dele-
térias e, portanto, estão sob seleção purificadora31 . Por isso, há algum tempo
convencionou-se analisar subconjuntos de códons em busca de seleção (e.g.
Hughes e Nei, 1988; Hughes e Nei, 1989; Bitarello et al., 2015) ou através de mo-
delos que estimam diferentes valores de dN/dS para grupos de códons (Yang
e Swanson, 2002; Bitarello et al., 2015), tornando possível inferir quais deles
evoluíram adaptativamente.
29 Embora algumas mutações sinônimas possam ser alvo de seleção devido ao viés no uso de
códons e uma parcela das mutações não-sinônimas ser neutra, a premissa é válida devido às
proporções (e.g. Comeron et al., 2008).
30 Esta referência está disponibilizada no Apêndice A.4.
31 Essa ideia é indiretamente explorada no Capítulo 2.
30
Introdução Geral
nos e chimpanzés embora em princípio possa se tratar de substituições entre quaisquer duas
espécies.
31
Introdução Geral
32
Introdução Geral
Embora diversos scans genômicos tenham sido feitos com o intuito de lo-
calizar alvos de seleção positiva (revisado em Akey, 2009), poucos trabalhos,
comparativamente, buscaram localizar alvos de seleção balanceadora. Em parte
isso é devido às dificuldades de detecção desse tipo de seleção em escala genô-
mica (Andrés et al., 2009). A Figura 6 resume os estudos que buscaram por
assinaturas de seleção balanceadora em humanos.
Até muito recentemente, não se havia estabelecido mais que alguns poucos
casos de seleção balanceadora em humanos. Mesmo com o advento de dados
de sequência pra diversos genes, poucos alvos foram propostos além dos genes
33
Introdução Geral
34
Introdução Geral
Foi observado que a maior parte dos loci sob seleção balanceadora eram
compartilhados entre as duas populações, com poucas exceções : quatro genes
com evidência de seleção balanceadora apenas nos americanos com ascendên-
cia africana e nove apenas nos americanos com ascendência europeia.
35
Introdução Geral
– alguns deles conhecidos (e.g. Andrés et al., 2009), e outros novos (Key et al.,
2014).
Três importantes limitações destes trabalhos são que: (1) apesar de o traba-
lho mostrar que os métodos T1 e T2 têm poder elevado sob modelos simples,
os autores não exploram modelos demográficos humanos (e.g. Gravel et al.,
2011); (2) os 200 genes reportados provêm de uma lista dos “100 genes mais ex-
tremos” para cada teste e população, mas não foi estabelecido um critério para
que 100 genes fossem reportados; portanto, esse trabalho, apesar de inques-
tionavelmente contribuir muito para o conhecimento acumulado de regiões do
genoma humano que possuem assinaturas de SBLP, não fornece uma estimativa
aproximada do quão frequente ela pode ser no genoma humano; (3) apesar de
os autores terem usado dados de genoma completo, eles reportam como alvos
apenas genes codificadores de proteínas, e não exploram ou reportam regiões
genômicas candidatas que estão fora dos limites gênicos.
36
Introdução Geral
prazo não significa que a seleção tenha perdurado até o passado recente ou
até a geração atual. Vê-se, portanto, que definir escalas de tempo dos regimes
seletivos é importante tanto no sentido de determinar quais as ferramentas ade-
quadas, quanto no grau de resolução que se pode alcançar.
37
Introdução Geral
letérias, ou rumo à fixação por serem vantajosas (Hudson et al., 1987; Mitchell-
Olds et al., 2007; Sellis et al., 2011). Finalmente, uma certa quantidade desses
polimorfismos é mantida em populações individuais por seleção balanceadora
ou ao longo de toda a distribuição da espécie, por adaptação local35 (revisado
em Mitchell-Olds et al., 2007).
35 Situação
na qual genótipos de diferentes populações têm aptidão maior em seus ambientes
de origem, devido a seleção natural histórica na região.
38
Introdução Geral
Carga genética
No âmbito da evolução molecular, a maior parte das mutações não são adap-
tativas, piorando (mutações mal-adaptativas) ou não interferindo (neutras) no
grau de adaptação dos caráteres ao ambiente (Orr, 1998). A eficácia de remoção
de variantes deletérias de uma população depende de vários fatores: mutação
(que cria novas variantes deletérias constantemente), dominância (que influ-
encia o quanto a mutação é “visível” para a seleção) (e.g. Sellis et al., 2011),
demografia e ligação (Gravel, 2016).
Além disso, uma situação de má-adaptação de uma população pode ser cau-
36 Tamanho populacional efetivo: reflete o tamanho de uma população idealizada que estaria
sujeita à deriva da mesma forma que a população de fato. O Ne pode ser menor que o tama-
nho real da população devido a vários fatores, incluindo variância no sucesso reprodutivo, uma
história demográfica com gargalos genéticos (reduções extremas de tamanho populacional, se-
guida de uma expansão a partir de uma amostra da população original) e endogamia (revisado
em Cutter e Payseur, 2013).
39
Introdução Geral
sada por falta de variação genotípica segregante para responder à seleção. De-
riva genética e endogamia, por exemplo, removem as populações de seus picos
adaptativos e podem levar à má-adaptação fenotípica (Crespi, 2000). A pleiotro-
pia37 pode resultar em populações mal-adaptadas pois a otimização conjunta de
muitos caracteres é inviável (Charlesworth e Charlesworth, 2010; Crespi, 2000).
Finalmente, a migração entre populações de indivíduos que se adaptaram em
diferentes subpopulações pode também levar à má-adaptação (Charlesworth e
Charlesworth, 2010; Crespi, 2000).
37 Fenômeno em que um gene afeta múltiplos caracteres, considerado o modo quase universal
de atuação gênica.
40
Introdução Geral
41
Introdução Geral
pítulo 1.
42
Introdução Geral
Existem dois modos através dos quais a seleção sobre um traço interfere so-
bre a seleção sobre outros traços. O primeiro se dá em condições em que um
gene tem funções pleiotrópicas. Seleção positiva ou balanceadora pressupõe
adaptação para algum traço (aquele relacionado à pressão seletiva), mas a con-
sequência da pressão seletiva em termos de fixação de mutações pode não ser
uma adaptação para todas as possíveis funções que exerce. Nesse caso, o “sub-
produto” seria uma má-adaptação. Como exemplo, temos os genes HLA, que
estão relacionados à resposta imune em humanos e têm fortes evidências de se-
leção positiva e balanceadora. Por outro lado, muitas doenças inflamatórias e
autoimunes também estão relacionadas aos genes de HLA (Becker et al., 1998).
43
Introdução Geral
43 Diferentes sítios sinônimos são seletivamente diferentes, uma vez que certos códons (ditos
“ótimos”) são utilizados mais frequentemente que outros, possivelmente em função de eficiên-
cia e precisão da tradução (Betancourt e Presgraves, 2002).
44
Introdução Geral
gados.
A conclusão geral é que, ao menos em Drosophila, parece haver uma “hie-
rarquia de pressões seletivas”: a seleção contra mutações deletérias é mais forte
do que a seleção sobre mutações não-sinônimas vantajosas, que por sua vez é
maior do que a seleção para códons ótimos. O que não significa, conforme os
próprios autores salientam, que o efeito cumulativo de muitos códons subóti-
mos seja negligenciável.
Além disso, foi demonstrado que a seleção direcional forte (positiva ou ne-
gativa) é capaz de gerar um aumento da proporção de variantes deletérias se-
gregando em regiões adjacentes àquelas que foram alvo da seleção direcional
em humanos (Chun e Fay, 2011). Assim como os scans genômicos para seleção
balanceadora são muito menos abundantes que aqueles para seleção positiva,
o mesmo ocorre em relação ao acúmulo de deletérias: até o momento, nenhum
estudo investigou especificamente o impacto que a seleção balanceadora tem
sobre o acúmulo de variantes deletérias em sítios na vizinhança do polimor-
fismo balanceado.
Existe evidência de que genes na vizinhança dos genes HLA têm um ex-
cesso de variantes potencialmente deletérias (Mendes, 2013; Lenz et al., 2016).
Entretanto, dadas as várias particularidades dessa região genômica (Meyer et
al., 2006), permanece em aberto qual o efeito que a seleção balanceadora tem,
em geral, sobre variantes não-neutras ligadas. No Capítulo 2 eu abordo essa
questão.
45
Introdução Geral
Relevância
Na última década, com a implementação cada vez mais frequente de scans genô-
micos, foram geradas várias listas de regiões e genes do genoma humano e de
outras espécies que têm assinaturas de seleção positiva. O acesso a sequências
de genomas de diversas espécies com ferramentas bioinformáticas estimulou a
quantificação da evolução adaptativa, que resulta nos padrões de polimorfismo
observados em humanos. O fato de grupos de genes definidos com base em
sua função em categorias tais como “espermatogênese”, “olfação”, “percepção
sensorial” e “resposta imune” serem recorrentes nas listas de genes candidatos
à ação da seleção positiva (e.g. Nielsen, 2005; Sabeti et al., 2006) é algo que, in-
trinsecamente, é provido de sentido biológico: muitos genes nessas categorias
estão diretamente envolvidos em interações com o ambiente. Tais observações
aumentam nossa confiança em relação aos scans genômicos, ao mesmo tempo
em que nos ajudam a compreender as adaptações específicas de nossa espécie.
Por ser considerada amplamente como o principal mecanismo responsável pela
evolução adaptativa, a seleção positiva foi e é intensamente estudada.
46
Introdução Geral
47
Introdução Geral
Por outro lado, pouco se sabe sobre o efeito que polimorfismos balancea-
dos têm sobre variantes não-neutras adjacentes. Esse conhecimento é escasso
mesmo para alvos de seleção positiva, e praticamente inexistente para alvos
de seleção balanceadora – com exceção de alguns estudos em genes HLA (e.g.
Oosterhout, 2009; Lenz et al., 2016). Entender melhor o impacto da seleção ba-
lanceadora na evolução humana é importante não apenas para entender como
nos tornamos o que somos, mas também para melhor entender doenças com-
plexas que ocorrem com frequência relativamente alta em humanos (Vallender
e Johnson, 2008).
44 Umapossibilidade, mas não uma certeza. Para os genes HLA sabe-se que muitos sítios são
ativamente mantidos polimórficos.
48
Introdução Geral
Nesse contexto, as principais questões abordadas nesta tese foram: (1) é pos-
sível desenvolver métodos mais poderosos para encontrar regiões do genoma
que evoluem sob seleção balanceadora? (2) quais são os alvos de seleção balan-
ceadora de longo prazo em humanos? (3) quais são as propriedades biológicas
desses alvos: eles são majoritariamente genes (codificadores de proteínas), re-
giões regulatórias, ou regiões que afetam a expressão gênica? (4) quais são as ca-
tegorias funcionais mais abundantes entre genes-alvo de seleção balanceadora:
fora os genes HLA, que estão envolvidos da resposta imune, o que podemos di-
zer sobre os alvos em termos de função? (5) o que sabemos sobre a importância
biológica de alguns desses genes candidatos, com base em estudos independen-
tes? (6) os alvos de seleção balanceadora são partilhados entre populações ou
continentes? (7) qual é a prevalência de assinaturas de seleção balanceadora de
longo prazo no genoma humano? Podemos quantificar a proporção do genoma
humano que foi moldado por mecanismos de manutenção de diversidade? (8)
a seleção balanceadora sobre um ou mais sítios interfere na eficácia da seleção
purificadora sobre sítios não-neutros adjacentes?
49
Introdução Geral
• (b) a seleção balanceadora afeta tanto regiões gênicas quanto regiões regu-
latórias/controladoras de expressão.
As hipóteses (a-d) foram testadas com os alvos obtidos com um método que
desenvolvi com colaboradores, que vasculhou o genoma todo em busca de as-
sinaturas de seleção balanceadora, para quatro populações de dois continentes.
45 Embora haja relativamente poucos casos descritos até o momento e os mecanismos não se-
jam totalmente compreendidos, eles parecem promissores. Por exemplo, genes envolvidos em
espermatogênese, reconhecimento entre espermatozoide e óvulo, e hormônio folículo estimu-
lante (revisado em Vallender e Johnson, 2008).
50
Introdução Geral
51
Bibliografia
52
Introdução Geral
53
Introdução Geral
Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. Em: Taxo-
nomy and Geography. Ed. por D. Nichols. Oxford: Systematics Association.
Comeron, J. M., a. Williford e R. M. Kliman (2008). “The Hill-Robertson effect: evolutionary
consequences of weak selection and linkage in finite populations.” Em: Heredity 100 (1),
pp. 19–31.
Connallon, T. e A. G. Clark (2013). “Antagonistic versus nonantagonistic models of balancing
selection: characterizing the relative timescales and hitchhiking effects of partial selective
sweeps.” Em: Evolution; international journal of organic evolution 67 (3), pp. 908–17.
Crespi, B. J. (2000). “Short Review The evolution of maladaptation”. Em: Heredity 84 (March
1999), pp. 623–629.
Crow, J. F. (1987). “Muller, Dobzhansky, and overdominance”. Em: Journal of the History of Biology
20 (3), pp. 351–380.
Cutter, A. D. e B. A. Payseur (2013). “Genomic signatures of selection at linked sites: unifying
the disparity among species.” Em: Nature reviews. Genetics 14 (4), pp. 262–74.
Darwin, C. (1859). The origin of species: complete and fully illustrated. 1979ª ed. New York: Gra-
mercy Books. ISBN: 9780517123201.
— (1876). The effects of cross and self fertilisation in the vegetable kingdom.
De Boer, R. J., J. a. M. Borghans, M. van Boven, C. Kesmir e F. J. Weissing (2004). “Heterozygote
advantage fails to explain the high degree of polymorphism of the MHC.” Em: Immunoge-
netics 55 (11), pp. 725–731.
DeGiorgio, M., K. E. Lohmueller e R. Nielsen (2014). “A model-based approach for identifying
signatures of ancient balancing selection in genetic data.” Em: PLoS genetics 10 (8), e1004561.
Dempster, E. R. (1955). “Maintenance of genetic heterogeneity.” Em: Cold Spring Harbor Symposia
on Quantitative Biology. Cold Spring Harbor Laboratory Press, pp. 25–32.
Dobzhansky, T. (1937). Genetics and the Origin of Species. 2nd. New York: Columbia University
Press.
Doherty, P. C. e R. M. Zinkernagel (1975). “Enhanced immunological surveillance in mice hete-
rozygous at the H-2 gene complex”. Em: Nature 256 (5512), pp. 50–52.
Enard, D., F. Depaulis e H. Roest Crollius (2010). “Human and Non-Human Primate Genomes
Share Hotspots of Positive Selection”. Em: PLoS Genetics 6 (2), pp. 1–13.
54
Introdução Geral
Eyre-Walker, A. (2006). “The genomic rate of adaptive evolution.” Em: Trends in ecology & evolu-
tion 21 (10), pp. 569–75.
Eyre-Walker, A. e P. D. Keightley (1999). “High genomic deleterious mutation rates in homi-
nids”. Em: Nature 397 (6717), pp. 344–347.
Fay, J. C., G. J. Wyckoff e C.-I. I. Wu (2001). “Positive and negative selection on the human
genome.” Em: Genetics 158 (3), pp. 1227–34.
Fijarczyk, A. e W. Babik (2015). “Detecting balancing selection in genomes: Limits and pros-
pects”. Em: Molecular Ecology, n/a–n/a.
Fisher, R. A. (1922). “On the Dominance Ratio.” Em: Proc. R. Soc. 42, pp. 321–341.
Fu, W. e J. M. Akey (2013). “Selection and Adaptation in the Human Genome”. Em: Annual
Review of Genomics and Human Genetics 14 (1), pp. 467–489.
Garrigan, D. e P. W. Hedrick (2003). “Detecting adaptive molecular polymorphism : Lessons
from the MHC”. Em: Evolution 57 (8), pp. 1707–1722.
Gillespie, J. H. (1991). The causes of molecular evolution. Oxford: Oxford University Press. ISBN:
0-19-509271-6.
Gillespie, J. H. e C. Langley (1974). “A general model to account for enzyme variation in natural
populations”. Em: Genetics 76 (4), pp. 837–48.
Gloss, A. D. e N. K. Whiteman (2016). “Balancing Selection: Walking a Tightrope”. Em: Current
Biology 26 (2), R73–R76.
Gravel, S. (2016). “When Is Selection Effective?” Em: Genetics 203 (1), pp. 451–462.
Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A. Gibbs
e C. D. Bustamante (2011). “Demographic history and rare allele sharing among human
populations.” Em: Proceedings of the National Academy of Sciences of the United States of America
108 (29), pp. 11983–8.
Haldane, J. (1937). “The Effect of Variation on Fitness”. Em: The American Naturalist 71 (735),
pp. 337–349.
Harris, E. E. (2010). “Nonadaptive processes in primate and human evolution.” Em: American
journal of physical anthropology 143 Suppl, pp. 13–45.
Harris, E. E. e D. Meyer (2006). “The Molecular Signature of Selection Underlying Human
Adaptations”. Em: Yearbook of Physical Anthropology 130, pp. 89–130.
55
Introdução Geral
Haygood, R., C. C. Babbitt, O. Fedrigo e G. A. Wray (2010). “Contrasts between adaptive coding
and noncoding changes during human evolution”. Em: Proceedings of the National Academy
of Sciences of the United States of America 107 (17), pp. 7853–7857.
Hedrick, P. W. (2006). “Genetic Polymorphism in Heterogeneous Environments: The Age of
Genomics”. Em: Annual Review of Ecology, Evolution, and Systematics 37, pp. 67–93.
— (2012). “What is the evidence for heterozygote advantage selection?” Em: Trends in Ecology
& Evolutiony & evolution 27 (12), pp. 698–704.
Hellmann, I., I. Ebersberger, S. E. Ptak, S. Pääbo e M. Przeworski (2003). “A neutral explanation
for the correlation of diversity with recombination rates in humans.” Em: American journal
of human genetics 72 (6), pp. 1527–35.
Hill, W. G. e A. Robertson (1966). “The effect of linkage on limits to artificial selection”. Em:
Genetical Research 8 (03), p. 269.
Hudson, R. R., M. Kreitman e M. Aguade (1987). “A Test of Neutral Molecular Evolution Based
on Nucleotide Data”. Em: Genetics 116 (1), pp. 153–159.
Hughes, A. L. e M. Nei (1989). “Nucleotide substitution at major histocompatibility complex
class II loci: evidence for overdominant selection”. Em: Proceedings of the National Academy
of Sciences of the United States of America 86 (3), pp. 958–962.
Hughes, A. L. e M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility
class I loci reveals overdominant selection”. Em: Letters to Nature 335 (8), pp. 167–170.
Innocenti, P. e E. H. Morrow (2010). “The sexually antagonistic genes of drosophila melanogas-
ter”. Em: PLoS Biology 8 (3), e1000335.
Jablonski, N. G. e G. Chaplin (2010). “Human skin pigmentation as an adaptation to UV ra-
diation”. Em: Proceedings of the National Academy of Sciences 107 (Supplement_2), pp. 8962–
8968.
Key, F. M., J. C. Teixeira, C. de Filippo e A. M. Andrés (2014). “Advantageous diversity maintai-
ned by balancing selection in humans”. Em: Current Opinion in Genetics & Development 29,
pp. 45–51.
Kimura, M. (1991). “The neutral theory of molecular evolution: a review of recent evidence”.
Em: Japanese Journal of Genetics 66 (4), pp. 367–386.
Kimura, M. (1968). “Evolutionary rate at the molecular level”. Em: Nature 217, pp. 624–626.
56
Introdução Geral
— (1983). The Neutral Theory of Molecular Evolution. Cambridge: Cambridge University Press.
ISBN: 9780511623486. URL: http://ebooks.cambridge.org/ref/id/CBO9780511623486.
Kimura, M. e J. F. Crow (1963). “The Measurement of Effective Population Number”. Em: Evo-
lution 17 (3), pp. 279–288.
— (1964). “The Number of Alleles that Can Be Maintained in a Finite Population”. Em: Genetics
49, pp. 725–738.
Klein, J., A. Sato, S. Nagl e C. O’hUigin (1998). “Molecular trans-species polymorphism”. Em:
Annual Review of Ecology and Systematics 29, pp. 1–21.
Kreitman, M. e A. Di Rienzo (2004). “Balancing claims for balancing selection”. Em: Trends in
Genetics 20 (7), pp. 300–304.
Lande, R. (1975). “The maintenance of genetic variability by mutation in a polygenic character
with linked loci”. Em: Genetical Research 26 (3), pp. 221–35.
Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. Em: Science 339 (6127), pp. 1578–1582.
Lenz, T. L., V. Spirin, D. M. Jordan e S. R. Sunyaev (2016). “Excess of Deleterious Mutations
around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. Em: bioRxiv, pp. 1–
30.
Levene, H. (1953). “Genetic Equilibrium When More Than One Ecological Niche is Available”.
Em: The American Naturalist 87 (836), pp. 331–333.
Lewontin, R. C. e J. L. Hubby (1966). “A Molecular Approach to the Study of Genic Heterozy-
gosity in Natural Populations. II. Amount of Variation and Degree of Heterozygosity in
Natural Populations of Drosophila pseudoobscura”. Em: Genetics 54 (2), pp. 595–609.
Lynch, M. (2007). “The evolution of genetic networks by non-adaptive processes.” Em: Nature
reviews. Genetics 8 (10), pp. 803–13.
Maynard-Smith, J. e J. Haigh (1974). “The hitch-hiking effect of a favorable gene.” Em: Genetical
Research (23), pp. 23–35.
McDonald, J. H. e M. Kreitman (1991). “Adaptive protein evolution at the Adh locus in Dro-
sophila.” en. Em: Nature 351 (6328), pp. 652–4.
Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Rel. téc.
Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/41/
41131/tde-02082013-161104/pt-br.php.
57
Introdução Geral
58
Introdução Geral
59
Introdução Geral
Trachtenberg, E. et al. (2003). “Advantage of rare HLA supertype in HIV disease progression”.
Em: Nature Medicine 9, pp. 928–935.
Vallender, E. J. e W. E. Johnson (2008). “Balancing Selection in Human Evolution”. Em: eLS.
Watterson, G. A. (1978). “The homozygosity test of neutrality.” Em: Genetics 88 (2), pp. 405–17.
Williams, G. C. (1957). “Pleiotropy, Natural Selection, and the Evolution of Senescence”. Em:
Evolution 11 (4), p. 398.
Wright, S. (1937). “The Distribution of Gene Frequencies in Populations.” Em: Proceedings of the
National Academy of Sciences 23 (6), pp. 307–320.
Yang, Z. e W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution that
Account for Heterogeneous Selective Pressures Among Site Classes”. Em: Molecular Biology
and Evolution 19 (1), pp. 49–57.
Zhang, Z. e J. Parsch (2005). “Positive correlation between evolutionary rate and recombination
rate in Drosophila genes with male-biased expression.” Em: Molecular Biology and Evolution
22 (10), pp. 1945–7.
60
Capítulo 1
Considerações Iniciais
Neste capítulo apresento um manuscrito – atualmente em revisão final pelos
co-autores – em que desenvolvemos uma nova estatística para detecção de ins-
tâncias de seleção balanceadora no genoma humano. Ela quantifica diretamente
as duas principais assinaturas de regimes de seleção balanceadora atuantes por
longas escalas de tempo: um excesso de alelos segregando em frequências in-
termediárias e um excesso de sítios polimórficos em relação às expectativas sob
um modelo nulo.
Cerca de um terço dos genes que detectamos com essa nova estatística tem
evidência prévia de seleção balanceadora – de acordo com métodos e dados
bastante diferentes dos nossos. Contudo, descrevemos também mais de 150
novos genes candidatos, bem como regiões não-codificadoras candidatas e as
propriedades dessas regiões.
61
Capítulo 1
Nosso método tem maior poder que outros descritos na literatura, e é ex-
tremamente simples de ser implementado e interpretado, além de rodar rapi-
damente. Combinado a um dedicado controle de qualidade dos dados utiliza-
dos, e verificação das regiões candidatas obtidas, acreditamos ter fornecido um
mapa extremamente confiável da extensão das assinaturas de seleção balance-
adora no genoma humano. Com este trabalho, contribuímos para a literatura
(não muito extensa) de seleção balanceadora em humanos, além de propormos
um método com alto poder estatístico que, em princípio, pode ser utilizado em
abordagens semelhantes para outras espécies.
Este trabalho foi feito em colaboração com a pesquisadora Aida M. Andrés,
do Max Planck Institute for Evolutionary Anthropology (MPI-EVA, Leipzig),
que concebeu a ideia do novo método. O trabalho começou em 2013, durante
meu doutorado sanduíche, e contou com a co-supervisão de Diogo Meyer e
A.M.A. Contei ainda com a colaboração dos alunos Cesare de Filippo (pós-
doutorando, MPI-EVA) e João C. Teixeira (doutorando, MPI-EVA). J.C.T. rea-
lizou parte das análises de enriquecimento para as regiões candidatas, e C.F.
colaborou nas etapas de simulações para avaliação da estatística e na imple-
mentação do scan em si.
O manuscrito foi redigido por mim, juntamente A.M.A. e D.M, e todos os
autores contribuíram com comentários sobre a redação do mesmo. Ele será sub-
metido para o períódico Plos Genetics.
Todo o material suplementar citado no texto foi disponibilizado no fim do
capítulo.
62
Capítulo 1
Introduction
refers to a class of selective mechanisms that main-
B
ALANCING SELECTION
versity with phenotypic relevance. For example, decades of research have established
HLA genes as a prime example of balancing selection (Meyer and Thomson, 2001; Spur-
gin and Richardson, 2010) with thousands of alleles segregating in humans, extensive
support for the functional relevance of polymorphism (e.g., Hedrick et al., 1991; Prug-
nolle et al., 2005) and various well-documented cases of association between selected
alleles and disease resistance and susceptibility (e.g. Raychaudhuri et al., 2012; Howell,
2014).
small, but genes identified are associated to phenotypes such as auto-immune dis-
Network, 2015), resistance to HIV infection (Biasin et al., 2007) and polycystic ovary
syndrome (Day et al., 2015). Thus, the relevance of balanced polymorphisms is not re-
63
Capítulo 1
stricted to their historical influence on individual fitness: they also shape, today, human
and Charlesworth, 2010; Clarke, 1962; Fijarczyk and Babik, 2015; reviewed in Andrés,
2011; Key et al., 2014b). These include heterozygote advantage (or overdominance)
(Andrés, 2011; Key et al., 2014b; Fijarczyk and Babik, 2015), frequency-dependent se-
lection, including rare allele advantage (Clarke, 1962; Charlesworth and Charlesworth,
2010), selective pressures that fluctuate in time (Andrés, 2011; Bergland et al., 2014; Fi-
jarczyk and Babik, 2015) or in space in panmitic populations (Andrés, 2011; Charlesworth
et al., 1997; Charlesworth, 2006; Fijarczyk and Babik, 2015; Key et al., 2014b) and cases
pleiotropy, and some instances of selection that varies in space, a stable equilibrium
can be reached (Charlesworth and Charlesworth, 2010). For other mechanisms the
frequency of the selected allele can change in time with no theoretical equilibrium fre-
quency, although the frequency of the balanced polymorphism will be strongly affected
with respect to neutral expectations and has the potential to leave identifiable signa-
tures in genomic data. These include local site-frequency spectra with an excess of
alleles close to the frequency of the balanced allele and, when selection is old enough,
some cases, very ancient balancing selection can maintain trans-species polymorphisms
in sister species (Leffler et al., 2013; Teixeira et al., 2015), while recent balancing selection
or selection that is transient (e.g., that predicted in the model of Sellis et al., 2011) will
result in signatures that are probably difficult to distinguish from incomplete, recent
While balancing selection has been extensively explored from a theoretical perspec-
64
Capítulo 1
tive, an empirical understanding of its prevalence in the human genome lags behind
our knowledge of positive selection. This stems from technical difficulties in detect-
ing balancing selection, as well as the perception that balancing selection may be rare
(Hedrick, 2012). In fact, few methods have been developed to identify its targets, and
only a handful of studies have sought to uncover targets of balancing selection genome-
wide (Andrés et al., 2009; Alonso et al., 2008; Asthana et al., 2005; Bubb et al., 2006;
Leffler et al., 2013; DeGiorgio et al., 2014; Rasmussen et al., 2014; Teixeira et al., 2015),
with different methods and datasets. Andrés et al. (2009) and DeGiorgio et al. (2014)
identified, with different approaches, genes (Andrés et al., 2009) or genomic regions
(DeGiorgio et al., 2014) with an excess of polymorphism and with site-frequency spec-
Leffler et al. (2013) and Teixeira et al. (2015) identified trans-species polymorphisms
between humans and other primates. Overall, these studies suggested that balancing
selection may act on a relatively small portion of the genome, although the limited ex-
tent of the data available (e.g. exome data in Andrés et al., 2009 and small sample size
in DeGiorgio et al., 2014), and the stringency of the criteria - e.g., balanced polymor-
phisms that pre-date human-chimpanzee divergence in Leffler et al. (2013) and Teixeira
Here, we developed two new test statistics that summarize, directly and in a sim-
ple way, the degree to which allele frequencies in a genomic region deviate from the
showed that one of our methods outperforms existing methods for realistic demo-
graphic scenarios for human populations. We applied our statistic to the genome-wide
1000 Genomes Project (Abecasis et al., 2012) data in four human populations and used
both outlier and simulation-based cut-offs to identify both known and new genomic
65
Capítulo 1
Results
NCD Method
Background Owing to linkage, the signature of long term balancing selection (LTBS)
on a site extends to the genetic neighborhood of the selected variant(s), so the patterns
it evolved under balancing selection (Charlesworth, 2006; Andrés, 2011). LTBS leaves
two distinctive signatures in linked variation, when compared with neutral expecta-
tions. The first is an increase in the ratio of polymorphic to divergent sites. This occurs
because, by reducing the probability of fixation, balancing selection increases the local
TMRCA (Hudson and Kaplan, 1988). One commonly used test to detect this signature
In humans, the folded site frequency spectrum (SFS), which is the distribution of the fre-
quency of the minor alleles (MAF) regardless of whether they are ancestral or derived,
recent population expansions (e.g. Coventry et al., 2010), with the abundance of rare al-
leles further increased by purifying selection and recent selective sweeps. On the other
hand, regions under LTBS are expected to show a markedly different SFS, with propor-
tionally more alleles at intermediate frequency (Fig 1A-B). Such a deviation in the SFS
is the signature identified by classical neutrality tests, such as Tajima’s D (Tajima, 1989)
The signatures of LTBS on the SFS will depend on the selective regime and the
intensity of selection on each genotype. For example, under overdominance the fre-
66
Capítulo 1
67
Capítulo 1
2006; Charlesworth and Charlesworth, 2010; Fijarczyk and Babik, 2015). Given selec-
tion coefficients s and t against the AA and BB homozygotes, respectively, the deter-
s
f eqA = (1)
s+t
nance (t 6= s), which might be more prevalent in natural systems (Hedrick, 2012), it
β-globin and sickle cell anemia, where in regions of endemic malaria the fitness of the
HbA homozygote for the β-globin locus is approximately 9 times higher than that of
the HbS homozygote, with the resulting equilibrium frequency of the HbS allele being
0.13 (Allison and Clyde, 1961). Under frequency-dependent selection, f eq will depend
on the frequency of the favored allele. Under fluctuating selection the frequency of the
selected allele will depend on the temporal and spatial scales of selection (Andrés, 2011;
Clarke, 1964; Pasvol et al., 1978) and although no stable, long-term frequency equilib-
rium may be reached, the balanced polymorphism may be actively maintained (as long
as the heterozygote fitness exceeds that of homozygotes in their harmonic and geomet-
ric means, for spatial and temporal models, respectively) (reviewed in Hedrick, 2006).
In these cases, f eq can be thought of as the frequency, at the time of sampling, of the
balanced polymorphism.
Non-Central Deviation (NCD) In the tradition of neutrality tests that analyze di-
rectly the SFS (e.g. Nielsen et al., 2005; Nielsen et al., 2009; Williamson et al., 2007), we
propose two related test statistics that explore the abundance and frequency of poly-
which we define as the degree to which the local SFS deviates from a pre-specified al-
lele frequency (the target frequency, t f ). Under a model of balancing selection, t f can
68
Capítulo 1
be thought of as the deterministic frequency that would be attained given the selection
parameters, with the NCD statistic querying how far SNP frequencies are from it. We
propose two implementations for this statistic: NCD1 and NCD2. The NCD1 statistic
is based solely on the SFS, using information on allelic frequency, pi , of each site in a
locus:
v
u n
u ∑ ( p i − t f )2
u
t i =1
NCD1t f = (2)
n
where i = 1,2,3,...,n is the i-th polymorphism, pi is the MAF for the i-th polymor-
phism, and t f is is the target frequency with respect to which the deviations of the
observed alleles frequencies are computed. Thus, NCD1 is a non-central standard de-
viation that quantifies the dispersion of allelic frequencies from t f , rather than from the
mean of the distribution. Because the frequencies of alleles at bi-allelic loci are comple-
mentary, and under balancing selection there is no prior expectation on the ancestral
or the derived allele being maintained at higher frequency, we use the folded SFS (Fig
1). The minimum amount of data required for calculating NCD1 is one polymorphism,
The NCD2 statistic is an extension of NCD1 that includes information not only on
the frequency of polymorphisms, but also on the number of fixed differences (FDs):
v
u n · (0 − t f )2 + n ( p − t f )2
u
u ∑ i
i =1
NCD2t f = (3)
t
nfd + n
, where n f d is the number of FDs in the locus. In NCD2, all informative sites (IS
= SNPs + FDs) are taken into account. FDs can be considered informative sites with a
minor allele frequency (MAF) of 0, and as such they contribute to deviation from t f :
the greater the number of fixed differences, the larger the NCD2 value and hence the
weaker the support for LTBS. The minimal data required for calculating NCD2 is one
69
Capítulo 1
informative site, and for simplicity only bi-allelic allelic SNPs and single nucleotide FDs
are considered.
From equations 2 and 3 it follows that the maximum value for NCD2t f for a given
t f is the target frequency itself (i.e, no SNPs and one or more FDs in the locus, as in S1
Fig) and for NCD1t f the maximum value approaches - but never reaches - t f when all
SNPs are singletons. The minimum value for both NCD1t f and NCD2t f is 0, when all
SNPs segregate at t f and, in the case of NCD2t f , there are no FDs (S2 Fig). Thus, low
NCD1 and NCD2 values reflect a low deviation of the SFS from the pre-defined target
frequency, which is expected in windows containing sites under LTBS (Fig 1C).
We evaluated the specificity and sensitivity of NCD1 and NCD2 by benchmarking their
ferred for African, European, and Asian human populations (Fig 2), and simulated both
neutrality and balancing selection using a model of heterozygote advantage (see Meth-
ods). We then explored the influence of the parameters that may affect the power of
the NCD statistics: time since the onset of balancing selection (Tbs), the deterministic
tory of the sampled population, the chosen target frequency in NCD calculation (both
for cases in which f eq does and does not match t f ), the length of the genomic region
analyzed (L), and the implementation of NCD (NCD1 or NCD2). Box 1 summarizes
70
Capítulo 1
IS, informative sites (number of polymorphic sites in the ingroup species plus
t f , target frequency, i.e, the frequency used in the NCD statistics as the value
NCD1, NCD statistic that measures the average distance between poly-
NCD2, NCD statistic that measures the average distance between allelic
phisms and fixed differences with an outgroup. NCD2t f is NCD2 for that
given t f .
For simplicity we present power values (always at false positive rate of 5%) aver-
aged across NCD implementations (NCDt f being the average of NCD1t f and NCD2t f ),
demographic models and sequence lengths. These averages are helpful in that they
reflect the general changes in power when changing individual parameters. Never-
theless, because they often include conditions for which power is low, the averages
underestimate the power that the test can reach under a given parameter. The full ma-
71
Capítulo 1
72
Capítulo 1
trix of power results for each condition is presented in S1 Table, and some key points
Time since the onset of balancing selection and sequence length The signa-
tures of LTBS are expected to be stronger the longer the time since the onset of balancing
selection, because there will have been more time for linked mutations to accumulate
phism with Tbs of 1, 3, and 5 million years (myr) (Fig 2). For simplicity, in this section
we consider only cases where t f = f eq although this condition is relaxed in later sec-
tions.
For both NCD10.5 and NCD20.5 ( f eq = 0.5), power to detect balancing selection with
Tbs = 1 myr is low across all scenarios and for all t f (always lower than 0.43, S1 Table).
Nevertheless, power to identify older balanced polymorphisms is high, for all t f , for
both 3 myr (e.g. average NCD0.5 is 0.70) and 5 myr (average NCD0.5 0.77) (S3-S8 Figs,
Tbs also affects the length of the region bearing the signature of balancing selection,
signatures with older Tbs (Leffler et al., 2013; Teixeira et al., 2015). This is indeed the
case for all t f (S3-S8 Figs, S1 Table). For example, NCD0.5 at Tbs = 5 myr resulted in
average power of 0.78, 0.76, and 0.67 for 3, 6, and 12 Kb, respectively (S3-S8 Figs, S1
Table), and a similar pattern emerges for NCD0.4 and NCD0.3 . For NCD1 the power
increment for shorter regions was less pronounced than for NCD2 (S1 Table), perhaps
due to the lower number of informative sites. Again, a similar picture emerges for
with 25% reduction in power for 12 Kb compared to 3 Kb (S1 Table; S3-S8 Figs).
In summary, the power of the NCD statistics grows with the age of the balanced
polymorphism and the narrowness of the analyzed window. These analyses suggest
that the NCD statistics are well powered to detect balancing selection that started at
73
Capítulo 1
least 3 myr ago in windows of 3 Kb centered on the selected site (S1 Table) and we
Demography Power is similar for samples simulated under the African (average
NCD0.5 of 0.86) and European (average NCD0.5 of 0.87) demographic scenarios for both
NCD10.5 and NCD20.5 and drastically lower for a population under the demographic
model for Asians (average NCD0.5 of 0.48; S3-S8 Figs, and S1 Table). Similar trends
are observed for NCD0.4 (75% average reduction in Asia when compared with Africa)
and NCD0.3 (92% reduction). One explanation for the lower power under the Asian
demographic model is the stronger effect of random genetic drift in this population due
to its lower Ne (Gutenkunst et al., 2009; Gravel et al., 2011), which affects both the SFS of
and those under balancing selection (reducing the efficacy of selection and putatively
increasing the dispersion from the balanced frequency equilibrium). We thus focused
our subsequent analyses on the African and European populations, for which power
was high and comparable (thus allowing fair comparisons between these geographic
regions).
tics have high power: on average 0.86, 0.79, and 0.70 for f eq = 0.5, 0.4, and 0.3, respec-
tively (S1 Table). Selection with f eq =0.2 resulted in low power across all parameters and
In practice, though, one does not have a priori knowledge about the equilibrium
frequency of balanced polymorphisms. We thus explored the power of NCD when the
simulated equilibrium and the target frequencies differ. The power to detect LTBS is
very high for NCD20.5 and NCD10.5 , even when selection is simulated with other f eq
values (average NCD0.5 of 0.79, Table 1, S3-S8 Figs, and S1 Table) and similarly high for
74
Capítulo 1
Table 1. Power for simulations under the African and European demographic
models
Tbs, time since onset of balancing selection (in millions of years); f eq , frequency
equilibrium in the simulations. Power values are for a false positive rate of 0.05,
for simulations of the African and European demographic scenarios, L = 3 Kb.
Africa Europe
NCD2 NCD1 NCD2 NCD1
tf tf tf tf
Tbs f eq 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 0.96 0.94 0.84 0.93 0.91 0.39 0.97 0.95 0.83 0.92 0.85 0.20
5 0.4 0.94 0.94 0.89 0.89 0.89 0.67 0.95 0.94 0.91 0.85 0.82 0.59
5 0.3 0.90 0.91 0.93 0.72 0.80 0.84 0.84 0.85 0.89 0.47 0.57 0.74
3 0.5 0.91 0.88 0.68 0.86 0.80 0.24 0.93 0.89 0.68 0.81 0.69 0.14
3 0.4 0.88 0.86 0.76 0.78 0.78 0.56 0.89 0.87 0.79 0.74 0.71 0.46
3 0.3 0.75 0.77 0.81 0.56 0.64 0.71 0.73 0.76 0.79 0.39 0.48 0.63
Conversely, power to detect LTBS with f eq = 0.4 is similar with NCD0.5 or NCD0.4
(Table 1 and S1 Table), but for f eq = 0.3 power is 10% is higher for NCD0.3 than for
NCD0.5 (Table 1 and S1 Table). Therefore, NCD statistics can be well powered both
when the frequency of the balanced polymorphism is the same as the target frequency,
and when it is not (as expected given correlations among these statistics; S9 Fig). Nev-
ertheless, the closest t f is to f eq , the highest the power to identify targets of LTBS (Table
1). Thus, information is gained by calculating NCD with different target frequencies.
NCD implementations The power for NCD2 is greater than for NCD1, for all t f :
f eq = 0.5 (average power of 0.94 for NCD20.5 and 0.88 for NCD10.5 ), f eq = 0.4 (0.93 for
NCD20.4 and 0.80 for NCD10.4 ) and for f eq = 0.3 (0.85 for NCD20.3 and 0.73 for NCD10.3
(Table 1, Fig 3, S1 Table). The gain in power that occurs when using information on FDs
was also explored by jointly considering NCD1 with HKA (see below).
D (TajD) and HKA (Hudson et al., 1987; Tajima, 1989), a pair of composite likelihood-
75
Capítulo 1
76
Figure 3. ROC curves for comparison between NCD20.5 and other tests
Power to detect LTBS for simulations where the balanced polymorphism was modeled to achieve frequency equilib-
rium ( f eq ) of A) 0.3, B) 0.4, and C) 0.5. Plotted values are for African demography, Tbs = 5 myr, L = 3 kb, except for
T1 and T2 (DeGiorgio et al., 2014), which were evaluated based on 100 SNPs in 15 Kb simulated windows following
the original publication (see Methods). Target frequency for NCD1 and NCD2 matches the simulated f eq . Similar
results are observed for European demography (Fig S7).
Capítulo 1
based measures recently developed by DeGiorgio et al. (2014) termed T1 and T2 (T1
only looks at the SFS, T2 includes information on FDs), and NCD1 and NCD2. We
additionally explored the power of a composite statistic, where the p-value was jointly
computed as a function of NCD1 and HKA statistics (NCD1+HKA), with the goal of
quantifying the contribution of FDs to NCD power (see Methods). For simplicity we
considered Tbs = 5 my and 3 Kb for all comparisons. The only exceptions are T1 and
T2, for which a larger window size (100 informative sites) was used, following DeGior-
gio et al. (2014), to compare the methods using their optimal window size.
When f eq = 0.5, NCD20.5 has the highest power: 0.96 (0.94 for T2, 0.93 for TajD, 0.91
for NCD10.5 +HKA, 0.78 for HKA, and 0.5 for T1; Fig 3). The gain in power provided by
NCD20.5 is much higher when f eq departs from 0.5, where NCD2 clearly outperforms
all other tests if t f = f eq (Fig 3). For f eq = 0.4, the power of NCD20.4 is 0.94 (0.9 for
TajD, T2, and NCD10.5 +HKA; 0.76 for HKA, and 0.58 for T1; Fig 3 and Table 1) and
for f eq = 0.3 NCD20.3 power is 0.91 (0.89 for T2, 0.85 for NCD10.5 +HKA, 0.75 for TajD,
0.73 for HKA, 0.59 for T1; Fig 3). These patterns are consistent in both African and
European simulations (Fig 3, Table 1, S10 Fig). Thus, NCD2 has greater or comparable
power to detect LTBS than TajD, HKA, T1 and T2, and a combined test of NCD1+HKA
for African and European scenarios (Fig 3, Table 1, S7 Fig). Notably, as the simulated
frequency equilibrium moves away from 0.5, its advantage over TajD increases (Fig 3).
methods tested (Table1 , Fig 3) and it reaches very high power when t f = f eq (higher
than 0.9 for 5 myr alleles and than 0.79 for 3 myr alleles). While the f eq of a puta-
tively balanced allele is unknown, the simplicity of the statistic makes it trivial to run
it for several t f values. Importantly, power was very similar under the African and
European models (Table 1, Fig 3, S10 Fig). Because NCD2 outperforms NCD1 we rec-
ommend using of NCD2 in humans, although NCD1 is a good choice when outgroup
77
Capítulo 1
data is lacking.
We aimed to identify regions of the genome under LTBS. Based on the power analyses,
we used NCD20.5 , NCD20.4 and NCD20.3 , which are well powered to detect LTBS and
do not provide fully overlapping sets of candidate windows. We calculated these statis-
tics for 3kb windows (1.5kb step size) and tested for significance using two complemen-
tary approaches: one based on neutral expectations, and one based on the empirical
data. We analyzed genome-wide data from two of African (YRI: Yoruba in Ibadan,
Nigeria; LWK: Luhya in Webuye, Kenya) and two European populations (GBR: British
from England and Scotland; TSI:Toscani in Italy) (Abecasis et al., 2012). We filtered for
mappability, segmental duplications, and orthology with the outgroup genome (chim-
In addition, because windows with a low number of IS have high NCD2 variance
due to noisy SFS (S18 Fig), a pattern also observed in neutral simulations (S11 Fig), we
excluded windows with less than 19 and 15 IS in African and European populations, re-
spectively. This filter removed only 4% of the windows while keeping a set of windows
for which NCD2 values remain quite stable regardless of the number of IS (S11-S18
Figs). After all filters, the genomic coordinates defining the windows were identical in
throughout the genome (Table 2, S13 Fig). These windows overlapped 18,308 protein-
coding genes (95% of all human autosomal genes). For each window we calculated a
p-value that reflects the quantile of its NCD2 value, when compared with the NCD2
each population, and conditioned on the same number of IS (to account for the higher
Over all populations, between 4,826 and 5,910 (0.30-0.36%) of the genomic win-
78
Capítulo 1
dows have a lower NCD20.5 value than any of the 10,000 neutral simulations (p-value <
0.0001, Table 2). The proportions were very similar for NCD20.4 and NCD20.3 : between
sets, whose patterns we cannot explain under neutrality, as the significant windows.
Due to our criterion for defining significance, all significant windows had an identi-
cal p-value (p < 0.0001). To quantify the degree of departure from neutral expectations,
NCD2 was compared to the mean of NCD2 values for the 10,000 simulations with the
same number of IS. We defined, for each genomic window, Ztf (Equation 4) as the num-
ber of standard deviations that its NCD value for each window lies from the neutral
tify the most extreme signatures of LTBS, we selected the windows with the 0.05% most
extreme Ztf values for each population and t f value (resulting in 816 outlier windows),
which we refer to as the outlier windows (Table 2). The empirical outlier windows,
which represent a smaller and more conservative set of genes, are almost entirely a sub-
set of the significant windows (Methods). Below, we discuss properties of the union of
all significant (or outlier) windows (Table 2) taken over all of the target(s) frequency(s)
The significant windows are extremely rich both in polymorphic sites (Fig 4) and num-
ber of intermediate-frequency alleles (Fig 5), with the shape of the SFS depending on
the t f at which they reach significance. These patterns are not unexpected, since they
were used to identify these windows. Nevertheless, they show that neither SNP den-
sity nor the SFS dominate the selection process, as significant windows are unusual in
both aspects. Also, the striking differences with respect to the background empirical
distribution, combined with the fact that no neutral simulation had lower NCD value
79
Capítulo 1
mapping errors due to genomic duplicates (e.g. we removed positions with poor map-
pability, and those that fall within tandem repeats and segmental duplications; S13 Fig
and Methods). Also, we found that the significant windows have extremely similar
coverage to the rest of the genome (S14 Fig), showing that they are not enriched in
mechanisms other than balancing selection: introgression and gene conversion. The
outlier windows are significantly depleted of SNPs annotated as introgressed from Ne-
anderthals (S17 Fig, S1 Text), and significant windows do not show a different propor-
tion of introgressed SNPs from controls, showing that introgression is not a confound-
ing mechanism leading to significant or outlier regions (S7 Fig, S1 text). Finally, the
80
Capítulo 1
olfactory receptors (S16 and S19 Figs, S1 Text). Thus, the significant and outlier win-
populations that are not likely to be driven by introgression or gene conversion (S16,
Significant and outlier windows were not randomly distributed across the genome.
Chromosome 6 is the most enriched for signatures of LTBS, contributing 11.2% of sig-
nificant windows genome-wide (24.5% of outlier windows) while having only 6.4% of
analyzed windows (S12 Fig). This is due to the presence of the MHC region, rich in
genes with well-supported evidence for balancing selection. In fact, several HLA genes
known to be targets of LTBS appear among our outlier windows, i.e, the strongest can-
didates. For the outlier windows, 10 HLA genes are found in all four populations,
most of which have prior evidence for balancing selection (Table 3): HLA-B,HLA-C,
HLA-G (DeGiorgio et al., 2014; Liu et al., 2006; Meyer et al., 2006; Sanchez-Mazas, 2007;
Although the union of significant windows considering all t f values span on average
only 0.51% of the genome (Table 2), 37.8% of those windows overlap protein-coding
focused on protein-coding genes that contain at least one significant or outlier window
(“U” set, Table 2), and investigated the functional categories they belong to.
genes (S2 Table), 22 of which are shared by at least two populations. Three significant
81
Capítulo 1
categories are driven by olfactory receptor genes (OR), which we could not rule out
as artifacts (S1 Text), although they do not appear in the more conservative outlier set
of genes (S3 Table). Among the remaining categories, at least half of them are directly
related to immune response (e.g. “type I interferon signaling pathway”, “MHC class
involved in antigen presentation by MHC molecules (e.g. “MHC class I protein com-
plex”, “MHC class II protein complex”, “peptide antigen binding”, among others). For
the outlier genes, 27 enriched categories were found, at least 18 of which are immune-
related, and 10 of which are directly related to antigen presentation by MHC molecules
(S3 Table).
When classical HLA genes are removed from the sets, no categories remain enriched
for the outlier genes (S3 Table; but note that this resulted in a small set of 162-192 genes
et al. (2014). For the larger set of significant genes, the immune related category “pep-
tide antigen binding” remains significant in LWK , driven by TAP1, TAP2, and HLA-G,
all previously reported candidate targets of balancing selection (Cagliani et al., 2011;
Tan et al., 2005). These results show the strong influence of the classical HLA genes
to signatures of LTBS. However, “extracellular region” and “keratin filament” are en-
riched in the set of significant genes, in several populations, even after the removal
of HLA genes, in agreement with previous findings pointing that balancing selection
targets genes related to extracellular and cell-surface proteins (Key et al., 2014b).
Nevertheless, for the significant genes only about half of the immune-related en-
riched categories are directly linked to peptide presentation by MHC molecules. Other
riched after the removal of HLA genes (S2 Table), are not strictly composed of HLA
genes.
82
Capítulo 1
In order to gain more insight on the importance of non-HLA immune related genes
to the outlier set of genes, we verified that the GO categories of 62 outlier genes shared
by at least two populations (Table 3) are immune-related, although only 10 HLA genes
compose that set (S8 Table). This shows that not only HLA-related categories are en-
riched among the significant genes, pointing that immune response, in a broader sense,
Most windows were found to be significant (S20A Fig) or outliers (S20B Fig) in multiple
populations. On average 81% of significant windows in any one population are shared
between any two populations, and 69% of the windows are shared between two pop-
ulations within the same continent (66% between African and 71% between European
populations, see S20A Fig). For the more restrictive set of outlier windows, the shar-
ing increased to 87% between any two populations, and 78% within continent (75% of
African windows were shared, and 80% of European (S20B Fig). There was also similar
overlap exons. On average, 31.2% of the windows that overlap protein-coding genes
overlap their exons, very similar to the 30.8% for the background distribution (S15
83
Capítulo 1
When these sites are divided as synonymous (putatively neutral) and non-synonymous,
significant windows are enriched for non-synonymous SNPs when compared with con-
trols sampled from the background distribution (Fig 6A,C). This is also true when only
intermediate frequency alleles are considered (MAF>0.20, Fig 6B,D). Taken together,
our results indicate that balancing selection is associated to regions of increased non-
synonymous polymorphism.
Regulatory function It has been suggested that balancing selection may have a par-
ticularly important role in maintaining genetic diversity that affects gene expression
(Leffler et al., 2013; Savova et al., 2016). Because the identification of significant and
test the hypothesis that LTBS has preferentially targeted regulatory regions. Signifi-
cant windows were enriched in SNPs that have regulatory functions (Fig 7A, p<0.001),
allele frequency must be accounted for. When only SNPs with intermediate frequency
eQTLs (Fig 7D); rather, in most populations there is a significant depletion of eQTLs
ulatory regions when considering a more inclusive category that depends exclusively
see Methods; Fig 7B-E). Regardless of allele frequency, SNPs in significant windows are
7, Fig 7C-F). Although the annotation of each of these RegulomeDB categories is not
perfect, these results suggest that balancing selection does not preferentially target, in
Finally, in agreement with Savova et al. (2016) we find a modest yet significant en-
richment for genes with mono-allelic expression (MAE) among the outlier genes shared
84
Capítulo 1
by at least two populations (Table 3): 26% of them are MAE genes, while only 22% of
The signatures of long-term balancing selection may not be shared between popula-
tions due to changes in selective pressure, which may be important during fast, local
adaptation (Filippo et al., 2016). Still, loci with signatures across human populations are
more likely to represent old, stable events of balancing selection in human populations.
We considered as “African” those outlier genes resulting from the union of outlier win-
dows for all t f values (Table 2) that are shared between YRI and LWK (but neither or
only one of the European populations), and as “European” those that shared between
GBR and TSI (but neither or only one of the African populations). Those shared by all
four populations were considered as “African and European” (Table 3). Importantly,
these designations do not imply that the genes referred to as “African” or “European”
in Table 3 are putative targets of LTBS for only one of the continents, as there are power
differences between Africa and Europe, particularly for t f = 0.3 (Fig 3, Table 1, S10 Fig),
but rather serve the purpose of quantifying the extent of sharing across populations.
The combined set of “African” (69 genes), “European” (71 genes) and “African and
European” (75 genes) contains 213 genes ( 1.1% of all queried genes) (Table 3). When
applying the same criteria for the significant windows, the set contains 1,470 genes ( 8%
of all queried genes, see S2 Text and S8 Table). We focus the following discussion on the
set of 213 outlier genes, since they constitute the most restricted set. Of these, 61 (29%)
(Andrés et al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013), and others were detected
in individual gene studies (Table 3 and Discussion). Overall, about 70% of the outlier
genes reported here have not been reported as having signatures of LTBS in previous
studies.
85
Capítulo 1
Obviously, a given window can be significant for more than one t f value. Because
our simulations suggest that the t f is informative about the frequency of the balanced
allele, we use the lowest Ztf to assign a t f value to each window (for a given popula-
tion), providing information on the nature of the SFS skew (S7 Table). For 50% of the
outlier windows, the assigned t f is 0.3, and 36% have 0.5 as assigned t f ; only 14%
Based on the p-values of the most extreme window for each of the outlier genes, we
were able to rank them. The top 10 genes are highlighted in Table 3. Among the top
ten candidates, two (HLA-DRB5 and HLA-DQA1) are related to adaptive immunity,
two (PCDH15 and NDUFA10) are related to sensory perception of mechanical stimu-
lus, including sound and two (PROKR2 and CPE) are related to neuropeptide signal-
ing pathway. Six of the top 10 genes (PROKR2, HLA-DQA1, CPE, HLA-DRB5, LUZP2,
and MYO3A) have been previously described as having signatures of LTBS in humans
(Table 3). The four among them that are novel (B4GALNT2, C1orf101, NDUFA10, and
Table 3. Outlier genes All reported genes have are overlapped by at least one
outlier window for at least one t f value (Table 2 and Methods). Outliers for both
African populations (“African”), for both European populations (“European")
or for all four populations ("African & European"). A version of this table with
p-values and assigned t f values is provided in S7 Table. When a gene has been
previously reported as having signatures of LTBS, the reference is provided.
[A], Andrés et al. (2009), [D], DeGiorgio et al. (2014), [L], Leffler et al. (2013), [S],
reported as being under balancing selection in Savova et al. (2016), [T] Tan et al.
(2005). * top 10 most highly ranked genes (for YRI).
86
Capítulo 1
87
Capítulo 1
88
89
Figure 5. Site frequency spectra
SFS in A) LWK population and B) GBR population of background windows (all windows in chromosome 1, in grey),
significant windows for NCD20.5 (blue), significant windows for NCD20.4 (orange), and significant windows for
NCD20.3 (pink).
Capítulo 1
Capítulo 1
Table 2. Significant and outlier windows and protein-coding genes across populations
Significant and outliers, see main text. U, union of all windows found with all target frequencies (t f ).
Population LWK YRI GBR TSI
tf 0.3 0.4 0.5 U 0.3 0.4 0.5 U 0.3 0.4 0.5 U 0.3 0.4 0.5 U
Significant
5,620 5,516 4,826 7,770 6,137 5,919 5,213 8,436 5,465 6,312 5,904 8,526 5,464 6,183 5,801 8,395
windows
Outlier
90
816 816 816 1,139 816 816 816 1,142 816 816 816 1,131 816 816 816 1,163
windows
Significant
1,037 1,003 878 1,321 1,129 1,044 928 1,400 967 1,025 971 1,321 983 1,047 1,009 1,378
genes
Outlier
128 130 147 202 124 120 131 187 107 114 123 172 116 121 137 189
genes
Queried
1,631,372 1,631,372 1,631,372 1,631,372
windows
Capítulo 1
91
Capítulo 1
92
Figure 7. RegulomeDB enrichment analysis for scores 1 and 7
Proportion of SNPs in (A,D) RegulomeDB category 1 (eQTLs), (B,E) RegulomeDB categories 1 and 2 (overlapping a
putatively regulatory site) or (C,F) RegulomeDB category 7 (no evidence for regulatory role) for (A,B,C) all SNPs or
(D,E,F) SNPs with intermediate MAF in both the significant and background windows. In gray, distribution obtained
from 1,000 samplings of a set of windows from the background (see Methods). In orange, significant windows.
Capítulo 1
Discussion
NCD Method
We present two new summary statistics, which are simple and fast to compute and to
run, and which allow, unlike classical approaches such as Tajima’s D (Tajima, 1989) or
the Mann-Whitney U for comparing local and global SFS (Andrés et al., 2009; Nielsen et
al., 2009), explicit exploration of different target frequencies - a property also shared by
the T1 and T2 tests (DeGiorgio et al., 2014), albeit in a likelihood framework. We show
that the NCD statistics are well powered to detect balancing selection for a complex
The NCD statistics can be used to detect selected regions using null distributions
not expected under neutrality) or by an empirical outlier approach, which allows the
history. Furthermore, NCD1 can be used in the absence of a close outgroup species,
which extends further the set of possible species. This allows exploring the genomes of
Many previous and well-supported targets of balancing selection are present in our
list of selected genes, but approximately 70% of the protein-coding genes we identify
On average 0.51% of the windows show, per population, signatures of LTBS that are
tures of LTBS with neutral simulations. We showed that these windows are unlikely to
93
Capítulo 1
Although the total proportion of the genome under balancing selection may be
small, our results show that many genes contain putatively selected regions. For ex-
ample, under a restrictive criterion of being significant in at least two populations from
tain an outlier window. Because our statistic is powerful for detecting selection in rel-
atively narrow genomic regions (3kb), it is possible that we are identifying signatures
that would not be found when analyzing properties of entire genes or larger genomic
regions.
Long-term balancing selection is known to maintain both coding – e.g., HLA-B, HLA-C,
ABO (Hughes and Nei, 1988; Hughes and Nei, 1989; Segurel et al., 2012; Ségurel et al.,
2013) – and regulatory diversity – e.g. HLA-G, UGT2B (Tan et al., 2005; Sun et al., 2011)
(we confirm these targets in Table 3). We are in a particularly good position to quantify
humans. We found no excess of eQTLs within selected windows once the frequency of
the alleles is accounted for, and also no evidence for enrichment of regulatory function.
A recent study suggested that there is enrichment for genes with mono-allelic ex-
pression (MAE) among those with signatures of balancing selection (Savova et al.,
2016). In agreement with this observation, we found a small but significant enrichment
for MAE genes among the outlier genes reported in Table 3 (p = 0.03, Fisher Exact Test).
We note that this overlap would be even greater if HLA genes had not been excluded
by the MAE genes list provided in Savova et al. (2016). This result is consistent with the
claim for a biological link between balancing selection and MAE (Savova et al., 2016).
Nevertheless, it remains elusive whether the detection of MAE genes is correlated with
allelic frequency, as is the case for eQTLs which could this explain this enrichment.
94
Capítulo 1
genes more often than expected by chance, the proportion of the windows overlap-
ping genes that overlap exons is the same for significant and background windows,
showing that there is a depletion of introns in the significant windows. Finally, signif-
icant and outlier windows show an enrichment for nonsynonymous SNPs. This result
is compatible with two scenarios: (a) direct selection on multiple coding sites or (b) an
Fay, 2011).
For both new and previously known targets, an advantage of our method is that it
provides an assigned target frequency for each window, and consequently information
on the shape of its SFS (Table 3, S7 Table). In some candidate genes – the HLA genes
– we know that LTBS has targeted not one, but several sites (Hughes and Nei, 1988;
Hughes and Nei, 1989). In this case, the theoretical expectation of the shape of the re-
sulting local SFS is unclear. Nevertheless, in loci with a single balanced polymorphism,
which we assume may be common outside of the MHC, our simulations suggest that
the assigned t f can be informative about the frequency of the balanced allele. Our
results indicate that a large proportion of significant windows (50 %) have minor al-
lele frequencies which lie closer the target allele frequency of 0.3 than to 0.5, as would
be expected, for instance, under asymmetric overdominance. This highlights the im-
balanced polymorphism.
Whereas studies of positive selection show a remarkably low overlap with respect to
the genes they identify, with Akey (2009) reporting that only 14% of protein-coding loci
appear in more than a single study, we identified 61 of the outlier genes (29%) with
95
Capítulo 1
al., 2009; DeGiorgio et al., 2014; Leffler et al., 2013) (Table 3), and a few other genes
detected in individual gene studies (Tables 3 and S7). This is a reasonable overlap as
Many candidates for balancing selection from previous studies are also identified
here. For example, Leffler et al. (2013) identified 6 genes with particularly strong ev-
PROKR2, IGFBP7; Table 3). Of the 5 genes identified by both Andrés et al. (2009) (an
exon-based approach) and DeGiorgio et al. (2014) (a genome-wide study), 4 are among
our outlier genes (HLA-B, CDSN, LGALS8, SLC2A9; Table 3), and one (RCBTB1) among
our significant genes (S8 Table). We find 2 additional genes from Andrés et al. (2009)
(PREX2 and TNS1; Table 3) and 53 genes from DeGiorgio et al. (2014) (Table 3).
Other outlier genes have prior evidence for balancing selection in candidate gene
studies. Among the oldest known cases of genetic polymorphisms in humans are the
blood-group genes (Segurel et al., 2012; Ségurel et al., 2013), including the ABO gene,
which we also identify (Table 3). TRIM5 has prior evidence of balancing selection in
humans and Old World Monkeys (Cagliani et al., 2010) and OAS1 since the split be-
tween humans, chimpanzees and gorillas (reviewed in Fijarczyk and Babik, 2015); both
enzyme that metabolizes steroid hormones and bile acids and is associated to predis-
position to breast cancer (Sun et al., 2011) – and HLA-G – a non-classic HLA gene that
has tightly-regulated expression patterns between fetal and adult life (Tan et al., 2005).
Among the top 10 ranked genes (which we manually checked for undetected dupli-
NDUFA10, and PCDH15. NDUFA10 produces a subunit of the enzyme NADH, the
largest among the complexes of the electron-transport chain, and is associated to neu-
96
Capítulo 1
ticular tissues (Petit et al., 2015). PCDH15 is a protocadherin protein that is essential
for normal retinal and cochlear function. Interestingly, this gene shows strong signa-
tures of positive selection in East Asian populations (Sabeti et al., 2007). Moreover, two
other outlier genes (not among the top 10 ranked genes) are members of the beta-globin
cluster and have evidence for recent positive selection in Andean (HBE1, HBG2) and Ti-
betan populations (HBG2) (Bigham et al., 2010; Rottgardt et al., 2010; Yi et al., 2010). It
is plausible that these genes have been under LTBS in Africa and Europe, and recently
in selective regime recently detected for other loci (e.g. Filippo et al., 2016).
Finally, B4GALNT2 encodes a blood-group enzyme that has evidence for trans-
species polymorphism maintaining two classes of alleles with high divergence, which
2011). Moreover, variation in this gene in mice seems to be associated with the pres-
ence of Helicobacter species in the gut (Staubach et al., 2012; Ségurel et al., 2013). Finally,
a deletion encompassing the first exon of this gene has been described and it is possible
that it became fixed in chimpanzees by positive selection (Perry et al., 2008). To date,
Conclusions
We have developed a tool to identify genomic regions under long-term balancing selec-
tion that is simple, fast, and has a high degree of sensitivity for different frequencies of
the balanced polymorphism. The NCD statistics can be applied to single loci of to the
whole genome, in species with sufficient demographic information and those without
it, and both in the presence and in the absence of an appropriate outgroup.
Our analyses indicate that, in humans, balancing selection may be shaping variation
97
Capítulo 1
in about 0.5% of the genome including at least 1% of the human protein-coding genes.
Because there are so many genes, and since although they affect mostly immunity they
also affect other pathways and phenotypes, we provide evidence that balanced poly-
tion, including many completely novel targets. These shall be further investigated, for
example, to infer the selective force maintaining the balanced polymorphisms, to de-
about 80% of windows are shared across populations, the remaining show signatures
tative influence in subsequent local adaptations through shifts in selective pressure (as
Simulations
Performance of NCD2 and NCD1 was evaluated by extensive simulations with MSMS
(Ewing and Hermisson, 2010). The simulations followed a realistic demographic model
for African, European and East Asian human populations described in Gravel et al.
(2011), including the effective populations sizes (Ne ) and migration rates. A generation
time of 25 years, a mutation rate of 2.5 ∗ 10−8 mutations per site (Nachman and Crowell,
2000) and a recombination rate of 1 ∗ 10−8 were used. The human-chimpanzee split at
6.5 million years ago was added to the model. This was our null demographic model
For simulations with selection, a balanced polymorphism was added to the center
98
Capítulo 1
overdominance model, for a bi-allelic locus with alelles A and B, the relative fitnesses
of the three genotypes are: wAA = 1 − s1 , wAB = 1, and wBB = 1 − s2 , where s1 and
s2 are the selection coefficients in the two homozygous genotypes, and the frequency
where Ne is the effective population size used to scale the coalescent simulations and s
is the selection coefficient for the mutant allele (B). The selection coefficient (s) was set
to 0.01 (the influence of s is modest once the frequency equilibrium is reached, as in the
case of LTBS). We considered four frequency equilibria: f eq = 0.2, 0.3, 0.4, 0.5. Simula-
tions with and without selection were run for different sequence lengths (L), such that
L = 3, 6, 12 kb and time of onset of balancing selection (Tbs), such that Tbs = 1, 3, 5 myr
(Fig 2).
Power analyses
For each set of parameters, 1,000 neutral simulations were compared to 1,000 match-
ing simulations with balancing selection for evaluation of the performance of the NCD
statistics. The relationship between the true positive (TPR, the power of the statistic)
and false positive (FPR) rates is represented through receiver operating characteristic
(ROC) curves. For comparisons between statistics and across demographic scenarios,
NCD implementations (NCD1 and NCD2) and other parameters, the power at the FPR
= 0.05 threshold was considered. When comparing performance under a given con-
ditions (e.g. L values), power values were averaged across implementations (NCD1
and NCD2), demographic scenarios (Africa, Europe, Asia), and the other parameters,
The same simulations and procedures were used to evaluate the comparative per-
99
Capítulo 1
formance of the different methods. NCD2 and NCD1 were run using 3kb windows
and L=3kb and Tbs = 5 myr. They were compared with Tajima’s D (TajD), HKA (Tajima,
1989; Hudson et al., 1987), and the combined NCD1+HKA test (a joint distribution of
the two summary statistics) also in 3kb windows. DeGiorgio et al. (2014) report the
performance of T1/T2 based on windows of 100 informative sites upstream and down-
stream of the target site (on average 13.7 Kb in YRI and 14.7 Kb in CEU). Therefore, we
divided 15kb simulations in windows of 100 informative sites and calculated T1 and
T2 using BALLET (DeGiorgio et al., 2014). We selected the highest T1 or T2 value from
Data We analyzed genome-wide data from the 1000 Genomes Project phase I (Abeca-
sis et al., 2012). SNPs that were only detected in the high coverage exome sequencing
of the 1000G were not considered because the difference in coverage between the low
versus high coverage-exclusive SNPs make the exome dataset biased in the sense that
coding regions have higher SNP density, potentially biasing our results.
The genomes of individuals from African and European populations were queried
(excluding the recently admixed AWS population), but not those from Asian popula-
tions due to lower performance in this population (see “Demography” in the Results
section). We considered two African populations (YRI and LWK), and two European
populations (GBR and TSI). For comparisons between continents only two European
population (Key et al., 2014a). We used the minor allele frequency (MAF) in the NCD
statistics calculations to analyze the folded SFS (Fig 1). This allows us to retain SNPs
100
Capítulo 1
Filtering Genome analyses require extensive filtering in order to avoid the inclusion
of errors that may bias the results. We dedicated extensive efforts to obtain a filtered
dataset (see Fig S13). We disregarded positions not present in the 50mer CRG Alignabil-
ity track (Derrien et al., 2012), which requires that 50 bp segments should map uniquely
(only one region of the genome, allowing up to two mismatches). We filtered out all re-
gions annotated as segmental duplications (Alkan et al., 2009; Cheng et al., 2005) and
positions that are simple units of repeat detected by the Tandem Repeat Finder (Benson,
1999). We also required that all scanned positions be orthologous to the PanTro2 chim-
panzee reference sequence, because NCD2 includes FDs. After this filtering, NCD2
was calculated for the remaining windows (1,705,970 windows per population).
Because L = 3 Kb yielded the best performance for NCD2 for both African and Euro-
pean simulations (see Results, Figs 3, S1, S2), we queried the human population genetic
data with sliding windows of L = 3 Kb with 1.5 Kb step size. Windows are defined in
physical distance since the presence of balancing selection may affect the population-
based estimates of recombination rate. Variable positions were categorized as a SNP (if
polymorphic in the sample) or a FD (if all humans differ from the chimpanzee); the only
exception are polymorphic sites where both allelic states differ from the chimpanzee
reference state, as this position was considered both a SNP and a FD. Each population
was queried separately, and NCD2 was calculated considering three target frequencies:
0.3, 0.4, 0.5. For each queried window, the number of SNPs, FDs, IS, SNP/(FD+1) and
Filtering and correction for number of informative sites (IS) Neutrality tests
typically place a threshold on the minimum number of informative sites necessary – e.g.
at least 10 IS in Andrés et al. (2009), and 100 informative sites in DeGiorgio et al. (2014).
101
Capítulo 1
human genomic data, and find NCD2 has high variance when the number of IS is low
(S18 Fig). We therefore required that each window has at least 19 (African populations)
or 15 (European populations) IS, and the same sets of windows were queried in all
4 populations (Figs S9 and S10). These values where chosen because beyond them
NCD2 stabilizes (Figs S18 and S19). This final filter resulted in 1,631,372 considered
windows (4% of the queried windows were excluded) (Fig S13). Furthermore, neutral
simulations with different mutation rates were performed in order to retrieve 10,000
simulations for each bin of IS ranging from 4-229 (Africa) and 4-199 (Europe); this range
is compatible with the range seen in the actual data. Next, NCD2 (t f = 0.3,0.4,0.5)
was calculated for all simulations. These simulations per bin of IS allowed both the
(obtained based on the empirical distribution). The significant windows were defined
as those that fulfill the criterion whereby the observed NCD2t f value is lower than any
of the 10,000 values obtained for simulations with the same number of IS. Based on this
criterion, all significant windows have the same p-value (p < 0.0001).
Outlier windows In order to rank the queried windows and apply an outlier ap-
proach, we developed a standardized distance measure between the observed NCD2t f
(for the queried window) and the mean of the NCD2t f values for the 10,000 simulations
for the matching number of IS. This distance (Ztf ) is given by:
NCD2t f − NCD2 IS
Zt f = (4)
sd IS
, where Ztf is the corrected NCD2t f distance by the number of IS, NCD2t f is the
NCD2 value for the n-th empirical window, NCD2 IS is the mean NCD2 for 10,000
102
Capítulo 1
neutral simulations for the corresponding value of IS, and sd IS is the standard deviation
of NCD2 for 10,000 simulation values with the matching number of IS.
This standardized distance measure takes into account the range of possible values
within each IS value, and also the different ranges of values across different target fre-
quencies. Therefore, Zt f allows not only the ranking of all windows for a given t f , but
also takes into account the residual effect that the number of IS has on NCD2t f (even
after filtering for a minimum number of IS, see S11 and S18 Figs) and, finally, allows a
Once the Zt f scores were calculated, the outlier sets of windows were ranked according
to Z0.5 , Z0.4 , and Z0.3 . An empirical p-value was attributed to each window based on
the Zt f values, and the windows corresponding to the 0.05% lower tail (816 windows)
of the genomic distribution of Zt f values were defined as the “outlier windows”. All
outlier windows are contained within the significant windows except four windows in
this property allowed an assignment, for each window, of the t f value that minimizes
the NCD2t f . For the significant and outlier windows, we assigned a t f value as the t f
that yields the lowest empirical p-value for the window (S6 Table). For the outlier genes
in Table 3, a t f value was assigned to a gene by asking: (1) which window overlapping
the gene has the lowest p-value; and (2) which t f value is associated with the p-value
in 1. Thus, the assigned t f value for a gene is the assigned t f for the window that has
the lowest empirical p-value. This was done for each population separately as seen in
S7 Table.
Coverage To test whether the signatures of LTBS are driven by undetected duplica-
tions, which can produce mapping error and false SNPs, we analyzed modern human
103
Capítulo 1
shotgun genome-wide data that has been sequenced to an average coverage per indi-
vidual between 20x and 30x (Meyer et al., 2012; Prüfer et al., 2013). We used an indepen-
dent dataset because read coverage data is low and cryptic in the 1000G and because
putative duplications that affect the SFS must be at appreciable frequency and should
and two populations per continent: Yoruba and San (Africa), French and Sardinian
For each sample, we retrieved the positions that have coverage higher than the
97.5% of the coverage distribution specific for that sample (termed “high coverage”
positions). For each window in our analysis for signatures of LTBS, we calculated the
proportion of positions having high coverage in at least two samples (pHC), and plot
the distributions for different NCD2 empirical p- values – i.e, those based on the Zt f
scores (S14 Fig). Our significant and outlier windows are not enriched in positions with
high coverage in the samples considered herein, but rather the opposite: the significant
windows show a significant reduction in the proportion of positions with high coverage
Enrichment Analyses
cludes windows that fall within intronic regions. GO (gene ontology) and PO (phe-
notype ontology) enrichment analyses were performed using the software GOWINDA
(Kofler and Schlötterer, 2012), which avoids common biases that result from gene length
(longer genes with more windows have by chance a higher probability of containing a
candidate window) and/or gene clustering. We ran the analysis in mode: gene and per-
formed 100,000 simulations for FDR estimation. Significant categories were obtained by
104
Capítulo 1
considering an FDR<0.05.
GOWINDA was designed for SNP-based analysis so we considered the middle po-
sition of every scanned window as the target site. To correct for this, we extended
used the annotation file (.gtf) and the gene set file for Gene Ontology from Ensembl
arate analyses were performed for each population and considering a combination of
windows); 2) different t f (0.5, 0.4 and 0.3); 3) the union of candidate windows for
all t f ; 4) excluding the classical HLA genes with previous evidence of balancing se-
non-synonymous SNPs. Every SNP used in NCD2 calculation and overlapping NCD2-
proach as described above was used to perform the enrichment analysis. To control
for possible effects of allele frequency on the enrichment for specific features such as
eQTLs, a separate analysis only included SNPs at intermediate frequencies (MAF >=
and predicted regulatory elements (Boyle et al., 2012). Specifically, we considered Regu-
105
Capítulo 1
+ DNase peak) score 2 (TF + binding + matched t f motif + matched DNase Footprint
+ DNase peak), and 1+2 together (sites that are annotated as eQTL and those that are
not), as well as 7 (no regulatory annotation). These represent SNPs with the highest
and the lowest evidence for regulatory function, respectively. We also considered score
2 alone (TF binding + matched t f motif + matched DNase Footprint + DNase peak).
For each candidate window we sum the number of SNPs with each score that over-
lap the window. The expectation in the absence of LTBS is obtained by randomly sam-
pling from the genome the same number of windows as there are with evidence for
LTBS (Table 2). This enabled the calculation of an empirical p-value of the enrichment
of RegulomeDB scores in candidate windows when compared with the empirical back-
ground distribution while accounting for the size of each candidate windows set (sig-
nificance when p < 0.05). Because we considered the sum of scores across all windows,
considering each SNP only once even if it overlapped more than one window, our strat-
only alleles found at intermediate frequencies (MAF >= 20%) as described above.
Immune-related genes To specifically test for enrichment for significant genes re-
lated to immunity, we used a list of 386 immune-related keywords from the Compre-
gov/) to query the GO categories of the outlier genes. In total, 200 out of our 212 out-
lier genes have at least one associated GO category, of which 62 have at least one GO
category that matches at least one of the keywords on the list and was thus considered
to be “immune-related”.
106
References
107
Capítulo 1
108
Capítulo 1
Chun, S. and J. C. Fay (2011). “Evidence for hitchhiking of deleterious mutations within the
human genome.” In: PLoS genetics 7 (8), e1002240.
Clarke, B. (1962). “Balanced polymorphism and the diversity of sympatric species”. In: Taxon-
omy and Geography. Ed. by D. Nichols. Oxford: Systematics Association.
— (1964). “Frequency-Dependent Selection for the Dominance of Rare Polymorphic Genes”.
In: Evolution 18 (3), pp. 364–369.
Coventry, A. et al. (2010). “Deep resequencing reveals excess rare recent variants consistent with
explosive population growth”. In: Nature Communications 1 (8), p. 131.
Day, F. R. et al. (2015). “Causal mechanisms and balancing selection inferred from genetic asso-
ciations with polycystic ovary syndrome”. In: Nature Communications 6, p. 8464.
DeGiorgio, M., K. E. Lohmueller, and R. Nielsen (2014). “A model-based approach for iden-
tifying signatures of ancient balancing selection in genetic data.” In: PLoS genetics 10 (8),
e1004561.
Derrien, T., J. Estellé, S. Marco Sola, D. G. Knowles, E. Raineri, R. Guigó, and P. Ribeca (2012).
“Fast Computation and Applications of Genome Mappability”. In: PLoS ONE 7 (1). Ed. by
C. A. Ouzounis, e30377.
Ewing, G. and J. Hermisson (2010). “MSMS: a coalescent simulation program including recom-
bination, demographic structure and selection at a single locus”. In: Bioinformatics 26 (16),
pp. 2064–2065.
Fijarczyk, A. and W. Babik (2015). “Detecting balancing selection in genomes: Limits and prospects”.
In: Molecular Ecology, n/a–n/a.
Filippo, C. de, F. M. Key, S. Ghirotto, A. Benazzo, J. R. Meneu, A. Weihmann, G. Parra, E. D.
Green, and A. M. Andrés (2016). “Recent Selection Changes in Human Genes under Long-
Term Balancing Selection”. In: Molecular Biology and Evolution, msw023.
Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth, A. G. Clark, F. Yu, R. A.
Gibbs, and C. D. Bustamante (2011). “Demographic history and rare allele sharing among
human populations.” In: Proceedings of the National Academy of Sciences of the United States of
America 108 (29), pp. 11983–8.
Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Bustamante (2009). “Infer-
ring the Joint Demographic History of Multiple Populations from Multidimensional SNP
Frequency Data”. In: PLoS Genetics 5 (10). Ed. by G. McVean, e1000695.
109
Capítulo 1
110
Capítulo 1
Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582.
Linnenbrink, M., J. M. Johnsen, I. Montero, C. R. Brzezinski, B. Harr, and J. F. Baines (2011).
“Long-Term Balancing Selection at the Blood Group-Related Gene B4galnt2 in the Genus
Mus (Rodentia; Muridae)”. In: Molecular Biology and Evolution 28 (11), pp. 2999–3003.
Liu, X. et al. (2006). “An ancient balanced polymorphism in a regulatory region of human ma-
jor histocompatibility complex is retained in Chinese minorities but lost worldwide.” In:
American Journal of Human Genetics 78 (3), pp. 393–400.
Malaria Genomic Epidemiology Network (2015). “A novel locus of resistance to severe malaria
in a region of ancient balancing selection”. In: Nature 526 (7572), pp. 253–257.
Meyer, D., R. M. Single, S. J. Mack, H. A. Erlich, and G. Thomson (2006). “Signatures of demo-
graphic history and natural selection in the human major histocompatibility complex Loci.”
In: Genetics 173 (4), pp. 2121–2142.
Meyer, D. and G. Thomson (2001). “How selection shapes variation of the human major histo-
compatibility complex: a review.” In: Annals of Human Genetics 65 (1), pp. 1–26.
Meyer, M. et al. (2012). “A High-Coverage Genome Sequence from an Archaic Denisovan Indi-
vidual”. In: Science 338 (6104), pp. 222–226.
Nachman, M. W. and S. L. Crowell (2000). “Estimate of the Mutation Rate per Nucleotide in
Humans”. In: Genetics 156 (1), pp. 297–304.
Nielsen, R. et al. (2005). “A Scan for Positively Selected Genes in the Genomes of Humans and
Chimpanzees”. In: PLoS Biology 3 (6), e170.
Nielsen, R. et al. (2009). “Darwinian and demographic forces affecting human protein coding
genes.” In: Genome Research 19 (5), pp. 838–49.
Pasvol, G., D. J. Weatherall, and R. J. M. Wilson (1978). “Cellular mechanism for the protective
effect of haemoglobin S against P. falciparum malaria”. In: Nature 274 (5672), pp. 701–703.
Perry, G. H. et al. (2008). “Copy number variation and evolution in humans and chimpanzees”.
In: Genome Research 18 (11), pp. 1698–1710.
Petit, F. G., C. Kervarrec, S. P. Jamin, F. Smagulova, C. Hao, E. Becker, B. Jegou, F. Chalmel, and
M. Primig (2015). “Combining RNA and Protein Profiling Data with Network Interactions
Identifies Genes Associated with Spermatogenesis in Mouse and Human”. In: Biology of
Reproduction 92 (3), pp. 71–71.
111
Capítulo 1
Prüfer, K. et al. (2013). “The complete genome sequence of a Neanderthal from the Altai Moun-
tains”. In: Nature 505 (7481), pp. 43–49.
Prugnolle, F., A. Manica, M. Charpentier, J. F. Guégan, V. Guernier, and F. Balloux (2005). “Pathogen-
driven selection and worldwide HLA class I diversity.” In: Current Biology 15 (11), pp. 1022–
7.
Rasmussen, M. D., M. J. Hubisz, I. Gronau, and A. Siepel (2014). “Genome-Wide Inference of
Ancestral Recombination Graphs”. In: PLoS Genetics 10 (5). Ed. by G. Coop, e1004342.
Raychaudhuri, S. et al. (2012). “Five amino acids in three HLA proteins explain most of the
association between MHC and seropositive rheumatoid arthritis”. In: Nature Genetics 44 (3),
pp. 291–296.
Rottgardt, I., F. Rothhammer, and M. Dittmar (2010). “Native highland and lowland popula-
tions differ in γ-globin gene promoter polymorphisms related to altered fetal hemoglobin
levels and delayed fetal to adult globin switch after birth”. In: Anthropological Science 118 (1),
pp. 41–48.
Sabeti, P. C. et al. (2007). “Genome-wide detection and characterization of positive selection in
human populations.” In: Nature 449 (7164), pp. 913–8.
Sanchez-Mazas, A. (2007). “An apportionment of human HLA diversity”. In: Tissue Antigens 69,
pp. 198–202.
Savova, V., S. Chun, M. Sohail, R. B. McCole, R. Witwicki, L. Gai, T. L. Lenz, C.-t. Wu, S. R.
Sunyaev, and A. A. Gimelbrant (2016). “Genes with monoallelic expression contribute dis-
proportionately to genetic diversity in humans”. In: Nature Genetics 48 (3), pp. 231–237.
Ségurel, L., Z. Gao, and M. Przeworski (2013). “Ancestry runs deeper than blood: The evolu-
tionary history of ABO points to cryptic variation of functional importance”. In: BioEssays
35 (10), pp. 862–867.
Segurel, L. et al. (2012). “The ABO blood group is a trans-species polymorphism in primates”.
In: Proceedings of the National Academy of Sciences 109 (45), pp. 18493–18498.
Sellis, D., B. J. Callahan, D. a. Petrov, and P. W. Messer (2011). “Heterozygote advantage as a
natural consequence of adaptation in diploids”. In: Proceedings of the National Academy of
Sciences 108 (51), pp. 20666–20671.
Solberg, O. D., S. J. Mack, A. K. Lancaster, R. M. Single, Y. Tsai, A. Sanchez-Mazas, and G.
Thomson (2008). “Balancing selection and heterogeneity across the classical human leuko-
112
Capítulo 1
cyte antigen loci: A meta-analytic review of 497 population studies”. In: Human Immunology
69 (7), pp. 443–464.
Spurgin, L. G. and D. S. Richardson (2010). “How pathogens drive genetic diversity: MHC,
mechanisms and misunderstandings.” In: Proceedings. Biological sciences / The Royal Society
277 (1684), pp. 979–88.
Staubach, F., S. Künzel, A. C. Baines, A. Yee, B. M. McGee, F. Bäckhed, J. F. Baines, and J. M.
Johnsen (2012). “Expression of the blood-group-related glycosyltransferase B4galnt2 influ-
ences the intestinal microbiota in mice”. In: The ISME Journal 6 (7), pp. 1345–1355.
Sun, C., D. Huo, C. Southard, B. Nemesure, A. Hennis, M. Cristina Leske, S.-Y. Wu, D. B. Witon-
sky, O. I. Olopade, and A. Di Rienzo (2011). “A signature of balancing selection in the region
upstream to the human UGT2B4 gene and implications for breast cancer risk”. In: Human
Genetics 130 (6), pp. 767–775.
Tajima, F. (1989). “Statistical method for testing the neutral mutation hypothesis by DNA poly-
morphism.” In: Genetics 123 (3), pp. 585–595.
Tan, Z., A. M. Shon, and C. Ober (2005). “Evidence of balancing selection at the HLA-G pro-
moter region”. In: Human Molecular Genetics 14 (23), pp. 3619–3628.
Teixeira, J. C. et al. (2015). “Long-Term Balancing Selection in LAD1 Maintains a Missense Trans-
Species Polymorphism in Humans, Chimpanzees, and Bonobos”. In: Molecular Biology and
Evolution 32 (5), pp. 1186–1196.
Vernot, B. and J. M. Akey (2014). “Resurrecting Surviving Neandertal Lineages from Modern
Human Genomes”. In: Science 343 (6174), pp. 1017–1021.
Williamson, S. H., M. J. Hubisz, A. G. Clark, B. A. Payseur, C. D. Bustamante, and R. Nielsen
(2007). “Localizing recent adaptive evolution in the human genome.” In: PLoS genetics 3 (6),
e90.
Yi, X. et al. (2010). “Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude”. In:
Science 329 (5987), pp. 75–78.
113
Capítulo 1
Supplementary Text
and genes
In all the analyses below, the set of significant or outlier windows (or genes) consists on the
union of windows or genes overlapped by them considering all t f values.
Neanderthal introgression
Background Genomic segments that contain introgressed haplotypes from archaic human
forms (Meyer et al., 2012; Prüfer et al., 2013) have, on average, older TMRCA and higher di-
versity than the rest of the genome. In the absence of positive (or balancing) selection, though,
introgressed segments are not expected to reach intermediate frequencies and contribute to the
significant and outlier windows defined in the main paper.
Results Accordingly, the significant and outlier windows in European populations are not
enriched in putatively introgressed SNPs (defined as those with an allele absent in the Africans,
shared between Europeans and Neanderthals, and that fall in previously identified introgressed
regions (Vernot and Akey, 2014) (S17 Fig and Methods in main paper). In fact, the outlier win-
dows are significantly depleted of introgressed SNPs (S17 Fig).
114
Capítulo 1
Results and Methods We also investigated the possibility of non-homologous gene con-
version, which is another biological phenomenon that may increase diversity. To do so, for each
significant or outlier gene (see Table 2 in main text) we analyzed the distribution of the number
of paralogs that reside on the same chromosome. Significant genes show no tendency towards
having more paralogs on the same chromosome than all autosomal genes (see S16 Fig), show-
ing that this is not a general issue. In both cases, more than 60% of the genes have no paralogs
on the same chromosome (S16 Fig). We nevertheless singled out olfactory receptor (OR) genes
(see below), which often appear in tandem and may undergo gene conversion. Unlike the other
significant and background genes, more than 80% of the OR genes present in all populations for
at least one t f value have at least one paralog on the same chromosome (S19 Fig). Thus, non-
homologous gene conversion does not appear to be a general issue among significant genes,
with the exception of the OR genes.
Among the windows believed to have less false positive candidates to LTBS, only two OR genes
are present: OR52A1 and OR6J1 (Table 3). Although patterns compatible with overdominance
have been reported for human olfactory receptor activity genes (Alonso et al., 2008), we cannot
rule out that the enrichment signature detected in the genes pertaining to the olfactory receptor
(OR) gene family is due to paralogous gene conversion (S19 Fig). Moreover, OR52A1 has 10
paralogues on the same chromosome, and OR6J1 has 1 (Table 3). We therefore recommend that
the results concerning OR genes be interpreted with caution.
Results A phenotype ontology analysis uncovered “abnormality of the sclera” as the only
significant category in YRI, and no significant categories appear in the other three populations
analyzed (S4 Table).
Methods See Methods section in main paper (“Gene and phenotype ontology”).
115
Capítulo 1
Tissue-specific expression
Results Interestingly, though, when we perform enrichment analysis among significant win-
dows for tissue-specific expression, we find that targets of LTBS are significantly enriched in
genes that are highly expressed in adrenal gland in both TSI and in the lung in GBR European
populations (S5 Table). These results are mirrored in African populations when considering
outlier windows, albeit the results are not significant.
116
Capítulo 1
Description
In the main text, we mention that among the top 10 outlier genes from Table 3 (considering
the p-values for YRI), 6 have been reported previously as having signatures of LTBS: PROKR2
(Leffler et al., 2013), HLA-DQA1 (DeGiorgio et al., 2014), CPE (DeGiorgio et al., 2014), HLA-
DRB5 (DeGiorgio et al., 2014), LUZP2 (DeGiorgio et al., 2014), and MYO3A (Asthana et al.,
2005; DeGiorgio et al., 2014) (Tables 3 and S7).
In order to certify that these genes have genuine extreme signatures of LTBS due to bal-
ancing selection, and not: (a) bad SNP calls by collapsed reads of duplicates, and (b) non-
homologous gene conversion between close paralogs, we performed a manual verification us-
ing BLAT (http://www.ensembl.org/Multi/Tools/Blast?db=core).
Methods
For each gene, the corresponding FASTA sequence was taken from the hg19 reference genome
and queried in BLAT. We only considered the top 100 hits for each gene. For each of the 100
hits, positions that coincide with a SNP position for the gene in the Phase 1 1000 Genomes data
set (Abecasis et al., 2012) were manually verified. If the position is a match between the query
and the hit, i.e, both have the same variant in the SNP position, this SNP is considered a match.
If the position is a mismatch between the query and the hit, i.e, query and hit have different
variants in the SNP position, this SNP is considered a mismatch.
A mismatched SNP could either have the alternate allele in the hit (alternate mismatch*) or
an allele which is not the alternate allele in the 1000 G data set (simple mismatch). Further, we
considered as more relevant and likely problematic those SNPs that not only are classified as
alternate mismatch, but have somewhat intermediate frequencies (> 0.10). Results are provided
for each gene separately, below. Location of the gene is provided, for reference.
117
Capítulo 1
Results
B4GALNT2
BLAT After looking at all hits, we found three alternate mismatch SNPs. Only two of them
have intermediate frequency: rs78050610,rs11654406), while the other is a singleton (rs140853454).
Conclusion Roughly 94 intermediate SNPs remain for this gene, making it thus unlikely
that its signatures are dependent on problematic SNPs.
NDUFA10
BLAT After looking at all hits, we found two alternate mismatch SNPs, one of which is
intermediate frequency (rs6759128) and the other is low to intermediate (rs28429725).
Conclusion Roughly 29 intermediate SNPs remain for this gene, making it thus unlikely
that its signatures are dependent on problematic SNPs.
PCDH15
118
Capítulo 1
BLAT After looking at all hits, we found 1 alternate mismatch SNP (rs188658080), which
has verylow frequency in all populations or is absent for some populations (considering only
African and European populations).
Conclusion Although this SNP is likely unreliable, it has low frequency. Coupled with the
fact that only this position clearly appears to be problematic, we concluded that the signature
observed for this gene is reliable.
C1orf101
BLAT After looking at all hits, we found 19 alternate mismatch SNPs. This seemed to be
a potentially problematic candidate for LTBS, so we looked at in in further detail. Of the 19
alternate mismatch SNPs, between 7 and 12 have intermediate frequency. They are listed below.
Alternate mistmatch SNPs for C1orf101 (the ones with * have intermediate frequency and
are thus more likely to be sources of bias for observed NCD2 values): rs3003250, rs3005972,
rs3005973, rs138545291 (singleton), rs114536159, rs142620294, rs189591539(singleton), rs3003251*,
rs3005938*, rs3005945*, rs3005947*, rs3005957*, rs3005958*, rs3005968*, rs3005969*, rs3005971*,
rs3005975*, rs7538776*, rs9429008 *
Given that several SNPs for this gene are likely problematic (alternate mismatch with inter-
mediate frequencies, we thus recalculated NCD2 for the windows that overlap this gene, after
removing those SNPs. We removed the 19 SNPs (including the low/high frequency ones and
singletons) from our data set (Methods in main), and recalculated NCD20.5 for the six outlier
windows (Table 2) that overlap this gene for YRI. The six outlier windows that overlapped this
gene have zero FDs. For the first of the six window, the removal of the problematic SNPs re-
sulted in it going from 19 to 17 SNPs (IS=17), so it does not fulfill our criteria of having at least
19 IS in Africa (see Methods in main). The second windows goes from having 22 to 19 IS. It
passes the "significance criterion" (simulation based p < 0.0001) and has Zt f value within the
observed range for outlier windows.
119
Capítulo 1
Conclusion 19 SNPs are alternate mismatches, and 7-12 also have intermediate frequencies
in most African and European populations. Even if all (19) SNPs are removed and NCD20.5 is
re-calculated, at least one window remains as significant and probably outlier, demonstrating
that this gene is likely a true positive. Moreover, there is an entire an entire range of the query
(between positions 8,000-10,500) for which there are no hits, and this range contains 37 SNPs
(around 12 intermediate frequency), further supporting the signatures observed for this gene.
Supplementary Tables
NCD2 NCD1
Target Frequency
Tbs f eq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.959 0.944 0.835 0.929 0.911 0.393
5 0.5 6 0.917 0.885 0.728 0.903 0.847 0.392
5 0.5 12 0.829 0.789 0.548 0.846 0.772 0.325
5 0.4 3 0.939 0.935 0.886 0.886 0.894 0.674
5 0.4 6 0.871 0.860 0.790 0.838 0.819 0.651
5 0.4 12 0.742 0.726 0.612 0.745 0.717 0.534
5 0.3 3 0.895 0.908 0.929 0.717 0.801 0.836
5 0.3 6 0.776 0.796 0.833 0.659 0.709 0.794
5 0.3 12 0.572 0.597 0.638 0.509 0.570 0.663
3 0.5 3 0.911 0.882 0.681 0.855 0.797 0.236
3 0.5 6 0.856 0.809 0.574 0.854 0.770 0.266
3 0.5 12 0.727 0.666 0.410 0.768 0.678 0.232
3 0.4 3 0.878 0.864 0.759 0.781 0.783 0.557
3 0.4 6 0.803 0.785 0.678 0.770 0.753 0.527
3 0.4 12 0.659 0.621 0.500 0.654 0.629 0.441
3 0.3 3 0.749 0.774 0.811 0.561 0.640 0.706
3 0.3 6 0.628 0.648 0.700 0.526 0.570 0.658
120
Capítulo 1
NCD2 NCD1
Target Frequency
Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.968 0.951 0.835 0.921 0.846 0.197
5 0.5 6 0.941 0.907 0.747 0.916 0.871 0.234
5 0.5 12 0.849 0.80 0.573 0.847 0.767 0.201
5 0.4 3 0.948 0.944 0.907 0.849 0.826 0.596
5 0.4 6 0.901 0.892 0.832 0.849 0.850 0.633
5 0.4 12 0.779 0.757 0.687 0.750 0.744 0.536
5 0.3 3 0.836 0.855 0.892 0.471 0.569 0.740
5 0.3 6 0.726 0.758 0.810 0.497 0.606 0.722
5 0.3 12 0.551 0.595 0.670 0.387 0.493 0.644
3 0.5 3 0.928 0.892 0.678 0.814 0.693 0.145
3 0.5 6 0.875 0.833 0.607 0.841 0.755 0.195
3 0.5 12 0.761 0.704 0.451 0.759 0.670 0.187
3 0.4 3 0.888 0.875 0.794 0.738 0.709 0.461
3 0.4 6 0.828 0.809 0.723 0.782 0.765 0.517
3 0.4 12 0.678 0.662 0.567 0.682 0.676 0.462
3 0.3 3 0.733 0.757 0.795 0.389 0.480 0.634
3 0.3 6 0.609 0.643 0.703 0.425 0.512 0.632
3 0.3 12 0.433 0.473 0.558 0.332 0.402 0.548
1 0.5 3 0.472 0.388 0.161 0.430 0.305 0.054
121
Capítulo 1
NCD2 NCD1
Target Frequency
Tbs feq L 0.5 0.4 0.3 0.5 0.4 0.3
5 0.5 3 0.666 0.687 0.705 0.448 0.476 0.365
5 0.5 6 0.584 0.605 0.614 0.438 0.465 0.378
5 0.5 12 0.469 0.476 0.450 0.398 0.401 0.332
5 0.4 3 0.343 0.372 0.430 0.136 0.167 0.225
5 0.4 6 0.262 0.291 0.356 0.135 0.167 0.224
5 0.4 12 0.187 0.206 0.241 0.116 0.133 0.189
5 0.3 3 0.113 0.135 0.186 0.015 0.022 0.055
5 0.3 6 0.062 0.071 0.113 0.012 0.024 0.046
5 0.3 12 0.030 0.041 0.068 0.011 0.015 0.037
3 0.5 3 0.611 0.627 0.616 0.393 0.404 0.344
3 0.5 6 0.532 0.545 0.529 0.411 0.422 0.374
3 0.5 12 0.412 0.418 0.389 0.371 0.370 0.298
3 0.4 3 0.245 0.269 0.332 0.111 0.141 0.173
3 0.4 6 0.189 0.208 0.252 0.101 0.128 0.166
3 0.4 12 0.128 0.144 0.178 0.085 0.111 0.142
3 0.3 3 0.073 0.087 0.126 0.011 0.022 0.050
3 0.3 6 0.036 0.052 0.072 0.012 0.017 0.046
3 0.3 12 0.019 0.029 0.047 0.010 0.018 0.037
1 0.5 3 0.287 0.274 0.222 0.245 0.225 0.145
1 0.5 6 0.222 0.212 0.159 0.235 0.215 0.167
1 0.5 12 0.181 0.173 0.132 0.205 0.188 0.135
122
Capítulo 1
LWK
G-protein_coupled_receptor_
GO:0007186 35.17 61 0.00001 0.00104
signaling_pathway
GO:0042612 0.13 4 0.00001 0.00104 MHC_class_I_protein_complex
GO:0042613 0.291 10 0.00001 0.00104 MHC_class_II_protein_complex
GO:0045095 0.961 9 0.00001 0.00104 keratin_filament
GO:0042605 0.564 8 0.00001 0.00104 peptide_antigen_binding
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.689 12 0.00001 0.00104
reticulum_membrane
GO:0032395 0.124 6 0.00001 0.00104 MHC_class_II_receptor_activity
positive_regulation_of_T_cell_
GO:0001916 0.659 7 0.00001 0.00104
mediated_cytotoxicity
GO:0016021 314.878 394 0.00001 0.00104 integral_to_membrane
antigen_processing_and_presentation_of_
GO:0002504 0.216 7 0.00001 0.00104 peptide_or_polysaccharide_antigen_via
_MHC_class_II
GO:0006955 14.317 33 0.00001 0.00104 immune_response
interferon-gamma-mediated_
GO:0060333 3.3 13 0.00001 0.00104
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.223 4 0.00001 0.00104 exogenous_peptide_antigen_via_
MHC_class_I,_TAP-independent
GO:0019882 1.306 15 0.00001 0.00104 antigen_processing_and_presentation
GO:0030658 2.507 11 0.00001 0.00104 transport_vesicle_membrane
GO:0030669 0.972 8 0.00001 0.00104 clathrin-coated_endocytic_vesicle_membrane
123
Capítulo 1
124
Capítulo 1
interferon-gamma-mediated_
GO:0060333 3.529 13 0.00001 0.00114
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.24 4 0.00001 0.00114 exogenous_peptide_antigen_via_
MHC_class_I,_TAP-independent
GO:0019882 1.401 16 0.00001 0.00114 antigen_processing_and_presentation
GO:0030658 2.646 10 0.00001 0.00114 transport_vesicle_membrane
GO:0030669 1.043 8 0.00001 0.00114 clathrin-coated_endocytic_vesicle_membrane
GO:0004984 2.416 21 0.00001 0.00114 olfactory_receptor_activity
GO:0012507 1.424 12 0.00001 0.00114 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 26.334 51 0.00001 0.00114 G-protein_coupled_receptor_activity
G-protein_coupled_receptor_
GO:0007186 37.515 62 0.00003 0.00336
signaling_pathway
GO:0032588 3.336 11 0.00003 0.00336 trans-Golgi_network_membrane
detection_of_chemical_stimulus_involved_
GO:0050911 0.244 4 0.00009 0.00993
in_sensory_perception_of_smell
GO:0005576 92.15 123 0.00023 0.02592 extracellular_region
positive_regulation_of_T_cell_
GO:0001916 0.707 5 0.00033 0.03707
mediated_cytotoxicity
YRI (without HLA)
G-protein_coupled_receptor_
GO:0007186 37.187 62 0.00001 0.00492
signaling_pathway
GO:0045095 1.023 9 0.00001 0.00492 keratin_filament
GO:0004984 2.397 21 0.00001 0.00492 olfactory_receptor_activity
GO:0004930 26.103 51 0.00001 0.00492 G-protein_coupled_receptor_activity
GO:0016021 332.421 394 0.00002 0.00802 integral_to_membrane
GO:0005576 91.502 123 0.0004 0.03721 extracellular_region
detection_of_chemical_stimulus_involved_
GO:0050911 0.242 4 0.00013 0.03885
in_sensory_perception_of_smell
GBR
GO:0042612 0.128 4 0.00001 0.00129 MHC_class_I_protein_complex
GO:0042613 0.285 9 0.00001 0.00129 MHC_class_II_protein_complex
GO:0042605 0.562 6 0.00001 0.00129 peptide_antigen_binding
integral_to_lumenal_side_of_
GO:0071556 0.683 12 0.00001 0.00129
endoplasmic_reticulum_membrane
GO:0032395 0.12 6 0.00001 0.00129 MHC_class_II_receptor_activity
GO:0016021 312.004 411 0.00001 0.00129 integral_to_membrane
GO:0005576 85.721 124 0.00001 0.00129 extracellular_region
125
Capítulo 1
antigen_processing_and_presentation_of_
GO:0002504 0.211 6 0.00001 0.00129 peptide_or_polysaccharide_antigen_via_
MHC_class_II
interferon-gamma-mediated_
GO:0060333 3.266 16 0.00001 0.00129
signaling_pathway
antigen_processing_and_presentation_of_
GO:0002480 0.221 4 0.00001 0.00129 exogenous_peptide_antigen_via_MHC_
class_I,_TAP-independent
GO:0019882 1.294 15 0.00001 0.00129 antigen_processing_and_presentation
GO:0030669 0.96 9 0.00001 0.00129 clathrin-coated_endocytic_vesicle_membrane
GO:0004984 2.241 21 0.00001 0.00129 olfactory_receptor_activity
GO:0012507 1.314 12 0.00001 0.00129 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 24.487 50 0.00001 0.00129 G-protein_coupled_receptor_activity
GO:0007186 34.875 58 0.00002 0.00235 G-protein_coupled_receptor_signaling_pathway
GO:0006955 14.165 31 0.00002 0.00235 immune_response
GO:0030658 2.469 10 0.00003 0.00346 transport_vesicle_membrane
GO:0045095 0.956 7 0.00005 0.00569 keratin_filament
GO:0060337 1.679 8 0.00009 0.01008 type_I_interferon_signaling_pathway
GO:0032588 3.114 10 0.00014 0.01412 trans-Golgi_network_membrane
GO:0007608 4.894 13 0.00019 0.02054 sensory_perception_of_smell
positive_regulation_of_T_cell_
GO:0001916 0.654 5 0.0002 0.02087
mediated_cytotoxicity
GBR (without HLA)
G-protein_coupled_receptor_
GO:0007186 34.566 58 0.00001 0.0039
signaling_pathway
GO:0016021 309.57 404 0.00001 0.0039 integral_to_membrane
GO:0005576 85.005 124 0.00001 0.0039 extracellular_region
GO:0004984 2.217 21 0.00001 0.0039 olfactory_receptor_activity
GO:0004984 24.276 50 0.00001 0.0039 G-protein_coupled_receptor_activity
GO:0045095 0.948 7 0.00003 0.01045 keratin_filament
TSI
GO:0042613 0.304 8 0.00001 0.00163 MHC_class_II_protein_complex
GO:0042605 0.588 6 0.00001 0.00163 peptide_antigen_binding
integral_to_lumenal_side_of_
GO:0071556 0.718 10 0.00001 0.00163
endoplasmic_reticulum_membrane
GO:0032395 0.129 5 0.00001 0.00163 MHC_class_II_receptor_activity
GO:0016021 325.611 414 0.00001 1.63251-3 integral_to_membrane
GO:0005576 89.537 140 0.00001 0.00163 extracellular_region
126
Capítulo 1
antigen_processing_and_presentation_of_
GO:0002504 0.225 6 0.00001 0.00163 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0060333 3.413 13 0.00001 0.00163 interferon-gamma-mediated_signaling_pathway
GO:0019882 1.357 13 0.00001 0.00163 antigen_processing_and_presentation
GO:0004984 2.347 23 0.00001 0.00163 olfactory_receptor_activity
GO:0012507 1.371 10 0.00001 0.00163 ER_to_Golgi_transport_vesicle_membrane
GO:0004930 25.566 48 0.00001 0.00163 G-protein_coupled_receptor_activity
GO:0007608 5.077 15 0.00002 0.00313 sensory_perception_of_smell
GO:0006955 14.835 31 0.00005 0.00762 immune_response
GO:0030669 1.007 7 0.00006 0.00861 clathrin-coated_endocytic_vesicle_membrane
GO:0060402 2.707 8 0.00017 0.02497 calcium_ion_transport_into_cytosol
GO:0030658 2.578 9 0.00018 0.02503 transport_vesicle_membrane
GO:0032588 3.244 10 0.00022 0.02925 trans-Golgi_network_membrane
GO:0042612 0.135 3 0.00023 0.02925 MHC_class_I_protein_complex
TSI (without HLA)
GO:0016021 323.377 407 0.00001 0.003894 integral_to_membrane
GO:0005576 88.921 140 0.00001 0.003894 extracellular_region
GO:0007608 5.05 15 0.00001 0.003894 sensory_perception_of_smell
GO:0004984 2.326 23 0.00001 0.003894 olfactory_receptor_activity
GO:0004930 25.39 48 0.00001 0.003894 G-protein_coupled_receptor_activity
S3 Table. Gene ontology enrichment analyses for outlier windows. The union
of significant windows for at least one of the t f values is used. t f , target fre-
quency used in NCD equation. FDR, false discovery rate. genes (sims), ex-
pected number of genes in this category (see Methods). genes (data), actual
number of genes in the category in the analyzed set. Because no categories ra-
mained significant after removal of HLA genes, these sets are not reported.
YRI
GO:0019882 0.157 5 0.00005 0.00402 antigen_processing_and_presentation
GO:0030669 0.119 4 0.00005 0.00402 clathrin-coated_endocytic_vesicle_membrane
GO:0032395 0.015 3 0.00005 0.00402 MHC_class_II_receptor_activity
GO:0042613 0.034 5 0.00005 0.00402 MHC_class_II_protein_complex
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.081 4 0.00005 0.00402
reticulum_membrane
GO:0030658 0.354 5 0.00002 0.00674 transport_vesicle_membrane
GO:0012507 0.157 4 0.00004 0.01191 ER_to_Golgi_transport_vesicle_membrane
127
Capítulo 1
antigen_processing_and_presentation_of_
GO:0002504 0.025 3 0.00005 0.01318 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0031295 0.444 5 0.00011 0.02661 T_cell_costimulation
LWK
antigen_processing_and_presentation_of_
GO:0002504 0.029 5 0.00005 0.00129 peptide_or_polysaccharide_antigen_via
_MHC_class_II
GO:0019882 0.177 10 0.00005 0.00129 antigen_processing_and_presentation
GO:0019221 1.556 10 0.00005 0.00129 cytokine-mediated_signaling_pathway
GO:0030658 0.395 7 0.00005 0.00129 transport_vesicle_membrane
GO:0030669 0.133 6 0.00005 0.00129 clathrin-coated_endocytic_vesicle_membrane
GO:0032588 0.491 6 0.00005 0.00129 trans-Golgi_network_membrane
GO:0031295 0.495 7 0.00005 0.00129 T_cell_costimulation
GO:0012507 0.177 9 0.00005 0.00129 ER_to_Golgi_transport_vesicle_membrane
GO:0032395 0.016 5 0.00005 0.00129 MHC_class_II_receptor_activity
GO:0042612 0.017 3 0.00005 0.00129 MHC_class_I_protein_complex
GO:0042613 0.039 7 0.00005 0.00129 MHC_class_II_protein_complex
GO:0006955 1.986 16 0.00005 0.00129 immune_response
GO:0042605 0.074 4 0.00005 0.00129 peptide_antigen_binding
interferon-gamma-mediated_
GO:0060333 0.455 9 0.00005 0.00129
signaling_pathway
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.092 9 0.00005 0.00129
reticulum_membrane
antigen_processing_and_presentation_of_
GO:0002480 0.029 3 0.00005 0.00129 exogenous_peptide_antigen_via_MHC_class_I,_
TAP-independent
positive_regulation_of_T_
GO:0001916 0.089 3 0.00008 0.01018
cell_mediated_cytotoxicity
antigen_processing_and_presentation_of_
GO:0019886 0.775 6 0.0001 0.01099
exogenous_peptide_antigen_via_MHC_class_II
GO:0030666 0.752 6 0.0001 0.01099 endocytic_vesicle_membrane
GO:0005765 2.206 10 0.0001 0.01099 lysosomal_membrane
negative_regulation_of_
GO:0032689 0.099 3 0.00012 0.01257
interferon-gamma_production
GO:0030670 0.305 4 0.00035 0.03549 phagocytic_vesicle_membrane
TSI
128
Capítulo 1
antigen_processing_and_presentation_of_
GO:0002504 0.027 5 0.00005 0.00131 peptide_or_polysaccharide_antigen_via_
MHC_class_II
GO:0019882 0.17 9 0.00005 0.00131 antigen_processing_and_presentation
GO:0019221 1.477 9 0.00005 0.00131 cytokine-mediated_signaling_pathway
GO:0030658 0.375 6 0.00005 0.00131 transport_vesicle_membrane
GO:0030669 0.126 5 0.00005 0.00131 clathrin-coated_endocytic_vesicle_membrane
GO:0031295 0.467 6 0.00005 0.00131 T_cell_costimulation
ER_to_Golgi_transport_
GO:0012507 0.168 8 0.00005 0.00131
vesicle_membrane
GO:0032395 0.016 4 0.00005 0.00131 MHC_class_II_receptor_activity
GO:0042612 0.016 3 0.00005 0.00131 MHC_class_I_protein_complex
GO:0042613 0.036 6 0.00005 0.00131 MHC_class_II_protein_complex
GO:0006955 1.898 15 0.00005 0.00131 immune_response
GO:0042605 0.072 4 0.00005 0.00131 peptide_antigen_binding
GO:0060333 0.434 9 0.00005 0.00131 interferon-gamma-mediated_signaling_pathway
integral_to_lumenal_side_of_endoplasmic_
GO:0071556 0.088 8 0.00005 0.00131
reticulum_membrane
antigen_processing_and_presentation_
of_exogenous_peptide_antigen_
GO:0002480 0.029 3 0.00005 0.00131
via_MHC_
class_I,_TAP-independent
positive_regulation_of_T_cell_
GO:0001916 0.085 3 0.00005 0.00628
mediated_cytotoxicity
GO:0060337 0.214 4 0.00005 0.00628 type_I_interferon_signaling_pathway
GO:0032588 0.466 5 0.00006 0.0072 trans-Golgi_network_membrane
GO:0030670 0.291 4 0.00028 0.03158 phagocytic_vesicle_membrane
GBR
antigen_processing_and_presentation_of_peptide_
GO:0002504 0.024 6 0.00005 0.00112
or_polysaccharide_antigen_via_MHC_class_II
GO:0019882 0.151 11 0.00005 0.00112 antigen_processing_and_presentation
GO:0019221 1.332 10 0.00005 0.00112 cytokine-mediated_signaling_pathway
GO:0030658 0.339 7 0.00005 0.00112 transport_vesicle_membrane
GO:0030669 0.112 6 0.00005 0.00112 clathrin-coated_endocytic_vesicle_membrane
GO:0032588 0.421 6 0.00005 0.00112 trans-Golgi_network_membrane
GO:0031295 0.421 6 0.00005 0.00112 T_cell_costimulation
GO:0012507 0.151 9 0.00005 0.00112 ER_to_Golgi_transport_vesicle_membrane
GO:0032395 0.014 5 0.00005 0.00112 MHC_class_II_receptor_activity
GO:0042612 0.014 3 0.00005 0.00112 MHC_class_I_protein_complex
129
Capítulo 1
130
Capítulo 1
S7 Table. List of outlier genes This is the same list reported in Table 3, but
included additional information. In purple, "African" genes; in orange, "Euro-
pean" genes and in green , "African and European" (see main text). P, p-value of
the most exteme window overlapping the gene; tf, assigned target frequency of
the window with lowest p-value. When a gene is "African" or "European" but
one of the populations from the other continents also has an extreme window
for the gene, it is highlighted with the same color code.
131
Capítulo 1
132
Capítulo 1
RP11-
15 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743
96O20.4
10 SFTPD 0.3 0.000194315 0.3 0.0000147 0.3 0.023205621 0.3 0.004575903
8 SGCZ 0.5 0.000256226 0.5 0.00048119 0.3 0.000722705 0.4 0.000495289
6 SLC17A5 0.5 0.000250709 0.5 0.00033101 0.5 0.008776049 0.3 0.004632297
11 SLC35F2 0.5 0.000266034 0.5 0.00031875 0.5 0.009484655 0.5 0.011214487
1 SPRR3 0.5 0.000357368 0.5 0.000496515 0.5 0.000538197 0.5 0.000263582
20 SPTLC3 0.5 0.000460962 0.4 0.000395373 0.4 0.000560878 0.5 0.000449315
15 SQRDL 0.5 0.000389243 0.5 0.000419892 0.3 0.001878787 0.3 0.001026743
5 STK32A 0.5 0.000274615 0.5 0.000300361 0.5 0.000370241 0.3 0.000521034
14 STXBP6 0.5 0.000163053 0.5 0.000146502 0.3 0.001623787 0.3 0.00021393
20 TGM6 0.3 0.0000638 0.3 0.00013363 0.3 0.000148341 0.3 0.000667536
13 TMCO3 0.5 0.000337753 0.5 0.00046464 0.3 0.001407404 0.3 0.001530613
16 WWOX 0.3 0.000407019 0.5 0.000427248 0.3 0.000239676 0.3 0.00133875
19 ZNF331 0.5 0.000378209 0.5 0.000473834 0.3 0.00091089 0.3 0.000663245
3 ALDH1L1 0.3 0.000354303 0.3 0.000328558 0.4 0.001094171 0.4 0.001080072
22 CELSR1 0.3 0.000129952 0.3 0.000253774 0.3 0.007483272 0.3 0.010092732
5 COMMD10 0.3 0.000278906 0.3 0.00023232 0.3 0.000710445 0.3 0.001828522
2 MLPH 0.3 0.000350625 0.4 0.000460962 0.3 0.00040518 0.3 0.002850975
18 NEDD4L 0.5 0.000389856 0.3 0.000369015 0.3 0.000840397 0.5 0.001739027
14 OR6J1 0.3 0.000134856 0.3 0.000158762 0.3 0.000568233 0.3 0.000383726
6 SLC22A16 0.3 0.000498354 0.3 0.000435829 0.3 0.010716746 0.3 0.00173351
3 SUMF1 0.3 0.000151406 0.3 0.000326719 0.3 0.00174577 0.3 0.000546166
17 ZZEF1 0.3 0.000253161 0.3 0.000253161 0.3 0.000133017 0.3 0.000502644
15 C15orf48 0.5 0.000165505 0.3 0.00036595 0.3 0.517718828 0.3 0.471580976
6 CCHCR1 0.3 0.000426022 0.3 0.000300361 0.3 0.000782164 0.3 0.000416827
3 CLDN16 0.3 0.000385565 0.3 0.000255613 0.3 0.014823106 0.3 0.010374703
8 EXTL3 0.3 0.000269712 0.3 0.000457897 0.5 0.064383231 0.3 0.00184446
2 IL37 0.3 0.0000368 0.3 0.000430313 0.3 0.007752983 0.3 0.001856106
5 NR3C1 0.3 0.000401503 0.3 0.000410084 0.3 0.000630757 0.3 0.000290553
1 PGLYRP4 0.4 0.0000975 0.3 0.000274615 0.3 0.005533992 0.3 0.006012117
5 SLC27A6 0.3 0.000466479 0.3 0.000497741 0.5 0.00057988 0.4 0.000313233
8 STAU2 0.3 0.000329171 0.3 0.000444411 0.3 0.00096238 0.3 0.000500805
12 TMEM132D 0.5 0.000429087 0.3 0.000454832 0.3 0.00082875 0.5 0.001126659
11 TMEM135 0.3 0.000266647 0.3 0.000291166 0.3 0.000403954 0.3 0.000597043
17 WSCD1 0.5 0.000492837 0.3 0.000438894 0.3 0.001060457 0.3 0.000837945
1 ZNF670 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717
1 ZNF695 0.3 0.00026726 0.3 0.000476899 0.3 0.000495289 0.3 0.001072717
12 AC121757.1 0.5 0.000594592 0.5 0.000751515 0.5 0.000419279 0.5 0.000359207
133
Capítulo 1
134
Capítulo 1
135
Capítulo 1
136
Capítulo 1
137
Capítulo 1
138
Capítulo 1
139
Capítulo 1
RP11-
AP3B1 COX19 GCNT3 LNX1 TDRD10
295P9.3
RP11-
APBB1IP CPA3 GFRA2 LOXL2 TEAD2
297M9.1
RP11-
APBB2 CPA5 GIPC2 LPHN2 TEC
302M6.4
RP11-
APIP CPB2 GLDN LPHN3 TEK
307N16.6
RP11-
APPBP2 CPE GLIPR1L2 LPIN1 TEKT1
321F6.1
RP11-
ARHGAP22 CPLX1 GLIS1 LPIN2 TENM2
383H13.1
RP11-
ARHGAP24 CPNE4 GMNC LPPR1 TENM3
389E17.1
RP11-
ARHGAP28 CPNE8 GNA15 LRCH1 TENM4
433C9.2
RP11-
ARHGAP44 CPXM2 GNG2 LRP1B TES
45H22.3
RP11-
ARHGAP8 CREB5 GNLY LRRC16A TESC
463C8.4
RP11-
ARHGEF10L CRELD2 GOLM1 LRRC7 TESPA1
697E2.6
RP11-
ARHGEF18 CRHR1 GOPC LRRFIP1 TEX2
96O20.4
RP13-
ARHGEF37 CRTAC1 GOSR2 LRRK2 TFB2M
279N23.2
RP5-
ARL14EPL CRTC3 GPC5 LRRTM4 TG
1052I5.2
RP5-
ARL15 CRX GPC6 LSAMP TGM6
966M1.6
ARSB CRYL1 GPD1L LTBP1 RPA3-AS1 THBS2
ARSJ CSGALNACT1 GPLD1 LUZP2 RPAIN THBS4
ART1 CSMD1 GPR111 LYAR RPGRIP1 THSD4
ART3 CSMD2 GPR114 LYPD6B RPS6KA2 THSD7A
ASAH2 CSMD3 GPR115 MACC1 RPSA THSD7B
ASAP1 CSN3 GPR133 MACROD2 RPTOR TIAM1
ASB18 CSRP1 GPR137B MAGI1 RRM1 TIAM2
CTB-
ASGR2 GPR158 MAGI2 RRP12 TIFA
129P6.11
140
Capítulo 1
CTD-
ASIC2 GPR78 MAMDC2 RUNX1 TJP2
2207O23.3
CTD-
ASPA GPRIN3 MAML3 RUNX2 TLDC1
2260A17.2
CTD-
ASTN2 GRAMD3 MANBA RXFP1 TLK1
2287O16.3
CTD-
ATF7IP2 GRAMD4 MAP3K13 RYR1 TLR10
2616J11.11
CTD-
ATP10A GRB10 MAPT RYR2 TMCC3
3088G3.8
CTD-
ATP10D GREB1 "MARCH1" RYR3 TMCO3
3105H18.16
CTD-
ATP2C2 GRHL2 MARCH4 SAMD12 TMED3
3105H18.18
ATP6V0A4 CTIF GRID1 MARCH7 SAMD3 TMEM104
ATP6V0E2 CTNNA2 GRID2 MARK4 SAMD5 TMEM106B
ATP6V1E1 CTNNA3 GRIK1 MAST4 SBF2 TMEM117
ATP8A1 CTNND2 GRIK2 MATN1 SCARB2 TMEM128
ATP8A2 CUBN GRIK4 MB21D2 SCD5 TMEM129
ATP9A CUX1 GRIN2A MCF2L SCLY TMEM132B
ATRNL1 CWF19L2 GRIN3A MCF2L2 SCML4 TMEM132C
ATXN3 CXCL11 GRIN3B MCM9 SCN1A TMEM132D
AVEN CYB5A GRIP1 MDGA2 SCN3A TMEM135
AXDND1 CYB5R2 GRM4 MECOM SCNN1G TMEM156
B3GNTL1 CYBRD1 GRM7 MEGF11 SCP2 TMEM179
B4GALNT2 CYP24A1 GRM8 MEIOB SCUBE1 TMEM220
BAI3 CYP4F12 GSTO1 MEOX2 SDC2 TMEM229B
BARD1 CYP4F3 GTF2H4 MFAP3 SDK2 TMEM232
BBS9 DAAM1 GTF2IRD1 MFSD6L SDR39U1 TMEM244
BCAR3 DAAM2 GTF3C6 MGAT5 SEMA3A TMEM259
BCAS1 DAB1 GUCA1A MGAT5B SEMA3E TMEM44
BCAS3 DAB2 HAAO MGMT SEMA6D TMEM51
BCKDHB DAD1 HABP2 MGST2 SEPT11 TMEM63C
BCL2L14 DAPK1 HAGH MGST3 SEPT9 TMEM71
BCR DCBLD1 HBE1 MICAL3 SERINC5 TMEM88B
BDH1 DCC HBG2 MICALCL SERPINA5 TMPRSS11E
BEST3 DCDC2C HDAC4 MICB SERPINB5 TMPRSS2
BFSP2 DCHS2 HDAC7 MIS12 SFTPD TMTC1
BICC1 DCTD HEATR1 MITF SGCG TMTC2
141
Capítulo 1
142
Capítulo 1
143
Capítulo 1
Supplementary Figures
144
Capítulo 1
145
Capítulo 1
146
S3 Fig.Effect of sequence length on NCD20.5 power (Africa).
ROC curves for sequence lengths (L) of (A) 3 Kb, (B) 6 Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the African demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
147
S4 Fig.Effect of sequence length on NCD20.5 power (Africa).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the African demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1
148
S5 Fig.Effect of sequence length on NCD20.5 power (Europe).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the European demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
149
S6 Fig.Effect of sequence length on NCD20.5 power (Europe).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the European demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1
150
S7 Fig.Effect of sequence length on NCD20.5 power (Asia).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the Asian demographic scenario and Tbs = 5 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
151
S8 Fig.Effect of sequence length on NCD20.5 power (Asia).
ROC curves for sequence lengths (L) of (A) 3Kb, (B) 6Kb, and (C) 12 Kb. Each plot shows NCD20.5 performance for
simulations where the balanced polymorphism is modeled to achieve an equilibrium frequency ( f eq ) of 0.5 (blue), 0.4
(orange), 0.3 (pink), or 0.2 (green), based on simulations under the Asian demographic scenario and Tbs = 3 myr.
FPR, false positive rate (100-Specificity); TPR, true positive rate (sensitivity, or power). Note that the x-axis ranges
from 0 to 0.05, while the y-axis ranges from 0 to 1.
Capítulo 1
Capítulo 1
152
153
Fig S10. ROC curves for comparison between NCD20.5 and other tests (Europe).
Power to detect LTBS for simulations where the balanced polymorphism was modeled to achieve frequency equilib-
rium ( f eq ) of (left) 0.3, (center) 0.4, and (right) 0.5. Plotted values are for European demography, Tbs = 5 myr, L = 3
kb). Target frequency for NCD1 and NCD2 matches the simulated f eq .
Capítulo 1
Capítulo 1
Fig S11. Relationship between NDC2tf and the number of informative sites.
NCD2tf was calculated for neutral simulations (10,000 for each bin of IS) for
African demographic scenario and the 0.01 quantile value for each bin is plot-
ted. Blue (t f = 0.5), orange (t f = 0.4), pink (t f = 0.3), green (t f = 0.2).
154
Capítulo 1
Fig S12. Proportion of windows per chromosome. Sets of significant and out-
lier windows are derived from the union of three target frequencies (0.3, 0.4,
0.5). Grey, all genomic windows; significant (green) and (blue) outlier windows.
Fig S13. Proportion of positions in the genome retained after each filter. Pro-
portion of the hg19 human reference genome (total base-pairs = 2,684,573,005)
retained for each individual filtering criterium described in the Methods,
and for all filters jointly applied together. Proportion of sequences retained:
Map50=0.843; TRF=0.976; SD=0.961; pantro2=0.961; all=0.819. Map50: mappa-
bility 50-mer (see Methods); TRF: tandem repeats; SD: segmental duplications;
pantro2: reference chimp genome.
155
Capítulo 1
156
Capítulo 1
157
Capítulo 1
158
Fig S16. Number of paralog genes per gene on the same chromosome. For each protein-coding gene from human
autosomes (19,349), the number of paralogs present in the same chromosome is plotted (left, gray). All autosomes
come from Ensembl for hg19, regardless if they were queried or not for NCD2. Significant genes (middle, blue) come
from the union of significant genes for YRI considering all t f values (see Table 2 in main paper). Significant genes
without olfactory receptor genes (21 ORs in total) are shown on the right (green). y-axis, relative frequency of the
genes that contain a given number of paralogs on the same chromosome. Note that the distributions are very similar
for the background and significant genes.
Capítulo 1
159
Capítulo 1
160
Fig S18. NCD20.5 empirical values for each bin of informative sites, IS. A) NCD20.5 for windows with IS between 1
and 100 for YRI (>99% of all windows); B) NCD20.5 for all windows with IS between 1 and 100 for GBR (>99% of all
windows). In blue, median value for all the windows within a given bin. Note that the medians stabilize around 20
IS for YRI and around 15 IS for GBR.
Capítulo 1
Fig S19. Number of paralog genes per gene on the same chromosome. For
each OR gene contained within any of the significant sets of windows (all pop-
ulations and t f values, 53 ORs in total), the number of paralogs present in the
same chromosome is plotted. y-axis, relative frequency of the genes that contain
a given number of paralogs on the same chromosome. Compare with distribu-
tions in S16 Fig.
161
Capítulo 1
Fig S20. Venn diagrams of candidate windows for four populations. A, left,
significant windows; B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR,
Great Britain; TSI, Toscani. The set of significant windows for each population
comes from the union of significant and outlier windows for tf=0.3, 0.4, 0.5 (see
Results and Methods). African populations are shown in tones of purple, and
European in tones of green.
162
163
Fig S21. Venn diagrams of significant windows for four populations, for each t f value. A, left, significant windows;
B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR, Great Britain; TSI, Toscani. The set of significant windows
for each population comes from from those detected with each t f value. African populations are shown in tones of
purple, and European in tones of green.
Capítulo 1
Capítulo 1
164
Fig S22. Venn diagrams of outlier windows for four populations, for each t f value. A, left, significant windows;
B, right, outlier windows; YRI, Yoruba; LWK, Luhya; GBR, Great Britain; TSI, Toscani. The set of significant windows
for each population comes from from those detected with each ft value. African populations are shown in tones of
purple, and European in tones of green.
Capítulo 2
Considerações Iniciais
165
Capítulo 2
cia que as “varreduras seletivas” têm sobre variantes neutras próximas – redu-
zindo a diversidade – e, menos frequentemente, sobre variantes não-neutras –
limitando a eficácia da seleção nos sítios ligados (Betancourt e Presgraves, 2002;
Chun e Fay, 2011). Até o momento, nenhum estudo buscou avaliar o impacto
que a seleção para a manutenção de um polimorfismo balanceado tem sobre
variantes deletérias ligadas (exceto para HLA, Lenz et al., 2016). Assim, busca-
mos testar a hipótese de que a seleção balanceadora sobre um sítios aumenta a
abundância de alelos deletérios ligados que, na ausência de seleção balancea-
dora, poderiam ter sido eliminados por seleção purificadora.
A fim de abordar essa questão, valemo-nos dos genes com assinaturas de se-
leção balanceadora identificados no Capítulo 11 . Neste trabalho tive a colabora-
ção de Débora Y.C. Brandt (doutoranda, Universidade da Califórnia, Berkeley)
e Jônatas E. César (pós-doutorando, Universidade de São Paulo, IB), além de
Diogo Meyer, que orientou o trabalho. D.Y.C.B. organizou os dados do Projeto
1000 Genomas para nossas análises, calculou as frequências alélicas por popu-
lação e contribuiu com anotações funcionais para os SNPs. J.E.C. desenvolveu
scripts eficientes para as abordagens de re-amostragem descritas no manuscrito,
além de ter feito o pré-processamento dos dados para nossas análises. Eu parti-
cipei de todas as etapas descritas, fiz o planejamento das análises a serem feitas
(juntamente com D.M. e J.E.C.) e redigi o manuscrito, juntamente com D.M.,
com colaboração e aprovação dos outros co-autores. Todos contribuíram para a
discussão dos resultados. Pretendemos submetê-lo para o periódico Genetics.
al. (n.d).
166
Capítulo 2
Introduction
the dynamics and factors that interfere with the ef-
U
NDERSTANDING
167
Capítulo 2
mutation rate, but also rely on effective population size, dominance, and the
assumption that populations are in equilibrium (Brandvain and Wright, 2016).
Moreover, the methods used in the determination of what makes a mutation
“deleterious” vary considerably (reviewed in Henn et al., 2015). Therefore, it is
possible that many analyses on the load of deleterious mutations carry inaccu-
racies (reviewed in Brandvain and Wright, 2016).
Three processes have a central role in accounting for the abundance and dis-
tribution of deleterious mutations in the genome: mutation, drift, and selection
(Brandvain and Wright, 2016). Firstly, the balance between influx via muta-
tion and removal via purifying selection results in a dynamic process, where a
large number of weakly selected variants can be maintained at low frequencies.
Exome and genome-wide studies reporting an enrichment of recent deleterious
mutations are strong evidence for this process (Casals et al., 2013; Fu et al., 2012;
Tennessen et al., 2012; Kiezun et al., 2013).
Although the former result is compatible with a rich body of literature doc-
168
Capítulo 2
A third factor that can account for the load in our genomes is pleiotropy,
which is widespread in the human genome. For example, several studies show
that disease alleles are often also positively selected, indicating that a deleteri-
ous variant has been pushed to a high frequency due to some other contribution
to fitness it displays (e.g. Corona et al., 2010).
Here, we explore an additional process that can play a role in shaping the
load of mutations: the effect of selection on closely linked loci. It is plausible
that at least part of the mutational load in humans is due not to demographic
factors, but to indirect consequences of selection in adjacent loci (Figure 1).
169
Capítulo 2
gions that lie close to sites under either directional or balancing selection (e.g.
Charlesworth, 2006; Charlesworth, 2012; Cutter and Payseur, 2013; Nielsen,
2005, to name a few). Recombination rates and neutral genetic diversity are
correlated in several organisms (reviewed in Charlesworth, 2012; Cutter and
Payseur, 2013) and, when selective sweeps occur, a depletion of neutral diver-
sity is verified around the selected site (Charlesworth, 2012; Cutter and Payseur,
2013; Nielsen, 2005). The extent of this effect is a consequence of both recom-
bination rates and the intensity of selection (Roux et al., 2013; Schierup et al.,
2000; Charlesworth et al., 1997).
170
Capítulo 2
All these findings suggest that there is a complex interaction of different se-
lective forces targeting linked sites and possibly that linkage limits the efficiency
of purifying selection in purging deleterious mutations from the genome. In this
context, we propose to examine the following question: does balancing selec-
tion targeting certain sites in the human genome interfere with the effectiveness
of natural selection in nearby sites, as has been observed for strong directional
selection?
171
Capítulo 2
Our expectation, given the theoretical background outlined above, was that
regions with evidence for balancing selection would show an enrichment of
172
Capítulo 2
Methods
Population datasets
In order to test the hypothesis that sites within genes with evidence for long-
term balancing selection (LTBS) show an excess of deleterious variants, we con-
sidered all protein-coding SNP positions (nonsynonymous, N, and synonymous,
S) from the 1000 Genomes Phase 3 data (Auton et al., 2015). We selected SNPs
that fall within the coordinates of genes with signatures of LTBS (“balanced
genes”, see below) or within the target windows per se (“balanced windows”)
(Tables 1 and 2).
We used the integrated call sets in VCF format for each chromosome, and
calculated reference and alternative allele frequencies per population using VCFtools
(Danecek et al., 2011). We only considered populations from Africa and Eu-
rope, and excluded the admixed ones, thus resulting in 10 populations: [Africa:
Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK), Mende in
Sierra Leone (MSL), Gambian in Western Division, The Gambia (GWD), Esan
in Nigeria (ESN)]; [Europe: Toscani in Italy (TSI), British in England and Scot-
land (GBR), Iberian populations in Spain (IBS), Finnish in Finland (FIN), Utah
residents with Northern and Western European ancestry (CEU)]. We did not in-
clude Asian populations because the targets of balancing selection defined by
173
Capítulo 2
Bitarello et al. (n.d.) were only documented for African and European popula-
tions.
The "balanced genes" are those reported by Bitarello et al. (n.d.) (see their Table
3) as having the strongest statistically significant signatures of LTBS in humans
(213 genes in total). The list of balanced genes was generated by intersecting 3
Kb windows with strong signatures of LTBS with the protein-coding gene anno-
tation from Encode/Ensembl (Bitarello et al., n.d.). Here we test the hypothesis
of an enrichment in the proportion of deleterious variants in the balanced genes
per se (Table 1).
A more specific definition of the regions under balancing selection would in-
volve the analysis of "balanced windows" (i.e., the queried sub-region of a gene
with evidence for balancing selection, according to the method of Bitarello et al.
(n.d.). However, this approach generates a dataset which is too restrictive (Table
2), with a number of SNPs that is too small to provide reliable contrasts among
regions under balancing selection with respect to the rest of the genome (with
on average 209 protein-coding SNPs per population, after the HLA genes are
removed). We therefore chose to restrict our analyses to the balanced genes, for
which we have a larger number of SNPs documenting the influence of selection
on nearby sites.
174
Capítulo 2
Annotation
One of the summary statistics used to quantify the genetic load was the CADD
(Combined Annotation Dependent Depletion, or simply “C Score”) described
in Kircher et al. (2014). Thus, we used the annotation provided in the study of
Kircher et al. (2014) (available at: http://cadd.gs.washington.edu/download,
accessed in March 2016).
Because two of our measurements of genetic load (see below) are only ap-
plicable to protein-coding sites, we restricted our analyses to these categories.
Thus, all quantification of deleterious load was restricted to sites which are
protein-coding (i.e., N or S) (Tables 1 and 2). For the step where we iden-
tified the specific SNPs with highest heterozygosity for each gene (and which
is/are the putative target(s) of balancing selection, see below) we used the com-
plete set of sites within the gene, since it is plausible that sites which are not
protein-coding are under balancing selection.
175
Capítulo 2
Given that the effect of balancing selection on the load of nearby regions is
also expected to affect sites which are functional but not protein-coding, our ap-
proach could theoretically be extended to this class of sites (for example, using
a measure of deleteriousness such as the C score, which is applicable to these
sites as well). However, in the present study we opted to restrict our analyses to
variants that affect protein-coding sequences. We justify this based on the fact
that assignment of deleteriousness at these sites can be performed by estimating
the ratio of nonsynonymous to synonymous polymorphisms, a measure which
is based exclusively on the nature of the variants, with no direct influence of
allele frequencies and phylogenetic conservation.
All protein-coding SNPs from the set of balanced SNPs (Table 1) were jointly
considered when calculating the statistics below, i.e, a single estimate of load
was made for the entire set of genes with evidence for balancing selection. This
avoids the difficulty in obtaining reliable estimates when computing load for
individual genes, since these often have a small number of SNPs. For controls,
the same approach was adopted for each re-sampled set of SNPs (details below).
PN
PN /PS = (1)
PS
176
Balanced genes
POP # sites N S Pdel1 Pdel2 Benign Cscore
YRI 3,423(2,961) 1,871(1,564) 1,552(1,397) 197(182) 262(238) 1,300(1,034) 10.72(11.36)
LWK 3,587(3,126) 1,959(1,654) 1,628(1,472) 213(194) 281(260) 1,357(1,093) 10.85(11.52)
MSL 3,387(2,937) 1,837(1,543) 1,550(1,394) 198(184) 241(222) 1,273(1,013) 10.63(11.26)
GWD 3,550(3,098) 1,967(1,668) 1,583(1,430) 198(181) 292(272) 1,353(1,091) 10.75(11.35)
ESN 3,261(2,821) 1,804(1,517) 1,457(1,304) 202(187) 265(248) 1,238(984) 10.91(11.63)
TSI 2,596(2,146) 1,479(1,181) 1,117(965) 175(161) 230(211) 997(733) 10.88(11.82)
GBR 2,299(1,866) 1,328(1,043) 971(823) 155(140) 188(173) 919(664) 10.69(11.70)
177
FIN 2,149(1,715) 1,230(948) 910(767) 138(124) 173(159) 864(610) 10.33(11.36)
CEU 2,353(1,925) 1,334(1,054) 1,019(871) 145(132) 198(183) 924(672) 10.66(11.66)
IBS 2,612(2,153) 1,507(1,200) 1,105(953) 171(155) 237(216) 1,034(764) 10.82(11.71)
Table 1: Statistics for protein-coding SNPs within the balanced genes. Genes under balancing selection were defined
by Bitarello et al. (n.d.). Numbers in parentheses refer to the datasets after removal of HLA genes (see Methods).
Cscore, average scaled C score for all SNPs in the set (see Methods). PN and PS , numbers of nonsynonymous and
synonymous sites, respectively. Pdel1 , Pdel2 , and Benign, numbers of possibly damaging, probably damaging and
benign variants (Adzhubei et al., 2010), respectively.
Capítulo 2
Capítulo 2
Balanced windows
POP # sites N S Pdel1 Pdel2 Benign Cscore
YRI 635(225) 401(124) 234(102) 23(9) 30(9) 333(93) 7.50(9.26)
LWK 651(236) 400(122) 251(114) 29(12) 28(8) 332(9) 7.34(9.07)
MSL 636(234) 391(124) 245(110) 20(7) 29(10) 327(93) 7.52(9.29)
GWD 652(248) 406(136) 246(112) 29(13) 35(16) 328(93) 7.88(9.91)
ESN 630(235) 387(125) 243(110) 28(13) 24(8) 321(91) 7.46(9.43)
TSI 612(205) 388(113) 224(92) 27(13) 33(15) 317(75) 7.62(9.93)
GBR 561(171) 361(100) 200(71) 24(9) 28(14) 302(70) 7.25(9.31)
FIN 553(166) 351(91) 202(75) 22(8) 24(10) 297(65) 7.02(8.80)
CEU 559(172) 353(96) 206(76) 19(6) 28(14) 297(67) 7.17(9.29)
IBS 607(196) 384(107) 223(89) 26(11) 33(15) 314(70) 7.46(9.58)
178
Capítulo 2
To estimate the number of damaging alleles in the “balanced” and control sets
of SNPs, we used the PolyPhen-2 (Adzhubei et al., 2010) annotation provided
in Kircher et al. (2014). PolyPhen-2 classifies nonsynonymous variants as ei-
ther benign, possibly damaging (Pdel1 ) and probably damaging (Pdel2 ). We thus
defined the ratio of damaging to synonymous SNPs (Lohmueller et al., 2008) as:
Pdel1 + Pdel2
Pdel /PS = (2)
PS
CADD (C score)
The C score was provided by the CADD tool (Kircher et al., 2014). The C score is
a composite measure using information from more than 60 different such meth-
ods to quantify the effects of a mutation and has been shown to differentiate lev-
179
Capítulo 2
We used both the scaled and the raw C scores provided for the 1000G phase
3 SNPs. The scaled C scores range from 1-99 (higher values indicating higher
deleteriousness potential). Although counting the number of SNPs above a cer-
tain threshold could be used as a strategy, we used the approach of comparing
distributions of C scores between groups in order to increase power (Kircher
et al., 2014). Throughout the discussion, when C scores are presented they re-
fer to the average C score of all N + S SNPs contained in a given set of SNPs
(balanced or control). We restricted the analyses to these sites so as to make the
results comparable to those of the two other metrics for measuring deleterious-
ness.
n n
∑ (CNi ) + ∑ (CSi )
i =1 i =1
Cscore = (3)
N+S
, where n is the total number of SNPs in the set of SNPs, CNi and CSi are the C
scores for N and S SNPs contained in the set of SNPs. The overall C score used
in our analyses is thus an average of the C scores of all N + S SNPs contained
within the “balanced” and control sets of SNPs.
Scaled C scores are very useful for identifying a top ranked SNP and easier
to interpret, but raw C scores offer superior resolution for comparison of dis-
tributions of scores between groups of variants (Kircher et al., 2014). Thus, we
180
Capítulo 2
also compared the distribution of raw C scores for SNPs within balanced genes
to those of the re-sampled sets of controls, and performed a one-tailed Mann-
Whitney U-test (the alternative hypothesis being that SNPs from balanced genes
have higher raw C scores) to compare the balanced SNPs’ distribution to each
control replicate (significance threshold 5%).
To test the hypothesis that balanced genes are enriched for deleterious SNPs, we
compared the three statistics that measure deleteriousness between the SNPs
contained within the genes under balancing selection and a random sample
with the same number of SNPs, but chosen from genes with no evidence for
balancing selection (controls) (Table 1). We use the distributions of control SNPs
to obtain an empirical p-value for the SNPs from balanced genes, defined by the
fraction of re-sampled distributions with deleteriousness statistics which are
more extreme (i.e, higher) than those of the SNPs from the balanced genes.
Previous studies have shown, and we confirm here (see Results) that there
is a strong correlation between allele frequency and the probability of a variant
being annotated as deleterious (see Results). Because genes/regions under bal-
ancing selection are enriched for SNPs at intermediate frequencies (i.e. higher
heterozygosities), this effect will itself result in a marked difference between
measures of load for balanced genes and the genomic background. In order
to control for this effect and guarantee that differences in load are attributable
specifically to the effects of linked selection, we compared the proportion of
deleterious variants in balanced genes/windows to those of the control sets of
SNPs after matching the control SNPs to the frequencies of those in the “bal-
anced” set. Next we describe the procedure used to re-sample a set of SNPs
181
Capítulo 2
Once the protein-coding SNPs from balanced genes had been selected, we
followed a similar approach as the one adopted by Subramanian (2016): (1) we
took the MAF (minor allele frequency) of each protein-coding SNP; (2) we cal-
culated the log (base 10) of the MAF (logMAF), because in humans the MAF
follows an exponential distribution, i.e, a huge proportion of alleles have very
low MAF (e.g. Abecasis et al., 2012; Subramanian, 2016); (3) we divided the
SNPs into bins according to the logMAF (in our case, we used 9 bins rang-
ing from logMAF=-0.24, 0.1 with a 0.25 interval, encompassing a MAF range of
0.00398-0.5). Given that we did not expect balancing selection to favour derived
or ancestral variants preferentially (Bitarello et al., n.d.), using the MAF is ap-
propriate and does not require further filtering of data in order to infer ancestral
and derived states. Importantly, once the set of SNPs from balanced genes were
divided into bins of logMAF, we were able to quantify the relative contribution
of SNPs to each bin, thus allowing the re-sampled sets of SNPs to match the
site-frequency spectrum (SFS) of the SNPs observed in balanced genes. We re-
sampled from the control SNPs a set following the proportions of each logMAF
bin and the total number of SNPs within the set of target genes (Table 1).
This re-sampling schema was designed to account for the fact that all of the
genetic load measurements adopted here correlate negatively with allelic fre-
quency (see Results) and that there is an enrichment of intermediate-frequency
alleles among the balanced genes (Bitarello et al., n.d.). Each SNP was sampled
independently of its location, provided that it was protein-coding, autosomal
and matched the logMAF proportions calculated based on the balanced genes.
This means that each control set had the same number of protein-coding SNPs
as the balanced genes’ set and a similar SFS (Table 1), but those SNPs were not
182
Capítulo 2
necessarily attributed to the same number of genes as those for the balanced
genes.
Our goal is to test the hypothesis that SNPs within genes under balancing selec-
tion have a higher proportion of deleterious variants than expected for a set of
control SNPs. However, an excess of deleterious or functional variation could
be an outcome of the direct effects of balancing selection. For example, the un-
usually high proportion of nonsynonymous polymorphism in HLA genes is a
consequence of balancing selection directly on functional sites, and not of dele-
terious variants accumulating as a byproduct of selection on a specific site (e.g.
Hughes and Nei, 1988; Bitarello et al., 2015).
In order to separate the direct effects of balancing selection from those due to
hitch-hiking, we also calculated the genetic load measurements after excluding
the sites which are the strongest candidates for balancing selection (thus justi-
fying the assumption that the remainder of the highly polymorphic variants are
present due to linkage with this selected variant). This approach relies on the
assumption that one or at most one or a few sites are the targets of balancing
selection within each gene.
For each balanced gene, the putatively selected SNPs were identified by lo-
cating within the outlier window with evidence for balancing selection (as re-
ported in Bitarello et al., n.d.) the site with the highest heterozygosity. This SNP
was then excluded from the set of SNPs for the balanced genes. When a bal-
anced gene had more than one balanced window, we chose the one with the
most extreme signature of LTBS (Bitarello et al., n.d.).
183
Capítulo 2
, where i is each SNP position and MAF is the minor allele frequency for that
position. For this exclusion step, all coding SNPs were considered, not only N
and S, and most of the excluded SNPs were intronic (∼ 90% per population,
Table 3). When the most extreme heterozygosity in a gene was shared by mul-
tiple SNPs, all were removed. The average number of SNPs removed per gene,
across all populations, was 3.8 SNPs. Also, few genes had one or more N or
S SNPs removed by this filter (average 14 genes out of 213, per population).
Overall 71 unique (29 N and 42 S) SNPs were removed across all populations
(average 11 N and 34 S per population, Table 3).
Table 3: Classes of SNPs with the highest heterozygosity(ies) per gene For
each gene, the SNP(s) with the highest heterozygosity(ies) were removed (All).
Only SNPs contained within the outlier windows (Bitarello et al., n.d.) of those
genes were considered. Splice, splice-site position, 3’ and 5’ UTR, 3 and 5 prime
UTR regions, N, S, Intron, nonsynonymou, synonymous and intronic sites.
Given that the assumption that a balanced gene/window has one or a few
sites which is/are the actual target(s) of selection is not reasonable for the HLA
184
Capítulo 2
genes – where several sites are targets of balancing selection (Hughes and Nei,
1988; Bitarello et al., 2015) – we performed our analyses under two scenarios: ei-
ther keeping the HLA genes or removing them. For these analyses we removed
the following HLA genes, which have prior strong evidence for long-term bal-
ancing selection and are included among the outlier genes in Bitarello et al.
(n.d.): HLA-A,HLA-B, HLA-C, HLA-DRB1, HLA-DRB5, HLA-DPA1, HLA-DPA2,
HLA-DPB1, HLA-DPB2, HLA-DQB1, HLA-DQB2, HLA-DQA1, HLA-DQA2. Their
removal changes the proportion of target SNPs that fall into each bin of logMAF,
with the SFS becoming less enriched for intermediate frequency variants.
Results
Here, we consider as "balanced genes" the set described as having the strongest
signatures of LTBS in Bitarello et al. (n.d.). Balancing selection shifts the SFS
towards intermediate frequencies (Andrés et al., 2009; Bitarello et al., n.d.). Al-
though the selected sites may only comprise a subset of the entire locus, bal-
ancing selection changes levels of polymorphism at adjacent sites (neutral and
non-neutral), thus generating a signature that allows selected genes to be de-
tected (Bitarello et al., n.d.). It is entirely plausible, and likely, that only portions
185
Capítulo 2
of those genes are the targets of balancing selection, and this provided us with
an appropriate dataset that has the putative site(s) that were selected and their
immediate vicinities, which show signatures of LTBS (as seen in Figure 5 of
Bitarello et al., n.d.).
The SNPs in the balanced genes were binned according to their MAFs (Ta-
ble 4), and their distribution into the bins was used for the re-sampling proce-
dure for the controls. Because signatures of LTBS are expected to be restricted
to narrow windows (Andrés et al., 2009; Andrés, 2011; Bitarello et al., n.d.;
Charlesworth, 2006) and here we consider the entire gene, this shift towards
intermediate frequencies is modest.
quency
186
Bins
Pop 1 2 3 4 5 6 7 8 9
n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF n MAF
YRI 905 0.5-0.5 277 0.9-0.9 317 1.4-1.8 348 2.3-3.7 292 4.2-6.9 320 7.4-12.5 361 13-22.2 416 22.7- 39.3 187 39.8-50
LWK 977 0.5-0.5 386 1-1 355 1.5-2 283 2.5-3.5 335 4-7 282 7.6-12.1 390 12.6-22.2 383 22.7-39.4 196 40-50
MSL 894 0.6-0.6 304 1.12-1.2 208 1.8-1.8 333 2.3-3.5 364 4.1-7 309 7.6-12.3 390 12.9-22.3 390 22.9-39.4 195 40-50
GWD 974 0.4-0.4 304 0.9-0.9 395 1.3-2.2 269 2.6-3.5 322 4-6.6 311 7.1-12.4 368 12.8-22.1 404 22.6-39.4 203 39.8-50
ESN 790 0.5-0.5 303 1-1 306 1.5-2 268 2.5-3.5 334 4-7.1 300 7.6-12.1 372 12.6-22.2 388 22.7-39.4 200 39.9-50
TSI 820 0.5-0.5 173 0.9-0.9 167 1.4-1.9 175 2.3-3.7 171 4.2-7 177 7.5-12.1 329 12.6-22 390 22.4-39.7 194 40.2-50
GBR 586 0.55-0.55 162 1.1-1.1 150 1.6-2.2 145 2.7-3.8 157 4.4-6.6 194 7.1-12.1 313 12.6-22 363 22.5-39.6 229 40.1-50
FIN 409 0.5-0.5 143 1-1 171 1.5-2 153 2.5-3.5 196 4-7 203 7.6-12.1 304 12.6-22.2 370 22.7-39.4 304 39.9-50
187
CEU 644 0.5-0.5 147 1-1 138 1.5-2 160 2.5-3.5 194 4-7.1 165 7.6-12.1 325 12.6-22.2 376 22.7-39.4 204 39.9-50
IBS 825 0.5-0.5 173 0.9-0.9 176 1.4-1.8 158 2.3-3.7 191 4.2-7 176 7.5-12.1 311 12.6-22 392 22.4-39.7 210 40.2-50
188
Figure 2: Site frequency spectrum for protein-coding (N + S) SNPs SNPs come from balanced (blue) and control
genes (gray). Rectangles identify the sections that are zoomed in in (B) and (C).
Capítulo 2
had enough time to purge deleterious variants from the population (assuming
most nonsynonymous variants are deleterious). This is in fact the pattern seen
for human populations, where there is a vast excess of low frequency nonsyn-
onymous variants (Casals et al., 2013; Fu et al., 2012; Tennessen et al., 2012).
Moreover, C scores also tend to be higher for lower frequency variants (Kircher
et al., 2014), although it has been shown that C score distributions have power
to differentiate lead-SNPs and tag-SNPs from GWAS, which by definition have
similar frequencies (Kircher et al., 2014).
We confirmed these patterns with the 1000 Genomes data we analyzed (Fig-
ure 3). When dividing all protein-coding SNPs (whether they fall into balanced
genes or not) into bins of minor allele frequencies (logMAF, see Methods), a
clear negative correlation is observed between MAF and the three statistics:
PN /PS , Pdel /PS and Cscore (Figure 3).
All of the aforementioned observations indicate the importance of control-
ling for allele frequencies when analyzing the load of deleterious mutations
among balanced genes. Lack of a control would cause higher load among con-
trol SNPs than for the SNPs from balanced genes, as a consequence of an en-
richment in intermediate frequency variants in balanced genes (Bitarello et al.,
n.d.) and the fact that deleterious variants are more abundant in the lower bins
of MAF (Adzhubei et al., 2010; Kircher et al., 2014; Lohmueller et al., 2008; Sub-
ramanian, 2016).
189
Capítulo 2
190
191
Figure 3: Boxplot of load statistics by each bin of MAF. All autosomic N + S SNPs were included here. Bins were
defined based of the log(base 10) of the MAF of each variant (see Methods). y-axis, (A) PN /PS , (B) Pdel /PS , (C) Cscore.
Capítulo 2
Capítulo 2
192
Figure 4: Correlations between load summary statistics. (A) C score and PN /PS (cor=0.80, for all populations com-
bined, Pearson, p-value < 2.2e−16 ), (B) Pdel /PS and Cscore (cor=0.91, p < 2.2e−16 ) and (c) Pdel /PS and PN /PS (cor=0.91,
p < 2.2e−16 ). Each color corresponds to one population and each point is refers to the metric estimated for a re-
sampled set of SNPs. Lines represent linear regressions for each population. For correlation values per population,
see Table 5.
Capítulo 2
In the scan for balancing selection of Bitarello et al. (n.d.), HLA genes were
over-represented among the category of selected genes, and showed extremely
strong evidence for selection, with 12 classical HLA genes present among the
213 genes with strongest signatures of balancing selection. This observation,
associated to the fact that HLA genes are likely to carry several sites under the
direct effects of balancing selection, led us to single them out for an exploratory
analysis.
We initially compared the load statistics for the set of SNPs from balanced
genes to a group comprising all control SNPs in the genome. Note that here
there was no re-sampling involved; we simply compared the statistics between
the different groups. We evaluated the influence of SNPs from HLA genes on
the load summary statistics by excluding all HLA SNPs from the set of SNPs
contained in balanced genes.
The set of SNPs from the balanced genes have different values for the three
statistics when compared to the control set of SNPs: PN /PS values are higher
(Figure 5), while Pdel /PS and C score values are lower for the SNPs from all
balanced genes (Figure 5). Interestingly, the HLA set of SNPs follows the same
pattern as the SNPs from the set of all balanced genes (which include the HLA
SNPs), although in a much stronger way. When we examine HLA genes alone,
we find that the average PN /PS for these loci is almost 2-fold greater than that
of control SNPs (Figure 5). Similarly, the reduction in Pdel /PS and C score in
HLA compared to control SNPs is also about two-fold (Figure 5). Moreover,
when HLA SNPs are removed from the set of SNPs from balanced genes, the
remaining set tends to have values closer to controls, albeit still different (Figure
193
Capítulo 2
5).
The extreme patterns of HLA SNPs for these three statistics could be driving
the patterns seen in the SNPs from balanced genes, of which they are part of.
The PN /PS result is conservative in the sense that, although one could expect
lower values for HLA genes (less low frequency and thus less nonsynonymous
variants), it is actually almost two-fold higher (Figure 5). This observation
likely results from the high number of sites that are actively maintained by bal-
ancing selection in these genes. It is a well-known fact that balancing selection
has targeted several sites in HLA genes (e.g. Hughes and Nei, 1988; Yang and
Swanson, 2002; Bitarello et al., 2015), which could at least partially explain the
patterns observed for PN /PS . The mechanisms driving diversity in the other
balanced genes, however, remain largely unknown, and it is reasonable to as-
sume that only one (or a few) site(s) has been targeted by balancing selection in
a given gene (e.g. in Leffler et al., 2013 i.e, that the HLA represents the excep-
tion, rather than the rule.
194
Capítulo 2
ingly.
Firstly, we note that PN /PS values are on average higher for European than
for African populations (Figure 6). This confirms the finding that European
populations have a higher proportion of nonsynonymous variants than African
populations (Lohmueller et al., 2008). Since our re-sampling was done by popu-
lation, we intrinsically take this into account as seen in the control distributions
(Figure 6).
The PN /PS values of SNPs from balanced genes are significantly higher than
controls (p < 0.01; Figure 6A). These results are not explained by the HLA
genes: although their removal reduces the balanced PN /PS (as expected), while
slightly increasing the control values (because the target SFS changes, thus re-
sulting in less intermediate frequencies in the controls), the increase in of bal-
anced genes with respect to the controls remains significant, albeit less extreme
(Figure 6B, P < 0.01 for all African populations and GBR and FIN). One Eu-
ropean population has marginally significant values (P = 0.06, TSI) and for
two others (CEU and IBS) PN /PS falls within the control distribution after the
removal of HLA SNPs (P > 0.24, Figure 6B).
The increased PN /PS for balanced genes is also not likely driven by the puta-
tive target(s) of balancing selection in these genes: when PN /PS for the balanced
genes was estimated after removing the SNP(s) with the highest heterozygos-
ity(ies) for each gene (see Methods), results remain qualitatively similar (Figure
6). In fact, most of the SNPs excluded from the balanced genes were intronic
(∼ 90% for all populations, Table 3), and on average 11.5 N and 14.6 S SNPs
were removed from each population (Table 3) which makes the PN /PS esti-
195
Capítulo 2
196
Figure 5: Boxplot of load statistics sets of SNPs. Balanced, SNPs contained in the balanced genes; balanced.no.HLA,
same as the previous category, but excluding 12 HLA genes (see Methods); control, all SNPs not contained in balanced
genes; HLA, a set of 12 HLA genes (see Methods). Each boxplot is composed of 10 data points, each one corresponding
to a population (see Methods). y-axis, (A) PN /PS , (B) Pdel /Ps , (C) Cscore.
Capítulo 2
anced genes
Again, we note that European populations have higher balanced and control
values than African populations, as seen previously (Lohmueller et al., 2008).
When comparing Pdel /PS estimates for SNPs from balanced genes and control
sets of SNPs, a similar pattern emerges, although less extreme than the one
seen for PN /PS : balanced genes tend to have higher load compared to controls
(p < 0.05) for all populations, except CEU and IBS (p > 0.14; Figure 7A).
The removal of HLA SNPs only slightly changes the Pdel /PS , and the quali-
tative relationship between them does not change, with all populations except
CEU and IBS having p < 0.05 (Figure 7). This differs from what was observed
for PN/PS, where the removal of HLA SNPs made the estimates of load for
balanced genes less different from controls, although still highly significant.
197
Capítulo 2
Moreover, the Pdel /PS estimates with and without the removal of SNPs with
the highest heterozygosity per gene only slightly increase the estimates, com-
patible with the observation that few of the removed SNPs with this filter are
nonsynonymous, and always less than the number of synonymous SNPs (Table
3).
The results for Pdel /PS are in agreement with what was observed for PN /PS ,
suggesting that the patterns observed for PN /PS are driven by deleterious, and
not adaptive or neutral nonsynonymous variants.
Average scaled C scores yield qualitatively different results with respect to anal-
yses based on PN /PS and Pdel /PS . Firstly, for African populations and for TSI
the load estimates for balanced genes are very elevated (p < 0.01) compared
to controls, similarly to what was seen for PN /PS (Figure 8A). However, this
pattern is not observed for the other European populations, with p-values ap-
proaching one for CEU and IBS (Figure 8A). Interestingly, in this case, the re-
moval of HLA SNPs enhances the signal: African values become even more
extreme and all the European populations acquire extreme values when com-
pared to controls as well (p < 0.01 for all populations, Figure 8B).
Given that the most appropriate set of SNPs for testing our hypothesis of
load is the set without HLA genes and without the SNPs with the highest het-
erozigosities per gene (Figure 8B), pink triangle), it is plausible that the reduc-
tion or loss of significance of the load in the set of all balanced genes (in Africa
and Europe, respectively) is due to the excess of adaptive variants (from HLA
or other genes) present in the complete set, which tend to have lower C scores.
Note that the control distributions in Figures 8A and 8B are very similar, and
198
199
Figure 6: PN /PS for balanced genes A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, estimate for all
protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozygosity in each
gene (see Methods. **, p − value < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
Capítulo 2
Capítulo 2
200
Figure 7: Pdel /PS for balanced genes A) Including HLA SNPs; B) removing HLA SNPs. Blue circle, estimate for all
protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozygosity in each
gene (see Methods). **, p < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
201
Figure 8: Average scaled Cscore for balanced genes. A) Including HLA SNPs; B) removing HLA SNPs. Blue circle,
estimate for all protein-coding SNPs in the set; pink triangle, estimate after removal of SNP(s) with highest heterozy-
gosity in each gene (see Methods). **, p < 0.01, *p < 0.05. Reported p-values are for the estimates with all SNPs.
Capítulo 2
Capítulo 2
what changes dramatically is the load estimate for the balanced genes. Also,
the Scaled C scores are PHRED-scaled, ranging from 1 to 99, with the top 10%
most deleterious variants having scores above 10, and the top 1% above 20, and
so on (Kircher et al., 2014). Thus this difference is likely to be even greater than
what is conveyed by this analysis.
We also looked at the raw C scores which provide more power in tests com-
paring sets of SNPs (Kircher et al., 2014). For African populations, balanced
genes have raw C score distributions with significantly higher values than the
controls (Mann-Whitney U test, one-tailed, P<=0.05) for more than 70% of the
control re-samplings (Table 6), except for GWD, for which only 251 out of 1,000
controls have significantly lower C scores than the balanced genes. For Europe,
only TSI has 13% of the controls with lower C scores than the balanced genes,
and all other populations have less than 5 such cases (Table 6).
202
Capítulo 2
However, when we perform the same analyses for the balanced genes after
the removal of HLA SNPs, balanced genes have higher raw C scores for all
comparisons in African populations, and for more than 995 comparisons for all
European populations, except CEU, for which 844 comparisons are significant
(Table 6).
203
Capítulo 2
Additionally, we also removed from each balanced gene the SNP(s) likely to
be the targets of balancing selection, in order to filter our datasets from the po-
tentially conflicting patterns generated by advantageous and deleterious vari-
ants within the balanced genes. Importantly, in this approach we only excluded
the SNP(s) with the highest heterozigosiy among those contained in a window
with a very strong signature of LTBS as reported in Bitarello et al. (n.d.), thus
204
Capítulo 2
increasing the chance that the actual selected site was filtered out.
We chose to use statistics based on the counts of deleterious variants (PN /PS
and Pdel /PS ) and deleteriousness scores (C score). With PN /PS and Pdel /PS we
quantified the proportions of nonsynonymous and potentially damaging vari-
ants, respectively, for balanced genes and control groups. PN simply documents
whether the polymorphism changes the coded aminoacid, and thus is unbiased
with respect to knowledge of the frequency at which the polymorphism is seg-
regating. Nevertheless, PN counts are composed of neutral, deleterious and
advantageous variants and thus are not straight-forward to interpret. Pdel is
more accurate than PN as a measure of deleteriousness, but it is restricted to
nonsynonymous sites and is only available for a subset of the nonsynonymous
variants (∼ 80% for the genome, but ∼ 70% for balanced genes), thus reducing
its power, particularly in small sets of SNPs. Moreover, PolyPhen-2 has been
shown to overfit its training data and not to generalize well for other datasets
(Grimm et al., 2015). Neither of these approaches incorporate the frequency of
the deleterious mutations when classifying SNPs.
205
Capítulo 2
Our approach here is thus conservative. Previous studies have shown that
for comparisons between African and Out-of-Africa, small or no difference is
verified when the number of putative deleterious mutations is counted (Henn
et al., 2016; Tennessen et al., 2012; Do et al., 2015; Lohmueller, 2014), whereas on
average the out-of-Africa populations are more homozygous for the putative
deleterious mutations (Lohmueller et al., 2008) – a difference not detectable by
these two methods.
206
Capítulo 2
On the other hand, recent work (Lenz et al., 2016) showed through simula-
tions that deleterious mutations are expected to accumulate in the vicinity of a
locus under balancing selection. The simulation framework assumed that sev-
eral sites were under balancing selection in an HLA-like gene – as is the case
for classic HLA genes (e.g. Hughes and Nei, 1988; Yang and Swanson, 2002;
Bitarello et al., 2015).
207
Capítulo 2
ing selection. With these simulations, the authors demonstrated that such a sce-
nario leads to an overall reduction of diversity around the HLA-like locus, but
the variants that "survive" tend to segregate at higher frequencies, demonstrat-
ing the potential for balancing selection in HLA genes to increase the frequency
of deleterious variants around the HLA loci. Lenz et al. (2016) confirm this
prediction with empirical data an excess of damaging (Adzhubei et al., 2010)
variants in non-HLA loci of the MHC region.
Importantly, the simulations of Lenz et al. (2016) assume an additive model,
not a recessive one. Thus, their observations suggest that some other mecha-
nism other than the "sheltered load" is responsible for the increased load in the
vicinity of HLA genes, and this is likely to be the hitch-hiking effect mentioned
above.
208
References
209
Capítulo 2
210
Capítulo 2
Do, R., D. Balick, H. Li, I. Adzhubei, S. Sunyaev, and D. Reich (2015). “No evidence that selection
has been less effective at removing deleterious mutations in Europeans than in Africans”.
In: Nature Genetics 47 (2), pp. 126–131.
Eyre-Walker, A. and P. D. Keightley (1999). “High genomic deleterious mutation rates in ho-
minids”. In: Nature 397 (6717), pp. 344–347.
Fay, J. C. (2011). “Weighing the evidence for adaptation at the molecular level.” In: Trends in
genetics : TIG 27 (9), pp. 343–9.
Fu, W. et al. (2012). “Analysis of 6,515 exomes reveals the recent origin of most human protein-
coding variants”. In: Nature 493 (7431), pp. 216–220.
Grimm, D. G. et al. (2015). “The evaluation of tools used to predict the impact of missense
variants is hindered by two types of circularity”. In: Human Mutation 36 (5), pp. 513–523.
Henn, B. M., L. R. Botigué, C. D. Bustamante, A. G. Clark, and S. Gravel (2015). “Estimating the
mutation load in human genomes”. In: Nature Reviews Genetics 16 (6), pp. 333–343.
Henn, B. M. et al. (2016). “Distance from sub-Saharan Africa predicts mutational load in diverse
human genomes”. In: Proceedings of the National Academy of Sciences 113 (4), E440–E449.
Hill, W. G. and A. Robertson (1966). “The effect of linkage on limits to artificial selection”. In:
Genetics research 8 (2), pp. 269–294.
Hodgkinson, A., F. Casals, Y. Idaghdour, J.-C. Grenier, R. D. Hernandez, and P. Awadalla (2013).
“Selective constraint, background selection, and mutation accumulation variability within
and between human populations.” In: BMC genomics 14, p. 495.
Hughes, A. L. and M. Nei (1988). “Pattern of nucleotide substitution at major histocompatibility
class I loci reveals overdominant selection”. In: Letters to Nature 335 (8), pp. 167–170.
Kiezun, A. et al. (2013). “Deleterious Alleles in the Human Genome Are on Average Younger
Than Neutral Alleles of the Same Frequency”. In: PLoS Genetics 9 (2), pp. 1–12.
Kircher, M., D. M. Witten, P. Jain, B. J. O’Roak, G. M. Cooper, and J. Shendure (2014). “A general
framework for estimating the relative pathogenicity of human genetic variants”. In: Nature
genetics 46 (3), pp. 310–315.
Klein, J. (1986). Natural History of the Major Histocompatibility Complex. New York: John Wiley &
Sons, Ltd.
Kondrashov, A. S. (1995). “Contamination of the genome by very slightly deleterious mutations:
why have we not died 100 times over?” In: Journal of Theoretical Biology 175 (4), pp. 583–594.
211
Capítulo 2
Leffler, E. M. et al. (2013). “Multiple Instances of Ancient Balancing Selection Shared Between
Humans and Chimpanzees”. In: Science 339 (6127), pp. 1578–1582.
Lenz, T. L., V. Spirin, D. M. Jordan, and S. R. Sunyaev (2016). “Excess of Deleterious Mutations
around HLA Genes Reveals Evolutionary Cost of Balancing Selection”. In: bioRxiv, pp. 1–30.
Lohmueller, K. E. (2014). “The distribution of deleterious genetic variation in human popula-
tions.” In: Current opinion in genetics & development 29C, pp. 139–146.
Lohmueller, K. E. et al. (2008). “Proportionally more deleterious genetic variation in European
than in African populations.” In: Nature 451 (7181), pp. 994–997.
Mendes, F. (2013). Natural selection on HLA and its effects on adjacent regions of the genome. Tech.
rep. Universidade de São Paulo. URL: http://www.teses.usp.br/teses/disponiveis/
41/41131/tde-02082013-161104/pt-br.php.
Morton, N. E., J. F. Crow, and H. J. Muller (1956). “An Estimate of the Mutational Damage in
Man From Data on Consanguineous Marriages”. In: Proceedings of the National Academy of
Sciences of the United States of America 42 (11), pp. 855–863.
Nielsen, R. (2005). “Molecular Signatures of Natural Selection”. In: Annual Review of Genetics
39 (1), pp. 197–218.
Oosterhout, C. van (2009). “A new theory of MHC evolution: beyond selection on the immune
genes.” In: Proceedings of the Royal Society of London. Series B, Biological Sciences 276 (1657),
pp. 657–65.
Peischl, S., I. Dupanloup, M. Kirkpatrick, and L. Excoffier (2013). “On the accumulation of dele-
terious mutations during range expansions.” In: Molecular ecology 22 (24), pp. 5972–82.
Peischl, S. and L. Excoffier (2015). “Expansion load: recessive mutations and the role of standing
genetic variation”. In: Molecular Ecology 24 (9), pp. 2084–2094.
Robinson, J., J. A. Halliwell, H. McWilliam, R. Lopez, P. Parham, and S. G. E. Marsh (2013). “The
IMGT/HLA database.” In: Nucleic Acids Research 41, pp. 1222–7.
Roux, C., M. Pauwels, M. V. Ruggiero, D. Charlesworth, V. Castric, and X. Vekemans (2013). “Re-
cent and ancient signature of balancing selection around the S-Locus in arabidopsis halleri
and A. lyrata”. In: Molecular Biology and Evolution 30 (2), pp. 435–447.
Schierup, M. H., D. Charlesworth, and X. Vekemans (2000). “The effect of hitch-hiking on genes
linked to a balanced polymorphism in a subdivided population”. In: Genetical research 76 (01),
pp. 63–73.
212
Capítulo 2
Simons, Y. B., M. C. Turchin, J. K. Pritchard, and G. Sella (2014). “The deleterious mutation load
is insensitive to recent population history”. In: Nature Genetics 46 (3), pp. 220–224.
Stone, J. L. (2004). “Sheltered load associated with S-alleles in Solanum carolinense.” In: Heredity
92 (4), pp. 335–42.
Subramanian, S. (2012). “The abundance of deleterious polymorphisms in humans.” In: Genetics
190 (4), pp. 1579–83.
— (2016). “Europeans have a higher proportion of high-frequency deleterious variants than
Africans”. In: Human Genetics 135 (1), pp. 1–7.
Sunyaev, S., V. Ramensky, I. Koch, W. Lathe 3rd, A. S. Kondrashov, and P. Bork (2001). “Predic-
tion of deleterious human alleles”. In: Hum Mol Genet 10 (6), pp. 591–597.
Tennessen, J. A. et al. (2012). “Evolution and Functional Impact of Rare Coding Variation from
Deep Sequencing of Human Exomes”. In: Science 337 (6090), pp. 64–69.
Tishkoff, S. A. and S. M. Williams (2002). “Genetic analysis of African populations: human evo-
lution and complex disease.” In: Nature Reviews Genetics 3 (8), pp. 611–621.
Yang, Z. and W. J. Swanson (2002). “Codon-Substitution Models to Detect Adaptive Evolution
that Account for Heterogeneous Selective Pressures Among Site Classes”. In: Molecular Bi-
ology and Evolution 19 (1), pp. 49–57.
213
Considerações Finais e Perspectivas
214
Considerações Finais e Perspectivas
215
Considerações Finais e Perspectivas
216
Considerações Finais e Perspectivas
em que uma janela é considerada significativa se seu valor de NCD2 para uma
dada frequência-alvo é menor do que aquele de 10.000 simulações neutras com
número igual de sítios informativos (resultando em cerca de 0,50% das janelas
por população, considerando a união de todas as frequências-alvo) e; (2) um
critério de ranking na distribuição genômica, após a aplicação de uma correção
que leva em conta o número de sítios informativos da janela. Com o segundo
critério, definimos como outliers as janelas na cauda da distribuição empírica
(0,05%), que é basicamente um subconjunto das janelas obtidas com o primeiro
critério.
Finalmente, reportamos como genes outlier aqueles que têm pelo menos uma
janela outlier (independente da frequência-alvo) em pelo duas populações do
mesmo continente. Com isso, esperamos reduzir os falsos positivos que pode-
riam ter surgido devido a alguma propriedade dos dados de uma certa popu-
lação, dado que, na escala de tempo que investigamos, esperamos que popula-
ções de um mesmo continente tenham compartilhado pressões seletivas, bem
como história demográfica. Nossos resultados mostraram que pelo menos 1%
dos genes do genoma têm assinaturas extremas de seleção balanceadora (ou-
tlier), mas talvez mais, podendo chegar até 8% (Tabela S8, Capítulo 1). Mesmo a
estimativa mais conservadora de 1% é bem mais alta do que o que já tinha sido
observado até hoje. Por exemplo, apenas 0.4% dos 13.500 genes analisados por
Andrés et al. (2009) apresentaram fortes assinaturas de seleção balanceadora. O
fato de nossa estimativa ser mais alta é provavelmente decorrente de múltiplos
fatores: o alto poder de NCD2, os dados genômicos utilizados, o fato de mesmo
com todos os nossos filtros termos retido mais de 18.000 genes autossômicos nas
análises e o fato de as janelas analisadas serem pequenas, o que aumenta a pro-
babilidade de detectar uma assinatura de SBLP (Andrés, 2011; Charlesworth,
217
Considerações Finais e Perspectivas
2009).
Boa parte dos das janelas candidatas é compartilhada entre ao menos duas das
populações analisadas (87%), particularmente entre populações do mesmo con-
tinente (78%). Mesmo nos casos em que um gene não passa o critério de perten-
cer aos dois continentes, a grande maioria tem assinaturas em ambos os conti-
nentes (ou seja, em pelo menos 3 das quatro populações analisadas), com raras
exceções. Finalmente, cerca de 32% dos genes outlier (69 genes, Tabela 3, Capí-
tulo 1) são partilhados entre as quatro populações.
218
Considerações Finais e Perspectivas
gamos (>= 3 milhões de anos). Mesmo que, na história humana recente, África
e Europa tenham divergido em diversos aspectos – em termos de história de-
mográfica e de pressões seletivas – é plausível que muitos alvos de seleção ba-
lanceadora de longo prazo tenham sido mantidos em ambas, e/ou que tenham
cessado de ser selecionados em um dos continentes apenas recentemente, pre-
servando assim as assinaturas de SBLP até o presente.
Resposta imune
219
Considerações Finais e Perspectivas
Por outro lado, é interessante observar que mesmo após a remoção dos ge-
nes HLA clássicos, algumas categorias funcionais permaneceram enriquecidas
para os genes significativos, algumas delas relacionadas ao sistema imune, mas
envolvendo outros genes, incluindo genes HLA não-clássicos. De fato, 1/3 dos
genes significativos são relacionados a funções imunes, mesmo que não compo-
nham categorias enriquecidas. Entre as outras categorias, temos por exemplo
“região extra-celular”, que confirma a observação de que tende a haver um ex-
cesso de genes relacionados à matriz extracelular entre os alvos de SBLP em
humanos (revisado em Key et al., 2014).
220
Considerações Finais e Perspectivas
221
Considerações Finais e Perspectivas
são receptores olfatórios (Tabela 3, Capítulo 1), o que implica que: (1) é plausível
que não sejam falsos positivos, dados todos os cuidados que tomamos, mas não
podemos descartar essa possibilidade; (2) nossas verificações nos deixam confi-
antes de que vieses desse tipo não são uma característica dos genes candidatos
de forma geral.
222
Considerações Finais e Perspectivas
2 Para a maioria dos genes, em organismos diploides, acredita-se que a expressão gênica
ocorre simultaneamente para os dois alelos. Para outros, apenas um dos alelos, o materno
ou o paterno, é expresso, ao passo que o outro é inativado. Esse padrão é alcançado através
de modificações epigenéticas, assim levando a uma expressão mono-alélica que é mantida ao
longo das divisões mitóticas.
223
Considerações Finais e Perspectivas
224
Considerações Finais e Perspectivas
3 Aqui, refiro-me a traços que, acredita-se, resultam de variação genética em múltiplos genes
e suas interações com fatores ambientais e comportamentais (Mitchell-Olds et al., 2007).
225
Considerações Finais e Perspectivas
As três estimativas são mais elevadas para os genes balanceados do que para
os controles, com poucas exceções. Mais ainda, quando removemos os genes
HLA – que têm muitos sítios mantidos de forma adaptativa e poderiam con-
fundir a interpretação das estimativas – os resultados foram qualitativamente
semelhantes. Avaliamos, por fim, o impacto que os sítios potencialmente se-
lecionados nos genes balanceados têm sobre essas estimativas, e vimos que as
observações se mantêm mesmo quando eles são removidos.
A fim de discernir entre esses dois possíveis cenários, uma opção seria : (1)
4 A ideia de que variantes deletérias recessivas raramente estarão em homozigose quando es-
tão nos genes HLA, pois a região tem alta heterozigose. Assim, tais variantes deletérias estariam
protegidas da seleção purificadora (Oosterhout, 2009).
226
Considerações Finais e Perspectivas
verificar com simulações se sob modelo de seleção balanceadora não com múl-
tiplos, mas apenas um, sítio selecionado, os mesmo padrões são observados e;
(2) se existe um excesso de associações a doenças nas regiões genômicas dos
genes sob seleção balanceadora; (3) se o excesso de carga genética é menor (mas
ainda significativo) para genes vizinhos aos genes balanceados e/ou fixando-se
janelas genômicas em torno dos genes e verificando se a carga genética diminui
com a distância em relação ao gene-alvo.
Ainda que permaneçam algumas questões em aberto, nosso trabalho é uma
contribuição para dois campos estimulantes da biologia evolutiva: o estudo do
acúmulo de mutações deletérias no genoma humano e o estudo da importância
evolutiva da seleção balanceadora para a evolução humana.
227
Considerações Finais e Perspectivas
Perspectivas
“(...) genome-wide scans are a hatchet, whereas what we need now is a scal-
pel. In-depth follow-up studies of individual outlier loci can be one such
scalpel, more precisely defining important population genetic parameters
such as the timing and magnitude of selection, the geographic distribu-
tion of selected variation, the interaction of population demograhic history,
recombination, and selection in shaping patterns of variation, and the func-
tional form of selection acting on individual outlier loci” (Akey, 2009)
228
Considerações Finais e Perspectivas
229
Considerações Finais e Perspectivas
Ainda que elucidar a relação causal entre genótipo e fenótipo como nos
exemplos acima esteja além do escopo do presente trabalho, demos importan-
tes passos nessa direção ao explorar propriedades das regiões candidatas. No
Capítulo 1, dentro dessas limitações, buscamos explorar a base biológica dos al-
vos de seleção balanceadora, ao olharmos para as categorias funcionais às quais
eles pertencem, para a proporção de sítios codificadores, e dentre esses, os sítios
não-sinônimos. No Capítulo 2, analisamos em maior detalhe as propriedades
dos sítios contidos nas regiões-alvo de seleção balanceadora. Assim, pudemos
testar hipóteses acerca do acúmulo de mutações deletérias em regiões sob se-
leção balanceadora e aprofundamos nossa compreensão acerca dos potenciais
alvos de seleção balanceadora no genoma humano.
230
Considerações Finais e Perspectivas
231
Bibliografia
232
Considerações Finais e Perspectivas
233
Apêndices
234
Apêndice A.1.
Cópia do artigo “Mapping bias overestimates reference allele frequencies at the
HLA genes in the 1000 Genomes Project phase I data”: G3: Genes|Genomes|Genetics
(2015), 5(3): 931-941.
Neste artigo, eu contribuí com o planejamento das análises e na compreen-
são da organização dos dados do Projeto 1000 Genomas. Além disso, realizei
alguns dos testes estatísticos e propus a utilização de medidas de desvio de
frequência. Finalmente, contribuí com comentários acerca da redação do texto.
235
INVESTIGATION
ABSTRACT Next-generation sequencing (NGS) technologies have become the standard for data generation KEYWORDS
in studies of population genomics, as the 1000 Genomes Project (1000G). However, these techniques are NGS
known to be problematic when applied to highly polymorphic genomic regions, such as the human leukocyte mapping bias
antigen (HLA) genes. Because accurate genotype calls and allele frequency estimations are crucial to pop- 1000 Genomes
ulation genomics analyses, it is important to assess the reliability of NGS data. Here, we evaluate the reliability HLA
of genotype calls and allele frequency estimates of the single-nucleotide polymorphisms (SNPs) reported by
1000G (phase I) at five HLA genes (HLA-A, -B, -C, -DRB1, and -DQB1). We take advantage of the availability of
HLA Sanger sequencing of 930 of the 1092 1000G samples and use this as a gold standard to benchmark the
1000G data. We document that 18.6% of SNP genotype calls in HLA genes are incorrect and that allele
frequencies are estimated with an error greater than 60.1 at approximately 25% of the SNPs in HLA genes.
We found a bias toward overestimation of reference allele frequency for the 1000G data, indicating mapping
bias is an important cause of error in frequency estimation in this dataset. We provide a list of sites that have
poor allele frequency estimates and discuss the outcomes of including those sites in different kinds of
analyses. Because the HLA region is the most polymorphic in the human genome, our results provide insights
into the challenges of using of NGS data at other genomic regions of high diversity.
Whole-genome resequencing data for large numbers of human indi- variable position, which constitute the data for downstream analyses
viduals, as generated by the 1000 Genomes Project (www.1000genomes. and hypothesis testing.
org), provide unprecedented amounts of information about micro- The calling of single-nucleotide polymorphisms (SNPs) and
evolutionary processes and demographic histories. Such inferences genotypes and the estimation of allele frequencies from next-
rely on either genotypic or allelic frequency information for each generation sequencing (NGS) has undergone rapid development,
along with likelihood-based and Bayesian methods created to deal with
challenges associated to heterogeneity in read quality and coverage
(Nielsen et al. 2011). In Phase I of the 1000 Genomes Project, geno-
Copyright © 2015 Brandt et al.
doi: 10.1534/g3.114.015784 types were called using a combination of different approaches: first,
Manuscript received December 22, 2014; accepted for publication March 13, 2015; primary call sets were independently generated by different centers with
published Early Online March 17, 2015. different sequencing platforms, alignment, and variant calling methods;
This is an open-access article distributed under the terms of the Creative
Commons Attribution Unported License (http://creativecommons.org/licenses/
then, a consensus SNP call set was generated and made publicly avail-
by/3.0/), which permits unrestricted use, distribution, and reproduction in any able (The 1000 Genomes Project Consortium 2012).
medium, provided the original work is properly cited. The data generated by the 1000 Genomes Project frequently have
Supporting information is available online at http://www.g3journal.org/lookup/ been used to make inferences about evolutionary processes affecting
suppl/doi:10.1534/g3.114.015784/-/DC1
Data available in public repositories: https://github.com/deboraycb/reliability_hla_1000g
our species, including the detection of targets of natural selection
1
Corresponding author: Departamento de Genética e Biologia Evolutiva, Rua do (Hernandez et al. 2011; Ward and Kellis 2012; Andersen et al. 2012)
Matão, 277, São Paulo, SP 05508-090, Brazil. E-mail: diogo@ib.usp.br and understanding the genetic basis of complex phenotypes
Figure 1 Genotype mismatches between the 1000G and PAG2014 datasets. Results per polymorphic site (“Position”) and per individual (930 in
total). Individuals are ordered by number of mismatches (individuals with less mismatches on top). Sites are numbered according to their position
in ARS exons coding sequence. Dark squares indicate mismatches between genotypes in the two datasets. ARS, antigen recognition sites; HLA,
human leukocyte antigen.
differentiate them are in other exons. This results in what we refer to this article, sites are numbered according to their position in the ARS
as an “ambiguous allele call” for an HLA allele (e.g., the allele is iden- exons coding sequences (12546 at the class I loci and 12270 at the
tified as B35:03, but we cannot establish whether it is B35:03:01 or class II loci).
B35:03:02, or a group of alleles is attributed to an individual, such as
B35:02/B35:03/B35:04). Ambiguous allele calls also may happen
Allele frequency comparisons
when sequencing has low quality at bases that differentiate two alleles.
After correcting all possible ambiguities in PAG2014 (as described
In addition, there are also genotypic ambiguities, which occur when
previously), we calculated allele frequencies for SNPs in both datasets.
different pairs of alleles are compatible with the sequencing results. For
By comparing the frequency of the reference allele in 1000G to its
individuals that bear ambiguous alleles, we created a consensus se-
value in PAG2014, we assessed the accuracy of allele frequency
quence in which ambiguous sites were reported with both possible
estimation. The reference allele was defined as the allele present in the
alleles (e.g., A/T, see Figure S1). In this way, we incorporate the un-
hg19 build of the reference sequence of the human genome. RefSeq
certainty associated to the sequence-based typing into downstream
IDs of the reference sequences used for each HLA gene are reported
analyses.
on File S1.
Although we cannot rule out technical errors in the Sanger
We computed the error in 1000G frequency estimates per site
sequencing that generated the PAG2014 data (Gourraud et al. 2014),
i (FEi) as follows:
we assume that this method provides the most reliable estimate of
HLA alleles (and hence SNP genotypes), and will serve as a standard FEi ¼ fi;1000G 2 fi;PAG2014
to estimate the reliability of genotype calls and allele frequencies for
the 1000 Genomes data (De Santis et al. 2013). where fi;1000G and fi;PAG2014 are the frequency of the reference allele at
site i in 1000G and PAG2014, respectively. We also computed the
Genotype comparisons mean absolute error in frequency estimates per gene as a mean of
We initially quantified how well the 1000G and PAG2014 data agreed absolute FEi for all sites within a gene (MAE):
with respect to genotype calls. Genotypes at each site in each individual
n
were compared between the 1000G data and the PAG2014 data, here 1X
MAE ¼ fi;1000G 2 fi;PAG2014
considered as a gold standard. In the case of sites with ambiguity (e.g., n i¼1
T/A) in the PAG2014 data, if one of the two possible alleles matched
an allele present in the 1000G, we considered this an allele match and where n is the number of SNPs in the gene.
PAG2014 was corrected, by attributing the allele present in the 1000G
data to the ambiguous site. After correcting the ambiguous sites in Coverage in 1000G
PAG2014, we only considered genotypes to be a match if both alleles Sequencing coverage per individual per site was calculated from the
in 1000G were present in the PAG2014 data, at that site. Throughout 1000 Genomes Project phase I BAM files for the low coverage
Figure 2 REF allele frequency per site in each HLA gene in the 1000 Genomes (1000G) and Sanger sequencing (PAG2014) datasets. Continuous
line indicates the expected relationship (i.e., no difference) between 1000G and PAG2014. Dashed lines indicate a 60.1 deviation from the
expected frequency (as estimated from PAG2014 dataset). MAE (mean absolute error) defined in the section Materials and Methods. Numbers
indicate site position in ARS exons sequence. REF, reference; ARS, antigen recognition sites; HLA, human leukocyte antigen.
Figure 4 Difference in reference allele frequency between 1000G and PAG2014, measured by FE (see the section Materials and Methods), at
each polymorphic site, in each population. Shades of red indicate overestimation of reference allele frequency and shades of blue indicate
underestimation of reference allele frequency in 1000G. Full population names are given in Table S3.
Testing for mapping bias alternative alleles in those sites are flanked by additional alternative
We hypothesized that the observed reference allele bias was caused by alleles.
a lower efficiency in the mapping of reads containing the alternative To test this hypothesis, we aligned sequences of all alleles present
allele. This is expected under the assumption that the reads carrying in PAG2014 to the HLA sequences present in the hg19 build of the
the alternative allele on average have more differences with respect to reference human genome (the same sequences used for the alignment
the reference genome (used by the 1000 Genomes Consortium as the of reads in the 1000 Genomes Project) and defined windows of 51
index to align NGS reads) than reads carrying the reference allele. In base pairs around each SNP. We then quantified the number of differ-
this scenario, some sites would have a stronger bias than others if the ences with respect to the reference genome for windows surrounding
Figure 8 Relationship between SNP heterozygosity (H) and (A) absolute value of deviation (jFEj; Pearson’s correlation = 0.32; P = 1.938 · 1027) or
(B) magnitude and direction of deviation (FE; Pearson’s correlation = 0.59; P , 10216). SNP, single-nucleotide polymorphism.
247
Immunogenetics
DOI 10.1007/s00251-015-0875-9
ORIGINAL PAPER
Abstract Supertypes are groups of human leukocyte antigen randomized groups of alleles. At HLA-A, low levels of vari-
(HLA) alleles which bind overlapping sets of peptides due to ation are observed at B and F pockets and randomized He and
sharing specific residues at the anchor positions—the B and F GST do not differ from the observed data. By contrast, HLA-B
pockets—of the peptide-binding region (PBR). HLA alleles concentrates most of the differences between supertypes, the
within the same supertype are expected to be functionally B pocket showing a particularly high level of variation.
similar, while those from different supertypes are expected Moreover, at HLA-B, the reassignment of alleles into random
to be functionally distinct, presenting different sets of pep- groups does not reproduce the patterns of population differen-
tides. In this study, we applied the supertype classification to tiation observed with supertypes. We thus conclude that dif-
the HLA-A and HLA-B data of 55 worldwide populations in ferently from HLA-A, for which supertype and allelic varia-
order to investigate the effect of natural selection on supertype tion show similar patterns of nucleotide diversity within and
rather than allelic variation at these loci. We compared the between populations, HLA-B has likely evolved through spe-
nucleotide diversity of the B and F pockets with that of the cific adaptations of its B pocket to local pathogens.
other PBR regions through a resampling procedure and com-
pared the patterns of within-population heterozygosity (He)
and between-population differentiation (GST) observed when Keywords HLA . Supertypes . Human populations . Natural
using the supertype definition to those estimated when using selection . Pathogens . Adaptation
248
Immunogenetics
249
Immunogenetics
Both conservation of supertype frequencies between pop- from which we excluded populations presenting (a) an allelic
ulations and increased heterozygosity at the supertype level resolution lower than the first two sets of digits (now referred
are expected to generate a pattern of low-population differen- to as second field level of resolution), so as to only keep alleles
tiation when compared with those observed at the allelic level. differing at the protein level; (b) genotypic ambiguities; and
Balancing selection at the supertype level would also enhance (c) deviation from Hardy-Weinberg expectations. This filter-
genetic variation at the B and F pockets compared with other ing resulted in a dataset of 6435 and 6409 individuals typed
regions of the PBR, increasing the chances of antigen recog- for HLA-A and HLA-B, respectively, belonging to 55 differ-
nition by the immune system. However, testing these hypoth- ent populations: seven sub-Saharan African (SSA), two North
eses, i.e., comparing population differentiation and variability African (NAF), eight Southwest Asian (SWA), four European
defined at the levels of HLA alleles and supertypes, respec- (EUR), 22 Southeast Asian (SEA), four Pacific islanders
tively, represents a methodological challenge due to the diffi- (PAC), four Australian aborigine (AUS), two North Asian
culties in comparing measures of differentiation and heterozy- (NEA), and two Native American (AME) populations
gosity for genetic variants that are defined by different attri- (Supplementary Material Table 1-S). Almost half of these
butes (alleles being defined by all variation in the coding re- populations (24 out of 55) had demographic histories indicat-
gion, by contrast with supertypes which are defined by a sub- ing that they were likely to have experienced severe founder
set of codons). Indeed, because supertypes are sets of alleles, effects (these populations were from Oceania, Taiwan, and the
genetic variation defined at the allele level is nested within that Americas). Because such reductions in diversity due to demo-
defined at the supertype level. Therefore, heterozygosity at the graphic effects can potentially mask signals of balancing se-
supertype level is constrained to be lower or equal to that lection, we carried out all the analyses with both the complete
estimated at the allele level. Furthermore, because population set of 55 populations and a reduced set of 30 populations
genetic differentiation measured by statistics related to (obtained by excluding those from Oceania, Taiwan, and the
Wright’s FST is strongly determined by intrapopulation vari- Americas).
ability (Jost 2008), we expect higher levels of population dif-
ferentiation at the supertype level simply because of the de- Supertype definition
creased number of supertype variants in comparison to alleles.
In the present study, our aim is to investigate whether the We assigned all HLA-A and HLA-B alleles to their specific
use of supertype instead of allele definitions at HLA-A and supertype as defined by the classification given in figures 1
HLA-B loci reduces population differentiation and increases (http://www.biomedcentral.com/1471-2172/9/1/figure/F1)
heterozygosity, as expected under a model of balancing selec- and 2 (http://www.biomedcentral.com/1471-2172/9/1/figure/
tion acting on supertypes. For the reasons explained above, we F2) from Sidney et al. (2008). The alleles not assigned to
control our analyses for the inherent differences in polymor- any supertype were treated in our analyses of population
phism between these two kinds of classification. Our approach differentiation and molecular variation in two ways: (a) their
consists in producing null distributions for population differ- allele-level definition was used and (b) they were pooled into
entiation and heterozygosity by generating randomized sets of groups of Bnon-classified alleles^ (named NCA and NCB for
alleles (herein referred to as Brandom supertypes^) that match HLA-A and HLA-B, respectively). We included A*29:01,
true supertype sampling properties (i.e., number of supertypes A*29:02, A*29:03, A*30:01, A*30:08, and A*68:06 in the
and number of alleles per supertype) without any biological NCA group because of their ambiguous supertype allocation
criteria for pooling them together. We also analyze supertype (Sidney et al. 2008), and all B*08 alleles were assigned to the
variation at the nucleotide level by partitioning DNA se- NCB group because of their unique PBR structures, which
quences into segments corresponding to the different pockets make the peptide-binding profile unpredictable (Sidney et al.
within the PBR. Our hypothesis is that the B and F pockets, 2008).
which are the major determinants of the peptide-binding spec-
ificities and used to define supertypes, constitute the main Population genetic analyses
targets of balancing selection and thus retain higher levels of
diversity compared to other PBR pockets. We tested the population samples for deviation from Hardy-
Weinberg (HW) equilibrium using the Gene[rate] program
which tests the null hypothesis of equilibrium on the basis of
Materials and methods a log-likelihood ratio test on frequency estimates (both under
HW and under a generalized non-HW model) (Nunes et al.
Population data 2014; Nunes 2014).
We wrote R scripts to estimate supertype frequencies by
We used a database generated for the 13th International direct counting of alleles, generate summary statistics (number
Histocompatibility Workshop (IHWS) (Mack et al. 2006) of alleles (k) and expected sample heterozygosity (He)), and
250
Immunogenetics
estimate genetic differentiation between pairs of populations We estimated the nucleotide diversity (π) (Nei 1987) per
by using GST (Nei and Chesser 1983). Mantel tests (Mantel pocket (i.e., A, B, pooled CDE, and F) for each population
1967) for assessing Pearson’s correlations between genetic (referred to as πtotal). For these four pockets, we also computed
distances obtained either from supertype or from allelic data within- and between-supertype nucleotide diversity (referred
were carried out using the ade4 R package (Dray and Dufour to as πwithin and πst, respectively), and thus estimated a mea-
2007), and all graphs and other statistical tests (e.g., Wilcoxon sure of among-supertype variation for each pocket, obtained
rank sum test) were also generated using R version 3.0.2 using the following formula:
(Development Core Team 2011). In box plots, the boxes cor- πtotal −πwithin
respond to the interquartile range, the median is the thick line πst ¼ ð1Þ
πtotal
inside the box, and whiskers extend up to observations that are
outside the box for less than 1.5 times the interquartile range. Total, within- and between-supertype π values were
Dots are outliers to these limits. By using Arlequin 3.5 pro- calculated in two ways: (a) by excluding the non-
gram (Excoffier and Lischer 2010), we performed a hierarchi- classified alleles and (b) by including the non-classified
cal analysis of molecular variance (AMOVA) for each alleles as a single group. As the dataset is limited to al-
supertype taken individually by pooling all others into a leles defined at second field level of resolution, no infor-
unique group of Bnon-classified alleles^ for the calculations. mation about synonymous polymorphism is available. We
In this way, we estimated the diversity among populations addressed this problem by applying the same strategy as
(FST), among populations within geographic regions (FSC), described by Buhler and Sanchez-Mazas (2011), which
and among geographic regions (FCT) for each supertype. consisted in treating as missing data the nucleotide posi-
tions which were described as synonymous (Robinson
et al. 2015). We excluded sites having more than 5 %
Testing the molecular variation of the PBR pockets missing data.
251
Immunogenetics
of randomized datasets with GST values lower or He values variation being found among populations of different geo-
higher than those observed for the true data. graphic regions (FCT >FSC; Table 2). The A1 supertype is
represented by a small number of alleles, with one or two
alleles in more than half of the populations (Fig. 1b) and only
Results and discussion one in 14 of them (Fig. 1c). The A2 and A3 supertypes exhibit
more even distributions, half of the populations having fre-
HLA-A and HLA-B supertype frequencies and their quencies ranging from 14 to 29 % for A2 and 14 to 32 % for
geographic distributions A3 (Figs. 1a and 2). As a consequence, among the HLA-A
supertypes, A2 and A3 present either the lowest or no geo-
In a previous study (the only one, to our knowledge, except graphic structure at all (F CT < F SC for A2 and FCT not
our own study on HLA-DRB1 (Gibert and Sanchez-Mazas significantly different from 0 for A3; Table 2). All populations
2003)) addressing population differentiation at the supertype present at least one allele of supertype A2 (eight of them
level, Sidney et al. (1996) used five population samples and showing just one), while the A3 supertype is represented by
reported that all supertypes were present in all world regions. a large number of alleles (Fig. 1b, c). The A24 supertype is
This current study with 55 populations greatly extends those observed in all populations (Fig. 1c), with frequencies ranging
original observations, allowing us to show that some from 13 to 40 % in half of them (Fig. 1a). Despite its broad
supertypes are not observed in all populations while reaching distribution, A24 is often represented by only two alleles,
a frequency of more than 50 % in others (Figs. 1a, c, 2, and 3). A*23:01 and A*24:02, with 26 and 10 populations showing
Among the HLA-A supertypes, A1 is the rarest, showing fre- just one or both of these alleles, respectively (Fig. 1b, c). This
quencies smaller than 9 % in more than half of the populations supertype is found at higher frequencies (40 % in average) in
(Fig. 1a) and being virtually absent in five of them (Fig. 1c). SEA, PAC, AUS, NEA, and AME (Fig. 2). Although A24
A1 alleles are found with high frequencies (22 % in average) exhibits the highest level of population differentiation among
in Africa, Southwest Asia, and Europe (Fig. 2), resulting in a the four HLA-A supertypes (FST =11 %, p<0.0001), most of
significant geographic structure, i.e., with most of the the variation is found within geographic regions (FCT <FSC).
Fig. 1 Supertype variation, a boxes represent the frequency distributions of populations showing only one allele for the referred supertype (referred
of the four HLA-A and the five HLA-B supertypes and the Bnon- to as Bmonomorphic populations^). The light gray section of the bars
classified alleles^ NCA and NCB, respectively; b each box represents represents the number of populations where the referred supertype was
the distribution of the number of distinct alleles of each supertype per not detected
population; and c the dark gray section of the bars represents the number
252
Immunogenetics
The frequencies of the HLA-A non-classified alleles (NCAs) ranging from 0 to 5.8 % and from 2.9 to 18 % in half of the
vary greatly between populations, ranging from 2 to 14 % in populations, respectively (Fig. 1a). Among the five HLA-B
half of them (Fig. 1a). The NCA group presents a strong supertypes, B62 presents the highest level of population differ-
geographic structure (FCT being twice as much as FSC) and entiation (FST =11.38 %, p<0.0001; Table 2), although with no
a very high FST value (almost 16 %) (Table 2). The highest clear geographic structure (FCT <FSC; Table 2). Such a geo-
NCA frequencies are found in African and Australian popu- graphic structure is only found for B58 (FCT of 5.6 %, almost
lations (averages of 16 and 43 %, respectively) (Fig. 2). twice as great as FSC; Table 2), which is observed in SSA
The HLA-B supertypes fall into two main categories re- populations at an average frequency of 33 % (from 23 to
garding their frequency distributions. On the one hand, B7 60 %; Fig. 3), against 4.2 % in the other regions (Fig. 3) and
and B44 exhibit a pattern resembling A2 and A3, with high no observation at all in many populations (18 out of 55; Fig. 3).
average frequencies (Figs. 1a and 3) and relatively low levels The B27 supertype presents an intermediate pattern between
of geographic structure (Table 2). Half of the populations pres- B7/B44 and B58/B62. It exhibits relatively lower frequencies
ent frequencies ranging from 18 to 31 % for B7 and from 21 to (from 7 to 19 % in half of the populations; Fig. 1a) and a higher
32 % for B44, respectively (Figs. 1a and 3). Both B7 and B44 level of population differentiation than B7 and B44 (FST =
are observed in all populations (except B7 in the Yami; Figs. 1c 7.5 %, p<0.0001; Table 2) but no geographic structure (FCT
and 3), with large numbers of alleles per population (Fig. 1b, very close to zero; Table 2). Contrasting with what is observed
c). By contrast, B58 and B62 exhibit very low frequencies, for the NCA, the non-classified alleles for HLA-B (NCB) are
253
Immunogenetics
quite frequent, with frequencies ranging from 10 to 17 % in to the analysis, they should not be ignored. They are a conse-
half of the populations (Fig. 1a). More than 75 % of popula- quence of the functional supertype classification, and they
tions present at least two different NCBs (Fig. 1b), and only were kept to understand exactly how they influence the vari-
two populations lack one of these alleles (Fig. 1c). The NCBs ations in HLA-A and HLA-B. As discussed above, the NCA
also exhibit a significant geographic structure, although not as consists of a small group of alleles, which reach high frequen-
strong as for NCA (Table 2). cies in island populations. On the other hand, NCB is a more
In summary, based on the observed data, supertypes can be heterogeneous group appearing in almost all populations.
allocated into two main categories: on the one hand, A2, A3,
B7, B27, and B44 fit the classical view that supertypes are Heterozygosity and interpopulation differentiation
evenly distributed (Figs. 1a, 2, and 3), poorly structured geo-
graphically (Table 2), and represented by a large number of Using both complete and reduced datasets (see BMaterials and
alleles (Fig. 1b, c). On the other hand, A1, A24, B58, and B62 methods^ section), the heterozygosity estimated for the data
present a greater frequency variation among populations treated at the allelic level is always larger than that estimated
(Figs. 2 and 3 and Table 2), and in some cases significant for the data treated at the supertype level (Table 3). This result
geographic structure (i.e., for A1 and B58, both being very is expected because alleles are nested within supertypes, and
common in Africa), and are represented by a smaller number the heterozygosity of the latter is thus constrained to be equal
of alleles. Although the unclassified alleles have brought noise to or smaller than that of the former.
254
Immunogenetics
Table 2 Supertype differentiation indexes among populations (FST), the higher correlations between alleles and supertypes when
among populations within geographic regions (F SC), and among
they are taken into account. The difference between alleles
geographic regions (FCT)
and supertypes is less pronounced for HLA-A which presents
Supertypes FST FSC FCTa a smaller number of alleles per supertype in all populations
(Fig. 1b, c).
A1 9.95 %*** 2.67 %*** 7.48 %***
A2 4.85 %*** 3.40 %*** 1.51 %*
Patterns of molecular variability for different PBR
A3 6.48 %*** 6.48 %*** 0.000b
pockets of HLA-A and HLA-B
A24 11.14 %*** 6.66 %*** 4.80 %***
NCA 15.90 %*** 4.90 %*** 11.56 %***
Our goal in this part of the study was to test the prediction that
B7 5.11 %*** 3.21 %*** 1.97 %*
the B and F pockets of the PBR exhibit the highest levels of
B27 7.54 %*** 7.10 %*** 0.47%b variation as a consequence of their crucial role in peptide
B44 3.21 %*** 1.72 %*** 1.51 %** binding, which is expected to result in a stronger effect of
B58 8.34 %*** 2.91 %*** 5.59 %** balancing selection.
B62 11.38 %*** 7.35 %*** 4.35 %* We first estimated the global levels of variation at the PBR
NCB 7.02 %*** 2.90 %*** 4.24 %*** and observed significantly higher levels of nucleotide diversi-
*p<0.01; **p<0.001; ***p<0.0001, where p values refer to the proba-
ty (πtotal) at HLA-B, compared to HLA-A (p<0.0000005;
bility of observing a statistic as extreme under the null hypothesis of no Wilcoxon rank sum test). Moreover, these two genes differ
structure in the way molecular variation is distributed among the A,
a
In italics: Values of FCT >FSC, an indication that most of the variation B, CDE, and F pockets within the PBR (Fig. 5). The rank
was found among populations of different geographic regions order of πtotal is pCDE≫pB≫pA>pF, at HLA-A, and pB≫
b
Not significant value pF>pCDE≫pA, at HLA-B (where p is an abbreviation for
Bpocket^ and ≫ and > indicate greater than and significant,
In order to define the degree to which genetic differentia- at the 0.00001 level, and greater than but non-significant dif-
tion, measured by GST between populations, was concordant ferences, respectively, according to a Wilcoxon rank sum test;
at the supertype and allelic levels, we estimated the correlation Fig. 5). Among the HLA-A pockets, most of the variation is
between these measures and tested their significance using found in the CDE pockets, which makes up the central region
Mantel tests. The results suggest that when using the complete of the PBR, and significantly less in pB (πtotal values ranging
population dataset, the patterns of population differentiation from 0.14 to 0.15 and from 0.11 to 012 in half of the popula-
observed at the supertype and allelic levels are very similar, tions, respectively; Fig. 5). The pA and pF pockets exhibit the
especially for HLA-A (r=0.956, p<0.0005; Fig. 4a) but also smallest levels of variation (πtotal values ranging from 0.07 to
for HLA-B (r=0.75, p<0.0005; Fig. 4b). The removal of the 0.09 in half of the populations; Fig. 5). Among the HLA-B
Pacific, Australian, Taiwanese, and Native American popula- pockets, pB exhibits by far the highest variation, with πtotal
tions provokes an overall drop of both the GST values and their values ranging from 0.18 to 0.21 in half of the populations,
correlations. Despite this decrease, a high-correlation coeffi- whereas the other pockets exhibit a relatively narrow πtotal
cient is still observed for HLA-A (r=0.62, p<0.0005; Fig. 4c), distribution (ranging from 0.10 to 012 in half of the popula-
whereas the value is much lower for HLA-B (r = 0.3, tions; Fig. 5).
p<0.0005; Fig. 4d). Because Pacific, Australian, Taiwanese, The hypothesis that the pockets B and F are the main tar-
and Native American populations contribute to large differen- gets of balancing selection is thus partially supported for
tiation values, lower-correlation coefficients were expected HLA-B, since pB presents by far the highest level of nucleo-
after removing them. Furthermore, these populations also ex- tide diversity. Interestingly, van Deutekom and Kesmir (2015)
hibit a reduced set of alleles per supertype, which may explain recently showed that changes involving several of the B
pocket’s amino acids had a profound impact on peptide-
Table 3 Expected heterozygosity (He) of alleles and supertypes
binding properties, which corroborates our interpretation. On
the other hand, pF, which is not significantly different from pA
Loci Dataseta Average allelic He Average supertype He at HLA-A, and from pCDE at HLA-B, does not present an
increased value of πtotal which would be an evidence against
HLA-A Complete 0.7761 0.6774
balancing selection. It is important to note that these results
HLA-A Reduced 0.8974 0.7504
were obtained independently from the classification of alleles
HLA-B Complete 0.8948 0.7577
into supertypes, since the determination of the pockets’ co-
HLA-B Reduced 0.9429 0.7766
dons was taken from the classical study of Saper et al. (1991).
a
Complete dataset, all populations; reduced dataset, excluding Pacific, We also analyzed how the nucleotide diversity was distrib-
Australian, Taiwanese, and Native American populations uted between supertypes. Since the supertype categorization is
255
Immunogenetics
based on variations of pB and pF, these pockets were expected the assignment of alleles to supertypes was randomized
to present more differences between supertypes than the by permuting the supertype labels attributed to each
others. This prediction was confirmed for pF at HLA-A and allele motif, as described in the BMaterials and
pB at HLA-B (Fig. 6). methods^ section. As the same patterns were obtained
As pB presents the highest levels of variation at HLA-B using the two different simulation approaches (see
and also accounts for most of the differences between HLA-B BMaterials and methods^ section), we only present the
supertypes, we conclude that the variation between HLA-B results for the case without any constraint on the num-
supertypes accounts for most of the differences observed be- ber of alleles associated to a specific supertype.
tween HLA-B alleles. In other words, alleles classified within For HLA-A, we do not observe any population with a
a same HLA-B supertype share more similarities than alleles significant difference in He in contrasts between the real and
assigned to different HLA-B supertypes. By contrast, most of random supertype assignments. For HLA-B, 6 out of 55 pop-
the differences between HLA-A supertypes lie within pF, the ulations exhibit significantly lower He (permutation-based
pocket presenting the lowest πtotal values for this gene. p<0.05) than those acquired via simulations. These six popu-
Therefore, at this locus, the supertypes do not account for most lations belong to the reduced dataset. Because the number of
of the variation between alleles (Fig. 6). In other words, HLA- populations with individually significant p values in either
A presents more variation within than between supertypes. direction (i.e., with significantly lower or greater He compared
to the simulated value) is small, we investigated whether the
Simulation approach to test selection on supertypes distribution of the p values itself was informative regarding
selective effects. To do this, we used an exact binomial test to
According to the definition of Sidney et al. (1996), al- assess whether the observed distribution of p values deviated
leles included within the same supertype have overlap- from one composed of equal numbers of values on either side
ping peptide-binding specificities. To test the effects of of 0.5 (the expected proportion of deviation in either direction
the supertype classification on expected heterozygosities under the null hypothesis; Fig. 7). For HLA-A, no significant
(He) and pairwise differentiation (GST), we generated deviation is found (p value>0.05 for both complete and re-
null distributions for these two statistics under the hy- duced datasets). For HLA-B, however, a significant skew to-
pothesis that alleles within supertypes are a random col- wards p values greater than 0.5 is observed, indicating an
lection, with no shared functional attributes. To this end, overall significant excess of populations with lower He than
Fig. 5 Total nucleotide diversity (πtotal) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the total nucleotide diversity per
pocket for the populations of the complete dataset
256
Immunogenetics
Fig. 6 Nucleotide diversity between supertypes (πst) at HLA-A and HLA-B PBR pockets. Each box represents the distribution of the nucleotide
diversity between supertypes per pocket for the populations of the complete dataset
those obtained through simulations (p value<0.05 and p value between than within HLA-A supertypes. This indicates that
<0.005 for complete and reduced datasets, respectively). HLA-A supertypes are composed of heterogeneous sets of
For both HLA-A and HLA-B, GST values were not signif- alleles with few sequence similarities at pF (Figs. 5 and 6),
icantly different from those of the randomized data, when which explains the similarity between the results based on the
using the complete dataset. This is also true when using the observed and randomized data. On the other hand, HLA-B
reduced dataset for HLA-A but not for HLA-B. Indeed, after supertypes appear to be composed of alleles sharing more
removing the Pacific, Australian, Taiwanese, and Native sequence similarities, as shown by the molecular analysis of
American populations, the observed GST is higher than 98 % the PBR pockets (Figs. 5 and 6).
of the simulations for HLA-B (Fig. 8). This finding differs In summary, HLA-B supertypes are sets of alleles with B
from the expectations of Sidney et al. (1996), who predicted pocket resemblances, and these similarities can be interpreted
an overall decrease of differentiation at the supertype level. directly in terms of peptide presentation profiles because
However, it is in agreement with our description of the ob- HLA-B supertypes exhibit major differences regarding the
served data. Indeed, in our simulations, alleles were randomly chemical properties of pB. Thus, our results showing an in-
assigned to supertypes, creating randomized supertypes with creased differentiation at the level of HLA-B supertypes are
similar contents of common and rare alleles. The common consistent with an effect of natural selection resulting in local
alleles are expected to be assigned to different randomized adaptation of populations to different pathogen environments.
supertypes in most of the simulations because they are less Through our simulations, the functional grouping of alleles
numerous than the rare alleles. Such a pattern is similar to that reflected by the HLA-B supertypes is disrupted, creating ran-
described for real HLA-A supertypes, which present a low domized groups in the same way as described for HLA-A.
number of common alleles per population (Fig. 1b, c). As The frequent allocation of common alleles into different ran-
discussed above, this pattern also explains the high correlation domized supertypes in the simulations thus provokes both an
found between G ST values measured at the allelic and increase of He and a decrease of population differentiations
supertype levels for this locus (Fig. 4). Finally, as also (GST), when compared with the observed data (Figs. 7 and 8).
discussed above for the PBR pockets, less variation is found In agreement with this interpretation, the inclusion of the
257
Immunogenetics
Pacific, Australian, Taiwanese, and Native American popula- balancing selection, our simulation results reveal that HLA-
tions reduces this effect because the patterns of variation at B supertype frequencies do not show a signature of balancing
HLA-B for these populations resemble those observed at selection (i.e., we find lower He compared to those of random-
HLA-A, with a relatively low number of alleles belonging to ly assigned groups of alleles), implying that each supertype is
different supertypes. not maintained at relatively high frequencies in all popula-
tions. This result is supported by the geographically heteroge-
neous distributions of B58 and B62 (and, to a lesser extent,
Conclusions B27) frequencies among populations. Moreover, populations
are more differentiated than expected for HLA-B supertypes
The supertype classification of HLA-A and HLA-B alleles has (higher observed GST values than those obtained from ran-
been widely used in medical research, with reports suggesting domly assigned groups of alleles). As most of the differences
that supertype-level variation explains susceptibility or resis- between HLA-B supertypes lie in the B pocket, this means
tance to a series of pathogenic diseases (Alencar et al. 2013; that the differences in HLA-B supertype composition among
Chakraborty et al. 2013; Cordery et al. 2012; Gilchuk et al. populations can be interpreted in terms of peptide recognition.
2013; Karlsson et al. 2012, 2013; Kuniholm et al. 2013; Thus, for HLA-B, our results support the idea that populations
Trachtenberg et al. 2003). This classification was proposed present more differences in peptide presentation profiles than
in the 1990s as an attempt to find, as described by Sette and expected, possibly due to local adaptations to pathogens.
Sidney (1999), Bthe common denominators and similarities By contrast, most of the differences between HLA-A al-
hidden within this very large degree of polymorphism.^ The leles are not related with differences at the supertype level.
same authors also stated that Bthe overall frequency of each of This is supported by our simulation results showing that the
these supertypes is remarkably high and fairly conserved randomly assigned groups of alleles often reproduce the ob-
among very different ethnicities. Thus, there might be some served patterns of variation and differentiation of HLA-A
advantage for human populations to present approximately supertypes. Moreover, HLA-A alleles are more conserved at
five to ten main binding specificities and that each one of these the sites involved in peptide binding, suggesting that they
is maintained at relatively high frequency.^ According to our present a more conserved profile of peptides across popula-
results, the variation among HLA-B supertypes does reflect tions, differing from what is observed for HLA-B. Of note,
the functional diversity at this locus and is thus in agreement one possible caveat of inferring peptide binding through the
with the above-mentioned hypothesis. Our results strongly supertype classification is that some peptides presented by
indicate that the B pocket is likely to be the main target of HLA class I molecules are known to assume a looping con-
natural selection at HLA-B, as it presents the highest levels of formation outside the peptide-binding groove. However, no
molecular variation and accounts for the main differences in matter how different conformations a peptide can adopt, the
the peptide presentation profiles for this gene. However, in anchor amino acids located at the peptide ends remain the
contrast with classical expectations for loci evolving under same, limited by the B and F pockets. In this way, this
258
Immunogenetics
conformational variability exhibited by the peptides is also a R Development Core Team (2011) R: a language and environment for
statistical computing. Vienna, Austria: the R Foundation for
consequence of the interaction between the peptide anchors
Statistical Computing. ISBN: 3-900051-07-0. Available online at
and the B and F pockets and thus is not expected to change the http://www.R-project.org/
results obtained here. Dray S, Dufour AB (2007) The ade4 package: implementing the duality
Our results suggest that the B pocket of the HLA-B mole- diagram for ecologists. J Stat Softw 22(4):1–20
cules is the main target of natural selection, whereas no such Excoffier L, Lischer HE (2010) Arlequin suite ver 3.5: a new series of
programs to perform population genetics analyses under Linux and
signals could be retrieved for the other HLA-B pockets nor for Windows. Mol Ecol Resour 10(3):564–567
the pockets of the HLA-A molecules in relation to the Gibert M, Sanchez-Mazas A (2003) Geographic patterns of functional
supertype classification. This conclusion matches the expec- categories of HLA-DRB1 alleles: a new approach to analyse asso-
tations that supertypes are the primary targets of selection for ciations between HLA-DRB1 and disease. Eur J Immunogenet
30(5):361–374
HLA-B but not for HLA-A. Following this idea, we could
Gilchuk P, Spencer CT, Conant SB, Hill T, Gray JJ, Niu X, Zheng M,
state that HLA-A supertypes are composed by alleles whose Erickson JJ, Boyd KL, McAfee KJ, Oseroff C, Hadrup SR, Bennink
resemblances are not the consequence of a shared phylogenet- JR, Hildebrand W, Edwards KM, Crowe JE, Williams JV, Buus S,
ic origin. A future extension of this work could be to explore Sette A, Schumacher TN, Link AJ, Joyce S (2013) Discovering
naturally processed antigenic determinants that confer protective T
whether the central pockets C, D, and E that have been shown
cell immunity. J Clin Invest 123(5):1976–1987
to contain most of the variation at HLA-A could be used as an Hedrick PW, Whittam TS, Parham P (1991) Heterozygosity at individual
alternate functional classification for these alleles. amino acid sites: extremely high levels for HLA-A and -B genes.
Proc Natl Acad Sci U S A 88(13):5897–5901
Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major
histocompatibility complex class I loci reveals overdominant selec-
Acknowledgments This work was supported by the Swiss National
tion. Nature 335(6186):167–170
Science Foundation (SNSF) grant no. 31003A_144180 to ASM and
Jost L (2008) G(ST) and its relatives do not measure differentiation. Mol
São Paulo Research Foundation (FAPESP) 12/18010-0 and a CNPq pro-
Ecol 17(18):4015–4026
ductivity grant no. 308167/2012-0 to DM. RSF was supported by CNPq
(grant no. 142130/2009-5) and CAPES (grant no. 12447/12-9). We also Karlsson I, Kløverpris H, Jensen KJ, Stryhn A, Buus S, Karlsson A,
thank two anonymous reviewers for their useful comments. Vinner L, Goulder P, Fomsgaard A (2012) Identification of con-
served subdominant HIV type 1 CD8(+) T cell epitopes restricted
within common HLA supertypes for therapeutic HIV type 1 vac-
cines. AIDS Res Hum Retroviruses 28(11):1434–1443
Karlsson I, Brandt L, Vinner L, Kromann I, Andreasen LV, Andersen P,
Open Access This article is distributed under the terms of the Creative Gerstoft J, Kronborg G, Fomsgaard A (2013) Adjuvanted HLA-
C o m m o n s A t t r i b u t i on 4 . 0 I n t e r n a t i on a l L i c e n s e ( h t t p : / / supertype restricted subdominant peptides induce new T-cell immu-
creativecommons.org/licenses/by/4.0/), which permits unrestricted use, nity during untreated HIV-1-infection. Clin Immunol 146(2):120–
distribution, and reproduction in any medium, provided you give 130
appropriate credit to the original author(s) and the source, provide a link Kuniholm MH, Anastos K, Kovacs A, Gao X, Marti D, Sette A,
to the Creative Commons license, and indicate if changes were made. Greenblatt RM, Peters M, Cohen MH, Minkoff H, Gange SJ, Thio
CL, Young MA, Xue X, Carrington M, Strickler HD (2013)
Relation of HLA class I and II supertypes with spontaneous clear-
ance of hepatitis C virus. Genes Immun 14(5):330–335
References
Lawlor DA, Zemmour J, Ennis PD, Parham P (1990) Evolution of class-I
MHC genes and proteins: from natural selection to thymic selection.
Alencar LXE, Braga-Neto UM, Nascimento EJM, Cordeiro MT, Silva Annu Rev Immunol 8:23–63
AM, Brito CAA, Silva PM, Gil LH, Montenegro SM, Marques Mack S, Sanchez-Mazas A, Meyer D, Single R, Tsai Y et al (2006) 13th
Júnior ET Jr (2013) HLA-B*44 is associated with dengue severity International Histocompatibility Workshop Anthropology/Human
caused by DENV-3 in a Brazilian population. J Trop Med 2013: Genetic Diversity Joint Report—Chapter 2: methods used in the
648475 generation and preparation of data for analysis in the 13th
Apanius V, Penn D, Slev PR, Ruff LR, Potts WK (1997) The nature of International Histocompatibility Workshop. In: Hansen J (ed)
selection on the major histocompatibility complex. Crit Rev Immunobiology of the human MHC: Proceedings of the 13th
Immunol 17(2):179–224 International Histocompatibility Workshop and Conference.
Borghans JA, Beltman JB, De Boer RJ (2004) MHC polymorphism IHWG Press, Seattle, pp 564–579
under host-pathogen coevolution. Immunogenetics 55(11):732–739 Mantel N (1967) The detection of disease clustering and a generalized
Buhler S, Sanchez-Mazas A (2011) HLA DNA sequence variation regression approach. Cancer Res 27(2):209–220
among human populations: molecular signatures of demographic Naugler C, Liwski R (2008) An evolutionary approach to major histo-
and selective events. PLoS One 6(2):e14643 compatibility diversity based on allele supertypes. Med Hypotheses
Chakraborty S, Rahman T, Chakravorty R, Kuchta A, Rabby A, 70(5):933–937
Sahiuzzaman M (2013) HLA supertypes contribute in HIV type 1 Nei M (1987) Molecular evolutionary genetics. Columbia University
cytotoxic T lymphocyte epitope clustering in Nef and Gag proteins. Press, New York
AIDS Res Hum Retroviruses 29(2):270–278 Nei M, Chesser RK (1983) Estimation of fixation indices and gene diver-
Cordery DV, Martin A, Amin J, Kelleher AD, Emery S, Cooper DA, sities. Ann Hum Genet 47(Pt 3):253–259
STEAL study group (2012) The influence of HLA supertype on Nunes JM (2014) Using Uniformat and Gene[rate] to analyse data with
thymidine analogue associated with low peripheral fat in HIV. ambiguities in population genetics. http://dx.doi.org/10.6084/m9.
AIDS 26(18):2337–2344 figshare.984299
259
Immunogenetics
Nunes JM, Buhler S, Roessli D, Sanchez-Mazas A, HLA-net 2013 col- Saper MA, Bjorkman PJ, Wiley DC (1991) Refined structure of the
laboration (2014) The HLA-net Gene[rate] pipeline for effective human histocompatibility antigen HLA-A2 at 2.6 A resolution. J
HLA data analysis and its application to 145 populations from Mol Biol 219(2):277–319
Europe and neighbouring areas. Tissue Antigens 83(5):307–323 Sette A, Sidney J (1999) Nine major HLA class I supertypes account for
Parham P (2005) MHC class I molecules and KIRs in human history, the vast preponderance of HLA-A and -B polymorphism.
health and survival. Nat Rev Immunol 5(3):201–214 Immunogenetics 50(3–4):201–212
Parham P, Benjamin RJ, Chen BP, Clayberger C, Ennis PD, Krensky AM, Sidney J, Grey HM, Kubo RT, Sette A (1996) Practical, biochemical and
Lawlor DA, Littman DR, Norment AM, Orr HT et al (1989) evolutionary implications of the discovery of HLA class I
Diversity of class I HLA molecules: functional and evolutionary supermotifs. Immunol Today 17(6):261–266
interactions with T cells. Cold Spring Harb Symp Quant Biol Sidney J, Peters B, Frahm N, Brander C, Sette A (2008) HLA class I
54(Pt 1):529–543 supertypes: a revised and updated classification. BMC Immunol 9:1
Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F Slade RW, McCallum HI (1992) Overdominant vs. frequency-dependent
(2005) Pathogen-driven selection and worldwide HLA class I diver- selection at MHC loci. Genetics 132(3):861–864
sity. Curr Biol 15(11):1022–1027 Takahata N, Nei M (1990) Allelic genealogy under overdominant and
Qutob N, Balloux F, Raj T, Liu H, Marion de Procé S, Trowsdale frequency-dependent selection and polymorphism of major histo-
J, Manica A (2011) Signatures of historical demography and compatibility complex loci. Genetics 124(4):967–978
pathogen richness on MHC class I genes. Immunogenetics Takahata N, Satta Y, Klein J (1992) Polymorphism and balancing selection
64(3):165–175 at major histocompatibility complex loci. Genetics 130(4):925–938
Trachtenberg E, Korber B, Sollars C, Kepler TB, Hraber PT, Hayes E,
Robinson J, Halliwell JA, Hayhurst JD, Flicek P, Parham P, Marsh SG
Funkhouser R, Fugate M, Theiler J, Hsu YS, Kunstman K, Wu S,
(2015) The IPD and IMGT/HLA database: allele variant databases.
Phair J, Erlich H, Wolinsky S (2003) Advantage of rare HLA
Nucleic Acids Res 43(Database issue):D423–D431
supertype in HIV disease progression. Nat Med 9(7):928–935
Sanchez-Mazas A, Lemaître JF, Currat M (2012) Distinct evolutionary van Deutekom HW, Kesmir C (2015) Zooming into the binding groove of
strategies of human leucocyte antigen loci in pathogen-rich environ- HLA molecules: which positions and which substitutions changes
ments. Philos Trans R Soc Lond B Biol Sci 367(1590):830–839 peptide binding most? Immunogenetics 67(8):425–436
260
Apêndice A.3.
Cópia do artigo “Kiwi genome provides insights into evolution of a nocturnal
lifestyle”: Genome Biology (2015), 16(1): 1-15.
Neste trabalho, eu realizei os testes de seleção baseados em dN/dS usando
o pacote PAML – e/ou supervisionei sua execução e interpretação – e fui res-
ponsável pela discussão dos resultados referentes a estas análises no artigo.
Também fiz parte das análises referentes às regiões ultra-conservadas (Ultra-
conserved non-coding elements) que apresentam mais variação do que o esperado
em kiwi, indicando possíveis vias de desenvolvimento alteradas nessa espécie.
Finalmente, contribuí com correções do manuscrito e com discussões relaciona-
das aos aspectos evolutivos do trabalho.
261
Le Duc et al. Genome Biology (2015) 16:147
DOI 10.1186/s13059-015-0711-4
Abstract
Background: Kiwi, comprising five species from the genus Apteryx, are endangered, ground-dwelling bird species
endemic to New Zealand. They are the smallest and only nocturnal representatives of the ratites. The timing of kiwi
adaptation to a nocturnal niche and the genomic innovations, which shaped sensory systems and morphology to
allow this adaptation, are not yet fully understood.
Results: We sequenced and assembled the brown kiwi genome to 150-fold coverage and annotated the genome
using kiwi transcript data and non-redundant protein information from multiple bird species. We identified
evolutionary sequence changes that underlie adaptation to nocturnality and estimated the onset time of these
adaptations. Several opsin genes involved in color vision are inactivated in the kiwi. We date this inactivation to
the Oligocene epoch, likely after the arrival of the ancestor of modern kiwi in New Zealand. Genome comparisons
between kiwi and representatives of ratites, Galloanserae, and Neoaves, including nocturnal and song birds, show
diversification of kiwi’s odorant receptors repertoire, which may reflect an increased reliance on olfaction rather than
sight during foraging. Further, there is an enrichment of genes influencing mitochondrial function and energy
expenditure among genes that are rapidly evolving specifically on the kiwi branch, which may also be linked to its
nocturnal lifestyle.
Conclusions: The genomic changes in kiwi vision and olfaction are consistent with changes that are hypothesized to
occur during adaptation to nocturnal lifestyle in mammals. The kiwi genome provides a valuable genomic resource for
future genome-wide comparative analyses to other extinct and extant diurnal ratites.
© 2015 Le Duc et al. This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium,
provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://
creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
262
Le Duc et al. Genome Biology (2015) 16:147 Page 2 of 15
our understanding of how genomic features evolve during is syntenically alignable to 83.51 % of the chicken genome.
adaptation to nocturnality and the ground-dwelling niche. However, 91.96 % of the zebra finch sequences that are
We have also sequenced the transcriptome from embry- syntenic-chain-alignable to chicken showed conserved
onic tissue to provide support for the genome annotation. synteny in kiwi, suggesting that the kiwi genome assembly
We identified genomic changes in kiwi that affect physio- includes the majority of conserved regions between birds.
logical functions, including vision and olfaction, which We identified a set of 27,876 genes following de novo
have been predicted to characterize nocturnal adaptation gene prediction on the assembled genome (Additional
in the early history of mammals [4]. file 1: Note: De novo gene prediction and gene annota-
tion). To refine these gene annotations we used 47.5 Gb
Results of transcript sequence data from kiwi embryonic tissue
Genome sequencing, assembly, and annotation together with the de novo gene predictions and protein
We prepared 11 libraries with several insert sizes from evidence from three well-annotated bird species (G.
Apteryx mantelli genomic DNA and sequenced 83 billion gallus, T. guttata, M. gallopavo) as input to the MAKER
base pairs (Gb) from small insert-size libraries and 120 Gb genome annotation pipeline [10]. A validated set of
from large-insert mate-pair Illumina libraries (Additional 18,033 genes was selected based on their alignment to
file 1: Table S1). After read correction [5] we assembled orthologous genes in other birds and on supporting evi-
contigs and scaffolds using SOAPdenovo [6] (Additional dence provided by kiwi transcript sequences. In total,
file 1: Note: Filtering and read correction; Genome assem- the gene models spanned 306.62 Mb of the assembly,
bly) to generate a draft assembly, which spanned 1.595 Gb with exons accounting for 23.96 Mb (approximately 1.6
(Additional file 1: Tables S2 and S3). The N50s of contigs %) of the total kiwi genome.
and scaffolds were 16.48 kb and 3.95 Mb, respectively
(Additional file 1: Table S3). Since the size of the kiwi gen- Evolution of gene families
ome is unknown, we estimated average coverage using a Gene family expansion and/or contraction have been
19-mer frequency distribution (Additional file 1: Figure proposed as important mechanisms underlying adapta-
S1) which yielded a genome size estimate of 1.65 Gb, pla- tion [11]. We explored patterns of protein family expan-
cing the kiwi among the largest bird genomes sequenced sions and contractions in kiwi and used TreeFam [12] to
to date [7] (Table 1; Additional file 1: Table S4). The as- define gene families in the kiwi and all bird and reptile
sembled contigs and scaffolds cover approximately 96 % genomes in Ensembl 73, as well as two nocturnal birds
of the complete genome with an average sequence cover- (barn owl, chuck-will’s-widow), two other ratites (ostrich,
age of 35.85-fold after correction (Additional file 1: Note: tinamou) [7] (GigaDB [13]), two mammals (human,
Filtering and read correction). Assembly quality was mouse), and one fish (stickleback) (Ensembl 73 [14]). In
assessed by chaining the kiwi scaffolds to two Sanger- total we identified 10,096 gene families shared between
sequenced bird genomes: chicken [8] and zebra finch [9]. the inferred ancestral state and the 16 species consid-
A total of 50.09 % (0.8 Gb) of the kiwi genome is alignable ered, of which 623 represent single-gene families. For
in syntenic chains to 79.67 % of the much smaller chicken these single-gene families we constructed a maximum-
genome (1.07 Gb). A similar fraction, 57.61 % (0.9 Gb), of likelihood phylogeny [15] (Fig. 1) and tested for changes
the kiwi sequence was alignable to 76.92 % of the zebra in ortholog cluster sizes. In accordance with previous es-
finch genome (1.2 Gb) (Additional file 1: Table S5). For timates, our results indicate a net gene loss on the avian
comparison, 69.86 % (0.84 Gb) of the zebra finch genome branch [16].
Changes of gene-family sizes have been inferred for
Table 1 Kiwi genome assembly characteristics and genomic multiple de novo assembled genomes [17, 18]. However,
features compared with other avian genomes (see Additional
many of these genomes have rather fragmented assemblies
file 1: Table S4)
(Table 1); thus, results should be interpreted cautiously,
Species Size of N50 scaffolds Heterozygous SNP
assembly (Gb) (Mb) rate per kb only after manual inspection and ideally independent ex-
Apteryx mantelli 1.59 4 1.5
perimental confirmation.
We therefore manually examined the 130 gene families
Falco cherrug [17] 1.18 4.2 0.8
that had either significant expansion or contraction spe-
Falco peregrinus [17] 1.17 3.9 0.7 cifically to the kiwi branch. After excluding expansions
Taeniopygia guttata [9] 1.2 10.4 1.4 that were caused by fragmentation of the assembly [19]
Ficedula albicolis [90] 1.13 7.3 3.03 only 85 gene families remained significant (Additional
Anas platyrhynchos [18] 1.1 1.2 2.61 file 1: Table S6). Of these, 63 gene families are expanded
Gallus gallus [8] 1.07 15.5 4.5
in the kiwi. An analysis of gene family functions [20]
showing expansion in kiwi identified enrichment in cat-
Meleagris gallopavo [91] 0.93 1.5 ~1.36
egories including signal transduction, calcium homeostasis,
263
Le Duc et al. Genome Biology (2015) 16:147 Page 3 of 15
Fig. 1 Phylogenetic tree of 16 species built on 623 TreeFam [12] single-gene families. Branch lengths are scaled to estimate divergence
times. All branches are supported by 100 bootstraps. The song bird clade is depicted in blue, Galliformes jn purple, Anseriformes in green,
and nocturnal birds in red. Ratites (Struthio camelus and Apteryx mantelli) and Tinamus guttatus are highlighted in light green. The number
of genes gained (+ red) and lost (− blue) is given underneath each branch. The rate of gene gain and loss for the clades derived from
the most common recent ancestor was estimated [77] to 0.0007 per gene per million years
and motor activity (FDR <0.0001, Additional file 1: Figure Patterns of natural selection
S2A). Among the gene families that show contraction on To determine whether any branch-specific selection is
the kiwi branch we found an enrichment of development- present in kiwi we estimated branch ω-values (Ka/Ks sub-
related Gene Ontology (GO) categories (FDR <0.0001, stitution ratios) for 4,152 orthologous genes in eight bird
Additional file 1: Figure S2B). species: kiwi, ostrich, tinamou, chuck-will’s-widow, barn
Diversification of tetrapods and the colonization of ter- owl, chicken, zebra finch, and turkey using CODEML
restrial habitats are often accompanied by changes of [24]. Ortholog assignment was based on the orthology re-
physiological systems specifically in cellular signal trans- lation among chicken, zebra finch, and turkey defined in
duction [21]. Membrane proteins are involved in cellular Ensembl 73 (Additional file 1: Note: Orthologs and Ka/Ks
signaling, hence we aimed to determine more specifically calculation). The kiwi average ω across all the orthologs is
which classes of membrane-expressed proteins have comparable to that in ostrich, and higher than in tinamou
undergone changes in the number of coding genes. To and night birds (0.291, 0.313, 0.145, 0.202, and 0.200 for
this end we annotated the membrane proteome in kiwi, kiwi, ostrich, tinamou, chuck-will’s-widow, and barn owl,
human, all birds, and reptiles present in Ensembl 74, two respectively). This implies a relatively faster overall rate of
additional ratites (ostrich and tinamou) and two nocturnal functional evolution in kiwi and ostrich.
birds (chuck-will’s-widow and barn owl) (Additional file 1: In addition to gene-family expansions/contractions,
Note: Detection and classification of the membrane prote- we used evidence of branch-specific selection to iden-
ome; Additional file 1: Table S7). We manually inspected tify genes and functional pathways that may underlie
the classes which showed expansion in kiwi, to ensure that kiwi-specific adaptations. For the 4,152 orthologous
the higher number of predicted genes is not a result of as- genes in the eight bird species we used the branch models
sembly fragmentation. We found a significant expansion from CODEML to perform likelihood ratio tests [24],
in kiwi of genes coding for adhesion and immune-related comparing a simple model of one ω for all sites and
proteins (Additional file 1: Table S7). Additionally, we branches versus a model where kiwi is defined as the fore-
found a significant expansion of the Ephrin kinases class, ground branch and the other birds as background. We
which are functionally involved in the development of the first considered genes with a significantly higher ω on the
sensory-motor innervation of the limb [22] and later on in kiwi branch than that in all other birds (LRT >3.84, signifi-
tendons condensation and developing feather buds [23]. cance at 5 %, 1 degree of freedom). Functional enrichment
264
Le Duc et al. Genome Biology (2015) 16:147 Page 4 of 15
using GO [20] categories was tested using a hypergeo- file 1: Table S8B). Among slower evolving categories, the
metric test (Additional file 1: Note: Gene ontology and mitochondrial outer membrane was one of the kiwi-
rapidly evolving genes). The same test was performed on specific categories (Additional file 1: Table S9A), while
genes evolving significantly slower in kiwi. To assign func- anion channel activity was a shared category with chuck-
tional categories as either kiwi-specific, or shared with will’s-widow (Additional file 1: Table S9B). For the poten-
other ratites or nocturnal birds, a similar procedure was tially biological meaningful categories which could explain
performed for each species of Palaeognathae (ostrich, kiwi-specific physiology we extracted the genes clustering
tinamou) and night birds (chuck-will’s-widow, barn owl) in the node. GO categories have a high potential to deliver
by assigning each in turn as the foreground branch in false-positive enrichment, which could be considered bio-
CODEML. logically meaningful a posteriori [25]. Therefore, future
After multiple testing correction using family-wise error studies need to verify the adaptive functionality of genes
rate none of the categories remained significant. For fur- belonging to the respective category (Additional file 1:
ther analysis we considered only GO categories that had Tables S8C and S9C).
(1) a P value <0.05; (2) at least three significantly changed It has been proposed that, in a nocturnal environment,
genes; and (3) the number of significant genes was at least genes involved in circadian rhythm have been under se-
5 % of the total genes annotated in the GO category. GO lective pressure [4]. Our species-specific selection screens
categories that were over-represented (P value <0.05) on did not identify circadian rhythm-related categories to be
the kiwi branch, but not present in any of the other con- enriched for changed genes in either kiwi or the other
sidered species, were identified as potentially kiwi-specific nocturnal birds. However, since mutations in even a single
changes (Additional file 1: Note: Gene ontology and rap- gene may be relevant, we analyzed more closely bio-
idly evolving genes). Notably, faster-evolving categories rhythm regulators from the neuropsin gene family. Ence-
present in kiwi, but absent in any of the other species, are phalopsin (OPN3), melanopsin (OPN4-1), and neuropsin
related to mitochondrion, feeding behavior and energy re- (OPN5) showed a similar ω in kiwi and the other branches
serve metabolic process, visual perception, and eye photo- and no obvious alterations could be detected in the se-
receptor cell differentiation (Additional file 1: Table S8A). quence (Table 2). Similar to chicken [26], kiwi and the
Sensory perception of light stimulus is a faster evolving other tested birds have a duplication of the melanopsin
category shared, surprisingly, with the ostrich (Additional gene (OPN4-2), which displayed significant signals of
265
Le Duc et al. Genome Biology (2015) 16:147 Page 5 of 15
positive selection in kiwi but not in the other nocturnal Besides these two functionally well-characterized posi-
birds. However, a branch-site selection analysis of this tions, we identified several other amino acids substitu-
gene did not show any significant positively selected sites tions in kiwi OPN1MW and OPN1SW. Further, tests for
(Additional file 1: Note: Vision analysis). branch and branch-site specific ω values for OPN1MW
and OPN1SW on the kiwi branch showed no evidence
for positively selected sites in kiwi (Additional file 1:
Kiwi sensory adaptations – vision Note: Vision analysis), suggesting that the greater ω
Nocturnality is accompanied by a number of specific values for kiwi are likely due to loss of constraint on
changes, including adaptations in visual processing [4]. these genes. Hence these genes are likely to be drifting
In contrast to most nocturnal animals, that have large and, considering the fact that only 8 % of all inactivating
eyes relative to their body size, kiwi have small eyes and mutations in GPCRs are stop codons while almost 65 %
reduced optic lobes in the brain [27]. However, the kiwi are missense mutations [35–37], the described loss-of-
retina has a higher proportion of rods than cones which function mutations in OPN1MW and OPN1SW render
is consistent with adaptation to nocturnality [3]. Besides color vision of kiwi, unlike for other sequenced ratites
black/white vision mediated via rhodopsin (RHO), most (Fig. 2), absent – at least for the green and blue spectral
birds have trichromatic or tetrachromatic vision, for which ranges.
various additional opsins are responsible: OPN1LW (red), We tentatively dated the opsin-loss-of-function event
OPN1MW (green, RH2), OPN1SW (blue, subtypes SWS1, as an indicator of the timing of adaptation to the noctur-
SWS2) [28]. We identified these genes in the kiwi assem- nal niche. Assuming that the loss of constraint happened
bly. The RHO gene in kiwi shows no interruption and no on the kiwi branch in a short period of time and chan-
obvious function-impairing amino acid changes compared ged the rate of selection, measured by the ω value, from
to other vertebrates. We were able to assemble only a par- the average over bird lineages (0.021 for OPN1MW and
tial sequence of the red opsin OPN1LW (transmembrane 0.014 for OPN1SW, Table 2) to the neutral ω value of 1,
(TM) helix 7) and found no previously described deleteri- the loss of function was dated to 30–38 million years
ous amino acid changes within this region [29]. ago (Additional file 1: Note: Vision analysis), which
In the green opsin, OPN1MW, we identified a Glu134 places the event shortly after the arrival of kiwi in New
to Lys substitution (relative position 3.49 in the Zealand [38].
Ballesteros and Weinstein nomenclature) in the highly
conserved D/ERY motif of this rhodopsin-like GPCR. Kiwi sensory adaptations – olfaction
We confirmed this mutation in a second Apteryx man- Kiwi are unique among birds in having nostrils
telli individual, as well as in other kiwi species (Fig. 2). present at the end of their prominent beaks and have
To determine whether the change is kiwi-specific we se- been reported to depend largely on tactile and olfac-
quenced this domain of OPN1MW in other ratites, in- tory senses for foraging [39]. To investigate whether
cluding the extinct moa. We found that Glu3.49 is 100 % the genome shows signs of olfactory adaptation in
conserved in all birds for which sequence was available kiwi we assessed the numbers of olfactory receptor
and also in over 250 other vertebrate orthologs. Previous (OR) genes [40] and the diversity in the OR sequence
experimental analysis showed that mutation of Glu3.49 to [41].
Arg – another basic amino acid – results in a non- The only previous approach to molecular characterization
functional receptor protein [30]. Furthermore, the Asp of the olfactory system in kiwi was based on PCR amplifi-
or Glu in the D/ERY motif is also highly conserved in cation of ORs with degenerate primers [42]. This allowed
most other rhodopsin-like GPCRs and the identical mu- only a rough estimation of the number of ORs of 478
tation of Glu3.49 to Lys in the thromboxane A2 receptor, genes (95 % confidence interval 156–1,708 genes). PCR
for example, prevents the receptor from being function- with degenerate primers only produces incomplete frag-
ally expressed on the plasma membrane [31]. ments of the genes and hence the accurate quantification
Similarly, at the N-terminal end of TM6 in OPN1SW of gene families with highly similar sequences, as in the
we identified a highly conserved Glu6.30 which is present case of ORs, is prone to over-estimation [43]. In contrast,
in all bird orthologs sequenced so far, except for kiwi de novo genome assembly facilitates a global assessment
OPN1SW where Glu6.30 is substituted by Gly. Previous of the gene repertoire [44] and can therefore be used to
functional characterization has shown that mutation of provide a more accurate estimate of the OR repertoire.
Glu6.30 destabilizes the H-bond network resulting in We thus annotated the OR genes in kiwi, as part of the
constitutively active opsins and other rhodopsin-like entire membrane proteome, on the basis of putative
GPCRs [32, 33]. A constitutively active opsin is function- functionality and seven transmembrane helices (7TM)
ally incapable of light signal transmission [34] and is (Additional file 1: Note: Olfactory receptor genes identifi-
therefore non-functional. cation and annotation). The number of non-OR receptor
266
Le Duc et al. Genome Biology (2015) 16:147 Page 6 of 15
Fig. 2 Protein sequence comparison revealed substitutions of Glu3.49 to Lys (E/DRY motif) and Glu6.30 to Gly in kiwi OPN1MW (RH2) and
kiwi OPN1SW, respectively. Both residues are 100 % conserved in all birds sequenced so far and over 100 publicly available sequences of
other vertebrate OPN1MW and OPN1SW orthologs. To assure the OPN1MW-change is kiwi-specific additional ratites were sequenced,
including different kiwi species and the extinct moa. Glu3.49 of the E/DRY motif and Glu6.30 at the N-terminal end of helix 6 are parts of
an ‘ionic lock’ interhelical hydrogen-bond network which is highly conserved in many rhodopsin-like GPCRs. Nb – North Island brown
kiwi, Ob – Okarito brown kiwi, Gs – Great spotted kiwi, Ec – Emeus crassus (Eastern moa), Pg – Pachyornis geranoides (Mappin’s moa),
Chuck-will – Chuck-will’s-widow
families was comparable to other avian species, suggesting up to 141 OR genes are present in the kiwi genome,
that the membrane proteome is well annotated in kiwi of which 86 encode for full-length receptors while the
(Additional file 1: Table S7). This analysis revealed an ini- rest are most likely pseudogenes due to frameshifts,
tial set of 82 OR genes in the kiwi genome. However, ORs premature stop codons, or truncations (Additional file
are highly duplicated across the genome and such regions 1: Note: Olfactory receptor genes identification and an-
could be prone to being overcollapsed during the notation). The estimated proportion of intact ORs
assembly process. We therefore estimated the copy num- among all OR genes in kiwi (61 %) is lower than previ-
ber of each annotated OR using a correction based on ously reported for Apteryx australis [42] (78.6 %), but
coverage. To obtain the correction factor for each OR, much higher than in zebra finch (38 %) [45].
read-coverage in the OR region was divided by the Comparative analysis of the OR repertoire shows that
genome-wide average coverage corresponding to its the kiwi genome has both the α and the γ subgroups of
GC bin. Following this correction we estimated that type 1 OR genes, as reported for other bird genomes
267
Le Duc et al. Genome Biology (2015) 16:147 Page 7 of 15
sequenced so far [45]. Unlike the majority of other birds Phylogenetic comparison of OR repertoires suggest
analyzed so far, kiwi has a higher number of γ subgroup that γ ORs within bird and reptile genomes exhibit con-
ORs. Gene family size estimates are highly dependent on trasting evolutionary rates. Tree topology suggests that γ
genome quality [46] and continuous curation is ongoing ORs in a few birds and reptiles show species-specific
even for well-annotated genomes: for example, in the clustering pattern (Fig. 3). This pattern was previously
chicken olfactory repertoire the number of annotated described in birds and it was suggested that these recep-
ORs changed by a factor of eight in two consecutive tors have undergone adaptive evolution with respect to
Ensembl releases (release 73 – 251 ORs and release 74 – the occupied environmental niche [45]. However, a few
30 ORs). Further improvement of genome qualities, in- γ ORs belonging to kiwi cluster with their reptilian
cluding kiwi, are therefore required for the identification counterparts, while some cluster basal to the clade con-
of a complete set of ORs. Thus, a correlation between taining most bird γ ORs (Fig. 3).
olfactory acuity and the number of ORs in different Phenotypic diversity in olfaction is, in part, attributable
birds could be subject to error. to genetic variation with a wider range of odors thought
Fig. 3 Maximum likelihood (ML) tree constructed using full-length intact α and γ group olfactory receptors from 10 birds (chicken, zebra
finch, flycatcher, duck, turkey, chuck-will’s-widow, barn owl, ostrich, tinamou, and kiwi) and two reptile genomes (anole lizard and Chinese soft-shell
turtle). The ML topology shown above was cross-verified using the neighbor joining (NJ) method. Three Class A (Rhodopsin) family GPCRs from
chicken genome, dopamine receptor D1 (DRD1), dopamine receptor D2 (DRD2), and histamine receptor H1 (HRH1) were used as the out-group
(shown as non-olfactory receptors). The red dot indicates confidence estimates (% bootstrap from 500 resamplings, >90 % bootstrap support from
both ML and NJ methods) for the nodes that distinguish α and γ ORs. The scale bar represents the number of amino-acid substitutions per site. The
topology supports lineage specific expansions of γ group olfactory genes in the bird and the reptile species. Note, a few of the γ group ORs
in kiwi cluster with reptilian ORs (highlighted by orange arrowhead), while some cluster basal to the clade containing bird ORs (highlighted
by green arrowhead). The topology supports contrasting evolutionary rates within the analyzed γ ORs, as indicated by short (blue arc with
arrowheads) and long branch lengths (pale orange arc with arrowheads). The inset shows the number of intact olfactory receptors in each
species that are analyzed using the ML tree topology
268
Le Duc et al. Genome Biology (2015) 16:147 Page 8 of 15
to be detectable given more genetic variation [41]. Since were then manually inspected. No insertions, deletions,
the absolute number of ORs might be a poor predictor and/or stop codons that would clearly disrupt the open
of olfactory abilities, we investigated the variation in the reading frame could be identified in the inspected genes.
γ ORs sequence as a measure of the range of possible Additionally, we found all 39 HOX genes expected for
detectable odors. The average protein sequence entropy the Sauropsid ancestor [54] and investigation of regula-
was calculated to check for variation within the γ-c clade tory sequences within the HOX clusters by phylogenetic
in each species (Additional file 1: Note: γ-c clade OR footprinting showed no preferential loss of conserved
within-species protein sequence entropy). DNA elements in Apteryx mantelli compared to Galli-
Previous studies have shown that Shannon entropy formes (Additional file 1: Figure S4; Additional file 1:
(H) analysis is a sensitive tool for estimating the diversity Table S11).
of a system [47, 48]. For protein sequence, H ranges To detect signs of different evolution in kiwi wing
from 0 (only one residue is present at that position in and tail developmental genes we performed a selective
the multiple sequence alignment) to 4.322 (all 20 resi- constraint analysis using the CODEML branch test
dues are equally represented in that position). Typically (Additional file 1: Note: Selection analysis on limb de-
H ≤2 is attributed to high conservation [49]. H values in velopment genes; Additional file 1: Table S12). Of
birds were in the range of 0.34±0.05 (zebra finch) to these genes FIBIN was the only gene that showed sig-
1.11±0.12 (chicken). The average entropy in kiwi se- nals of positive selection on the avian tree including
quences was 1.23±0.15, significantly higher than all other chicken, turkey, and zebra finch (Additional file 1:
bird species investigated (P value = 0.003 Wilcoxon Figure S5). Three sites with signs of positive selection
Signed-Rank test, Additional file 1: Note: γ-c clade OR that were 100 % conserved in the other species show
within-species protein sequence entropy). We conclude a different amino acid in kiwi: exchanges of Ser136Ala,
that overall the γ-c clade of ORs are highly similar in se- Gln148Arg, and Phe162Cys (positions are relative to
quence, in accordance with previously published data the mouse Fibin coding sequence). The functional
[45]. However, since detection of a wider range of odors relevance of these substitutions is unclear and needs
is correlated to genetic variation of ORs [41], the signifi- to be studied when experimental tests of FIBIN func-
cantly higher H in kiwi ORs is suggestive for a broad tion become available.
odor acuity in this species in comparison to other birds. Since no obvious alterations could be found in the
coding sequences of genes involved in developmental
Kiwi morphology processes, which could explain the regressed-wing
The most prominent phenotype of kiwi, lack of wings, morphology of kiwi, we further analyzed ultra-conserved
has been linked to energy conservation [50] and to the non-coding elements (UCNEs) (Additional file 1: Note:
limited resources in New Zealand in late Oligocene [51]. Ultra-conserved non-coding elements analysis). UCNEs
Like most ratites, kiwi are flightless, but the phylogenetic are defined as DNA non-coding regions of ≥95 % se-
tree of Palaeognathae implies that this phenotype quence identity between human and chicken, longer
evolved several times independently in this order [38]. than 200 bp [55]. The majority of UCNEs cluster in gen-
Unlike ostriches and rheas, that possess prominent omic regions containing genes coding for transcription
wings, kiwi show only vestigial invisible wings, while factors and developmental regulators [56] and experi-
moa lack even vestiges [52]. mental studies in transgenic animals have shown that
To determine whether we can identify the genetic some of these sequences can act as tissue-specific en-
basis for the extremely regressed wings in kiwi we anno- hancers during developmental processes [57]. Of the
tated genes in the highly conserved signaling pathways 4,351 UCNEs annotated in UCNEbase [55], 19 showed
related to limb development (Additional file 1: Note: more than the expected 5 % sequence variation as de-
Kiwi morphology analysis; Additional file 1: Figure S3). fined in the database [55] (Additional file 1: Table S13).
These include genes belonging to the FGFs, TBX cluster, Among these, four were related to HOXA, TBX2, Sp8,
HOX cluster (Additional file 1: Figure S4; Additional file and TFAP2A genes which have been previously de-
1: Table S11), WNT, SALL, and FIBIN genes, known to scribed in limb development pathways [53, 58, 59], sug-
be responsible for limb and wing development [53] gesting that changes in non-coding elements could be
(Additional file 1: Table S12). Growth and transcription involved in kiwi’s loss of wings.
factors typically influence the development of both
upper and lower limbs, while FIBIN is currently the only Discussion
gene described to be exclusively involved in the develop- With their small body size, extremely large egg size, noc-
ment of the upper limb [53]. turnal life style, and prominent nostrils at the end of
For these clusters of genes, we aligned corresponding their beaks, among several other traits, kiwi represent
orthologs and translated multiple alignments, which probably the most unusual member of the ratites [60]. A
269
Le Duc et al. Genome Biology (2015) 16:147 Page 9 of 15
recent mitochondrial DNA phylogeny placed kiwi as the dominant diurnal taxon at this time [4]. According to
closest relatives of the extinct Madagascan elephant this hypothesis, several traits typical for mammals, in-
birds [38]. Whether dispersal or vicariance best describe cluding a well-developed sense of smell, limited color
ratite distribution has been debated for over a century vision, increased eye size, and an energetic metabol-
[61]. A phylogeny including 169 bird species, built on 32 ism optimized for sun radiation-independent body
kb from 19 independent loci, showed ostrich as basal in temperature regulation, have been shaped by the noc-
the Palaeognathae clade [62]. In contrast, our phylogeny, turnal environment [65, 66]. Nocturnally adapted
based on 623 1:1 orthologs in 16 species, totaling ap- Mesozoic mammals also tended to have a small body
proximately 700 kb, places the tinamou as basal to size, an insectivorous diet, and low energy metabolism
Palaeognathae with 100 % bootstrap confidence (Fig. 1; [67]. Interestingly, kiwi has the smallest body size
Additional file 1: Figure S6). However, when the phyl- among flightless ratites, the lowest metabolic rate
ogeny was constructed for 10 bird species using just among birds [68, 69], and an insectivorous diet, sug-
UCNEs (totaling >1 Mb) the topology of the tree gesting a pattern of evolution that is similar to the
matches that obtained from fewer loci from a larger evolution of mammals under nocturnality. Consistent
number of species which agrees with a previous publica- with this hypothesis, our genome-wide scans for pat-
tion [62] (Additional file 1: Figure S7). Including more terns of positive selection showed enrichment in GO
ratites and a larger number of (hand-curated) loci should categories like mitochondrion functions and energy
provide better resolution of the tree topology, and in- reserve metabolic process (Additional file 1: Table
deed the topology we obtain here is well-supported. S8A), both related to metabolic rate. Moreover, we
However, we note that the topology changes depending found strong evidence for a loss of color vision in
on the gene sets that are included (Additional file 1: Figs. kiwi and their retinal structure also clearly supports
S6 and S7) and that when using ultra-conserved se- adaptation to vision under low light levels [3]. Al-
quences the phylogeny differs from that obtained from a though the small eye size of kiwi [27] is unusual for
larger, more representative set of genes. Hence, future a nocturnal species, based on the retinal anatomy
availability of additional genomes and ortholog sets from Corfield et al. rejected a regressive evolution model
multiple ratites will allow a better understanding of their for kiwi vision and suggested that kiwi have an acuity
origin. in detecting low light levels similar to other nocturnal
Nevertheless, a previous study has estimated that kiwi species [3]. This suggests that molecular mutations
diverged from the Madagascan elephant birds about 50 and retinal structure changed faster than eye size. In
million years ago [38] (Additional file 1: Figure S8). This birds, eye size was described to scale to body mass
estimate post-dates the split of Madagascar and New with an exponent similar to brain mass and metabolic
Zealand from Gondwana, which took place around 100 rate [70]. Thus, the low metabolic rate of kiwi [68]
and 80 million years ago, respectively, and implies that could be the constraint for their relatively small eyes.
ratites must have dispersed by flight and also that kiwi Alternatively, kiwi might serve as an example that ad-
arrived on New Zealand less than 50 million years ago. aptations in the retinal structure could be sufficient,
This conclusion is supported by the fossil record in New and changes in eye size are not absolutely necessary.
Zealand, which includes a flighted kiwi ancestor [63]. At This conclusion may be supported by the absence of
the time kiwi arrived, moa already inhabited New variation in eye shape according to activity pattern
Zealand and it has been hypothesized that moa were observed in lizards and non-primate mammals [71].
monopolizing the diurnal ground niche, which forced It has long been hypothesized that unlike most bird
kiwi to adapt to an alternative nocturnal lifestyle [38]. species kiwi is more similar to mammals in their reliance
This would suggest that kiwi adapted to the nocturnal on olfactory and mechanical cues for foraging, perceived
niche soon after arriving on the island. The loss of func- by the nostrils and mechanoreceptors located at the end
tion that we observe in OPN1SW is indicative of adapta- of its bill, for foraging [72]. We found that the kiwi, un-
tion to nocturnality [64]. We dated the loss of function like other ratites, has an increased diversity in the bird-
in several color vision opsins to 30–38 million years ago, specific γ-c clade ORs. Since OR diversity is hypothe-
which is consistent with the arrival of the kiwi in New sized to correlate positively with olfactory acuity in ver-
Zealand less than 50 million years ago, and their subse- tebrates [42, 73], the significantly higher diversity in kiwi
quent adaptation to a nocturnal niche. ORs compared to other birds (Additional file 1: Figure
In contrast to birds, which almost certainly have a di- S9) suggests that kiwi may be able to distinguish a larger
urnal origin, the nocturnal bottleneck hypothesis sug- range of odors than other birds.
gests that mammals were nocturnal for about 160 Steiger et al. formulated two possible scenarios that
million years in their evolution as they were restricted to could explain γ ORs evolution in birds: the first hypoth-
nighttime activity to avoid dinosaurs which were the eses that species-specific γ ORs arose from independent
270
Le Duc et al. Genome Biology (2015) 16:147 Page 10 of 15
expansion events in each species, while the second as- sequencing; Additional file 1: Table S1). Paired-end
sumes that the ancient γ OR clade was more diverse and sequencing was performed on HiScanSQ and HiSeq
became homogenized by concerted evolution within spe- platforms with read lengths of 101 bp and 96 bp,
cies [45]. Some γ ORs of kiwi, ostrich, tinamou, and respectively.
nocturnal birds clustered with their reptilian counter- Sequencing errors were corrected using Quake [5]
parts, while others clustered basal to the clade contain- (Additional file 1: Note: Filtering and read correction;
ing most bird γ ORs (Fig. 3). This supports a two-fold Additional file 1: Figure S1). A total of 52.53 Gb of high-
conclusion: (1) γ ORs in kiwi are more diverse in se- quality sequence was used for de novo assembly with
quence than in other birds investigated, which was veri- SOAPdenovo [6]. The short-insert-size libraries (240 bp,
fied by the significantly higher sequence entropy; and (2) 420 bp, 800 bp) were used to build contigs. Based on
since kiwi is basal to the Neognathae (Fig. 1), the ances- paired-end information scaffolds were generated using
tral state of γ OR clade is probably diversified compared all libraries (2 kb, 3 kb, 4 kb, 7 kb, 9 kb, 11 kb, 13 kb).
to other modern birds. Remaining gaps in the scaffolds were closed using the
paired-end information (Additional file 1: Note: Genome
Conclusions assembly). This final assembly (AptMant0) was used for
Since its arrival in New Zealand sometime after 50 all subsequent analyses.
million years ago, the kiwi adapted to a nocturnal, Gene annotation was performed with the MAKER
ground-dwelling niche. The onset of adaptation to pipeline [10], using several sources of evidence: de
nocturnality appears to have been approximately 30– novo gene predictions, RNA-Seq data, and protein
38 million years ago, about one-fifth of the time pro- evidence from three species (G. gallus, T. guttata, and
posed for the evolution of mammals in a nocturnal M. gallopavo) (Ensembl version 72). Briefly, after re-
environment. The molecular changes present in the peat masking, gene models were predicted by Augus-
kiwi genome are in accordance with the adaptations tus version 2.7 [74] using the training dataset for
that are hypothesized to have occurred during early chicken. Apteryx mantelli RNA-Seq data were then
mammalian adaptation to nocturnality. This suggests aligned to AptMant0 using NCBI BLASTN version
similar patterns of adaptation to the nocturnal niche 2.2.27+ [75] and BLASTX was used to align protein
both in kiwi and mammals. Further comparative ana- sequences to identify regions of homology. Finally,
lyses, including other diurnal Palaeognathae, as well using both the ab initio and evidence-informed gene
as additional nocturnal bird groups and their diurnal predictions, Maker updated features such as 5’ and 3’
sister species, should shed further light on the gen- UTRs based on RNA-Seq evidence and a consensus
omic imprints of adaptation to a nocturnal life style. gene set was retrieved (Additional file 1: Note: De
novo gene prediction and gene annotation).
Methods and materials
Genome sequence assembly and annotation Comparative genome analysis
We sequenced Apteryx mantelli female individuals, which Triplet orthologs between chicken, zebra finch, and
originate from the far North (kiwi code 73) and central turkey were downloaded from Ensembl 73. Kiwi genes
part – Lake Waikaremoana (kiwi code AT5 and kiwi code were considered orthologs to a triplet if the ortholog
16–12) of North Island (Additional file 1: Figure S10). assignment from Maker agreed with the orthologous
They were sampled in 1986 (kiwi code 73) and 1997 (kiwi gene assigned in each of the three considered species.
code AT5 and 16–12) in ‘operation nest egg’ carried out The ostrich, tinamou, chuck-will’s-widow, and barn owl
by Rainbow and Fairy Springs, Rotorua. No animals were orthologs were assigned by orthology to the chicken
killed or captured as a result of this study and genome as- proteins. After assigning orthology in the eight avian
sembly was performed with iwi approval from the Te species, coding sequences were aligned and two different
Parawhau and Waikaremoana Māori Elders Trust. sets of alignments were compiled for further analysis:
We extracted genomic DNA from Apteryx mantelli Set 1: alignments of all eight species that do not con-
embryos. Libraries with insert sizes of 240 bp, 420 tain a single frameshift indel.
bp, 800 bp, 2 kb, 3 kb, and 4 kb were obtained from Set 2: the longest uninterrupted run of at least 200
individual kiwi code 73, and mate-paired-end libraries aligned bases in each multiple sequence alignment, for
7 kb, 9 kb, 11 kb, and 13 kb, from individual kiwi which we first ensured that gaps in the alignment were
code 16–12. DNA from individual AT5 was used to not introduced by unresolved bases in our assembly.
build a 350 bp insert-size library with the purpose of The CODEML program from the package PAML [24]
confirming kiwi-specific sequence polymorphisms and was run first on four avian lineages: G. gallus, T. gut-
was not included in the genome assembly (Additional tata, M. gallopavo, and A. mantelli to compare the kiwi
file 1: Note: Sampling, DNA library preparation and genome to high-quality annotated ones. Six pairwise
271
Le Duc et al. Genome Biology (2015) 16:147 Page 11 of 15
combinations were run to obtain estimates of non- option of 0.0007 (Additional file 1: Note: Gene fam-
synonymous (Ka) and synonymous (Ks) changes in the ilies evolution using CAFE). Pfam IDs corresponding
four avian lineages. Ka and Ks distributions were com- to the TreeFam families were assigned to GO categor-
pared pairwise between all four avian species on a set ies. We tested whether significant (P <0.05) contraction/
of 3,754 orthologous genes which presented no frame- expansion events cluster in different GO categories using
shifts or indels (Additional file 1: Figure S11). ClueGO with a hypergeometric test [78] (Additional file 1:
We next scanned for differently evolving genes with the Figure S2).
CODEML program under a branch model (model = 2,
two ωs for foreground and background branches, respect-
ively, vs. model = 0, one ω for all branches, compared via Membrane proteome annotation
likelihood ratio test) [24] using the set of orthologs as de- Complete protein sequence sets for the following bird
fined above in the eight bird species (Additional file 1: and reptile species were downloaded from Ensembl 74
Note: Orthologs and Ka/Ks calculation). [14]: Taeniopygia guttata, Meleagris gallopavo, Ficedula
Branch specific ω values were used to identify GO albicollis, Anas platyrhynchos, Pelodiscus sinensis, Gallus
categories that are evolving significantly different on gallus, and Anolis carolinensis. Homo sapiens from the
each of the following bird species: kiwi, ostrich, tina- same Ensembl version was used as outgroup. Protein se-
mou, barn owl, and chuck-will’s-widow. GO categories quences of ratites (Tinamus guttatus, Struthio camelus)
enrichment was tested using the FUNC [76] package. and nocturnal birds (Antrostomus carolinensis, Tyto
A hypergeometric test was run for each species sep- alba) were downloaded from GigaDB [13]; although
arately on genes having a significantly higher ω. Mul- these genomes are more fragmented than the ones from
tiple testing correction was done using family-wise Ensembl, annotation of the membrane proteome in birds
error rate. Categories with P value <0.05 were consid- adapted, like kiwi, to the nocturnal niche and the ones
ered for further analysis if at least three significantly belonging to the same clade as kiwi, allows to differenti-
changed genes were present in the GO category, and ate between events that are clade-specific or shaped by
the number of significant genes was greater or equal nocturnality. Only the longest protein sequence for each
to 5 % of the total genes annotated in the respective gene was considered for analysis. Membrane proteins
GO category. The same test was applied on genes and signal peptides were predicted for all species with
with a significantly smaller ω in each of the species. Phobius [79]. These proteins were classified based on a
Kiwi-specific categories were considered those which manually curated human membrane proteome dataset,
showed no enrichment in any of the other ratites or which describes family relationship and molecular func-
night birds (Additional file 1: Note: Gene Ontology tion. The predicted membrane proteins were aligned to
and rapidly evolving genes). the human membrane proteome dataset with the BLASTP
We used the TreeFam methodology to define gene program of the BLAST package using default settings
families [12] across 16 genomes: Gallus gallus, Anas (v. 2.2.27+) [75]. Each predicted membrane protein was
platyrhynchos, Ficedula albicollis, Meleagris gallopavo, classified according to its best human hit with an e-value
Taeniopygia guttata, Pelodiscus sinensis, Anolis caroli- <10−6. Predicted membrane proteins with no hit were
nensis, Homo sapiens, Mus musculus, Gasterosteus acu- deemed unclassified, along with those proteins that hit
leatus, Ornithorhynchus anatinus, downloaded from an unclassified human protein (Additional file 1: Note:
Ensembl 73 [14], Tinamus guttatus, Struthio camelus, Detection and classification of the membrane prote-
Antrostomus carolinensis, Tyto alba, downloaded from ome; Additional file 1: Table S7).
GigaDB [13], and Apteryx mantelli. The longest tran-
script was chosen for further analysis. For the single-
copy orthologous families, genes were aligned against Vision evolutionary analysis
each other. To build a consensus phylogenetic tree Opsins are G protein-coupled receptors known to play a
(Fig. 1) the resulting alignments were loaded in PAUP* role in light signal transduction and night-day cycle
[15] version 4.0d105 and trees were inferred using max- (Table 2). For these genes ω was estimated by appointing
imum likelihood, with default parameters. To measure sequentially kiwi, ostrich, tinamou, chuck-will’s-widow,
the confidence for certain subtrees, a series of 100 boot- and barn owl as the foreground branch under the
strap replicates were performed (Additional file 1: Note: CODEML branch model (model = 2) [24] as described for
Nuclear loci phylogeny). comparative genome analysis. Inactivating mutations were
We determined the branch-specific expansion and verified by checking that they were present in reads from
contraction of the orthologous protein families among both sequenced individuals and in other kiwi species, by
the 16 species using CAFE (computational analysis of Sanger sequencing (OPN1MW) (Fig. 2; Additional file 1:
gene family evolution) version 3.0 [77] with lambda Note: Vision analysis).
272
Le Duc et al. Genome Biology (2015) 16:147 Page 12 of 15
Olfaction evolutionary analysis corresponding coverage (that is, 35-fold). The final num-
Olfactory receptors (ORs) in kiwi were annotated using ber of estimated ORs was obtained by multiplying the
both the Augustus de novo gene prediction and the number of initially annotated genes with their correspond-
Maker information after scaffold positions were checked ing correction factors.
and redundant sequences were removed. Using the same annotation procedure, the OR gene
We then performed four steps (Additional file 1: repertoire was estimated in all bird and reptile genomes
Figure S12): from Ensembl 74, two nocturnal birds (chuck-will’s-
widow and barn owl) and two Palaeognathae (ostrich
i. Functional ORs from chicken [45] were downloaded and tinamou) for comparative phylogenetic analysis with
and aligned against the kiwi transcriptome using the kiwi OR dataset. All obtained OR genes were then
TblastN with default parameters. After collecting aligned using MAFFT [81] v7, with BLOSUM62 as the
overall hits for each query (every chicken OR served scoring matrix and default settings of option E-INS-I.
as query), identical (same) hits from each run were Phylogenetic analyses were run using both maximum
removed to obtain a non-redundant dataset. likelihood (ML) and neighbor joining (NJ) methods
ii. A Pfam search against the kiwi proteome with a (Additional file 1: Note: Comparative phylogenetic ana-
default e-value cutoff of 1.0 was used to identify lysis on ORs from kiwi and other bird and reptile ge-
sequences that contained 7tm_4 domain (olfactory nomes). The reliability of the phylogenetic trees was
domain). evaluated with 500 bootstrap replicates.
iii. The 7tm_4 domain was searched against the kiwi We calculated Shannon entropy (H) using within spe-
proteome by a CDD search (conserved domain cies multiple sequence alignments of γ ORs for all birds
database search). and reptiles genomes separately with a built-in function
iv. Separate HMM profiles were built from conserved from BioEdit [82] (Additional file 1: Note: γ-c clade OR
7tm regions of functional ORs of chicken, turkey, within-species protein sequence entropy).
and zebra finch obtained from previous studies
[45]. Using the three HMM profiles, HMM Kiwi morphology
searches were performed against the kiwi Previously characterized wing development genes [53]
proteome and non-redundant hits were retrieved were assigned orthologs in kiwi, chicken, zebra finch,
from combined results of all three searches. and turkey (Additional file 1: Figure S3; Additional file 1:
Table S12). We aligned the sequences and multiple align-
A CD-HIT (Cluster Database at High Identity with ments were translated and manually inspected for se-
Tolerance) was performed to remove identical sequences quence differences as well as insertions/deletions and
with a cutoff of 100 %. Preliminary phylogenetic analysis rearrangements. We examined selective pressures under
was performed using a maximum likelihood approach the branch models implemented in CODEML [24]. The
(Additional file 1: Note: Olfactory receptor genes identi- one-ratio model (model = 0, NSsites = 0) was used to esti-
fication and annotation). Non-ORs were removed if they mate the same ω ratio for all branches in the phylogeny.
clustered separately from ORs. We excluded pseudogene Then, the two-ratio model (model = 2, NSsites = 0), with
candidates if at least one premature stop codon and/or a background ω ratio and a different ω on the kiwi branch,
frameshifts could be identified in the kiwi sequence. was used to detect selective pressure acting specifically on
OR repertoire estimates were curated based on genomic the kiwi branch. These two models were compared via a
coverage calculated using samtools mpileup version 0.1.18 LRT (1 degree of freedom), as mentioned above [83].
[80] on the alignment of the 240 bp, 420 bp, 800 bp Scaffolds and isolated contigs harboring (putative) HOX
insert-size libraries to AptMant0 (Additional file 1: Note: genes were identified by BLAST and mapped to all 673
Olfactory receptor genes identification and annotation). sauropsid HOX protein sequences from GenBank. Trans-
The correction factor for each annotated OR was obtained lated HOX sequences of Apteryx were aligned to the HOX
by dividing the read coverage in that region to the GC- proteins extracted from Genbank and differences were
content corresponding average coverage over the entire identified by manual inspection. Potential regulatory se-
genome. For example, if an OR sequence had a GC quences in the HOX cluster region were identified by
content of 50 %, we calculated the average genome-wide phylogenetic footprinting using tracker2 [84] (Additional
coverage corresponding to the GC bin of 50 % to be 35- file 1: Figure S4).
fold (Additional file 1: Note: Genome coverage and To retrieve the entire coding region of the FIBIN gene
estimation of genome size; Additional file 1: Figure S13). in kiwi, we designed primers based on the chicken and
Given a coverage in the respective OR region of 105-fold, ostrich sequence (Additional file 1: Table S14). Using the
we obtained a correction factor of 3 after dividing the OR 276-bp fragment amplified by Sanger sequencing, we
sequence coverage (that is, 105-fold) by the GC-bin blasted transcriptome sequences from kiwi and iteratively
273
Le Duc et al. Genome Biology (2015) 16:147 Page 13 of 15
274
Le Duc et al. Genome Biology (2015) 16:147 Page 14 of 15
13. Sneddon TP, Zhe XS, Edmunds SC, Li P, Goodman L, Hunter CI. GigaDB: 38. Mitchell KJ, Llamas B, Soubrier J, Rawlence NJ, Worthy TH, Wood J, et al.
promoting data dissemination and reproducibility. Database (Oxford). Ancient DNA reveals elephant birds and kiwi are sister taxa and clarifies
2014;2014:bau018. ratite bird evolution. Science. 2014;344:898–900.
14. Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. 39. Corfield JR, Eisthen HL, Iwaniuk AN, Parsons S. Anatomical specializations for
Nucleic Acids Res. 2013;41:D48–55. enhanced olfactory sensitivity in kiwi, Apteryx mantelli. Brain Behav Evol.
15. Wilgenbusch JC, Swofford D. Inferring evolutionary trees with PAUP*. Curr 2014;84:214–26.
Protoc Bioinformatics. 2003;Chapter 6:Unit 6 4. 40. Niimura Y, Nei M. Extensive gains and losses of olfactory receptor genes in
16. Hughes AL, Friedman R. Genome size reduction in the chicken has mammalian evolution. PLoS One. 2007;2, e708.
involved massive loss of ancestral protein-coding genes. Mol Biol Evol. 41. Hasin-Brumshtein Y, Lancet D, Olender T. Human olfaction: from genomic
2008;25:2681–8. variation to phenotypic diversity. Trends Genet. 2009;25:178–84.
17. Zhan X, Pan S, Wang J, Dixon A, He J, Muller MG, et al. Peregrine and saker 42. Steiger SS, Fidler AE, Kempenaers B. Evidence for increased olfactory receptor
falcon genome sequences provide insights into evolution of a predatory gene repertoire size in two nocturnal bird species with well-developed
lifestyle. Nat Genet. 2013;45:563–6. olfactory ability. BMC Evol Biol. 2009;9:117.
18. Huang Y, Li Y, Burt DW, Chen H, Zhang Y, Qian W, et al. The duck genome 43. Preston GM. Cloning gene family members using PCR with degenerate
and transcriptome provide insight into an avian influenza virus reservoir oligonucleotide primers. In: White BA (ed.) PCR cloning protocols: from
species. Nat Genet. 2013;45:776–83. molecular cloning to genetic engineering; In series: Methods in
19. Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW. molecular biology (Clifton, N.J.) 67; Humana Press: 1997 pg 433-49. ISBN
Extensive error in the number of genes inferred from draft genome 0896034436
assemblies. PLoS Comput Biol. 2014;10, e1003998. 44. Liu S, Wei W, Chu Y, Zhang L, Shen J, An C. De novo transcriptome
20. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene analysis of wing development-related signaling pathways in Locusta
ontology: tool for the unification of biology. The Gene Ontology migratoria manilensis and Ostrinia furnacalis (Guenee). PLoS One.
Consortium. Nat Genet. 2000;25:25–9. 2014;9, e106770.
21. Zakon HH, Jost MC, Lu Y. Expansion of voltage-dependent Na+ channel 45. Steiger SS, Kuryshev VY, Stensmyr MC, Kempenaers B, Mueller JC. A
gene family in early tetrapods coincided with the emergence of comparison of reptilian and avian olfactory receptor gene repertoires:
terrestriality and increased brain complexity. Mol Biol Evol. 2011;28:1415–24. species-specific expansion of group gamma genes in birds. BMC Genomics.
22. Luxey M, Jungas T, Laussu J, Audouard C, Garces A, Davy A. Eph:ephrin-B1 2009;10:446.
forward signaling controls fasciculation of sensory and motor axons. Dev 46. Morrison SS, Pyzh R, Jeon MS, Amaro C, Roig FJ, Baker-Austin C, et al. Impact of
Biol. 2013;383:264–74. analytic provenance in genome analysis. BMC Genomics. 2014;15:S1.
23. Patel K, Nittenberg R, D’Souza D, Irving C, Burt D, Wilkinson DG, et al. 47. Margulies DH, Natarajan K, Rossjohn J, McCluskey J. Fundamental
Expression and regulation of Cek-8, a cell to cell signalling receptor in Immunology. 7th ed. Philadelphia, PA: Wolters Kluwer Health/Lippincott
developing chick limb buds. Development. 1996;122:1147–55. Williams & Wilkins; 2012. p. 511.
24. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol 48. Shannon CE. The mathematical theory of communication. Bell System Tech
Evol. 2007;24:1586–91. J. 1948;27:379–243. 623–56.
25. Pavlidis P, Jensen JD, Stephan W, Stamatakis A. A critical assessment of 49. Litwin S, Jores R. Shannon information as a measure of amino acid diversity.
storytelling: gene ontology categories and the importance of validating In: Perelson AS, Weisbuch G, editors. Theoretical and experimental insights
genomic scans. Mol Biol Evol. 2012;29:3237–48. into immunology, vol. 66. NATO ASI Series. Berlin: Springer Berlin
26. Torii M, Kojima D, Okano T, Nakamura A, Terakita A, Shichida Y, et al. Two Heidelberg; 1992. p. 279–87.
isoforms of chicken melanopsins show blue light sensitivity. FEBS Lett. 50. McNab BK. Resource use and the survival of land and freshwater vertebrates
2007;581:5327–31. on oceanic islands. American Naturalist. 1994;144:643–60.
27. Martin GR, Wilson KJ, Martin Wild J, Parsons S, Fabiana Kubke M, Corfield J. 51. Cooper A, Cooper RA. The Oligocene bottleneck and New Zealand
Kiwi forego vision in the guidance of their nocturnal activities. PLoS One. biota: genetic record of a past environmental crisis. Proc Biol Sci.
2007;2, e198. 1995;261:293–302.
28. Osorio D, Vorobyev M. A review of the evolution of animal colour vision 52. Grzimek B, Schlager N, Olendorf D, McDade MC. Grzimek’s animal life
and visual communication signals. Vision research. 2008;48:2042–51. encyclopedia. Gale: Gale, MI; 2004.
29. Beukers MW, Kristiansen I, IJzerman AP, Edvardsen I. TinyGRAP database: a 53. Tanaka M. Molecular and evolutionary basis of limb field specification and
bioinformatics tool to mine G-protein-coupled receptor mutant data. Trends limb initiation. Dev Growth Differ. 2013;55:149–63.
Pharmacol Sci. 1999;20:475–7. 54. Pascual-Anaya J, D’Aniello S, Kuratani S, Garcia-Fernandez J. Evolution of
30. Jansen JJ, Mulder WR, De Caluwe GL, Vlak JM, De Grip WJ. In vitro Hox gene clusters in deuterostomes. BMC Dev Biol. 2013;13:26.
expression of bovine opsin using recombinant baculovirus: the role of 55. Dimitrieva S, Bucher P. UCNEbase–a database of ultraconserved non-coding
glutamic acid (134) in opsin biosynthesis and glycosylation. Biochim elements and genomic regulatory blocks. Nucleic Acids Res. 2013;41:D101–9.
Biophys Acta. 1991;1089:68–76. 56. Woolfe A, Elgar G. Organization of conserved elements near key
31. Capra V, Veltri A, Foglia C, Crimaldi L, Habib A, Parenti M, et al. Mutational developmental regulators in vertebrate genomes. Adv Genet. 2008;61:307–38.
analysis of the highly conserved ERY motif of the thromboxane A2 receptor: 57. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M,
alternative role in G protein-coupled receptor signaling. Mol Pharmacol. et al. In vivo enhancer analysis of human conserved non-coding sequences.
2004;66:880–9. Nature. 2006;444:499–502.
32. Schulz A, Schoneberg T, Paschke R, Schultz G, Gudermann T. Role of the 58. Bell SM, Schreiner CM, Waclaw RR, Campbell K, Potter SS, Scott WJ. Sp8 is
third intracellular loop for the activation of gonadotropin receptors. Mol crucial for limb outgrowth and neuropore closure. Proc Natl Acad Sci U S A.
Endocrinol. 1999;13:181–90. 2003;100:12195–200.
33. Vogel R, Mahalingam M, Ludeke S, Huber T, Siebert F, Sakmar TP. Functional 59. Gestri G, Osborne RJ, Wyatt AW, Gerrelli D, Gribble S, Stewart H, et al.
role of the “ionic lock”–an interhelical hydrogen-bond network in family A Reduced TFAP2A function causes variable optic fissure closure and
heptahelical receptors. J Mol Biol. 2008;380:648–55. retinal defects and sensitizes eye development to mutations in other
34. Ebrey T, Koutalos Y. Vertebrate photoreceptors. Prog Retin Eye Res. morphogenetic regulators. Hum Genet. 2009;126:791–803.
2001;20:49–94. 60. Reid B, Williams GR. The kiwi. In: Kuschel G, editor. Biogeography and
35. Schoneberg T, Schulz A, Biebermann H, Hermsdorf T, Rompler H, Sangkuhl Ecology in New Zealand, vol. 27. The Hague: Springer Netherlands; 1975. p.
K. Mutant G-protein-coupled receptors as a cause of human diseases. 301–30.
Pharmacol Ther. 2004;104:173–206. 61. van Tuinen M, Sibley CG, Hedges SB. Phylogeny and biogeography of ratite
36. Tao YX. Inactivating mutations of G protein-coupled receptors and diseases: birds inferred from DNA sequences of the mitochondrial ribosomal genes.
structure-function insights and therapeutic implications. Pharmacol Ther. Mol Biol Evol. 1998;15:370–6.
2006;111:949–73. 62. Hackett SJ, Kimball RT, Reddy S, Bowie RC, Braun EL, Braun MJ, et al. A
37. Vassart G, Costagliola S. G protein-coupled receptors: mutations and phylogenomic study of birds reveals their evolutionary history. Science.
endocrine diseases. Nat Rev Endocrinol. 2011;7:362–72. 2008;320:1763–8.
275
Le Duc et al. Genome Biology (2015) 16:147 Page 15 of 15
63. Worthy TH, Worthy JP, Tennyson AJD, Salisbury SW, Hand SJ, Scofield 89. Kiwi Annotated UCNEs. Available at: https://bioinf.eva.mpg.de/KIWI-UCNEs/
RP. Miocene fossils show that kiwi (Apteryx, Apterygidae) are probably 90. Ellegren H, Smeds L, Burri R, Olason PI, Backstrom N, Kawakami T, et al. The
not phyletic dwarves. In: Göhlich UB, Kroh A, editors. Proceedings of genomic landscape of species divergence in Ficedula flycatchers. Nature.
the 8th International Meeting Society of Avian Paleontology and 2012;491:756–60.
Evolution. Vienna, 2012, Verlag des Naturhistorischen Museums in Wien, 91. Dalloul RA, Long JA, Zimin AV, Aslam L, Beal K, Le Blomberg A, et al.
Vienna; 2013. p. 63–80. Multi-platform next-generation sequencing of the domestic turkey
64. Jacobs GH. Losses of functional opsin genes, short-wavelength cone (Meleagris gallopavo): genome assembly and analysis. PLoS Biol.
photopigments, and color vision–a significant trend in the evolution of 2010;8:1–21.
mammalian vision. Vis Neurosci. 2013;30:39–53.
65. Striedter GF. Principles of brain evolution. Sinauer Associates Inc.,U.S. ISBN:
978-0-87893-820-9. 2004/2005
66. Walls GL. The vertebrate eye and its adaptive radiation. Oxford: Cranbook
Institute of Science; 1942.
67. Crompton AW, Taylor CR, Jagger JA. Evolution of homeothermy in
mammals. Nature. 1978;272:333–6.
68. McNab BK. Metabolism and temperature regulation of kiwis (Apterygidae).
The Auk. 1996;113:687–92.
69. Sales J. The endangered kiwi: a review. Folia Zoologica Praha. 2005;54:1.
70. Brooke ML, Hanley S, Laughlin SB. The scaling of eye size with body mass in
birds. Proc Biol Sci. 1999;266:405–12.
71. Hall MI, Kamilar JM, Kirk EC. Eye shape and the nocturnal bottleneck of
mammals. Proc Biol Sci. 2012;279:4962–8.
72. Cunningham S, Castro I, Alley M. A new prey‐detection mechanism for kiwi
(Apteryx spp.) suggests convergent evolution between paleognathous and
neognathous birds. J Anat. 2007;211:493–502.
73. Gilad Y, Przeworski M, Lancet D. Loss of olfactory receptor genes
coincides with the acquisition of full trichromatic vision in primates.
PLoS Biol. 2004;2, E5.
74. Stanke M, Keller O, Gunduz I, Hayes A, Waack S, Morgenstern B. AUGUSTUS:
ab initio prediction of alternative transcripts. Nucleic Acids Res.
2006;34:W435–9.
75. Gertz EM, Yu YK, Agarwala R, Schaffer AA, Altschul SF. Composition-based
statistics and translated nucleotide searches: improving the TBLASTN
module of BLAST. BMC Biol. 2006;4:41.
76. Prüfer K, Muetzel B, Do HH, Weiss G, Khaitovich P, Rahm E, et al. FUNC: a
package for detecting significant associations between gene sets and
ontological annotations. BMC Bioinform. 2007;8:41.
77. De Bie T, Cristianini N, Demuth JP, Hahn MW. CAFE: a computational tool
for the study of gene family evolution. Bioinformatics. 2006;22:1269–71.
78. Bindea G, Mlecnik B, Hackl H, Charoentong P, Tosolini M, Kirilovsky A, et al.
ClueGO: a Cytoscape plug-in to decipher functionally grouped gene
ontology and pathway annotation networks. Bioinformatics. 2009;25:1091–3.
79. Kall L, Krogh A, Sonnhammer EL. An HMM posterior decoder for sequence
feature prediction that includes homology information. Bioinformatics.
2005;21:i251–7.
80. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The
Sequence Alignment/Map format and SAMtools. Bioinformatics.
2009;25:2078–9.
81. Katoh K, Standley DM. MAFFT multiple sequence alignment software
version 7: improvements in performance and usability. Mol Biol Evol.
2013;30:772–80.
82. Hall TA. BioEdit: a user-friendly biological sequence alignment editor and
analysis program for Windows 95/98/NT. Nucleic Acids Symp Ser.
1999;41:95–8.
83. Yang Z. Computational Molecular Evolution. Oxford: Oxford University Press;
2006.
84. Prohaska SJ, Fried C, Flamm C, Wagner GP, Stadler PF. Surveying
phylogenetic footprints in large gene clusters: applications to Hox cluster
duplications. Mol Phylogenet Evol. 2004;31:581–604.
85. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment
search tool. J Mol Biol. 1990;215:403–10.
86. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and
high throughput. Nucleic Acids Res. 2004;32:1792–7.
87. Kiwi Genome. Available at: http://www.bioinf.uni-leipzig.de/~studla/
KIWI-HOX/.
88. Kiwi Annotated HOX Cluster. Available at: https://bioinf.eva.mpg.de/
KIWI-HOX/
276
Apêndice A.4.
Cópia pessoal do manuscrito “Heterogeneity of dN/dS Ratios at the Classical
HLA Class I Genes over Divergence Time and Across the Allelic Phylogeny”:
Journal of Molecular Evolution (2015) 82(1): 38-50. O artigo em sua versão final
(pós-processamento editorial) não está liberado para ser re-distribuído a partir
deste documento, uma vez que o mesmo não é “Open Access”. Portanto, dis-
ponibilizo a versão aceita para publicação, porém sem a formatação da revista
– esta encontra-se disponível pelo DOI 10.1007/s00239-015-9713-9.
Esse artigo é o resultado do meu trabalho de mestrado, que foi aprimorado
ao longo do meu doutorado. Ele tem elementos em comum com o artigo apre-
sentado no apêndice A.2, pois em ambos investigamos as unidades de seleção
nos genes HLA: aqui, linhagens alélicas; no outro artigo (A.2), supertipos.
Neste trabalho, fui orientada por Diogo Meyer, que concebeu as ideias ori-
ginais do projeto. Ambos desenvolvemos as metodologias a serem adotadas ao
longo do projeto. Executei todas as análises e redigi o manuscrito juntamente
com DM. RSF e eu fizemos a detecção de sequências recombinantes e todos os
co-autores participaram na discussão e verificação dos resultados.
277
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
Abstract The classical class I HLA loci of humans show an excess of nonsynonymous with respect to
synonymous substitutions at codons of the antigen recognition site (ARS), a hallmark of adaptive evolution.
Additionally, high polymporphism, linkage disequilibrium and disease associations suggest that one or more
balancing selection regimes have acted upon these genes. However, several questions about these selective
regimes remain open. First, it is unclear if stronger evidence for selection on deep timescales is due to changes
in the intensity of selection over time or to a lack of power of most methods to detect selection on recent
timescales. Another question concerns the functional entities which define the selected phenotype. While most
analysis focus on selection acting on individual alleles, it is also plausible that phylogenetically defined groups
of alleles ("lineages") are targets of selection. To address these questions we analyzed how dN/dS (ω) varies
with respect to divergence times between alleles and phylogenetic placement (position of branches). We find
that ω for ARS codons of class I HLA genes increases with divergence time and is higher for inter-lineage
branches. Throughout our analyses, we used non-selected codons to control for possible effects of inflation of ω
associated to intra-specific analysis, and showed that our results are not artifactual. Our findings indicate the
importance of considering the timescale effect when analysing ω over a wide spectrum of divergences. Finally,
our results support the divergent allele advantage model, whereby heterozygotes with more divergent alleles
Keywords balancing selection, HLA, MHC, dN/dS, allelic lineages, antigen recognition site, divergent allele
advantage
Address: Departament of Genetics and Evolutionary Biology, University of São Paulo, Rua do Matão, 277, São Paulo. Tel.:
+55(11)3091-8092 E-mail: bdbitarello@gmail.com
278
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
1 Introduction
MHC class I and II classical molecules are cell-surface glycoproteins which mediate presentation of peptides
to T-cell receptors, and play a key role in triggering adaptive immune responses when the bound peptide is
recognized as foreign (Klein and Sato 2000). In humans, they are coded by HLA class I (HLA-A, -B, and -C )
and II (HLA-DR, -DQ, and -DP) classical genes. The class I and class II HLA classical genes are the most
polymorphic in the human genome (Meyer and Thomson 2001), and knowledge about their function in the
immune response supports a role for balancing selection in driving the diversity patterns at these loci.
A number of findings suggest MHC genes have experienced balancing selection: unusually high level of
heterozygosity with respect to neutral expectations (Hedrick and Thomson 1983); existence of trans-species
polymorphisms (Takahata and Nei 1990); high levels of linkage disequilibrium (Huttley et al 1999); site
frequency spectra with excess of common variants (Garrigan and Hedrick 2003); high levels of identity-by-descent
compared to genomic averages (Albrechtsen et al 2010); positive correlation between HLA polymorphism and
pathogen diversity (Prugnolle et al 2005), and significant associations of HLA alleles with the course of infectious
diseases (e.g. Apps et al 2013). Information on the crystal structure of MHC molecules (Bjorkman et al 1987)
allowed the identification of a specific set of amino acids that make up the antigen recognition site (ARS),
which determines the peptides that the molecule is able to bind (Bjorkman et al 1987; Chelvanayagam 1996).
The codons of the ARS were shown to have increased nonsynonymous substitution rates (Hughes and Nei
1988, 1989), consistent with the hypothesis that adaptive evolution at HLA loci is driven by peptide binding
properties.
Several models of selection are compatible with balancing selection at MHC genes. Heterozygote advantage
assumes that heterozygotes have higher fitness values because they are able to mount an immune response
to a greater array of pathogens, an idea originally proposed by Doherty and Zinkernagel (1975), who showed
that mice which were heterozygous for the MHC had increased immunological surveillance. Heterozygote
advantage has received support from experiments in semi-natural populations of mice (Penn et al 2002), which
show increased resistance of heterozygotes to multiple-strain infection, and through the finding that among
humans infected with HIV, those which are heterozygous for HLA genes have slower progression to AIDS
(reviewed in Dean et al 2002). Heterozygote advantage has also received support from substitution rate studies
(Hughes and Nei 1988, 1989) as well as simulation-based studies (e.g. Takahata and Nei 1990). A second
model for balancing selection at MHC genes is negative frequency dependent selection (or apostatic selection),
according to which rare variants have a selective advantage over common ones, because pathogens are more
likely to evade presentation by common molecules (Slade and McCallum 1992). Although both are biologically
compelling, decades of research have shown that most forms of summarizing genetic observation are incapable
of differentiating these two modes of selection (Hughes and Nei 1989; Meyer and Thomson 2001; Spurgin and
Richardson 2010), and the functional insights for the action of heterozygote advantage at least partially explain
279
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
A third model involves selective pressures that are heterogeneous over space and/or time, favoring different
alleles in different temporal or geographic compartments, and thus resulting in an overall increase in diversity
at MHC loci. This model has been shown to be capable of accounting for features of HLA variation (Hedrick
2002). Many studies have investigated this model by comparing the degree of population differentiation at MHC
and putatively neutral loci, with the expectation being that selection that is geographically heterogeneous will
result in increased differentiation at HLA genes. As reviewed in Spurgin and Richardson (2010) , the results are
mixed, and interpretation is hampered due to differences in the mutational models underlying the evolution
of HLA genes and loci used as neutral controls. Although the specific form of selection acting on MHC genes
remains an open question, the fact that these genes have evolved in a non-neutral way and are under balancing
selection is an undisputed finding, which is robust to complications introduced by demographic history (Harris
and Meyer 2006; Hughes and Yeager 1998; Garrigan and Hedrick 2003).
While studies of MHC have documented convincingly a role of selection, certain questions remain unresolved
in the context of variation of the human MHC genes (termed HLA loci). The first of these concerns the
"timescale" of selection: while most tests for selection have provided strong evidence for selection at classical
HLA class I genes in in deep timescales, there is comparatively less support for selection at recent timescales
(Garrigan and Hedrick 2003). It has proved difficult to tease apart the possibility that selection differs across
timescales from reduced statistical power of tests for recent selection, and thus the question of the timescale
The second question concerns targets of selection, i.e, which biological entity is targeted by selection in
HLA class I genes: individual alleles or groups of similar alleles? Classical MHC genes have many alleles,
which can be hierarchically classified into groups of alleles which reflect the phylogenetic relatedness and
shared functional attributes of these alleles. Wakeland et al (1990) proposed a mechanism coined "divergent
allele advantage", which is a specific case of heterozygote advantage, according to which the fitness values
of heterozygotes are proportional to the degree of divergence between the alleles they carry. This model was
motivated by the observation that, in MHC class II murine genes, alleles from a given allelic lineage often differ
by only minor structural variations in the ARS, while alleles in different lineages have functionally different
ARS. The open question is whether individual alleles or allelic lineages are the main targets of selection for
HLA genes. Although nucleotide diversity intra-lineages exceeds genome-wide averages, inter-lineage diversity
is substantially higher than intra (Takahata and Satta 1998). This raises the question of whether intra-lineage
variation is under a different mode and intensity of selection with respect to differences between lineages.
We address these questions by analysing the temporal and phylogenetic dynamics of dN/dS (or ω) for ARS
codons at the class I classical loci (HLA-A, -B and -C ) loci, using both pairwise and phylogenetic approaches.
These loci are all highly polymorphic and there is an abundance of data available for most exons of their coding
sequence, which makes our analyses of non-ARS codons (as a control) possible. Our pairwise comparisons of
alleles show that more divergent pairs show higher ω for ARS codons than closely related pairs of alleles.
The phylogenetic analyses support the hypothesis that selection is stronger for inter-lineage branches (i.e,
those connecting two clades from the same lineage, as opposed to those who do not), and also which are
280
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
internal to the phylogeny (when compared to terminal branches), provided that a bias toward overestimating
ω for recent divergence is taken into account (Rocha et al 2006). Although evidence for balancing selection
on the intra-lineage scale is weaker than on the inter-lineage scale, our findings show that there is statistical
support for deviation from a regime of neutrality for intra-lineage branches of the allelic tree. We conclude
that intra-lineage divergence has also evolved under a regime of balancing selection, and that inter-lineage
2.1 Data
Alignments for HLA-A, HLA-B and HLA-C were obtained from the IMGT/HLA Database (Robinson et al
2013). All dN/dS estimates and related analyses were implemented in CODEML (PAML package, Yang 2007).
First codon position was considered to be the first codon of exon 2, as indicated by annotation on IMGT
alignments. Our initial data sets were comprised of complete coding sequences, i.e, exons 2-7 (for HLA-A and
HLA-C ) and 2-6 (HLA-B). These data sets were used for the site models (SM) approach. For the pairwise and
branch model (BM) approaches, we used two datasets: one with 48 ARS codons (Chelvanayagam 1996) and
the other, referred to as "non-ARS", consisting of the remaining codons (Table 1).
In order to be able to use the methods available in CODEML we restricted our analysis to HLA alleles
which had complete coding sequences, no stop codons, were expressed in the cell surface and only differed with
respect to others by base changes (i.e. no insertions or deletions). Alleles with mutations putatively linked to
low or absent cell surface expression were also remove from analyses. The non-ARS data sets were used for
estimation of dS, used in the pairwise approach as a proxy for allelic divergence, as an internal control for ARS
analyses. For the branch models, further pruning of the phylogenetic trees was done, as described below.
Phylogenetic trees Complete alignments, described above, were used to generate NJ trees for each gene (Saitou
and Nei 1987). The program NEIGHBOR, from the PHYLIP package (Felsenstein 1989) was used with the
F84 method, k (transition/transversion ratio) = 2 and empirical base frequencies for the distance matrices
Recombination detection. Intragenic recombinants were detected by applying RDP3 (Martin et al 2010) to
the complete alignments, followed by manual inspection. The RDP3 program combines several non-parametric
recombination detection methods in sequence data, and we used 6 independent tests for recombination detection:
RDP; Chimaera; Maxchi; GENECONV, BootScan and SiScan for recombination detection (see Martin et al.
2010 and references therein). Window size was adjusted to 100 for BootScan and SiScan, and to 15 for RDP.
The number of variable sites per window was adjusted to 35 and 30 for Maxchi and Chimaera, respectively.
281
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
These sizes were chosen based on a test alignment we provided to the software, in which parental and daughter
HLA-B sequences were known a priori. Based on this training set, we adjusted the parameters as described, and
for other parameters default values were used. Since these six tests are mostly independent, and have different
strengths, we considered a recombination event to be significant when p < 0.05 in at least 3 of the above
methods, which means we were somewhat conservative in the removal of recombinants from the datasets. "Trace
evidence" cases, i.e, those that bear a signal of recombination but are technically not statistically significant,
were kept in the data sets. Following this initial procedure, we visually inspected the filtered alignments for
the detection of additional recombinant sequences. This procedure generated tow data sets for each locus, one
with recombinants and one without ("recombinant" (R), and "non-recombinant" (NR), respectively, Table 1).
Clade Filter For the branch models, we used t (expected number of nucleotide substitutions per codon) matrices
obtained in pairwise analyses of the non-recombinant non-ARS data sets as input for NEIGHBOR. The trees
were visualized for manual pruning and labeling in Mesquite (v2.75, http://mesquiteproject.org/). We imposed
that alleles from a given HLA lineage (as defined by the standard HLA nomenclature, which identifies lineage
membership by the first field of an allele’s name) had to group together in a clade, and alleles which did not
group in such manner were manually pruned from trees in order to fulfill this "clade membership criterium".
The effect of this filtering on inclusion of alleles is presented in Figure S1 in the Online Resource 1. After
pruning of the trees, the corresponding pruned alleles were removed from the NR data sets and these reduced
data sets were used for the branch model analyses. Table 1 shows the number of alleles used for each analysis.
Branch models (BM) With the pruned data sets we compared branch models 0 (one ω for all branches) and
2 (two or more categories of branches with independent ω) from CODEML. We provided CODEML with a
topology based on the non-ARS pruned data set, using branch lengths as starting points for ML estimation
(fix_blength=1). For all CODEML analyses (BM, site models and pairwise), the Goldman and Yang (1994)
model was used for estimation of substitution rates. Other parameters defined in the control file were as
follows: option F3x4 for codon frequency estimation, κ = 2 and ω = 0.4 as initial values. Tables S14-S16
(Online Resource 1) show likelihood convergence for the branch models, assuming different initial parameter
values and codon frequency estimation methods. BM analyses were performed solely for the NR data sets (see
tables 2 and 3). Branch models 0 (one omega for all branches) and 2 (two or more omegas) were compared,
where branches were labeled either as "intra" or "inter" lineages (Figure 3), or as "terminal" or "internal". The
two models were compared via a likelihood ratio test (LRT) with one degree of freedom (see below). BM
analyses were performed only for the NR (and pruned) datasets. See Figure 3 for an schema of the labels
Site models (SM) For the SM approach, the clade filter was not applied, which resulted in minor differences
between this data set and the other two (pairwise and branch models approach, see Table 1). We used the
282
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
site models from CODEML to identify codons with ω > 1 and thus to test if ARS codons bear evidence for
adaptive evolution. M0 (one ratio) assumes the existence of only one ω ratio for all codons, while M1 (neutral)
assumes the existence of two categories of sites, one with ω1 = 1 (sites evolving in a neutral fashion) and
the other with ωo < 1 (sites evolving under purifying selection), while M2 (selection) adds an extra category
to M1, where ω2 > 1, corresponding to sites with evidence for adaptive evolution. M7 (beta) is a flexible
null model where the value is sampled from a beta distribution, where ω0 < 1, and 0 < ω < 1 , while M8
adds an extra category to M7, ω2 , which is estimated from the data (Yang 2006). Codons with posterior
probabilities P > 0.95 of ω > 1 in the Bayes Empirical Bayes (BEB) (Yang et al 2005) approach implemented
in CODEML were considered to have significant evidence for adaptive evolution, following criteria described
elsewhere (Yang and Swanson 2002; Yang et al 2005). The ARS codon classification proposed by Bjorkman
et al. (1987) is referred to as BJOR, while the "peptide binding environments", i.e, the amino acid residues in
a fixed neighborhood of the peptide binding residues known from crystal structure complexes (which provide
a less restrictive description of the antigen binding sites), are referred to as CHEV (Chelvanayagam 1996).
Finally, the list of codons in HLA genes with evidence of ω > 1 from Yang and Swanson (2002) is referred to as
YANG (Figure 1 and Online Resource 1, Table S9). M1 vs M2 and M7 vs M8 models were compared through
a LRT with two degrees of freedom. Tables S3-S8 (Online Resource 1) show likelihoods obtained when altering
initial CODEML conditions for the SM analyses. SM analyses were performed for R and NR data sets.
Codons with P > 0.95 for ω > 1 in M8 (34 in total) were combined for the three loci, and the R and NR
data sets, and compared to CHEV, BJOR and YANG. Of these 34 codons, only one was outside of the exons
2 and 3 range (codon 305), which is where all ARS codons are located. Figure 1 shows the overlap between
the codons defined as making up the ARS in the BJOR and CHEV classifications, as well as those idenfied as
In order to evaluate if our site model analyses were robust to features of the estimation method, the analyses
were repeated with DATAMONKEY, from the HYPHY package (Pond et al, 2005). The substitution model
used for construction of the NJ tree was HKY85 (very closely related to F84, used for CODEML analyses).
Two criteria for detection significant dN/dS > 1 were considered: SLAC and FEL (both with significance level
of 0.1), with the former being the most conservative criterion available in the package. Tables S10-S12 report
the overlap of sites with evidence for dN/dS > 1 for BEB (CODEML), SLAC and FEL.
LRT When comparing two nested models the LRT test statistic is given by doubling the log likelihood differece
between the more parameter rich model and the less parameter rich model. The difference in parameter number
yields the degrees of freedom. It is expected that the use of a chi-square distribution for significance evaluation
of this test is a conservative approach (Yang 2006). Both site models and branch models comparisons were
Breslow-Day Test In order to compare ARS and non-ARS codons with respect to the distribution of synonymous
and nonsynoymous changes within and between lineages (or for internal or terminal branches), we used a
contingency table approach similar to the one described in Templeton (1996). We estimated the synonymous
283
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
(S ) and non-nonsynonymous (N ) changes on each branch in CODEML, using the branch models. Next we
counted N (nonsynonyous changes) and S (synonymous changes) for intra/inter or terminal/internal branches
for each locus, and for ARS and non-ARS codons (Table 5).
Nintra · Sinter
Ninter · Sintra
, and used a Breslow-Day test for homogeneity of OR to test the hypothesis that contingency tables from
ARS and non-ARS codons have the same OR. We applied the same test to internal/terminal branches. Data
from the three loci were combined into the same analysis to increase power.
Pairwise approach We also performed analyses where statistics were estimated in comparisons between all pairs
of alleles (pairwise analyses, see Table 1) using runmode=-2 in CODEML. This approach does not require a
phylogenetic tree. Because IMGT/HLA nomenclature allows information about allelic lineages to be known
without a tree, pairs were also classified as intra or inter-lineage. Correlations between allelic divergence and
omega values were tested with a Mantel Test using Pearson’s correlation index (Online Resource 1, Table S13).
We obtained quantiles of the dS non−ARS distribution and divided pairwise values according to these quantiles
(Online Resource 1, Table S1 for non-ARS data set and Table 4 in main text for ARS data set). Differences
in mean ω values for "intra" and "inter" comparisons were tested for significance by a Wilcoxon rank sum test
(Figure 2).
The IMGT/HLA database contains all HLA alleles described to date, regardless of their population frequencies.
Therefore, it is possible that rare variants can contribute disproportionately to patterns identified in the dN/dS
analyses. To address this concern, we investigated patterns of variation at the HLA loci in a population (Yoruba,
YRI) from the 1000 Genomes Project (1000G), for which frequency of alleles at specific SNP positions is
available (N = 88 individuals).
To test for a possible enrichment of rare variants in the IMGT data we compared patterns of variation
seen in the IMGT and 1000G phase I data (The 1000 Genomes Project Consortium, 2012). To this end, we
defined a set of sites, for each locus, which were variable in our IMGT-derived data sets (referred to as the
"OVERALL" set of sites). Next, we classified these sites as variable only within a single lineage ("INTRA"), or
variable in more than one lineage ("INTER"). For each site, we converted the positions within the HLA locus
Next, we verified if these positions are polymorphic in the 1000G Phase I low-coverage dataset (ftp://ftp.1000genomes.ebi.ac.uk/v
284
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
3 Results
Before investigating how ω varies over time and phylogenetic context, we tested (a) whether selection is
detectable in our data set with pairwise comparisons and phylogenetic dN/dS approaches; (b) if the presence
of HLA alleles resulting from intragenic recombination influences our inferences; and (c) if there is agreement
between the ARS codons defined by crystal structure (Bjorkman et al 1987; Chelvanayagam 1996) and the
codons inferred to have ω > 1 in our data set. The results to these tests are pre-requisites for subsequent
analyses addressing the more specific hypotheses about heterogeneity in dN/dS estimated across the allelic
We quantified the mean pairwise dN/dS (ω), and found ω ARS > 1 for all loci (Table 4). We used the
non-ARS codons from the same sequences as an internal control, and found that ω ARS is 3.9 (HLA-A), 4.0
(HLA-B) and 3.2-fold (HLA-C ) greater than ω non−ARS (Table 4). This effect is not driven by a subset of the
pairwise comparisons, since dN > dS for the majority (between 67 and 84%) of ARS pairwise comparisons, in
contrast to the non-ARS comparisons, where fewer than 7% show dN > dS (Table 4). Importantly, we find
that the result ω ARS > ω non−ARS is due to increased dN (3.5 to 14-fold higher for ARS), and not to decreased dS
(0.5 to 2.8-fold higher for ARS, Table 4). Qualitatively similar results were obtained when we computed the
ratio of mean substitution rates, dN /dS (Table 4). These findings are robust to the presence of recombinants
(Online Resource 1, Table S1). Overall, our results document that pairwise comparison of alleles provides
Evidence for adaptive evolution in ARS codons was also strongly supported by phylogenetic methods
from CODEML (see Methods), where models allowing for selection (M2 and M8) in a subset of codons were
significantly favored over the neutral models M1 and M7 (Online Resource 1, Table S2; p < 0.01, LRT). Results
were robust to starting conditions for HLA-A and HLA-B (Online Resource 1, Tables S3-S6), and less so for
We next quantified the overlap between codons we inferred to be under selection (using site models
from CODEML, "SM") and those defined as ARS based on structural analyses of HLA (Chelvanayagam
1996; Bjorkman et al 1987). Within exons 2 and 3 (which contain all ARS codons) we identified 33 codons
with significant ω > 1 for the M8 site model (see Methods and Table S2, Online Resource 1) in at least
one locus, of which 27 (82%) are contained within the set that forms the ARS according to the crystal
structure-based classification (Bjorkman et al 1987), 25 (76%) are contained within the peptide binding
environments (Chelvanayagam 1996), and 25 (76%) overlap with Yang and Swanson’s (2002) site models
approach to detect codons with ω > 1 in the three classical class I HLA loci (Figure 1 and Online Resource 1,
Table S9). The association between ARS and selected sites for all loci is highly significant (p < 10−11 , chi-square
test). There is extensive overlap between the two ARS classifications (Bjorkman et al 1987; Chelvanayagam
285
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
1996) (Figure 1) and we also find a high overlap of selected sites between the R and NR data sets for each
Overall, our results show that: (a) the pairwise and phylogenetic site models methods implemented in
CODEML strongly support adaptive evolution on the ARS codons of HLA loci - as also described by Yang
and Swanson (2002) through site models; (b) there is an enrichment of codons with ω > 1 in the CHEV set
of codons (see Online Resource 1, Table S9, for the names given to the sets of codons), supporting the use of
this classification for our study; (c) although the results were robust to the presence of recombinants, a finding
consistent with simulation studies (Anisimova et al, 2003), the estimated values for ω appear to be sensitive
to the inclusion of recombinants. Therefore, where appropriate, in subsequent pairwise analyses, we contrast
results of non-recombinant (NR) and recombinant (R) datasets, while for the branch models we use the NR
In addition, the results of HLA-C, although following the same trend observed for HLA-A and HLA-B,
show that absolute divergence values for ARS codons are on average 1/2 of those observed for the other two
loci, both for dN and dS (Table 4). This result might be a reflect of the fact that HLA-C not only has an
antigen presentation function, but has a huge role in interactions with NK receptors (KIR) and that, unlike
HLA-A and HLA-B, all HLA-C allotypes form ligands for KIR receptors (Hilton et al 2015; Single et al 2007).
Because the KIR loci have been shown to evolve quite rapidly across primate species, plausibly faster than
their MHC class I ligands (Single et al, 2007), it is possible that this important selective pressure is responsible
for the lower substitution rates seen for the ARS of HLA-C, as well as for the lack of consistency observed in
ML estimates
Having confirmed that selection at ARS sites is detectable with pairwise comparisons and phylogenetic approaches,
we investigated if recent evolutionary change (accounting for differences among recently diverged alleles) shows
different signatures of selection with respect to changes that occurred over greater timescales. Our first approach
consisted in examining the distribution of ωARS as a function of the time since divergence between allele pairs.
Our estimate of divergence time between allele pairs was based on the values of dS (estimated from non-ARS
codons) for each allele pair, thus avoiding statistical non-independence with ωARS . Because very recently diverged
alleles have low synonymous divergence (dSnon−ARS ), the corresponding ωARS values were often undefined or
extremely large. We therefore followed a strategy adopted by Wolf et al (2009) to filter out the allele pairs with
ωARS > 5 (resulting in the removal of 1.1%, 1.4% and 3.9% of ω values for pairwise comparisons at HLA-A, -B,
Pairwise estimates show that ωARS increases as a function of divergence time (Table 4). Indeed, ωARS and
dS non−ARS are positively correlated (Online Resource 1, Table S13; rHLA−A = 0.17, p < 0.001; rHLA−B = 0.20,
p < 0.001; rHLA−C = 0.20, p < 0.001; Pearson, significance obtained by Mantel Test). Qualitatively similar
results were found for NR data sets and were robust to different correlation measures (Online Resource 1,
286
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
10
Table S13). We also compared the ω between allele pairs classified as intra and inter-lineage (Figure 2). For
all loci, the median value of ωARS is > 1 for the inter lineage contrasts, and < 1 for the intra-lineage contrasts,
and the distribution of ω is significantly higher for inter-lineage contrasts (p < 0.001, Wilcoxon rank sum test;
The above pairwise comparison approach suffers from the limitation that allele pairs with ω > 5 were
treated as missing data, possibly underestimating ω for recently diverged alleles. This prompted us to use a
phylogenetic model to contrast alleles at different levels of differentiation, which is more robust to the effects
of low differentiation between specific allele pairs. We compared a branch model that estimates a single ω for
all branches to one that estimates two values of ω (inter versus intra-lineage; terminal versus internal; see
Figure 3). For all loci we found higher ωARS for inter-lineage branches than for intra-lineage branches, although
significance was not attained for these tests (Table 2). For the contrast between internal and terminal branches,
we found higher ωARS for internal branches at all loci and this result was statistically significant for HLA-C
(Table 2).
Our results show that both pairwise comparisons and branch models indicate a heterogeneity of ω throughout
the diversification of HLA alleles, with higher ω values associated to contrasts between more divergent alleles
(pairwise approach) or to branches connecting different lineages or that are internal to the phylogeny (BM
approach), although the difference was not significant for the "intra-inter" contrasts.
In this study we estimate ω for allele pairs or branches sampled within a single species, and over varying
timescales. Both these features imply in possible biases to the estimation of ω, which we now discuss.
Kryazhimskiy and Plotkin (2008) used analytical and simulation approaches to show that under positive
selection the behavior of ω within a single population is not a monotonic function of the intensity of selection, so
that ω intra a population can be low, even under positive selection. This occurs because, when an advantageous
nonsynonymous variant is fixed in a population, nonsynonymous variation can be decreased due to the
homogeneity generated by the selective sweep. However, this scenario clearly does not apply to HLA genes,
where balancing selection maintains multiple nonsynonymous polymorphisms simultaneously segregating within
Another challenge to the interpretation of ω arises from that fact that many studies have shown that
genes under purifying selection show surprisingly high ω (often close to 1) when samples with short divergence
times are analyzed (e.g., those from a single population or species). For example, Rocha et al (2006) showed
that dN/dS between two samples is negatively correlated with their divergence times, and exemplified these
predictions with bacterial genomes. Likewise, a decrease of dN/dS with divergence time has been described in
Wolf et al (2009), but considering a much deeper timescale. Kryazhimskiy and Plotkin (2008) demonstrated
that this pattern is expected even under a regime of purifying selection that is constant over time. Thus, it
is plausible that the recent divergence times among alleles within HLA allelic lineages could result in inflated
287
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
11
intra-lineage ω values, explaining the modest differences between intra and inter-lineage ω values seen in the
phylogenetic analyses (Tables 2 and 3). To explore this issue further, we used non-ARS codons as an internal
control for this putative build-up of dN/dS are recent timescales, and to do so we compared their patterns
of variation to those of ARS codons. We found that non-ARS codons have larger intra-lineage ω values than
inter-lineage values, and also higher ω for terminal than internal branches (p < 0.05 for HLA-A in the intra
versus inter-lineage contrast, and for HLA-A and HLA-C in the tips versus internal contrast; LRT; Table
3). This distribution of ω values is in the exact opposite direction to that observed for the ARS (Table 2),
consistent with an effect of short divergence times inflating the estimates of ω (Kryazhimskiy and Plotkin
2008).
In order to formally test whether ARS and non-ARS codons have a different distribution of synonymous
and nonsynonymous changes intra and inter-lineages (or for internal and terminal branches) we employed a
contingency table approach similar to that of Templeton (1996). We used the inferred number of synonymous
(S ) and nonsynonymous (N ) changes on each branch of the allelic phylogeny from each locus to estimate the
total number of each type of change in a specific class of branches (see Figure 3 for a schematic representation
of the branch labeling).The odds ratio was defined as presented in the Methods. For all loci, we find that
OR > 1 for non-ARS codons (proportionally more nonsynonymous on the intra-lineage branches) and OR < 1
for ARS codons (proportionally more nonsynonymous changes on the inter-lineage branches), as shown in
Table 5. This finding is consistent with the maximum likelihood estimates of ω for branches (Tables 2 and 3),
and the increased pairwise ω inter-lineage, relative to intra-lineage (Figure 2). To test for differences between
ARS and non-ARS codons, we pooled the contingency tables of all loci (due to the fact that several cells
for individual loci had low counts) and rejected the null hypothesis that contingency tables from ARS and
non-ARS codons have the same OR (p − value = 0.0069; Breslow-Day test). Our analysis comparing internal
and terminal branches showed the same pattern, with proportionally more nonsynonymous changes in internal
branches for ARS codons (p − value = 0.00013; Breslow-Day test; Table 5).
In summary, although there is evidence for an excess of inter-lineage nonsynonymous changes (or for
terminal branches) for ARS codons, there is also an enrichment for intra-lineage nonsynonymous changes for
ARS codons, when compared to non-ARS codons (P < 0.001; Fisher’s exact test). Next, we discuss possible
Our analyses are based on allele sequences available in the IMGT/HLA data base, which is a curated resource
to which newly discovered alleles are contributed. This data set is likely to be biased with respect to population
frequencies, since very rare HLA alleles are likely to represent a disproportionately larger fraction than in true
population samples, since all new alleles which are discovered are encouraged to be submitted to IMGT. We
therefore investigated if this bias influenced our findings. Specifically, we were concerned that the enrichment
for rare variants could result in an inflation of weakly deleterious nonsynonymous variants for recent divergence,
288
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
12
a well documented population genetic signature (Henn et al 2015). This signature could create an artificially
We found that only a subset of variable positions present in our IMGT-derived datasets are present in the
1000 Genomes Phase I low coverage data (Tables 6, 7, and 8 (for HLA-A, HLA-B, and HLA-C, respectively).
This is in accordance with the greater degree of sampling of rare variants in the IMGT data set.
We next divided positions into two groups: those which are only variable within a single lineage (’INTRA’),
and those variable in more than one lineage (’INTER’). For comparison, a third group, which consists of all
variable sites (’OVERALL’), was also defined. We found that, considering all INTRA and INTER positions
present in the 1000G data, there is no significant difference in minor allele frequency (MAF) between the two
categories (Tables 6, 7, and 8). Furthermore, when we classify the 1000G HLA SNPs into low (MAF<=0.1)
and high frequency (MAF>0.1), we do not see an enrichment for low frequency variants within the "INTRA"
set of SNPs when compared to the "INTER" set (Wilcoxon test, not shown).
These results reassure us that the intra-lineage variation we observe is not biased in the direction of
extremely rare variants, and that our observation that there is evidence for stronger intra-lineage balancing
selection for ARS codons than for neutrally evolving regions (non-ARS) is not a spurious result driven by an
4 Discussion
Our study documents a positive correlation between dN/dS values and the degree of divergence between
allele pairs. This result is supported by phylogenetic analyses, which show higher ω values for branches
connecting different lineages, or branches which are internal to the phylogeny. A heterogeneous nonsynonymous
substitution rate (dN ) for HLA genes was also reported in a study which found that dN for ARS codons is
not linearly correlated with divergence time in classical HLA loci (Yasukochi and Satta 2014). By further
investigating the temporal dynamics in the DRB1 gene, these authors showed that this rate heterogeneity
is likely the consequence of a reduction in the substitution rates in specific allelic lineages, possibly as a
consequence of continuous selective pressure by a specific pathogen. In the present study our goal was to
explicitly test for heterogeneity in the ω ratios over a priori defined groups of alleles (the HLA allelic lineages)
and for timescales of divergence (low and high divergence). As was the case with the study of Yasukochi
and Satta (2014), we find heterogeneity in the intensity of selection, in our case with evidence of increased
selection at deeper timescales than at more recent ones, and for greater selection on inter-lineage branches of
the allelic phylogeny, with respect to intra-lineage branches. Our findings indicate that long-term balancing
selection has resulted in an enrichment for adaptive changes between allelic lineages for HLA class I genes,
with proportionally weaker signatures of molecular adaptation for recent (terminal and intra-lineage branches)
Although previous studies have shown that low divergence is often associated to inflated ω estimates (Rocha
et al, 2006), the phylogenetic analyses carried out in the present work relied on non-ARS codons as a control
289
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
13
to show that low divergence times of intra-lineage contrasts does not explain the ω > 1 values within lineages,
at ARS codons. Thus, while we show that inter-lineage selection is stronger than intra-lineage selection, our
results also demonstrate that intra-lineage variation bears a signature of balancing selection.
Recently several papers have drawn attention to the effects of divergence times on dN/dS estimation (e.g.
Wolf et al 2009; Stolestki and Eyre-Walker 2011), and the complexities of interpreting these values when data
is drawn from a single population (Rocha et al 2006; Kryazhimskiy and Plotkin 2008). Our finding of increased
ωARS among more divergent alleles (or for inter-lineage branches) is conservative in light of these findings, which
predict decreased ω for more divergent alleles. We accounted for this effect by using non-ARS codons, which
have a similar phylogenetic structure to that of ARS codons (after removal of recombinants) to control for
the background inflation of omega in recently diverged alleles, and found that ARS codons have very different
distribution of ω, with increased inter-lineage evidence for selection, exactly the opposite to what is seen for
non-ARS codons.
An important caveat to this interpretation is that the temporal dynamics of dN/dS appears to be sensitive
to the selective regime which is assumed to be operating. Thus, while several authors have shown that, under
purifying selection, increased dN/dS at low divergence is expected, positive selection can produce a positive
correlation with divergence times (Dos Reis and Yang 2013; Mugal et al 2014), which could account for part of
the results we describe in this study. However, the case of directional positive selection, involving the sequential
substitution of adaptive mutations, is markedly different from the dynamics of a balanced polymorphism, as
Assuming that balancing selection has been the main selective regime shaping the molecular evolution
of HLA genes, and that heterozygote advantage is one (even if not exclusively) of the mechanisms through
which selection has acted upon this system, our finding that inter-lineage ωARS is greater than intra-lineage
is consistent with the divergent allele advantage model, according to which heterozygotes for more divergent
alleles have higher fitness than those carrying similar alleles (Wakeland et al 1990). Under this model, excess
of inter-lineage nonsynonymous changes in HLA genes would be expected, which is a result we have shown for
the ARS data set. This model has been shown to explain patterns of variation in the DRB locus in Galapagos
sea lions, where local allelic divergence at this locus positively influences fitness directly (Lenz et al 2013),
and not mere heterozygosity or number of alleles at the MHC locus. Most likely several selective regimes have
shaped the evolutionary history of MHC genes, as suggested by previous observations, and our contribution
suggests that these selective regimes could be operating alongside with divergent allele advantage.
Our results suggest that groups of functionally related alleles (in our analysis, the allelic lineages) should
be regarded as important targets of selection, rather than individual alleles. In line with our observations,
it has been proposed that HLA supertypes - groups of alleles sharing chemical properties at the B and F
pockets of the ARS region (Sidney et al 1996) - constitute the level of variation that is the primary target of
natural selection in HLA-B genes (Francisco et al 2015). Since there is a high overlap between allelic lineage
and supertype classifications(Sidney et al 1996), our results indicate that attempts to understand how natural
290
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
14
selection acts on HLA variation benefit by comparing the effects of selection on the allelic, allelic lineage or
Competing Interests
Author’s Contributions
BDB carried participated in the design of the study, performed analyses, discussed results and drafted the
manuscript. RDF performed analyses and discussions. DM conceived of the study, participated in its design
and discussion and in the drafting of the manuscript. All authors read and approved the final manuscript.
Acknowledgements The authors thank Kelly Nunes for thoughtful comments on the manuscript, Richard Single for
comments on the statistical aspects of this work, Aida M. Andrés for general comments and Débora Y.C.Brandt for help
with the 1000 Genomes data sets. This work was supported by the São Paulo Research Foundation (grants #2008/09127-8
and #2011/12500-2 to BDB; #08/56502-6 to DM) and Conselho Nacional de Desenvolvimento Científico e Tecnológico
(#152676/2011-2 to BDB, #142130/2009-5 to RSF and #308960/2009-2 to DM). The final publication is available at
Springer via http://dx.doi.org/DOI: 10.1007/s00239-015-9713-9
https://github.com/bbitarello/dNdS-hla-allelic-lineages
References
Albrechtsen A, Moltke I, Nielsen R (2010) Natural selection and the distribution of identity-by-descent in the
Anisimova M, Nielsen R, Yang Z (2003) Effect of recombination on the accuracy of the likelihood method for
Apps R, Qi Y, Carlson JM, Chen H, Gao X, Thomas R, Yuki Y, Del Prete GQ, Goulder P, Brumme ZL,
Brumme CJ, John M, Mallal S, Nelson G, Bosch R, Heckerman D, Stein JL, Soderberg Ka, Moody MA,
Denny TN, Zeng X, Fang J, Moffett A, Lifson JD, Goedert JJ, Buchbinder S, Kirk GD, Fellay J, McLaren
P, Deeks SG, Pereyra F, Walker B, Michael NL, Weintrob A, Wolinsky S, Liao W, Carrington M (2013)
291
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
15
Bjorkman PJ, Saper MA, Samraoui B, Bennett WS, Strominger JL, Wiley DC (1987) Structure of the human
Chelvanayagam G (1996) A roadmap for HLA-A, HLA-B, and HLA-C peptide binding specificities.
Immunogenetics 45(1):15–26
Dean M, Carrington M, O’Brien SJ (2002) Balanced polymorphism selected by genetic versus infectious human
Doherty PC, Zinkernagel RM (1975) Enhanced immunological surveillance in mice heterozygous at the H-2
Dos Reis M, Yang Z (2013) Why do more divergent sequences produce smaller nonsynonymous/synonymous
Felsenstein J (1989) PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5:164–166
Francisco RS, Buhler S, Nunes JM, Bitarello BD, França GS, Meyer D, Sanchez-Mazas A (2015) HLA supertype
variation in human populations: new insights about the role of natural selection on the evolution of HLA-A
Garrigan D, Hedrick PW (2003) Detecting adaptive molecular polymorphism : Lessons from the MHC.
Evolution (N Y) 57(8):1707–1722
Goldman N, Yang Z (1994) A codon-based model of nucleotide substitution for protein-coding DNA sequences.
Harris E, Meyer F (2006) The Molecular Signature of Selection Underlying Human Adaptations. Yearb Phys
Anthropol 130:89-130
Hedrick PW (2002) Pathogen resistance and genetic variation at MHC loci. Evolution (N Y) 56(10):1902–1908
Hedrick PW, Thomson G (1983) Evidence for balancing selection at HLA. Genetics 104(3):449–56
Henn B, Botigué LR, Bustamante C, Clark AG, Gravel S (2015) Estimating the mutation load in human
Hilton HG, Guethlein LA, Goyos A, Nemat-Gorgani N, Bushnell DA, Norman PJ, Parham P (2015)
Polymorphic HLA-C Receptors Balance the Functional Characteristics of KIR Haplotypes. J Immunol
195:3160-3170
Hughes AL, Nei M (1988) Pattern of nucleotide substitution at major histocompatibility complex class I loci
Hughes AL, Nei M (1989) Nucleotide substitution at major histocompatibility complex class II loci: evidence
Hughes AL, Yeager M (1998) Natural selection at major histocompatibility complex of vertebrates. Annu Rev
Genet pp 415–435
Huttley G, Smith MW, Carrington M, O’Brien S (1999) A scan for linkage disequilibrium accross the human
Klein J, Sato A (2000) The HLA system. First of two parts. Adv Immunol 343(10):702–709
Kryazhimskiy S, Plotkin JB (2008) The Population Genetics of dN/dS. PLoS Genet 4(12):10
292
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
16
Lenz T, Mueller B, Trillmich F, Wolf JBW (2013) Divergent allele advantage at MHC-DRB through direct
and maternal genotypic effects and its consequences for allele pool composition and mating. Proc R Soc B
280: 20130714
Martin DP, Lemey P, Lott M, Moulton V, Posada D, Lefeuvre P (2010) RDP3: a flexible and fast computer
Meyer D, Thomson G (2001) How selection shapes variation of the human major histocompatibility complex:
Mugal CF, Wolf JBW, Kaj I (2014) Why time matters: codon evolution and the temporal dynamics of dN/dS.
Penn DJ, Damjanovich K, Potts WK (2002) MHC heterozygosity confers a selective advantage against
Pond SLK, Frost SDW, Muse SV (2005) HyPhy: hypothesis testing using phylogenies. Bioinformatics
21(5):676-679
Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F (2005) Pathogen-driven selection
Robinson J, Halliwell Ja, McWilliam H, Lopez R, Parham P, Marsh SGE (2013) The IMGT/HLA database.
Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ (2006) Comparisons of dN/dS
are time dependent for closely related bacterial genomes. J Theor Biol 239(2):226–235
Saitou N, Nei M (1987) The neighbor-joining method: A new method for reconstructing phylogenetic trees.
Sidney J, Grey HM, Kubo RT, Sette A. (1996) Practical, biochemical and evolutionary implications of the
Single RM, Martin MP, Gao X, Meyer D, Yeager M, Kidd JR, Kidd K, Carrington M (2007 Global diversity
and evidence for coevolution of KIR and HLA. Nat Genetics 9:1114–1119
Slade R, McCallum H (1992) Overdominant vs. frequency-dependent selection at MHC loci. Genetics
132:861–864
Spurgin LG, Richardson DS (2010) How pathogens drive genetic diversity: MHC, mechanisms and
Stolestki N, Eyre-Walker A (2011) The positive correlation between dN/dS and dS in mammals is due to runs
Takahata N, Nei M (1990) Allelic Genealogy Under Overdominant and Frequency-Dependent Selection and
Takahata N, Satta Y (1998) Footprints of intragenic recombination at HLA loci. Immunogenetics 47(6):430–441
Templeton AR (1996) Contingency tests of neutrality using intra/interspecific gene trees: the rejection of
neutrality for the evolution of the mitochondrial Cytochrome Oxidase II gene in the hominoid primates.
293
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
17
Genetics 144(3):1263–1270
The 1000 Genomes Project Consortium (2012) An integrated map of genetic variation from 1,092 human
Wakeland EK, Boehme S, She JX, Lu Cc, Mclndoe RA, Cheng I, Ye Y, Potts WK (1990) Ancestral
Polymorphisms of MHC Class II Genes : Divergent Allele Advantage. Immunol Res 9:115–122
Wolf JBW, Künstner A, Nam K, Jakobsson M, Ellegren H (2009) Nonlinear dynamics of nonsynonymous (dN)
and synonymous (dS) substitution rates affects inference of selection. Genome Biol Evol 1:308–319
Yang Z (2007) PAML 4: Phylogenetic Analysis by Maximum Likelihood. Mol Biol Evol 24(8):1586–1591
Yang Z, Swanson WJ (2002) Codon-Substitution Models to Detect Adaptive Evolution that Account for
Heterogeneous Selective Pressures Among Site Classes. Mol Biol Evol 19(1):49 –57
Yang Z, Wong WSW, Nielsen R (2005) Bayes empirical bayes inference of amino acid sites under positive
Yasukochi Y, Satta Y (2014) Nonsynonymous Substitution Rate Heterogeneity in the Peptide-Binding Region
294
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
18
Locus All allelesa SM (R/NR)b Pairwise (R/NR)c BM pruned data setd Codons
Total Non-ARS ARS
HLA-A 1193 144/107 138/104 93 340 292 48
HLA-B 1799 233/78 173/71 63 324 276 48
HLA-C 829 133/109 125/110 105 341 293 48
Table 1 Number of alleles and codons for different data sets. a, included all available alleles in release 3.1.0, 2010-07-15.,
including possible recombinants; b, SM, data set used for site models, i.e, after selection of alleles with complete coding
sequences; c, R/NR, with and without recombinants data sets; d, BM (branch models) pruned data set is the NR data set
after prunning for alleles which do not cluster intra their respective allelic lineages (see Methods)
non-ARS ARS
Locus Quantilea dN dS ω b dN /dS dN > dS d dN dS ω dN /dS dN > dS
HLA-A 0.02c 0.05 0.35 0.35 628(6.64%) 0.12 0.07 1.36 1.74 7364(77.90%)
1 0.00 0.01 0.35 0.42 628 0.05 0.04 1.08 1.41 2132
2 0.02 0.05 0.398 0.397 0 0.12 0.06 1.47 1.94 2347
3 0.02 0.06 0.37 0.37 0 0.14 0.09 1.34 1.55 2316
4 0.02 0.08 0.29 0.29 0 0.15 0.08 1.50 1.97 2339
HLA-B 0.01 0.04 0.33 0.30 470(3.16%) 0.14 0.11 1.33 1.26 9908(66.59%)
1 0.01 0.02 0.46 0.46 470 0.10 0.09 1.17 1.08 2405
2 0.01 0.03 0.35 0.35 0 0.15 0.12 1.25 1.21 2460
3 0.01 0.05 0.27 0.27 0 0.15 0.13 1.28 1.18 2229
4 0.02 0.06 0.25 0.25 0 0.17 0.11 1.58 1.59 2814
HLA-C 0.02 0.05 0.38 0.37 474(6.12%) 0.07 0.02 1.22 3.04 6514(84.05%)
1 0.00 0.01 0.44 0.46 474 0.04 0.02 0.99 1.71 1303
2 0.01 0.04 0.31 0.31 0 0.07 0.02 1.04 3.52 1791
3 0.02 0.06 0.41 0.41 0 0.08 0.02 1.63 3.95 1810
4 0.03 0.08 0.37 0.37 0 0.09 0.03 1.55 3.35 1682
Table 4 Pairwise estimations for substitution rates (data sets prior to the removal of recombinants). a, quantiles of
divergence (dS non-ARS ); b, average pairwise dN/dS; c, bold refers to the average pairwise values for each locus; d, percentages
correspond to the proportion of pairs for which dN > dS in relation to the total number of pairwise comparisons
295
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
19
Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 68 29 5 24 0.15
Inter 88 55 12 43 0.14
Overall 156 84 17 67
Table 6 HLA-A: MAFs for SNPs in the 1000 Genomes dataset. Overall, set of variable positions considering all sequences
in the site models dataset after removal of recombinants. Intra, subset of the ’Overall’ set which is variable only within one
allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable within more than one allelic lineage. Var.Pos,
set of all variable positions in the site models dataset. Var.Pos.1000g, subset of Var.Pos which is a SNP in the 1000G low
coverage Phase I data. MAF, minor allele frequency. For details, see Methods.
Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 44 24 6 18 0.30
Inter 59 38 8 30 0.39
Overall 103 62 14 48
Table 7 HLA-B: MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set
of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the
’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable
within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset
of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.
Set of SNPs Var. Pos Var. Pos. 1000g MAF <= 0.1 MAF > 0.1 M AF
Intra 78 27 8 19 0.26
Inter 68 55 19 36 0.24
Overall 146 82 27 55
Table 8 HLA-C : MAFs for SNPs in the 1000 Genomes dataset. MAFs for SNPs in the 1000 Genomes dataset. Overall, set
of variable positions considering all sequences in the site models dataset after removal of recombinants. Intra, subset of the
’Overall’ set which is variable only within one allelic lineage for the locus. Inter, subset of the ’Overall’ set which is variable
within more than one allelic lineage. Var.Pos, set of all variable positions in the site models dataset. Var.Pos.1000g, subset
of Var.Pos which is a SNP in the 1000G low coverage Phase I data. MAF, minor allele frequency. For details, see Methods.
296
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
20
Fig. 1 Overlap between two ARS classifications and two site models studies. BJOR and CHEV are ARS classifications
(Bjorkman et al 1987; Chelvanayagam 1996); YANG is a list of codons with significant in HLA genes; BIT is the set of
codons with from our SM (site models) approach (see Materials and Methods for details)
297
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
21
Fig. 2 Pairwise estimates for intra-lineage and inter-lineage pairs of alleles. These results refer to ARS data sets prior to
the removal of recombinants, for pairwise analyses; Green, inter-lineage; purple, intra-lineage; gray, non-ARS ; * significant
difference between ω̄ (intra) and ω̄ (inter) (p < 0.001, Wilcoxon rank sum test)
298
bioRxiv preprint first posted online Aug. 22, 2014; doi: http://dx.doi.org/10.1101/008342. The copyright holder for this preprint (which was
not peer-reviewed) is the author/funder. It is made available under a CC-BY-NC-ND 4.0 International license.
22
Fig. 3 Schematic representation of the allelic phylogenies used in the branch models approach. Left: terminal vs internal
branches; right: intra-lineage vs inter-lineage; For the branch models approach, we labeled branches of each tree (HLA-A,
-B and -C ) as “intra/inter” or “terminal/internal” and ran model 2 (CODEML), which allows for two independent ω values
to be estimated, according to these labels
299