Você está na página 1de 33

BGI MetaHIT Project

食物

微生物

宿主

研究表明:人肠道中的微生物群
落和人自身是一个共生体,它们
的存在与人的健康息息相关。
MetaHIT项目目标

Genome

Foods
Drugs Health or Diseases

Microbiome
BGI的工作

10% 扩大了对人胃肠道菌群基因的认识 80%


insert size
140bp, 180bp, 350bp, 500bp, 2k, … , 10kb
read length
(2×)44bp, (2×)75bp, (2×)100bp …

16S rDNA

Metagenome Tens of millions of reads


per lane, still increasing
Metatranscriptome

Metaproteome
Data Production in MetaHIT
2009.2 Total
Sample Source
85 Danish, 39 Spanish 124
Data Production (Average for 124 samples)
Read Pairs # 31M
Read Length (bp) 75
Bases (bp) 4.5G 555.5G
High Quality Base(bp) 3.2G 390.8G

Up to today
306 samples were sequenced in MetaHIT;
73 samples from Chinese Diabetes study.

Totally 1.5Tb data.


Different
samples
have quite
different
microbial
community

We listed 64 high frequent


bacteria in human gut.
(read coverage > 1% in more
than 90% individual)
BGI’s current pipeline in MetaHIT
Assembly
Sequenced Other Human Gut
BGI’s Illumina
Bacteria Metagenomics
Genomes Data Gene Set Data

Non-redundant
Mapping
Gene Reference

Phylogenetic Functional Annotation Digital


Classification COGs, KOs, … Profiling
Merge all samples’ contigs

• Filtering: >500bp contigs


• Merge all unassembled reads from 124 samples and
assemble again.

Total Max.
Number N50 Size(bp) N90 Size(bp)
Size(bp) Length(bp)
10.7G 6,976,367 2,118 642 192,090
After assembly, there are over 80% reads
could be mapped to assembled contig
E ∩ B: 0.66%+-0.08%B: 0.83%±0.29%
S: 26.66%±1.26%
8.16%±1.01%

S∩E∩B E ∩ S: 0.82%+-0.06%

(S ∩ E)-(S ∩ E ∩ B):
40.59%±1.86%

E: 4.08%±0.12%

Unmapped: 17.7%

S : Solexa dataset E : EMBL dataset B : Known Bacterial genomes


< 0.1% read mapped to the Human Genome
Sample 1 … …
Sample 124

Sequencing … …
Finding unmapped
PE reads Reads … …
Assembling … …
Unassembled
PE reads
Contigs … Contigs
Mix-
assembling Removing
redundancy Finding solid
overlap
Unique
Contigs Contig Set

Removing Overlap
redundancy Relationships

Assemble is important! Unique


Contig Set
Solexa reads usage

2 mismatches were allowed in the first 35 bp, while 4 and 6 mismatches were
allowed in the rest of bases for 44 bp reads and 75 bp reads separately.
Map other data to total Contig Set

> 90% of read length and > 90% identity


Our contig set is more complete

Solexa reads (124 EU individuals)


454 reads (18 US individuals)
Sanger reads (13 Japanese
individuals)
Our contig set highly covered some frequent
bacteria in human gut
Gene prediction

• Firstly construct a non-redundant reference gene set


for human gut microbiome.
• Total number: 3,299,822
• Total length: 2.3 Gb
• Complete genes: 1,526,558
• Partial genes: 1,773,264

• There was a estimation of the total gene number at


3-4M. (relative abundance > 10-6)
Our gene catalog is more complete
Phylogeny composition
Deep sequencing confirmed newly found
species
• The H2-producing Prevotellaceae (a family
belonging to the Bacteroidetes)
• the H2-using Methanobacteriales (an order of
methanogenic archaea)
• Methanogens increase the extraction of energy
by the host from otherwise indigestible
polysaccharides
Zhang, H. et al. Proc. Natl. Acad. Sci. USA, 2009
Gene family in our gene catalog
OGs + Novel gene families

Known + Unknown OGs

Known OGs
Core-metagenome in human gut
You work
together
with your
gut
genomes!

KEGG
pathway
Metabolism related pathways

Green: Human Gut


Red : Human
Association Study
Gene profiling table Phenotype variable table

Gene 1 Gene 2 … Var 1 Var 2 …

Sample 1 X11 X12 … Sample 1 Y11 Y12 …

Sample 2 X21 X22 … Sample 2 Y21 Y22 …

Sample 3 … … … Sample 3 … … …

How to find the associations between genotype


and phenotype?
Crohn’s Disease
PCA analysis for 85 Danish samples
BMI related genes
208 genes with p-value < 1e-3

Obese sample enriched gene involved in:


Fatty acid biosynthesis
Carbon fixation in photosynthetic organisms
Benzoate degradation via hydroxylation
Glycine, serine and threonine metabolism
Naphthalene and anthracene degradation
Lean sample enriched gene involved in:
Phenylalanine, tyrosine and tryptophan biosynthesis
Purine metabolism
Glycosaminoglycan degradation
Lipopolysaccharide biosynthesis
Sphingolipid metabolism
Starch and sucrose metabolism
Oxidative phosphorylation
Androgen and estrogen metabolism
One carbon pool by folate
PCA analysis for all samples
Crohn’s disease related genes
368 genes with p-value < 1e-7
Disease sample enriched gene involved in:
Androgen and estrogen metabolism;
Ethylbenzene degradation;
Valine, leucine and isoleucine degradation
Glutathione metabolism
Androgen and estrogen metabolism
Urea cycle and metabolism of amino groups
Benzoate degradation via CoA ligation
Methionine metabolism
Disease sample underrepresentation gene involved in:
Nicotinate and nicotinamide metabolism
Peptidoglycan biosynthesis
Stilbene, coumarine and lignin biosynthesis
Cysteine metabolism
Nitrobenzene degradation
2,4-Dichlorobenzoate degradation
Ubiquinone biosynthesis
Phenylalanine metabolism
Histidine metabolism)
Folate biosynthesis
Riboflavin metabolism
Starch and sucrose metabolism
N-Glycan biosynthesis
The current methods
• Single gene analysis (test, correlation)
Advantage: Simple to compute
Disadvantage: Treat all genes independently, can’t
find the correlations among genes
• Multivariable statistics analysis (PCA, CCA,
PLS)
Advantage: Can find the combinations of some
genes
Disadvantage: Unable to handle very large dataset
and very slow
项目进行的流程

项目设计 生物学背景、科学思想(逻辑性、创造性)

技能全面或专长(计算机、数学、生物等)、
项目执行
统筹意识、合作精神

生物学知识、科学思想、英文功底、表达能
项目汇报
力、美感

Você também pode gostar