Escolar Documentos
Profissional Documentos
Cultura Documentos
ORIGINAL ARTICLE
Carlos W Nossa, William E Oberdorf, Liying Yang, Jørn A Aas, Bruce J Paster, Todd Z DeSantis, Eoin L Brodie,
Daniel Malamud, Michael A Poles, Zhiheng Pei
Carlos W Nossa, William E Oberdorf, Michael A Poles, primers for use in high throughput sequencing to clas-
Department of Medicine, New York University School of Medi- sify bacteria isolated from the human foregut microbi-
cine, New York, NY 10016, United States ome.
Liying Yang, Department of Pathology, New York University
School of Medicine, New York, NY 10016, United States
METHODS: A foregut microbiome dataset was con-
Jørn A Aas, Bruce J Paster, Department of Molecular Genet-
ics, The Forsyth Institute, Boston, MA 02115, United States
structed using 16S rRNA gene sequences obtained from
Jørn A Aas, Faculty of Dentistry, University of Oslo, PO Box oral, esophageal, and gastric microbiomes produced by
1052 Blindern, 0316 Oslo, Norway Sanger sequencing in previous studies represented by
Bruce J Paster, Harvard School of Dental Medicine, Boston, 219 bacterial species. Candidate primers evaluated were
MA 02115, United States from the European rRNA database. To assess the effect
Todd Z DeSantis, Eoin L Brodie, Lawrence Berkeley National of sequence length on accuracy of classification, 16S
Laboratory, Center for Environmental Biotechnology, Berkeley, rRNA genes of various lengths were created by trimming
CA 94720, United States the full length sequences. Sequences spanning various
Daniel Malamud, New York University College of Dentistry, hypervariable regions were selected to simulate the
New York, NY 10016, United States
amplicons that would be obtained using possible primer
Zhiheng Pei, Department of Pathology and Medicine, New
pairs. The sequences were compared with full length
York University School of Medicine, New York, NY 10016,
United States; Department of Veterans Affairs New York Har- 16S rRNA genes for accuracy in taxonomic classification
bor Health System, New York, NY 10010, United States using online software at the Ribosomal Database Project
Author contributions: Pei Z and Nossa CW designed this study; (RDP). The universality of the primer set was evaluated
Nossa CW, Oberdorf WE, Yang L and Aas JA performed com- using the RDP 16S rRNA database which is comprised
putational analyses; Nossa CW, Pei Z, Paster BJ, DeSantis TZ, of 433 306 16S rRNA genes, represented by 36 phyla.
Brodie EL, Malamud D and Poles MA participated in manuscript
preparation. RESULTS: Truncation to 100 nucleotides (nt) down-
Supported by (in part) Grants UH2CA140233 from the Human stream from the position corresponding to base 28 in
Microbiome Project of the NIH Roadmap Initiative and National the Escherichia coli 16S rRNA gene caused misclassifi-
Cancer Institute; R01AI063477 from the National Institute of
cation of 87 (39.7%) of the 219 sequences, compared
Allergy and Infectious Diseases; DE-11443 from the National In-
stitute of Dental and Craniofacial Research; and U19DE018385 with misclassification of only 29 (13.2%) sequences
from the National Institute of Dental & Craniofacial Research with truncation to 350 nt. Among 350-nt sequence
Correspondence to: Zhiheng Pei, MD, PhD, Department of reads within various regions of the 16S rRNA gene,
Veterans Affairs New York Harbor Health System, 423 E 23rd the reverse read of an amplicon generated using the
Street, New York, NY 10010, 343F/798R primers had the least (8.2%) effect on clas-
United States. zhiheng.pei@nyumc.org sification. In comparison, truncation to 900 nt mimick-
Telephone: +1-212-9515492 Fax: +1-212-2634108 ing single pass Sanger reads misclassified 5.0% of the
Received: March 12, 2010 Revised: April 16, 2010 219 sequences. The 343F/798R amplicon accurately
Accepted: April 23, 2010 assigned 91.8% of the 219 sequences at the species
Published online: September 7, 2010
level. Weighted by abundance of the species in the
esophageal dataset, the 343F/798R amplicon yielded
similar classification accuracy without a significant
loss in species coverage (92%). Modification of the
Abstract 343F/798R primers to 347F/803R increased their uni-
AIM: To design and validate broad-range 16S rRNA versality among foregut species. Assuming that a typical
polymerase chain reaction can tolerate 2 mismatches have been developed. The most commonly used culture-
between a primer and a template, the modified 347F independent method relies on amplification and analysis
and 803R primers should be able to anneal 98% and of the 16S rRNA genes in a microbiome[1]. 16S rRNA
99.6% of all 16S rRNA genes in the RDP database. genes are widely used for documentation of the evolu-
tionary history and taxonomic assignment of individual
CONCLUSION: 347F/803R is the most suitable pair of organisms[2-6] because they have highly conserved regions
primers for classification of foregut 16S rRNA genes but for construction of universal primers and highly variable
also possess universality suitable for analyses of other regions for identification of individual species[7]. Analysis
complex microbiomes. of a microbiome is classically performed by amplifying
all 16S rRNA genes in a sample, cloning the 16S ampli-
© 2010 Baishideng. All rights reserved.
cons into a vector transformed into Escherichia coli (E. coli),
randomly selecting transformed colonies, producing high
Key words: Foregut; Microbiome; 16S; 454 sequencing;
Primer
copies of the amplicon containing plasmid, purifying the
plasmids, and sequencing the 16S rRNA inserts by the
Peer reviewers: Richard Anderson, PhD, Department of Anat- Sanger method[8].
omy and Cell Biology, University of Melbourne, Victoria 3010, Advantages of this protocol are potential identifica-
Australia; Jong Park, PhD, MPH, MS, Associate Professor, tion of both cultivatable and uncultivable organisms and
Division of Cancer Prevention and Control, H. Lee Moffitt Can- acceptable adequacy of single pass Sanger sequencing of
cer Center, College of Medicine, University of South Florida, 900-1000 bases for classifying bacteria. Disadvantages in-
12902 Magnolia Dr. MRC209, Tampa, FL 33612, United States
clude annealing bias[9], cloning bias[10], as well as time and
Nossa CW, Oberdorf WE, Yang L, Aas JA, Paster BJ, DeSantis
expenses. The high cost associated with this approach has
TZ, Brodie EL, Malamud D, Poles MA, Pei Z. Design of 16S been prohibitive for in-depth sampling of a complex mi-
rRNA gene primers for 454 pyrosequencing of the human foregut crobiome. Inadequate sampling overlooks low abundance
microbiome. World J Gastroenterol 2010; 16(33): 4135-4144 species. As a result, a low abundance species that could be
Available from: URL: http://www.wjgnet.com/1007-9327/full/ a microbiome core species often cannot be consistently
v16/i33/4135.htm DOI: http://dx.doi.org/10.3748/wjg.v16. observed amongst different individuals. For example, a re-
i33.4135 cent 16S rRNA gene survey of the human distal esopha-
geal microbiome yielded 6800 sequences but revealed only
one of the 166 species, Streptococcus mitis, shared by all 34
subjects sampled[11].
INTRODUCTION High throughput sequencing technology introduces a
new way to perform microbiome surveys without limita-
Currently there is a broad international collaboration tions of cloning/Sanger sequencing. One 454 sequenc-
underway aimed at defining the microbiome (http://ni- ing run can produce 1.2 million sequences in 8 h [12],
hroadmap.nih.gov/hmp/) on the internal and external which would require months or years of work with older
surfaces of the human body [Human Microbiome Project methods. Because many sequences are acquired in one
(HMP)]. The goals of the HMP collaborative project are run, the unit cost per sequence read is a very small frac-
to determine a core microbiome, assess how it changes tion of that for Sanger sequencing. This improvement
during disease, and establish whether a change in the mi- allows sufficient sampling of a complex microbiome at
crobiome is associated with disease. It is anticipated that affordable cost. The new technology also eliminates the
knowledge gained from the HMP will shed light on the cloning bias by directly sequencing the 16S rRNA genes
etiology and pathogenesis of idiopathic chronic diseases generated by polymerase chain reaction (PCR). There-
that have a high impact on human health. fore, high throughput sequencing is ideal if adaptable to
The human body is a highly complex conglomeration meet the requirements needed for microbiome work.
of human cells, bacteria, Archaea, fungi, and viruses, in The main limitation of high throughput sequencing
which the number of microbes outnumbers the human is read length. Reads from next generation sequencing
cells by a factor of 10 to 1. Despite their intimate associ- technologies are considerably shorter than those from
ation, the microbial influence upon human development, Sanger sequencing. Illumina’s Solexa and Applied Biosys-
physiology, immunity, and nutrition remains largely tem’s SOLiD platforms generate reads of about 25-100
unstudied. This can be partly attributed to the lack of bases[13,14], while 454 sequencing technology reads up to
robust techniques needed for exploring a complex mi- 400-500 bases per sequence. The general approach to ana-
crobial community. lyzing microbiomes in health and disease focuses at 2 lev-
Previous attempts to fully characterize the human els. Population level analyses, such as the Fst test, P test,
microbiome have had certain limitations. The classi- Unifrac analysis, clustering analysis, and normal reference
cal method of culturing bacteria from human subjects range, relate samples of microbiomes by combined genet-
excludes a large number of unculturable or not-yet- ic distance between the samples. These types of analyses
cultivated bacteria, and also misrepresents the abundance are relatively insensitive to variation in read length[15]. In
of some species due to selection by culture conditions. To Unifrac clustering analysis, reads of 100-200 nucleotides
overcome these drawbacks, culture-independent methods can yield the same clustering as full-length sequences if
the correct regions are chosen for sequencing. Detailed pseudopneumoniae and Streptococcus pneumoniae[28], as the two
analyses at the level of the operational taxonomic unit species differ by only 5 base pairs (bp) between their 16S
(OTU) depend on read length - the longer the more ac- rRNA genes corresponding to a 0.03% difference (16S-
curate[16]. Compared with full length sequences, single pass based operational threshold for separation between two
reads of approximate 900 bases (of full length 1500-1600) species is 3% diversity), they were considered as one spe-
from Sanger sequencing is associated with a slight loss of cies in this study. As a result, 219 species were represented
classification accuracy, which has been acceptable consid- in the dataset.
ering the significant reduction in sequencing cost from
sequencing the entire 16S rRNA genes. Sequence alignment
In the recent studies utilizing 454 sequencing tech- The 219 16S rRNA gene sequences were aligned using
nology to perform 16S rRNA gene surveys of micro- the NAST alignment program[29]. The program was set
biomes [17-20], a major concern has been reduction of to recognize sequences at least 1250 bases long with at
classification accuracy with short sequence reads. Several least 75% identity. Because NAST may remove bases not
strategies have been tried to maximize the information found to be sufficiently homologous, and thus unalign-
obtained from short sequences. One is to utilize a paired- able, several sequences were truncated after alignment,
end sequencing strategy to increase sequence length[21]. particularly at the 5’ end. The missing portion of these
Another is to target certain hypervariable regions (HVR) sequences was manually replaced after alignment with
that are most informative for a specific microbiome of the corresponding region from the GenBank sequence
interest. Currently, various HVRs, individually[17,19] or in of the same species so that all 16S sequences would be
combination[18,22], have been used in analysis of a microbi- complete, beginning from position 28 of the E. coli 16S
ome but their efficacies often are not validated. rRNA gene.
Early studies using Sanger sequencing revealed in-
teresting associations between human microbiomes and Amplicon design
disease, such as that seen between the human foregut mi- To simulate the sequence data that would be obtained us-
crobiome, dental/periodontal diseases and gastroesopha- ing specific primer pairs, representative amplicons were
geal reflux diseases (GERD)[11,23,24]. To further study such constructed from full length 16S rRNA gene sequences.
associations, the Human Microbiome Project[25] (part of The criteria for design of the representative amplicons
the NIH Roadmap Initiative) currently supports in-depth was based on 454 requirements. Amplicons could be
analysis of the association between the human microbi- no more than 600 bases in total (including primers and
ome of various anatomical sites and related diseases. Our nucleotide barcode) for optimal emulsion PCR. Because
area of focus is the foregut microbiome during disease on average 454 read lengths are approximately 400 bases,
progression from GERD to esophageal adenocarcinoma only the portion of the amplicon that would be sequenced
(the fastest increasing cancer in the Western world). As the was considered. For example, for a forward read amplicon
first step of the foregut microbiome project, we identified of a total 500 base pairs with 50 bases of forward primer
the most informative HVRs, and designed and validated and nucleotide barcode, only the first 350 bases after the
a broad range primer set most suitable for the foregut primer would be read, with the final 50 bases ignored.
microbiome. These will be used to facilitate our in-depth These amplicons were designed using 6 universal primer
analysis of the human foregut microbiome and its role in sets from the European Ribosome Database[30] as shown
GERD-related disease progression. in Table 1.
Sequence trimming
MATERIALS AND METHODS The simulated amplicons were generated by trimming
Sequence collection full length sequences of the 219 foregut species using
Sequences collected for our in silico analysis were obtained the MEGA version 4 program (MEGA4) [31]. Aligned
from separate, previously conducted 16S rRNA gene sur- FASTA sequences were uploaded onto MEGA4, with
veys of 3 foregut sites. Esophageal 16S sequences (6800) the sequence file including an aligned E. coli 16S rRNA
were obtained from research by Yang et al[11], oral 16S gene sequence as a positional template.
rRNA gene sequences (2458) were from research by Aas To simulate data that would be obtained by cloning/
et al[26], and gastric 16S rRNA gene sequences (839) were Sanger sequencing methods, amplicons of approximately
from research by Bik et al[27], totaling 10 097 sequences. 900 bases downstream of the starting position were first
In addition, we removed sequences that were derived generated for the 219 sequences. These amplicons were
from patient gastric samples with Helicobacter pylori (300), used to theoretically compare how reads obtained by
chimera sequences (127), as well as sequences with more 454 sequencing technologies would compare to Sanger
than 8 bases missing after E. coli position 27 (186). The sequence reads. The 8F primer was located in the 219
final foregut dataset contained 9484 sequences (2373 aligned sequences (bases 8-27) and all bases upstream of
oral, 6626 esophageal, 485 gastric). These sequences were 28 were deleted. The gaps from the aligned sequences
represented by 220 species. Because 16S rRNA-based were then removed, and all bases downstream of 928
operational classification criterion does not have sufficient were deleted leaving the sequences between positions 28
discriminatory power to differentiate between Streptococcus and 928 in E. coli (900 bases total) as a reference for the
Table 1 Primers used in the study Sequence classification and comparison of amplicon
accuracy
Primer # bases Sequence (5’ to 3’) Species with Sequence trims for size and designed amplicons were clas-
identical
match (%)
sified at the phylum to genus level using RDPⅡ Classi-
fier[32] and at the species level using RDPⅡ Seqmatch[33].
Forward primers
8F 20 AGAGTTTGATCCTGGCTCAG n/a1
For classification at the phylum to genus level, FASTA
343F 15 TACGGRAGGCAGCAG 99.1 files were uploaded onto the RDPⅡ Classifier at 80%
517F 17 GCCAGCAGCCGCGGTAA 93.2 confidence threshold and results were viewed at a display
784F 15 AGGATTAGATACCCT 90.0 depth of 7 to see assignment data down to the genus
917F 16 GAATTGACGGGGRCCC 84.0
level. The resultant classification dataset for the trimmed
1099F 16 GYAACGAGCGCAACCC 73.1
Reverse primers sequences was individually compared to the dataset ob-
534R 18 ATTACCGCGGCTGCTGGC 91.8 tained using full length sequences and/or 900 bp Sanger
798R 15 AGGGTATCTAATCCT 90.0 length sequences. The number of discrepant assignments
926R 20 CCGTCAATTYYTTTRAGTTT 83.0
was recorded.
1114R 16 GGGTTGCGCTCGTTRC 74.9
1407R 16 GACGGGCGGTGTGTRC 91.3
For classification at the species level, FASTA files were
1541R 20 AAGGAGGTGATCCAGCCGCA n/a1 uploaded onto RDPⅡ Seqmatch. The following parame-
ters were selected: typed and non typed strains, uncultured
1
Sequences near the ends of the 16S rRNA genes are often designed for and isolates, ≥ 1200 bases, good quality, nomenclature
primer binding and are not included in sequences deposited in GenBank. taxonomy. The species classification of every sequence
n/a: Not available. R = A, G; Y = C, T.
was individually analyzed and compared to the species
that the full length version of the sequence was assigned
Sanger sequencing data that would be obtained for the to, as described previously[11].
219 species.
For length-based amplicon comparison, aligned se- Homology between universal primers and foregut
quences were again trimmed upstream of E. coli position bacterial species
28, representing the first base after the 8F forward primer. Universal 16S rRNA primers were obtained from the Eu-
This E. coli position 28 was set as base #1. All gaps were ropean ribosomal RNA database[30]. The primer sequences
then deleted from the aligned sequences and downstream are shown in Table 1. Determination of the percentage
bases were trimmed to leave amplicons with 900, 800, of the 219 foregut species with the exact primer sequence
700, 600, 500, 400, 300, 200, 100 and 50 bases. These were was done using MEGA4 software[31].
compared against full length 16S rRNA gene sequences
for classification accuracy. Evaluation of the universality of the foregut primers in
For amplicons based on different primer sets (within the domain Bacteria
varying regions of the 16S rRNA gene), sequences of re- The optimized primers were searched using Probe Match
verse and forward reads from both ends of the amplicons against the bacterial 16S rRNA gene database at Ribo-
were analyzed. To trim full length sequences to the am- somal Database Project (http://rdp.cme.msu.edu).
plicons of interest, the position of the forward or reverse
primer was first located by searching for the primer se-
quence. For theoretical forward reads of the amplicons, the RESULTS
sequence upstream and including the forward primer was
Accuracy of taxonomical classification is dependent on
deleted. Gaps were then deleted in the aligned sequences
amplicon length
leaving all sequences with the base after the forward prim-
To compare shorter reads that would be obtained with
er as base #1. From this position, the site of 350 bases
downstream of position #1 was determined, and all bases high throughput sequencing against longer reads that
downstream of #350 were deleted. This represented 454 would be obtained using Sanger sequencing, we per-
sequencing’s ability to read the 350 bases downstream of formed in silico analysis; looking at the capability of dif-
the primer (after the first 50 or so bases of the primers and ferent length 16S rRNA gene sequences to accurately
barcodes were read). For theoretical reverse reads of am- classify foregut bacteria. Using full length sequences from
plicons, the position of the reverse primer was located as 219 representative foregut species, we created sequence
before. From the 3’ end of the reverse primer complement truncations that were 50-900 bases long beginning from
(sense strand) the first base of the sequence of interest base 28, the first base after the 8F primer. A length of
was designated position #1’. Using the E. coli 16S rRNA 900 bases was chosen as the longest sequence to analyze
gene as a reference, the site of 360 bases upstream of base because it is the length of a single pass Sanger sequence
#1’ was determined and set as #360’. All bases upstream and generally accepted as accurate taxonomically. Each of
of #360’ were deleted. The gaps in the aligned sequences the truncated versions of the 16S rRNA gene sequences
were then deleted and the number of bases in each se- was analyzed using RDPⅡ classifier down to the genus
quence was determined. For any sequences with more than level. The results showed that loss of classification accu-
350 bases, extra bases were removed manually from the 5’ racy was length dependent (Figure 1). At the genus level,
end so that all sequences were 350 bases long. All sequence sequences as small as 200 bases showed relatively good
trims were then saved as FASTA files for analysis. classification accuracies (94.1% accuracy for 900 base se-
100
Table 2 Amplicons designed for analysis
75
% accuracy
Amplicon C’ 906-556
Figure 2 Design of amplicons for in silico evaluation. Amplicons designed using the 6 universal primer sets as described in Table 3 were evaluated for theoreti-
cal forward and reverse reads. The schematic shows the relative position of each amplicon read and read direction, as well as which bases would be included in the
sequence. The positional template is the Escherichia coli (E. coli) 16S rDNA gene with the approximate locations of the hypervariable regions labeled.
Table 3 Accuracy of taxonomic classification of 219 foregut species using 350-bp sequences
Forward reads
A 97.7/97.7 96.3/95.9 95.9/95.4 93.2/94.5 87.7/86.8
B 99.1/99.1 98.2/96.8 97.7/96.3 97.3/95.9 91.8/89.0
C 99.5/99.5 98.2/96.8 97.7/96.8 97.3/95.9 90.4/88.1
D 98.6/99.1 98.6/98.2 97.3/97.3 95.9/96.3 88.1/86.8
E 98.2/98.6 97.7/98.2 95.4/95.9 92.2/93.6 85.4/84.9
F 97.7/98.2 96.3/97.3 95.0/95.4 91.8/93.6 83.6/84.5
Reverse reads
A’ 98.6/98.6 97.3/97.7 97.3/97.7 95.0/96.3 90.0/90.9
B’ 99.5/99.5 98.2/96.8 97.7/96.8 97.7/96.3 93.6/91.8
C’ 99.5/99.5 98.2/96.8 97.7/96.8 97.3/95.9 91.3/90.0
D’ 99.5/100 98.6/99.5 96.8/98.2 94.5/96.3 87.2/89.5
E’ 98.2/98.6 96.8/98.2 94.5/96.3 91.3/94.1 83.6/87.2
F’ 98.2/98.6 96.8/98.2 95.0/96.4 92.2/95.0 82.6/85.8
6800). The oral and gastric sequences were not analyzed primer. This new primer, designated as 347F is a 19mer
because some were too short to span the full amplicon but this modification resulted in only 113 sequences
B’ region. We compared the taxonomical classifications (51.6%) with 100% homology. The 798R primer was
obtained using the amplicon B’ truncations to those ob- also lengthened to base 803, but shortened from 784 to
tained using the original sequences (approximate 900 bp 785 to ensure CG at 5’ end of replication and to bring
in length). Amplicon B’ accurately classified 6332 of the the melting temperature closer to the forward primer’s.
6800 sequences (93.1% accuracy) at the species level. The This modified primer, designated as 803R is a 19mer and
468 misclassified sequences belong to 14 species (Table 5). the modification resulted in only 180 sequences (81.4%)
with 100% homology. To improve the universality of
Optimization of primers used to generate amplicon B’ the modified 343F and 798R primers, we analyzed both
To maximize the probability of amplifying all bacterial at each individual base (Figure 3). For both the modi-
species in the foregut, the 343F (15mer between positions fied 343F and 798R there are positions of relatively low
343 and 357) and 798R (15mer between positions 784 consensus. In the modified 343F, the consensus bases at
and 798) primer set were examined against 16S rRNA positions 359 and 360 match with only 77.2% and 74.4%
genes from these species to determine their universality. of the 219 species, respectively. In 798R, the consensus
Of the 219 species, 217 (99.1%) had 100% homology bases at positions 798 and 784 have 90.0% matches, re-
to 343F and 197 (90.0%) had 100% homology to 798R. spectively. To improve on these mismatches degenerative
While the 343F primer had favorable homology with all primers were constructed. Thus bases 359 and 360 in
sequences, it was extended 8 bases to position 365 and the modified 343F were changed from G (guanine) to R
deleted 4 bases between positions 343 and 346 because (guanine or adenosine). For 798R, base 798 was changed
the original primer sequence was too short, having a low from A (adenosine) to R (guanine or adenosine) result-
melting temperature that would not have worked with ing in improved base matching as shown in Figure 3.
PCR conditions when used in tandem with the 798R The optimized 347F was 100% homologous with 205
No. of matches
343F (343-357) 803R (785-803)
199
347F (347-365)
199 Base 798 Bases 784
179 Bases 359, 360
A T deleted
G
159 189
343 345 347 349 351 353 355 357 359 361 363 365 803 801 799 797 795 793 791 789 787 785
E. coli base # E. coli base #
Figure 3 Optimization of primers used to generate amplicon B. European Ribosome Database primers 343F (A) and 798R (B) were optimized to generate
maximal % match with corresponding region in the 16S sequences of foregut species on a base by base manner. Each primer base was analyzed for the number of
matches with the corresponding base of all 219 foregut species studied [base # assigned by position within Escherichia coli (E. coli) 16S sequence]. Most bases for
both 343F and 798R showed 100% match (219/219), however some bases had slight mismatch % and a few bases had significant mismatch % (base 798, 784, and
359, 360). To have a better homology between the primers and the foregut 16S sequences, the bases with significant mismatch were adjusted to result in lower %
mismatch. This was accomplished by changing 798A to R (increasing match from 197 to 213/219) and by changing 359G and 360G to R (increasing match from 169
and 163 to 212 and 219/219, respectively). Further modifications of primers were made to make them more suitable for polymerase chain reaction (PCR) reactions. R
primer 5’ end was shifted from 798 to 803, and 3’ end from 784 to 785. F primer 5’ end was shifted from 343 to 347 and 3’ end from 357 to 365. These changes pro-
vided suitable melting and annealing temperatures for the designed primer pairs in our PCR reactions. Resulting primers were designated 347F and 803R.
Table 4 Foregut species misclassified using amplicon B’ Table 5 Esophageal species misclassified using amplicon B’
compared with full length sequences compared with Sanger sequences
(93.6%) species and 803R had 100% homology with 209 rRNA genes from both cultured and uncultured bacte-
(95.4%) species. The sequences of the optimized prim- rial species, representing 36 phyla. For optimized primer
ers are 5’-GGAGGCAGCAGTRRGGAAT (347F) and 347F, the introduction of ambiguity codons improved
5’-CTACCRGGGTATCTAATCC (803R). the coverage to 91.1% from 65.7% for the type strain
sequences and to 90.4% from 63.7% for all 16S rRNA
347F/803R primer set has a broad taxonomic coverage genes (Table 6). In comparison, the optimized primer
of domain bacteria 803R had an identical match with 91.8% of the type
The 347F/803R primer set has a broad taxonomic cov- strain sequences and 84.9% of all 16S rRNA genes. As-
erage of common foregut bacterial species identified suming a typical PCR can tolerate 2 mismatches between
from the limited number of sequences generated by a primer and a template, the modified 374F will be able
Sanger sequencing, but deep sampling of the foregut to anneal 99% of the type strain sequences and 98% of
microbiome by 454 sequencing will uncover numerous all 16S rRNA genes, compared with 99.9% and 99.6%
species that were below the detectable level by Sanger for 803R primer, respectively.
sequencing. To evaluate their potential to detect species
not included in the foregut dataset, 347F/803R primer
sequences were searched against the 16S rRNA gene
DISCUSSION
database at RDP (http://rdp.cme.msu.edu/). The da- The introduction of next generation sequencing has
tabase has a collection of 16S rRNA genes from 5165 been a boon to many fields of research, but has not yet
type strains and 433 306 high quality, near full length 16S been fully applied to microbial ecology. One reason for
Primer Optimization Total species Coverage at mismatches n (%) Total sequence Coverage at mismatches n (%)
0 1 2 0 1 2
374F Before 5165 3392 (65.7) 4835 (93.6) 5042 (97.6) 433 306 275 801 (63.7) 406 626 (93.8) 418 613 (96.6)
After 5165 4703 (91.1) 4996 (96.7) 5114 (99.0) 433 306 391 695 (90.4) 418 832 (96.7) 424 756 (98.0)
803R Before 5165 4584 (88.8) 5091 (98.6) 5159 (99.9) 433 306 352 827 (81.4) 417 612 (96.4) 430 967 (99.5)
After 5165 4741 (91.8) 5131 (99.3) 5159 (99.9) 433 306 367 771 (84.9) 427 791 (98.7) 431 725 (99.6)
this is that previous read lengths of approximately 200 of our primer set allowed us to find a region of the 16S
bases for 454 sequencing were not sufficient to accu- rRNA gene that gives maximum classification accuracy
rately classify bacteria based on their 16S rRNA genes[33]. within the current size limitations of 454 sequencing.
However, with the recent improvements to 454 sequenc- This region was between the universal primers 343F and
ing which allows for longer (approximately 400 base) 798R, encompassing bases 361-784 and HVR 3 and 4.
read lengths, this is now changing. Focusing on just the 219 species of the foregut also al-
This increase in read length greatly improves 454’s lowed us to tailor our primers for better match, resulting
applicability to 16S rRNA gene studies. We have shown in optimization of primers 343F and 798R (which we
that read lengths comparable to current 454 output have modified to 347F and 803R).
(300-400 bases) give satisfactory 16S rRNA gene clas- We are using these primers in several foregut mi-
sification of bacteria at the genus and species level while crobiome projects including one supported by the Hu-
shorter read lengths (such as those from Solexa and man Microbiome Project. Without the time and cost
SOLiD technology) do not. However, with a change restraints of cloning, 454 sequencing will allow us to
in sequencing read lengths of 16S rRNA from about analyze many more sequences than in previous foregut
900 bp (as is commonly used with Sanger sequencing) microbiome research (thousands vs 50-200 sequences
to read lengths of 400, it is important to revisit which per sample as was done previously). We expect that this
portion of the 16S rRNA gene to focus on to gain the more in-depth analysis of the foregut microbiome made
most information from shorter reads. With the selection possible by 454 sequencing will allow us to more clearly
of the most informative stretch of the 16S rRNA gene, characterize the association between commensal bacteria
it is also important to ensure that the designed primers of the esophagus and GERD-related esophageal disease
achieve a level of universality to be able to detect a wide progression to esophageal adenocarcinoma. In addition
array of microbial species. Although the primer pairs we to identifying bacteria already shown to occupy the fore-
chose to analyze were already available from the Euro- gut, 454 sequencing will broaden the range of bacteria
pean Ribosomal Database, Wang has recently reported found within the human foregut with the increased cov-
coverage rates of existing and predicted primers which erage. This may lead to discovery of additional species,
span various regions of the 16S rRNA gene[34]. Some of as well as previously undiscovered species, residing with-
those predicted primers with good coverage overlapped in the human foregut and a more accurate estimation of
those that we had chosen. the species diversity of the foregut in normal and disease
Using the designed primers and 454 sequencing data, conditions.
a complex microbiome can also be characterized by the Any species that was not included in our computation-
relative abundance of 16S rRNA genes representing al analyses will have a good chance of being identified by
bacterial species in the microbiome, by assigning the 16S the foregut primers, based on our data showing the broad
genes to specific species and calculating their relative taxonomic coverage of these primers within domain bac-
weights. This is a much more accurate method of deter- teria. Thus, even though these primers were optimized for
mining species relative abundance within the microbiome foregut species, they could potentially be used for other
than the traditional methods such as colony forming unit 16S rRNA based applications such as microbiome analysis
counting (which is biased against fastidious bacteria) or on other anatomic sites or environmental samples. These
cloning (which may be skewed due to cloning bias). As carefully selected and designed primers will be essential
a comparison, the absolute amount of 16S rRNA genes tools in our efforts to harness the maximum potential that
from each organism can be determined by quantitative 454 sequencing offers to the field of microbial ecology of
PCR (qPCR) but the number of species that can be tested the human body.
in qPCR is limited.
The primer pairs we analyzed could theoretically be COMMENTS
COMMENTS
used for a wide range of bacteria-containing sources,
such as environmental and clinical samples. By concen- Background
The study of the human foregut microbiome has generated interest recently
trating on species known to reside in the foregut (mouth, because of its association with gastroesophageal reflux disease-related com-
esophagus, and stomach), we have confirmed that 454 plications. New technologies, such as 454 pyrosequencing, have increased
sequencing is an appropriate method for 16S rRNA the capability and scope of the study of complex microbiomes but have not yet
gene analysis of the foregut microbiome. Careful choice been adapted for use with foregut microbiome samples.
Proc Natl Acad Sci USA 2006; 103: 732-737 Evolutionary Genetics Analysis (MEGA) software version
28 Arbique JC, Poyart C, Trieu-Cuot P, Quesne G, Carvalho 4.0. Mol Biol Evol 2007; 24: 1596-1599
Mda G, Steigerwalt AG, Morey RE, Jackson D, Davidson RJ, 32 Wang Q, Garrity GM, Tiedje JM, Cole JR. Naive Bayesian
Facklam RR. Accuracy of phenotypic and genotypic testing classifier for rapid assignment of rRNA sequences into the
for identification of Streptococcus pneumoniae and descrip- new bacterial taxonomy. Appl Environ Microbiol 2007; 73:
tion of Streptococcus pseudopneumoniae sp. nov. J Clin Mi- 5261-5267
crobiol 2004; 42: 4686-4696 33 Cole JR, Chai B, Farris RJ, Wang Q, Kulam-Syed-Mohideen
29 DeSantis TZ Jr, Hugenholtz P, Keller K, Brodie EL, Larsen AS, McGarrell DM, Bandela AM, Cardenas E, Garrity GM,
N, Piceno YM, Phan R, Andersen GL. NAST: a multiple Tiedje JM. The ribosomal database project (RDP-II): intro-
sequence alignment server for comparative analysis of 16S ducing myRDP space and quality controlled public data.
rRNA genes. Nucleic Acids Res 2006; 34: W394-W399 Nucleic Acids Res 2007; 35: D169-D172
30 Wuyts J, Perrière G, Van De Peer Y. The European ribosomal 34 Wang Y, Qian PY. Conservative fragments in bacterial 16S
RNA database. Nucleic Acids Res 2004; 32: D101-D103 rRNA genes and primer design for 16S ribosomal DNA am-
31 Tamura K, Dudley J, Nei M, Kumar S. MEGA4: Molecular plicons in metagenomic studies. PLoS One 2009; 4: e7401