Metagenomics is an increasingly common method of studying microbial communities as it allows estimation of functional potential and recovery of population genomes. However, unlike amplicon-based studies in metagenomics there is no concept of a sequence-based operational taxonomic unit (OTU), instead community profiling relies on taxonomic classification or only assesses a subset of lineages.
Here we introduce an algorithm and software toolbox SingleM, which finds de novo nucleotide sequence OTUs in metagenome reads. Targeting reads encoding conserved sections of ribosomal proteins, each read can be directly compared with all other reads from the same conserved section. The resultant species- or strain-level OTUs are more precise than 16S rRNA gene amplicons so application to environmental datasets often reveals at least 10x more diversity than application of taxonomy-based methods. OTU-based ecological diversity metrics can be calculated naturally, even for complex communities containing novel lineages.
Since the position of OTU sequences within each gene is standardised, OTUs obtained from unrelated datasets can also be directly compared. Using public data available in the Sequence Read Archive (SRA) >4000 environmental metagenomes were scanned finding >500,000 distinct lineages. This resource can be efficiently searched to find the global distribution of lineages of interest and their co-occurrence.
One significantly important application of SingleM is predicting which population genomes are likely to result from assembly and binning. An algorithm was designed to predict which lineages in a given sample are both sufficiently abundant and exhibit sufficiently low strain heterogeneity for the recovery of a high quality population genome. Metagenomes from the SRA strongly predicted to yield genomes from the elusive SAR11 clade, perhaps the most abundant microbial lineage on earth, were assembled and binned recovering the first population genomes from this clade. Targeted assembly of samples dominated by a single homogeneous lineage provides a new and scalable method for improving population genome recovery.
SingleM is available at http://github.com/wwood/SingleM.