Data Sets
Random reads from 113 isolate microbial genomes present in the Integrated Microbial Genomes system v1.3 (IMG) and sequenced at the DOE Joint Genome Institute were selected to form three distinct simulated metagenomic datasets of increasing complexity. Reads corresponding to the ends of the same clone (paired reads) were selected where possible. The selected reads belonged to libraries of 3kb and 8kb.
Follow the links below to retrieve detailed information for the simulated metagenomes.
Methods
Methods used for assembly, gene prediction and binning of metagenomic data were used. If you wish to contribute your method please read the How to submit section.
In all methods that genomes were needed for training the genomes that correspond to the dominant populations in the metagenomic datasets were excluded.
Assembly methods
JAZZ is a genome assembler developed in the DOE Joint Genome Institute.
Binning methods
kmer calculated the oligonucleotide frequencies of all metagenomic sequences and compared to a reference set of finished genomes from the Integrated Microbial Genomes system v1.3 (IMG). The metagenomic bin was assigned to the taxonomic family of the best matching isolate bin, using a chi-square measure. The comparison was done performed on both strands of any sequence and the best match was chosen. The oligomers used were a. of seven tandem nucleotides and b. of length of eight nucleotides following the pattern NNxNNxNN, to account for the degenerate nature of the genetic code.
PhyloPythia was based on the comparison of sequence patterns between the metagenomic sequences and a reference set of genomes, resulting in assignment of sequence fragments to phylogenetic clades. Initially, sequences were binned using a classification based on a generic training set (gen PhyloPythia) and bins were assigned to taxonomic levels ranging from domain to family. Subsequently, contigs containing single copy genes belonging to the most abundant species were used as training set (ssp PhyloPythia), in this case bins were assigned to genera as well. In both occasions two groups of bins were created, the first with high p-value of 0.85 and the second with a lower p-value of 0.5.
BLAST distr was based on the classification of sequences based on the distribution of BLAST hits of predicted genes to taxonomic classes. Genes were predicted using fgenesb and compared to a database composed of protein sequences from 253 complete bacterial and archaeal genomes, downloaded from NCBI's ftp site. Homologs with E-value less than 1e-05 were assigned a normalized blast score (blast score divided by self score of query). A metagenomic sequence was assigned to the phylogenetic class with the highest total normalized score, if at least 50% of its predicted genes had hits in this class and an average normalized score per ORF was greater than 0.2. Blast distr
Gene prediction methods
Combination of CRITICA / GLIMMER used by the Oakridge National Laboratory microbial pipeline (View a publication with the description of the method)
