Download Data
Simulated Metagenomes
- SimLC. Low complexity dataset
- SimMC. Medium complexity dataset
- SimHC. High complexity dataset
Sequence files contain the sequence of paired reads in fasta format. Quality files contain information necessary for the assembly.
| Dataset | sequence files | quality files |
| SimLC | Download | Download |
| SimMC | Download | Download |
| SimHC | Download | Download |
Note: filenames correspond to taxon_oids found in IMG
The sizes of the libraries used for this project can be found here. Reads that do not correspond to any of the libraries in this file can be considered either single reads (i.e. without a valid pair read) or belonging to 3Kb libraries.
The list of the reads and their corresponding origin can be found here.
The genes that are included in the "reference" genomes are here.
NOTE: this file contains all the protein coding genes that are included in the three
metagenomes.
The overlap of the genes with the sequencing reads are here.
Assemblers
- phrap
- Arachne
- Jazz
A file that contains the coordinates of the reads on the assembled sequences can be found here.
| Dataset | phrap | Arachne | Jazz |
| SimLC | Contigs Singlets |
Contigs Singlets |
Scaffolds Singlets |
| SimMC | Contigs Singlets |
Contigs Singlets |
Scaffolds Singlets |
| SimHC | Contigs Singlets |
Contigs Singlets |
Scaffolds Singlets |
JAZZ generates scaffold sequences (i.e. multiple contigs connected with streches of Ns).
The coordinates of contigs on scaffolds can be found here.
The taxonomic assignment of each contig can be found here.
Gene Prediction Methods
- Fgenes
- CRITICA/Glimmer
| dataset | phrap | Arachne | Jazz | |
| SimLC | Download | Download | Download | Fgenes |
| Download | Download | Download | CRITICA/Glimmer | |
| SimMC | Download | Download | Download | Fgenes |
| Download | Download | Download | CRITICA/Glimmer | |
| SimHC | Download | Download | Download | Fgenes |
| Download | Download | Download | CRITICA/Glimmer |
Binning Methods
- kmer (7mer, 8mer)
- PhyloPythia (generic and sample specific algorithm)
- BLAST distribution
| dataset | Phrap | Arachne | JAZZ | |
| SimLC | Download | Download | Download | kmer |
| Download | Download | Download | PhyloPythia | |
| Download | Download | Download | BLAST distribution | |
| SimMC | Download | Download | Download | kmer |
| Download | Download | Download | PhyloPythia | |
| Download | Download | Download | BLAST distribution | |
| SimHC | Download | Download | Download | kmer |
| Download | Download | Download | PhyloPythia | |
| Download | Download | Download | BLAST distribution | |
OTU analysis
Fasta formatted file of all 1677 sequences from this study.
Taxonomic identity of each sequence down to the species level.
Spreadsheet that displays all examined methodologies in this study. The VI column indicates the VI distance from the true species clustering. The ACE and CHAO1 columns are nonparametric estimators of the total number of species in the sampled environment. The SHANNON column is the compute Shannon diversity index.
| Distance matrix | Multiple sequence alignment | |||
| NAST | CLUSTALW | MUSCLE | ||
| Kimura | Download | Download | Download | |
| Jukes | Download | Download | Download | |
| Felsenstein | Download | Download | Download | Pairwise |
| Olsen | Download | Download | Download | |
