Gene Function Prediction Method Comparison
Genes identified on the simulated datasets were compared to the genes originally predicted on the corresponding reads of the isolate genomes (reference genes) using blastp. Genes were categorized into four groups.
-
The first group comprised genes common to both datasets (correctly identified genes). Only genes identified on the same sequence reads, with >80% amino acid identity over 50% of the shortest gene length were included in this group.
-
Genes falling below these thresholds formed the second group (inaccurately predicted genes).
-
The third group contained genes predicted in the simulated datasets with no corresponding reference gene (newly predicted genes).
-
Finally, reference genes without a corresponding predicted gene in the simulated dataset formed the group of missed genes. Reference genes represented by <90 bp in the meta-assemblies were expected to be missed by blastp due to their length and were excluded from the comparisons.
Reference genes for each simulated dataset, as well as predicted genes both on assembled sequences (contigs) and on singlets were compared to the COG database. For each combination of assembly/gene prediction method, the relative abundance of each cog group was calculated.


