All tools are accessible as Apps in either the CyVerse Discovery Environment or in KBase's App Catalog. We plan to extend the list of tools for viruses as long as we continue to receive funding (and sometimes beyond). We’ve also included more generalized apps for metagenomics and microbial ecology available through the iMicrobe Project.
Below is a list of apps we've used in an iVirus protocol or used successfully with viral data. We'll do our best to keep this updated as frequently as time allows, though feel free to contact us if there’s any mistakes or omissions.
Generally speaking, quality control (QC) is a technique most commonly applied to raw read data. This ensures that the data going into the assembly (common next step) is of high quality. Poor read quality can result in mis- or incorrectly assembled sequences. Most frequently, read data QC involves trimming reads according to their quality scores and removing barcoding sequences (if applicable). Although some assemblers do not require QC’d reads, we highly recommend it!
Joshi NA, Fass JN. (2011). Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files (Version 1.33) [Software]. Available at https://github.com/najoshi/sickle.
Buffalo V. Scythe - A Bayesian adapter trimmer (version 0.994 BETA) [Software]. Available at https://github.com/vsbuffalo/scythe
Kong, Y. (2011) Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies. Genomics. DOI: 10.1016/j.ygeno.2011.05.009
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.
Andrews, S. (2010). FastQC: A Quality Control Tool for High Throughput Sequence Data [Online]. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Once reads have passed quality control and are 'cleaned', the next usual step is assemble them. Since reads are fragments of a longer DNA template, assembly attempts to piece back together the original DNA sequence from the short-reads. This process is called assembly, and results in commonly called 'contigs' - or contiguous sequences - that represent a larger piece of DNA from the original DNA library. Multiple assemblers are available, and have different methods and algorithms to piece back together contigs. The choice of assembler can also depend on the complexity of the genome, as well as the type of organism. For viruses, SPAdes or MetaSPAdes have yielded good results.
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Bankevich, A. et al. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. J. Comput. Biol. 19, 455–477 (2012).
Peng, Y., et al. (2012) IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth, Bioinformatics, 28, 1420-1428.
Luo et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 2012 1:18.
Post assembly, gene prediction is the next step. Genes determined from a gene prediction tool can be fed into numerous sequence analysis tools.
Hyatt, D. Prodigal (2.6.3) [Software]. Available at https://github.com/hyattpd/Prodiga
Seemann T. Prokka: rapid prokaryotic genome annotation Bioinformatics 2014 Jul 15;30(14):2068-9. PMID:24642063
Once genes are called (and sometimes that's not required), the real "fun" of analyzing viral sequence data begins. The tools featured here aren't virus-specific, but they're often used with viral data.
B. Buchfink, Xie C., D. Huson, “Fast and sensitive protein alignment using DIAMOND”, Nature Methods 12, 59-60 (2015)
Analyzing viral data remains a major challenge in the field of viral ecology. A variety of approaches have been proposed, each dependent on the source of data and the underlying biological question. A relatively recent method of analyzing complex viral data is by organizing viral sequence space, often through the use of protein clustering techniques. Protein clusters can be used as a diversity metric, or as units for ecological studies when compared against other datasets, or functional profiling of the community.
Guo, J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome 9, 37 (2021).
Roux S, Enault F, Hurwitz BL, Sullivan MB. (2015) VirSorter: mining viral signal from microbial genomic data. PeerJ 3:e985
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
Amgarten, D., Braga, L. P. P., da Silva, A. M. & Setubal, J. C. MARVEL, a Tool for Prediction of Bacteriophage Sequences in Metagenomic Bins. Front. Genet. 9, 1–8 (2018).
Vik, D. R. et al. Putative archaeal viruses from the mesopelagic ocean. PeerJ 5, e3428 (2017).
Ren, J. et al. Identifying viruses from metagenomic data by deep learning. (2018).
Analyzing viral data remains a major challenge in the field of viral ecology. A variety of approaches have been proposed, each dependent on the source of data and the underlying biological question. A relatively recent method of analyzing complex viral data is by organizing viral sequence space, often through the use of protein clustering techniques. Protein clusters can be used as a diversity metric, or as units for ecological studies when compared against other datasets, or functional profiling of the community. For taxonomic classification on phages, we recommend vConTACT-related software. For virus genome gene annotation, DRAM-v and Cenote-Taker2 can generate files containing annotation data.
Bin Jang, H., Bolduc, B., Zablocki, O., Kuhn, J. H., Roux, S., Adriaenssens, E. M., … Sullivan, M. B. (2019). Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nature Biotechnology.
Bin Jang, H., Bolduc, B., Zablocki, O., Kuhn, J. H., Roux, S., Adriaenssens, E. M., … Sullivan, M. B. (2019). Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nature Biotechnology.
Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ 5, e3243 (2017).
Bolduc, B. et al. vConTACT: an iVirus tool to classify double-stranded DNA viruses that infect Archaea and Bacteria. PeerJ 5, e3243 (2017).
Shaffer, M. et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 48, 8883–8900 (2020).
Tisza, M. J., Belford, A. K., Domínguez-Huerta, G., Bolduc, B. & Buck, C. B. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evol. 7, 1–12 (2021).
Tisza, M. J. et al. Discovery of several thousand highly diverse circular DNA viruses. Elife 9, 1–26 (2020).
Analyses based on reads can be used for a variety of different reasons. Principle among them is estimating genome or population abundance.
Bolduc, B., Youens-Clark, K., Roux, S., Hurwitz, B. L. & Sullivan, M. B. iVirus: facilitating new insights in viral ecology with software and community data sets imbedded in a cyberinfrastructure. ISME J. 11, 7–14 (2017).
Bolduc, B., Youens-Clark, K., Roux, S., Hurwitz, B. L. & Sullivan, M. B. iVirus: facilitating new insights in viral ecology with software and community data sets imbedded in a cyberinfrastructure. ISME J. 11, 7–14 (2017).