All tools are accessible as Apps in the CyVerse Discovery Environment (formerly iPlant). The CyVerse Cyberinfrastructure is a freely available resource for computation, storage, and data analysis for the life sciences. We plan to extend the list of tools for viruses pending funding. The are also general Apps for metagenomics and microbial ecology available through the iMicrobe Project.
The lists below include virus-focused tools (either available through the iVirus project by others) and tools not specifically built for viruses but can be applied to viral metagenomic analyses.
Clicking on tools will take you to the App on CyVerse!
Once you’ve uploaded your read data, you’ll need to QC (“Quality Control”) it. This ensures that the data going into the assembly (the next step) is of high quality. Poor read quality can result in mis- or incorrectly assembled sequences. Most frequently, read data QC involves trimming reads according to their quality scores. Although some assemblers do not require QC’d reads, we highly recommend it!
|Trimmomatic||Identifies adapter sequences and quality filters|
|Btrim||Trims adapters and low quality regions|
|Scythe||Identifies contaminating sequences in read data based on a Bayesian approach|
|Sickle||Sliding window quality trimmer, designed to be used after Scythe|
|MetaGeneMark||Ab initio gene prediction|
|FragGeneScan||Ab initio gene prediction|
|Prodigal||Ab initio gene prediction|
Following read trimming and QC, reads can now be assembled into contiguous sequences (“contigs”). Most “recent” assemblers are designed to assemble Illumina data (short read lengths, massively deep sequencing) and are based on De Bruijn graphs (original ref). Assembler selection is dependent on the type of read data being assembled (often 454 vs Illumina vs Pacbio), source material (DNA vs. RNA, eukaryotic vs prokaryotic) and/or sample-specific determinants that may have biased the reads (high/low coverage, repetitive sequences, amplification polymerase, etc.). There is no “best” assembler, though there are assemblers that perform better with viral metagenomes than others.
|SOAPDenovo||Single-genome assembler tuned for metagenomics|
|Newbler (gs Assembler)||De novo assembly based on read overlap|
|SPAdes (multiple memory)||De Bruijn graph assembler|
|IDBA-UD (multiple memory)||De Bruijn graph multiple alignments|
|Trinity (multiple memory)||RNA-Seq De novo assembler|
Analyzing viral data remains a major challenge in the field of viral ecology. A variety of approaches have been proposed, each dependent on the source of data and the underlying biological question. A relatively recent method of analyzing complex viral data is by organizing viral sequence space, often through the use of protein clustering techniques. Protein clusters can be used as a diversity metric, or as units for ecological studies when compared against other datasets, or functional profiling of the community.
|PCPipe||Protein clustering pipeline and annotation|
|VirSorter||Find viral contigs in a microbial metagenome (reference)|
|vContact||Guilt-by-contig-association automatic classification of viral contigs|
|vContact-PCs||Generate PC-profiles using vContact/MCL|
|vContact-Gene2Contig||Conditions files for use in vContact|
|GAAS (Genome Abundance and Average Size) (In development)||Estimates relative abundance and average size of metagenomic sequences|
|Circonspect (In development)||Generates contig spectra for downstream modeling of community structure|
|PHACCS (Control In Research on CONtig SPECTra) (In development)||Estimates structure and diversity of viral communities|
|BatchBowtie||Performs mass alignment of paired and unpaired reads against a reference dataset using Bowtie2 and Samtools.|
|Read2RefMapper||Consumes input from BowtieBatch to generate coverage profiles.|
|Prokka||software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files|
iVirus tool updates (both improvements and bugs) will be worked on pending funding and time availability.