An Overview of the Tools
One of the strengths of iVirus (thanks to its underlying CyVerse cyberinfrastructure) is a focus on bringing bioinformatic tools to the viral ecology community. Here are a few examples of using iVirus/CyVerse available Apps to process data:
A few quick notes:
- Guides are not intended to assist users in understanding the biology behind the tools nor how the tools function.
- Where possible, Apps have links to their documentation on CyVerse as well as their citations (or original home pages).
- In some cases, many Apps are available to solve a particular problem. Guides will choose to highlight one or two.
- These guides assume you’ve created an CyVerse account and can access your account. Check out the getting started guide for assistance.
Several “use cases” are available at protocols.io. For nearly all these use cases, we’ll use (as a basis) actual reads from the Ocean Sampling Day (2014) and process them using Cyverse. In some cases we’ll take the user from using raw read files to assembly to identifying viral sequences and preliminary analysis. Other use cases will tackle ways of analyzing a viral metagenome, either reads or contigs, using traditional and non-traditional approaches. As a reminder, all these protocols are on protcols.io and should be considered the most up-to-date versions.
All example files can be found within the Cyverse datastore. To find these files, login to the Discovery Environment. Under “Data”, go to Community Data –> iVirus –> ExampleData. Alternatively, you can copy-and-paste the following into the “Viewing” bar under the data browser: /iplant/home/shared/iVirus/ExampleData/
All tools have “Input” and “Output” directories, so not only does the user have valid input data, but also the expected output data as well.
Processing a Viral Metagenome
Description: A long-standing challenge in viral metagenomics is actually processing a viral metagenome (we’re not talking about the science side!). For many reasons enumerated elsewhere, processing these datasets requires skilled bioinformaticians and computational resources not available to many researchers/labs. iVirus seeks to tackle this head-on.
Protocol “Collection”: protocols.io (collections are just that – collections of protocols)
- Cleaning up sequencing reads using Trimmomatic
- Assembling QC’d reads using SPAdes
- Identifying putative viral sequences using VirSorter
- Preparing data for vContact
- Running vContact and Visualization in Cytoscape
Mapping Metagenomic Reads to References
Description: One of the most commonly used procedures for analyzing viral metagenomic data is to map their reads (or reads from another dataset) against a set of references, often those from the read assembly. For example, if one wanted to know how well-represented viruses in NCBI’s Viral Reference Sequences (ViralRefSeq) were in ocean viromes, they could map reads from lots of ocean viral metagenomes against ViralRefSeq. This is generally done using Bowtie2 or BWA, by selecting a reference set of sequences, and then providing paired or unpaired reads to Bowtie2/BWA. Then the results must be processed/filtered to generate coverage tables. Dealing with setting up multiple reads files (10 paired metagenomes = 10 alignment runs) and the processing those read files can be challenging (not to mention computational resources).
- Mapping reads from multiple metagenomes to a set of references
- Filtering mapped reads and generate coverage tables
Before processing any data, users will need to upload their data to CyVerse’s data store. The data store is built on iRODS, an open source data management system. Data can be uploaded directly through the Discovery Environment’s (DE) upload menu (this is limited to 2 GB per file) or through one of iRODS clients (click here for a list of available offerings). The easiest way to upload files securely and quickly is by using Cyberduck. Here we’ll assume you’ve installed Cyberduck and are connecting to the Data Store (a complete guide is available here):
user: your CyVerse username
password: your CyVerse password
Once you’ve logged in, you should be at your home folder. Drag n’ drop your read files to your home directory.
Quality Control of Read Data
Once you’ve uploaded your read data, you’ll need to QC (“Quality Control”) it. This ensures that the data going into the assembly (the next step) is of high quality. Poor read quality can result in mis- or incorrectly assembled sequences. Most frequently, read data QC involves trimming reads according to their quality scores. Although some assemblers do not require QC’d reads, we highly recommend it! A number of Apps are available for trimming reads:
- Btrim (documentation, citation)
- fastx quality trimmer (site)
- Trimmomatic (documentation, citation)
- Scythe (documentation, site)
- Sickle (site)
A highly detailed overview of read processing using the CyVerse system is available here.
Following read trimming and QC, reads can now be assembled into contiguous sequences (“contigs”). Most “recent” assemblers are designed to assemble Illumina data (short read lengths, massively deep sequencing) and are based on De Bruijn graphs (original ref). Assembler selection is dependent on the type of read data being assembled (often 454 vs Illumina vs Pacbio), source material (DNA vs. RNA, eukaryotic vs prokaryotic) and/or sample-specific determinants that may have biased the reads (high/low coverage, repetitive sequences, amplification polymerase, etc.). There is no “best” assembler, though there are assemblers that perform better with viral metagenomes than others.
A list of assembly Apps that are available:
- IDBA-UD (documentation, citation)
- Meta-Velvet (citation)
- Newbler (also known as “gs Assembler”, site)
- SOAPDenovo2 (documentation, citation)
- Trinity (documentation, citation)
- SPAdes (documentation, citation)
Identifying Viral Sequences
Post assembly, a file containing the assembled sequences (contigs) will be created. Depending on the assembler, the contigs file may include all the contigs or contigs above a certain threshold (length, quality, etc). Check with the assembler’s documentation to check what files are generated.
VirSorter (documentation, reference) is an iVirus-exclusive app that can identify viral signal from a variety of different sources – single cell genomes, microbial and viral metagenomes, fragmented genomes, complete genomes, etc.
- To use VirSorter, navigate to “Apps” -> “VirSorter 1.0.2”
- Select VirSorter, opening up its app menu.
- Select the contigs file generated from the assembler and a reference database (“RefSeq” or “Virome”). Launch Analysis
Once complete (minutes -> hours -> days), VirSorter will generate several folders containing a number of files. The two most important are VIRSorter_global-phage-signal.csv and the files in /Predicted_viral_sequences. The global-phage-signal.csv file documents all sequences identified as viral and their “confidence” category, with “1” being the highest confidence and “3” being the least. It also includes summary information concerning what went into the scoring for each sequence. The files in /Predicted_viral_sequences are category-organized (explained previously) nucleotide-sequence files, split into predicted viruses and prophages.
For a more detailed explanation of the output, please see the documentation.
Analyzing viral data remains a major challenge in the field of viral ecology. A variety of approaches have been proposed, each dependent on the source of data and the underlying biological question. A relatively recent method of analyzing complex viral data is by organizing viral sequence space, often through the use of protein clustering techniques. Protein clusters can be used as a diversity metric, or as units for ecological studies when compared against other datasets, or functional profiling of the community.
Protein Cluster Based
- vContact-PCs (original site, original documentation) is an MCL-based protein clustering App that uses an all-verses-all BLASTp file to generate input files required for vContact. Additionally, it takes an annotation file containing protein IDs, its contig source name, and any keywords. This is generally the easiest way to prepare data for use with vContact.
- vContact (original site, original documentation) is an MCL-based, contig clustering App inspired by Lima-Mendez et al. that classifies viruses based on gene content. Unlike more traditional approaches to classify viruses (nucleic acid type, morphology, host range), vContact takes a genomic perspective. Required are three files, a protein clustering file containing proteins associated with each protein cluster, a file containing which proteins as generated from which contigs, and a file with information about the protein clusters themselves. All three of these files can be generated from vContact-PCs.
- BLAST is arguably the most popular similarity-based sequence analysis suite of tools.