Rapid detection of viruses in plants is essential to avoid the large negative economic impact caused by these pathogens. Currently, common methods to identify viruses rely on biological indicators and molecular assays, which are limited to one or group of viruses they are designed to detect. Because of this, next generation sequencing (NGS) techniques, such as smallRNA- and RNA-seq, are being rapidly adopted in the field. These techniques allow the unbiased detection of multiple viruses and the identification of novel viruses, an increasingly important issue in viral diagnostics. This is particularly true in the prevention of emerging viruses at boarders, as post entry quarantine systems are being put under pressure to detect new pathogens because of increases in global trade and movement.
However, using NGS methods for viral diagnosis requires lengthy and robust bioinformatics analysis. For this reason, the last few years have seen an increase in bioinformatics tools designed for viral diagnosis from mixed (host and pathogen) samples. However, most are designed for clinical samples, which render them suboptimal for the identification of viruses in plants. In addition, many tools rely on alignment methods for de novo assembly, and mapping of short reads to a reference, which (in addition of being slow) can lead to related viral sequence being undetected, due to high mutability rates of viral genomes.
For this reason, we have developed a new bioinformatics pipeline that takes mixed raw smallRNA- or RNA-seq reads from plant samples and produces a viral index using k-mer profiles. This method divides sequences into k-mers of specific lengths and uses exact matching. The pipeline has been developed on Galaxy, an open platform for intensive data analysis, allowing analysis to be conducted with no command line input, making it accessible to all researchers.