Optimizing genome assembly and loci extraction for evolutionary analyses
A fundamental part of studying evolutionary biology and molecular ecology is the bioinformatic processing of raw sequenced DNA and optimizing specific parts of this processing is currently a key challenge in these fields. For instance, there are no specific guidelines on the optimal parameters for the assembly of contiguous sequences of DNA (contigs) without a reference genome (de novo assembly) and for the extraction of targeted loci from these contigs. Moreover, there are no assessments of which available bioinformatic tools can simultaneously process data from different types of sequencing techniques (e.g whole genome sequencing – WGS - or target capture sequencing) while optimizing the performance of de novo assembly and loci extraction.
Fig.: Using the guidelines and approaches described in our study may improve the efficiency of evolutionary analyses especially for organisms for which there are no reference genomes available. The specimen of the Neotropical Urbanus genus of the Eudaminae subfamily (Lepidoptera:Hesperiidae) seen in the picture is one of such organisms, and addressing questions about this group’s evolutionary history may shed light to the outstanding patterns of Neotropical biodiversity. Credits of the picture for Daniel Linke.
In this study, we compared two of the most used genomic assemblers, ABySS and SPAdes, on their ability to de novo assemble contigs from reads deriving from different types of sequencing techniques: whole genome sequencing (at different depths of coverage – 10X, 5X, and 2X) and target capture sequencing. We also developed a new loci extraction approach, in which multiple contigs are merged after assembly with the program ABySS and implemented it in the SECAPR pipeline (Andermann et al., 2018). Using this pipeline, that allows the concomitant processing of WGS and target capture sequencing data, we showed that SPAdes and our newly developed approach with ABySS can assemble better contigs and therefore result in the extraction of more loci of interest from both types of sequencing techniques. We also show that whole genome sequencing with a depth of coverage of 5X is currently the most cost-efficient sequencing approach for extracting loci for phylogenomics and molecular ecology. Our study represents a way forward in bioinformatics and in evolutionary biology since it provides a guideline so users can make an informed choice on the best sequencing techniques and assembly approaches to use for their data.
de Gusmão Ribeiro P., Torres Jiménez M.F., Andermann T., Antonelli A., Bacon C.D., Matos Maravi P. F. (2021) A bioinformatic platform to integrate target capture and whole genome sequences of various read depths for phylogenomics. Molecular Ecology Early View: DOI: 10.1111/mec.16240