journal article
Open Access Collection
Mendes, Catarina Inês; Vila-Cerqueira, Pedro; Motro, Yair; Moran-Gilad, Jacob; Carriço, João André; Ramirez, Mário
doi: 10.1093/gigascience/giac122pmid: 36576131
BackgroundThe de novo assembly of raw sequence data is key in metagenomic analysis. It allows recovering draft genomes from a pool of mixed raw reads, yielding longer sequences that offer contextual information and provide a more complete picture of the microbial community.FindingsTo better compare de novo assemblers for metagenomic analysis, LMAS (Last Metagenomic Assembler Standing) was developed as a flexible platform allowing users to evaluate assembler performance given known standard communities. Overall, in our test datasets, k-mer De Bruijn graph assemblers outperformed the alternative approaches but came with a greater computational cost. Furthermore, assemblers branded as metagenomic specific did not consistently outperform other genomic assemblers in metagenomic samples. Some assemblers still in use, such as ABySS, MetaHipmer2, minia, and VelvetOptimiser, perform relatively poorly and should be used with caution when assembling complex samples. Meaningful strain resolution at the single-nucleotide polymorphism level was not achieved, even by the best assemblers tested.ConclusionsThe choice of a de novo assembler depends on the computational resources available, the replicon of interest, and the major goals of the analysis. No single assembler appeared an ideal choice for short-read metagenomic prokaryote replicon assembly, each showing specific strengths. The choice of metagenomic assembler should be guided by user requirements and characteristics of the sample of interest, and LMAS provides an interactive evaluation platform for this purpose. LMAS is open source, and the workflow and its documentation are available at https://github.com/B-UMMI/LMAS and https://lmas.readthedocs.io/, respectively.
Shukla, Harsh; Suryamohan, Kushal; Khan, Anubhab; Mohan, Krishna; Perumal, Rajadurai C; Mathew, Oommen K; Menon, Ramesh; Dixon, Mandumpala Davis; Muraleedharan, Megha; Kuriakose, Boney; Michael, Saju; Krishnankutty, Sajesh P; Zachariah, Arun; Seshagiri, Somasekar; Ramakrishnan, Uma
Avila Cartes, Jorge; Anand, Santosh; Ciccolella, Simone; Bonizzoni, Paola; Della Vedova, Gianluca
doi: 10.1093/gigascience/giac119pmid: 36576129
BackgroundSince the beginning of the coronavirus disease 2019 pandemic, there has been an explosion of sequencing of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus; most notably, the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.ResultsIn this article, we leverage the frequency chaos game representation (FCGR) and convolutional neural networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieved an $96.29\%$ overall accuracy, while a similar tool, Covidex, obtained a $77,12\%$ overall accuracy. As far as we know, our method is the first using deep learning and FCGR for intraspecies classification. Furthermore, by using some feature importance methods, CouGaR-g allows to identify k-mers that match SARS-CoV-2 marker variants.ConclusionsBy combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on random forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k-mers that capture relevant biological information that distinguishes the clades, known as marker variants.AvailabilityThe trained models can be tested online providing a FASTA file (with 1 or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr. CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.
Showing 1 to 4 of 4 Articles
The tiger, a poster child for conservation, remains an endangered apex predator. Continued survival and recovery will require a comprehensive understanding of genetic diversity and the use of such information for population management. A high-quality tiger genome assembly will be an important tool for conservation genetics, especially for the Indian tiger, the most abundant subspecies in the wild. Here, we present high-quality near-chromosomal genome assemblies of a female and a male wild Indian tiger (Panthera tigris tigris). Our assemblies had a scaffold N50 of >140 Mb, with 19 scaffolds corresponding to the 19 numbered chromosomes, containing 95% of the genome. Our assemblies also enabled detection of longer stretches of runs of homozygosity compared to previous assemblies, which will help improve estimates of genomic inbreeding. Comprehensive genome annotation identified 26,068 protein-coding genes, including several gene families involved in key morphological features such as the teeth, claws, vision, olfaction, taste, and body stripes. We also identified 301 microRNAs, 365 small nucleolar RNAs, 632 transfer RNAs, and other noncoding RNA elements, several of which are predicted to regulate key biological pathways that likely contribute to the tiger's apex predatory traits. We identify signatures of positive selection in the tiger genome that are consistent with the Panthera lineage. Our high-quality genome will enable use of noninvasive samples for comprehensive assessment of genetic diversity, thus supporting effective conservation and management of wild tiger populations.