HiFine: integrating Hi-C-based and shotgun-based methods to refine binning of metagenomic contigsDu, Yuxuan; Sun, Fengzhu
doi: 10.1093/bioinformatics/btac295pmid: 35482530
MotivationMetagenomic binning aims to retrieve microbial genomes directly from ecosystems by clustering metagenomic contigs assembled from short reads into draft genomic bins. Traditional shotgun-based binning methods depend on the contigs’ composition and abundance profiles and are impaired by the paucity of enough samples to construct reliable co-abundance profiles. When applied to a single sample, shotgun-based binning methods struggle to distinguish closely related species only using composition information. As an alternative binning approach, Hi-C-based binning employs metagenomic Hi-C technique to measure the proximity contacts between metagenomic fragments. However, spurious inter-species Hi-C contacts inevitably generated by incorrect ligations of DNA fragments between species link the contigs from varying genomes, weakening the purity of final draft genomic bins. Therefore, it is imperative to develop a binning pipeline to overcome the shortcomings of both types of binning methods on a single sample.ResultsWe develop HiFine, a novel binning pipeline to refine the binning results of metagenomic contigs by integrating both Hi-C-based and shotgun-based binning tools. HiFine designs a strategy of fragmentation for the original bin sets derived from the Hi-C-based and shotgun-based binning methods, which considerably increases the purity of initial bins, followed by merging fragmented bins and recruiting unbinned contigs. We demonstrate that HiFine significantly improves the existing binning results of both types of binning methods and achieves better performance in constructing species genomes on publicly available datasets. To the best of our knowledge, HiFine is the first pipeline to integrate different types of tools for the binning of metagenomic contigs.Availability and implementationHiFine is available at https://github.com/dyxstat/HiFine.Supplementary informationSupplementary data are available at Bioinformatics online.
Deephos: predicted spectral database search for TMT-labeled phosphopeptides and its false discovery rate estimationNa, Seungjin; Choi, Hyunjin; Paek, Eunok
doi: 10.1093/bioinformatics/btac280pmid: 35441674
MotivationTandem mass tag (TMT)-based tandem mass spectrometry (MS/MS) has become the method of choice for the quantification of post-translational modifications in complex mixtures. Many cancer proteogenomic studies have highlighted the importance of large-scale phosphopeptide quantification coupled with TMT labeling. Herein, we propose a predicted Spectral DataBase (pSDB) search strategy called Deephos that can improve both sensitivity and specificity in identifying MS/MS spectra of TMT-labeled phosphopeptides.ResultsWith deep learning-based fragment ion prediction, we compiled a pSDB of TMT-labeled phosphopeptides generated from ∼8000 human phosphoproteins annotated in UniProt. Deep learning could successfully recognize the fragmentation patterns altered by both TMT labeling and phosphorylation. In addition, we discuss the decoy spectra for false discovery rate (FDR) estimation in the pSDB search. We show that FDR could be inaccurately estimated by the existing decoy spectra generation methods and propose an innovative method to generate decoy spectra for more accurate FDR estimation. The utilities of Deephos were demonstrated in multi-stage analyses (coupled with database searches) of glioblastoma, acute myeloid leukemia and breast cancer phosphoproteomes.Availability and implementationDeephos pSDB and the search software are available at https://github.com/seungjinna/deephos.
Scoring protein sequence alignments using deep learningShrestha, Bikash; Adhikari, Badri
doi: 10.1093/bioinformatics/btac210pmid: 35385080
MotivationA high-quality sequence alignment (SA) is the most important input feature for accurate protein structure prediction. For a protein sequence, there are many methods to generate a SA. However, when given a choice of more than one SA for a protein sequence, there are no methods to predict which SA may lead to more accurate models without actually building the models. In this work, we describe a method to predict the quality of a protein’s SA.ResultsWe created our own dataset by generating a variety of SAs for a set of 1351 representative proteins and investigated various deep learning architectures to predict the local distance difference test (lDDT) scores of distance maps predicted with SAs as the input. These lDDT scores serve as indicators of the quality of the SAs.Using two independent test datasets consisting of CASP13 and CASP14 targets, we show that our method is effective for scoring and ranking SAs when a pool of SAs is available for a protein sequence. With an example, we further discuss that SA selection using our method can lead to improved structure prediction.Availability and implementationCode and the data underlying this article are available at https://github.com/ba-lab/Alignment-Score/.Supplementary informationSupplementary data are available at Bioinformatics online.
scGraph: a graph neural network-based approach to automatically identify cell typesYin, Qijin; Liu, Qiao; Fu, Zhuoran; Zeng, Wanwen; Zhang, Boheng; Zhang, Xuegong; Jiang, Rui; Lv, Hairong
doi: 10.1093/bioinformatics/btac199pmid: 35394015
MotivationSingle-cell technologies play a crucial role in revolutionizing biological research over the past decade, which strengthens our understanding in cell differentiation, development and regulation from a single-cell level perspective. Single-cell RNA sequencing (scRNA-seq) is one of the most common single cell technologies, which enables probing transcriptional states in thousands of cells in one experiment. Identification of cell types from scRNA-seq measurements is a fundamental and crucial question to answer. Most previous studies directly take gene expression as input while ignoring the comprehensive gene–gene interactions.ResultsWe propose scGraph, an automatic cell identification algorithm leveraging gene interaction relationships to enhance the performance of the cell-type identification. scGraph is based on a graph neural network to aggregate the information of interacting genes. In a series of experiments, we demonstrate that scGraph is accurate and outperforms eight comparison methods in the task of cell-type identification. Moreover, scGraph automatically learns the gene interaction relationships from biological data and the pathway enrichment analysis shows consistent findings with previous analysis, providing insights on the analysis of regulatory mechanism.Availability and implementationscGraph is freely available at https://github.com/QijinYin/scGraph and https://figshare.com/articles/software/scGraph/17157743.Supplementary informationSupplementary data are available at Bioinformatics online.
Robust and accurate estimation of cellular fraction from tissue omics data via ensemble deconvolutionCai, Manqi; Yue, Molin; Chen, Tianmeng; Liu, Jinling; Forno, Erick; Lu, Xinghua; Billiar, Timothy; Celedón, Juan; McKennan, Chris; Chen, Wei; Wang, Jiebiao
doi: 10.1093/bioinformatics/btac279pmid: 35438146
MotivationTissue-level omics data such as transcriptomics and epigenomics are an average across diverse cell types. To extract cell-type-specific (CTS) signals, dozens of cellular deconvolution methods have been proposed to infer cell-type fractions from tissue-level data. However, these methods produce vastly different results under various real data settings. Simulation-based benchmarking studies showed no universally best deconvolution approaches. There have been attempts of ensemble methods, but they only aggregate multiple single-cell references or reference-free deconvolution methods.ResultsTo achieve a robust estimation of cellular fractions, we proposed EnsDeconv (Ensemble Deconvolution), which adopts CTS robust regression to synthesize the results from 11 single deconvolution methods, 10 reference datasets, 5 marker gene selection procedures, 5 data normalizations and 2 transformations. Unlike most benchmarking studies based on simulations, we compiled four large real datasets of 4937 tissue samples in total with measured cellular fractions and bulk gene expression from different tissues. Comprehensive evaluations demonstrated that EnsDeconv yields more stable, robust and accurate fractions than existing methods. We illustrated that EnsDeconv estimated cellular fractions enable various CTS downstream analyses such as differential fractions associated with clinical variables. We further extended EnsDeconv to analyze bulk DNA methylation data.Availability and implementationEnsDeconv is freely available as an R-package from https://github.com/randel/EnsDeconv. The RNA microarray data from the TRAUMA study are available and can be accessed in GEO (GSE36809). The demographic and clinical phenotypes can be shared on reasonable request to the corresponding authors. The RNA-seq data from the EVAPR study cannot be shared publicly due to the privacy of individuals that participated in the clinical research in compliance with the IRB approval at the University of Pittsburgh. The RNA microarray data from the FHS study are available from dbGaP (phs000007.v32.p13). The RNA-seq data from ROS study is downloaded from AD Knowledge Portal.Supplementary informationSupplementary data are available at Bioinformatics online.
scSGL: kernelized signed graph learning for single-cell gene regulatory network inferenceKaraaslanli, Abdullah; Saha, Satabdi; Aviyente, Selin; Maiti, Tapabrata
doi: 10.1093/bioinformatics/btac288pmid: 35451460
MotivationElucidating the topology of gene regulatory networks (GRNs) from large single-cell RNA sequencing datasets, while effectively capturing its inherent cell-cycle heterogeneity and dropouts, is currently one of the most pressing problems in computational systems biology. Recently, graph learning (GL) approaches based on graph signal processing have been developed to infer graph topology from signals defined on graphs. However, existing GL methods are not suitable for learning signed graphs, a characteristic feature of GRNs, which are capable of accounting for both activating and inhibitory relationships in the gene network. They are also incapable of handling high proportion of zero values present in the single cell datasets.ResultsTo this end, we propose a novel signed GL approach, scSGL, that learns GRNs based on the assumption of smoothness and non-smoothness of gene expressions over activating and inhibitory edges, respectively. scSGL is then extended with kernels to account for non-linearity of co-expression and for effective handling of highly occurring zero values. The proposed approach is formulated as a non-convex optimization problem and solved using an efficient ADMM framework. Performance assessment using simulated datasets demonstrates the superior performance of kernelized scSGL over existing state of the art methods in GRN recovery. The performance of scSGL is further investigated using human and mouse embryonic datasets.Availability and implementationThe scSGL code and analysis scripts are available on https://github.com/Single-Cell-Graph-Learning/scSGL.Supplementary informationSupplementary data are available at Bioinformatics online.
GMHCC: high-throughput analysis of biomolecular data using graph-based multiple hierarchical consensus clusteringLu, Yifu; Yu, Zhuohan; Wang, Yunhe; Ma, Zhiqiang; Wong, Ka-Chun; Li, Xiangtao
doi: 10.1093/bioinformatics/btac290pmid: 35451457
MotivationThanks to the development of high-throughput sequencing technologies, massive amounts of various biomolecular data have been accumulated to revolutionize the study of genomics and molecular biology. One of the main challenges in analyzing this biomolecular data is to cluster their subtypes into subpopulations to facilitate subsequent downstream analysis. Recently, many clustering methods have been developed to address the biomolecular data. However, the computational methods often suffer from many limitations such as high dimensionality, data heterogeneity and noise.ResultsIn our study, we develop a novel Graph-based Multiple Hierarchical Consensus Clustering (GMHCC) method with an unsupervised graph-based feature ranking (FR) and a graph-based linking method to explore the multiple hierarchical information of the underlying partitions of the consensus clustering for multiple types of biomolecular data. Indeed, we first propose to use a graph-based unsupervised FR model to measure each feature by building a graph over pairwise features and then providing each feature with a rank. Subsequently, to maintain the diversity and robustness of basic partitions (BPs), we propose multiple diverse feature subsets to generate several BPs and then explore the hierarchical structures of the multiple BPs by refining the global consensus function. Finally, we develop a new graph-based linking method, which explicitly considers the relationships between clusters to generate the final partition. Experiments on multiple types of biomolecular data including 35 cancer gene expression datasets and eight single-cell RNA-seq datasets validate the effectiveness of our method over several state-of-the-art consensus clustering approaches. Furthermore, differential gene analysis, gene ontology enrichment analysis and KEGG pathway analysis are conducted, providing novel insights into cell developmental lineages and characterization mechanisms.Availability and implementationThe source code is available at GitHub: https://github.com/yifuLu/GMHCC. The software and the supporting data can be downloaded from: https://figshare.com/articles/software/GMHCC/17111291.Supplementary informationSupplementary data are available at Bioinformatics online.
Continuous chromatin state feature annotation of the human epigenomeDaneshpajouh, Habib; Chen, Bowen; Shokraneh, Neda; Masoumi, Shohre; Wiese, Kay C; Libbrecht, Maxwell W
doi: 10.1093/bioinformatics/btac283pmid: 35451453
MotivationSegmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity, such as ChIP-seq measurements of histone modification and transcription factor binding. They output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm-nonneg, that uses a non-negative state space model to efficiently annotate the genome with chromatin state features. We also propose several measures of the quality of a chromatin state feature annotation and we compare the performance of several alternative methods according to these quality measures.ResultsWe show that chromatin state features from epigenome-ssm-nonneg are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis.Availability and implementationSource code for epigenome-ssm is available at https://github.com/habibdanesh/epigenome-ssm and Zenodo (DOI: 10.5281/zenodo.6507585).Supplementary informationSupplementary data are available at Bioinformatics online.
Comprehensive comparison of two types of algorithm for circRNA detection from short-read RNA-SeqLiu, Hongfei; Akhatayeva, Zhanerke; Pan, Chuanying; Liao, Mingzhi; Lan, Xianyong
doi: 10.1093/bioinformatics/btac302pmid: 35482518
MotivationCircular RNA is generally formed by the ‘back-splicing’ process between the upstream splice acceptor and the downstream donor in/not in the regulation of the corresponding RNA-binding proteins or cis-elements. Therefore, more and more software packages have been developed and they are mostly based on the identification of the back-spliced junction reads. However, recent studies developed two software tools that can detect circRNA candidates by constructing k-mer table or/and de Bruijn graph rather than reads mapping.ResultsHere, we compared the precision, sensitivity and detection efficiency between software tools based on different algorithms. Eleven representative detection tools with two types of algorithm were selected for the overall pipeline analysis of RNA-seq datasets with/without RNase R treatment in two cell lines. Precision, sensitivity, AUC, F1 score and detection efficiency metrics were assessed to compare the prediction tools. Meanwhile, the sensitivity and distribution of highly expressed circRNAs before and after RNase R treatment were also revealed by their enrichment, unaffected and depleted candidate frequencies. Eventually, we found that compared to the k-mer based tools, CIRI2 and KNIFE based on reads mapping had relatively superior and more balanced detection performance regardless of the cell line or RNase R (-/+) datasets.Availability and implementationAll predicted results and source codes can be retrieved from https://github.com/luffy563/circRNA_tools_comparison.Supplementary informationSupplementary data are available at Bioinformatics online.
KIMGENS: a novel method to estimate kinship in organisms with mixed haploid diploid genetic systems robust to population structureWang, Yen-Wen; Ané, Cécile
doi: 10.1093/bioinformatics/btac293pmid: 35482481
MotivationKinship estimation is necessary for evaluating violations of assumptions or testing certain hypotheses in many population genomic studies. However, kinship estimators are usually designed for diploid systems and cannot be used in populations with mixed haploid diploid genetic systems. The only estimators for different ploidies require datasets free of population structure, limiting their usage.ResultsWe present KIMGENS (Kinship Inference for Mixed GENetic Systems), an estimator for kinship estimation among individuals of various ploidies, that is robust to population structure. This estimator is based on the popular KING-robust estimator but uses diploid relatives of the individuals of interest as references of heterozygosity and extends its use to haploid–diploid and haploid pairs of individuals. We demonstrate that KIMGENS estimates kinship more accurately than previously developed estimators in simulated panmictic, structured and admixed populations, but has lower accuracy when the individual of interest is inbred. KIMGENS also outperforms other estimators in a honeybee dataset. Therefore, KIMGENS is a valuable addition to a population geneticist’s toolbox.Availability and implementationKIMGENS and its association simulation tool are implemented and available open-source at https://github.com/YenWenWang/HapDipKinship.Supplementary informationSupplementary data are available at Bioinformatics online.