Genome-wide histone acetylation data improve prediction of mammalian transcription factor binding sitesRamsey, Stephen A.; Knijnenburg, Theo A.; Kennedy, Kathleen A.; Zak, Daniel E.; Gilchrist, Mark; Gold, Elizabeth S.; Johnson, Carrie D.; Lampano, Aaron E.; Litvak, Vladimir; Navarro, Garnet; Stolyar, Tetyana; Aderem, Alan; Shmulevich, Ilya
doi: 10.1093/bioinformatics/btq405pmid: 20663846
Motivation: Histone acetylation (HAc) is associated with open chromatin, and HAc has been shown to facilitate transcription factor (TF) binding in mammalian cells. In the innate immune system context, epigenetic studies strongly implicate HAc in the transcriptional response of activated macrophages. We hypothesized that using data from large-scale sequencing of a HAc chromatin immunoprecipitation assay (ChIP-Seq) would improve the performance of computational prediction of binding locations of TFs mediating the response to a signaling event, namely, macrophage activation.Results: We tested this hypothesis using a multi-evidence approach for predicting binding sites. As a training/test dataset, we used ChIP-Seq-derived TF binding site locations for five TFs in activated murine macrophages. Our model combined TF binding site motif scanning with evidence from sequence-based sources and from HAc ChIP-Seq data, using a weighted sum of thresholded scores. We find that using HAc data significantly improves the performance of motif-based TF binding site prediction. Furthermore, we find that within regions of high HAc, local minima of the HAc ChIP-Seq signal are particularly strongly correlated with TF binding locations. Our model, using motif scanning and HAc local minima, improves the sensitivity for TF binding site prediction by ∼50% over a model based on motif scanning alone, at a false positive rate cutoff of 0.01.Availability: The data and software source code for model training and validation are freely available online at http://magnet.systemsbiology.net/hac.Contact: [email protected]; [email protected] information: Supplementary data are available at Bioinformatics online.
EGM: encapsulated gene-by-gene matching to identify gene orthologs and homologous segments in genomesMahmood, Khalid; Konagurthu, Arun S.; Song, Jiangning; Buckle, Ashley M.; Webb, Geoffrey I.; Whisstock, James C.
doi: 10.1093/bioinformatics/btq339pmid: 20581400
Motivation: Identification of functionally equivalent genes in different species is essential to understand the evolution of biological pathways and processes. At the same time, identification of strings of conserved orthologous genes helps identify complex genomic rearrangements across different organisms. Such an insight is particularly useful, for example, in the transfer of experimental results between different experimental systems such as Drosophila and mammals.Results: Here, we describe the Encapsulated Gene-by-gene Matching (EGM) approach, a method that employs a graph matching strategy to identify gene orthologs and conserved gene segments. Given a pair of genomes, EGM constructs a global gene match for all genes taking into account gene context and family information. The Hungarian method for identifying the maximum weight matching in bipartite graphs is employed, where the resulting matching reveals one-to-one correspondences between nodes (genes) in a manner that maximizes the gene similarity and context.Conclusion: We tested our approach by performing several comparisons including a detailed Human versus Mouse genome mapping. We find that the algorithm is robust and sensitive in detecting orthologs and conserved gene segments. EGM can sensitively detect rearrangements within large and small chromosomal segments. The EGM tool is fully automated and easy to use compared to other more complex methods that also require extensive manual intervention and input.Availability: The EGM software, Supplementary information and other tools are available online from http://vbc.med.monash.edu.au/∼kmahmood/EGMContacts: [email protected]; [email protected] information: Supplementary data are available at Bioinformatics online.
Dealing with sparse data in predicting outcomes of HIV combination therapiesBogojeska, Jasmina; Bickel, Steffen; Altmann, André; Lengauer, Thomas
doi: 10.1093/bioinformatics/btq361pmid: 20624779
Motivation: As there exists no cure or vaccine for the infection with human immunodeficiency virus (HIV), the standard approach to treating HIV patients is to repeatedly administer different combinations of several antiretroviral drugs. Because of the large number of possible drug combinations, manually finding a successful regimen becomes practically impossible. This presents a major challenge for HIV treatment. The application of machine learning methods for predicting virological responses to potential therapies is a possible approach to solving this problem. However, due to evolving trends in treating HIV patients the available clinical datasets have a highly unbalanced representation, which might negatively affect the usefulness of derived statistical models.Results: This article presents an approach that tackles the problem of predicting virological response to combination therapies by learning a separate logistic regression model for each therapy. The models are fitted by using not only the data from the target therapy but also the information from similar therapies. For this purpose, we introduce and evaluate two different measures of therapy similarity. The models are also able to incorporate phenotypic knowledge on the therapy outcomes through a Gaussian prior. With our approach we balance the uneven therapy representation in the datasets and produce higher quality models for therapies with very few training samples. According to the results from the computational experiments our therapy similarity model performs significantly better than training separate models for each therapy by using solely their examples. Furthermore, the model's performance is as good as an approach that encodes therapy information in the input feature space with the advantage of delivering better results for therapies with very few training samples.Availability: Code of the efficient logistic regression is available from http://www.mpi-inf.mpg.de/%7Ejasmina/fastLogistic.zipContact: [email protected] information: Supplementary data are available at Bioinformatics online.
PERMORY: an LD-exploiting permutation test algorithm for powerful genome-wide association testingPahl, Roman; Schäfer, Helmut
doi: 10.1093/bioinformatics/btq399pmid: 20605926
Motivation: In genome-wide association studies (GWAS) examining hundreds of thousands of genetic markers, the potentially high number of false positive findings requires statistical correction for multiple testing. Permutation tests are considered the gold standard for multiple testing correction in GWAS, because they simultaneously provide unbiased type I error control and high power. At the same time, they demand heavy computational effort, especially with large-scale datasets of modern GWAS. In recent years, the computational problem has been circumvented by using approximations to permutation tests, which, however, may be biased.Results: We have tackled the original computational problem of permutation testing in GWAS and herein present a permutation test algorithm one or more orders of magnitude faster than existing implementations, which enables efficient permutation testing on a genome-wide scale. Our algorithm does not rely on any kind of approximation and hence produces unbiased results identical to a standard permutation test. A noteworthy feature of our algorithm is a particularly effective performance when analyzing high-density marker sets.Availability: Freely available on the web at http://www.permory.orgContact: [email protected]
Global sequence characterization of rice centromeric satellite based on oligomer frequency analysis in large-scale sequencing dataMacas, Jiří; Neumann, Pavel; Novák, Petr; Jiang, Jiming
doi: 10.1093/bioinformatics/btq343pmid: 20616383
Motivation: Satellite DNA makes up significant portion of many eukaryotic genomes, yet it is relatively poorly characterized even in extensively sequenced species. This is, in part, due to methodological limitations of traditional methods of satellite repeat analysis, which are based on multiple alignments of monomer sequences. Therefore, we employed an alternative, alignment-free, approach utilizing k-mer frequency statistics, which is in principle more suitable for analyzing large sets of satellite repeat data, including sequence reads from next generation sequencing technologies.Results: k-mer frequency spectra were determined for two sets of rice centromeric satellite CentO sequences, including 454 reads from ChIP-sequencing of CENH3-bound DNA (7.6 Mb) and the whole genome Sanger sequencing reads (5.8 Mb). k-mer frequencies were used to identify the most conserved sequence regions and to reconstruct consensus sequences of complete monomers. Reconstructed consensus sequences as well as the assessment of overall divergence of k-mer spectra revealed high similarity of the two datasets, suggesting that CentO sequences associated with functional centromeres (CENH3-bound) do not significantly differ from the total population of CentO, which includes both centromeric and pericentromeric repeat arrays. On the other hand, considerable differences were revealed when these methods were used for comparison of CentO populations between individual chromosomes of the rice genome assembly, demonstrating preferential sequence homogenization of the clusters within the same chromosome. k-mer frequencies were also successfully used to identify and characterize smRNAs derived from CentO repeats.Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
An alignment-free method to identify candidate orthologous enhancers in multiple Drosophila genomesArunachalam, Manonmani; Jayasurya, Karthik; Tomancak, Pavel; Ohler, Uwe
doi: 10.1093/bioinformatics/btq358pmid: 20624780
Motivation: Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory region such as transcriptional enhancers. However, detecting orthologous enhancers using alignment-based methods in higher eukaryotic genomes is particularly challenging, as regulatory regions can undergo considerable sequence changes while maintaining their functionality.Results: We have developed an alignment-free method which identifies conserved enhancers in multiple diverged species. Our method is based on similarity metrics between two sequences based on the co-occurrence of sequence patterns regardless of their order and orientation, thus tolerating sequence changes observed in non-coding evolution. We show that our method is highly successful in detecting orthologous enhancers in distantly related species without requiring additional information such as knowledge about transcription factors involved, or predicted binding sites. By estimating the significance of similarity scores, we are able to discriminate experimentally validated functional enhancers from seemingly equally conserved candidates without function. We demonstrate the effectiveness of this approach on a wide range of enhancers in Drosophila, and also present encouraging results to detect conserved functional regions across large evolutionary distances. Our work provides encouraging steps on the way to ab initio unbiased enhancer prediction to complement ongoing experimental efforts.Availability: The software, data and the results used in this article are available at http://www.genome.duke.edu/labs/ohler/research/transcription/fly_enhancer/Contact: [email protected]; [email protected] information: Supplementary data are available at Bioinformatics online.
Genome-wide functional element detection using pairwise statistical alignment outperforms multiple genome footprinting techniquesSatija, R.; Hein, J.; Lunter, G. A.
doi: 10.1093/bioinformatics/btq360pmid: 20610610
Motivation: Comparative genomic sequence analysis is a powerful approach for identifying putative functional elements in silico. The availability of full-genome sequences from many vertebrate species has resulted in the development of popular tools, for example, the phastCons software package that search large numbers of genomes to identify conserved elements. While phastCons can analyze many genomes simultaneously, it ignores potentially informative insertion and deletion events and relies on a fixed, precomputed multiple sequence alignment.Results: We have developed a new method, GRAPeFoot, which simultaneously aligns two full genomes and annotates a set of conserved regions exhibiting reduced rates of insertion, deletion and substitution mutations. We tested GRAPeFoot using the human and mouse genomes and compared its performance to a set of phastCons predictions hosted on the UCSC genome browser. Our results demonstrate that despite the use of only two genomes, GRAPeFoot identified constrained elements at rates comparable with phastCons, which analyzed data from 28 vertebrate genomes. This study demonstrates how integrated modelling of substitutions, indels and purifying selection allows a pairwise analysis to exhibit a sensitivity similar to a heuristic analysis of many genomes.Availability: The GRAPeFoot software and set of genome-wide functional element predictions are freely available to download online at http://www.stats.ox.ac.uk/∼satija/GRAPeFoot/Contact: [email protected] information: Supplementary data are available at Bioinformatics online.
First insight into the prediction of protein folding rate change upon point mutationHuang, Liang-Tsung; Gromiha, M. Michael
doi: 10.1093/bioinformatics/btq350pmid: 20616385
Summary: The accurate prediction of protein folding rate change upon mutation is an important and challenging problem in protein folding kinetics and design. In this work, we have collected experimental data on protein folding rate change upon mutation from various sources and constructed a reliable and non-redundant dataset with 467 mutants. These mutants are widely distributed based on secondary structure, solvent accessibility, conservation score and long-range contacts. From systematic analysis of these parameters along with a set of 49 amino acid properties, we have selected a set of 12 features for discriminating the mutants that speed up or slow down the folding process. We have developed a method based on quadratic regression models for discriminating the accelerating and decelerating mutants, which showed an accuracy of 74% using the 10-fold cross-validation test. The sensitivity and specificity are 63% and 76%, respectively. The method can be improved with the inclusion of physical interactions and structure-based parameters.Availability: http://bioinformatics.myweb.hinet.net/freedom.htmContact: [email protected] information: Supplementary data are available at Bioinformatics online.
Mining metabolic pathways through gene expressionHancock, Timothy; Takigawa, Ichigaku; Mamitsuka, Hiroshi
doi: 10.1093/bioinformatics/btq344pmid: 20587705
Motivation: An observed metabolic response is the result of the coordinated activation and interaction between multiple genetic pathways. However, the complex structure of metabolism has meant that a compete understanding of which pathways are required to produce an observed metabolic response is not fully understood. In this article, we propose an approach that can identify the genetic pathways which dictate the response of metabolic network to specific experimental conditions.Results: Our approach is a combination of probabilistic models for pathway ranking, clustering and classification. First, we use a non-parametric pathway extraction method to identify the most highly correlated paths through the metabolic network. We then extract the defining structure within these top-ranked pathways using both Markov clustering and classification algorithms. Furthermore, we define detailed node and edge annotations, which enable us to track each pathway, not only with respect to its genetic dependencies, but also allow for an analysis of the interacting reactions, compounds and KEGG sub-networks. We show that our approach identifies biologically meaningful pathways within two microarray expression datasets using entire KEGG metabolic networks.Availability and implementation: An R package containing a full implementation of our proposed method is currently available from http://www.bic.kyoto-u.ac.jp/pathway/timhancockContact: [email protected] information: Supplementary data are available at Bioinformatics online.
Integration of pathway knowledge into a reweighted recursive feature elimination approach for risk stratification of cancer patientsJohannes, Marc; Brase, Jan C.; Fröhlich, Holger; Gade, Stephan; Gehrmann, Mathias; Fälth, Maria; Sültmann, Holger; Beißbarth, Tim
doi: 10.1093/bioinformatics/btq345pmid: 20591905
Motivation: One of the main goals of high-throughput gene-expression studies in cancer research is to identify prognostic gene signatures, which have the potential to predict the clinical outcome. It is common practice to investigate these questions using classification methods. However, standard methods merely rely on gene-expression data and assume the genes to be independent. Including pathway knowledge a priori into the classification process has recently been indicated as a promising way to increase classification accuracy as well as the interpretability and reproducibility of prognostic gene signatures.Results: We propose a new method called Reweighted Recursive Feature Elimination. It is based on the hypothesis that a gene with a low fold-change should have an increased influence on the classifier if it is connected to differentially expressed genes. We used a modified version of Google's PageRank algorithm to alter the ranking criterion of the SVM-RFE algorithm. Evaluations of our method on an integrated breast cancer dataset comprising 788 samples showed an improvement of the area under the receiver operator characteristic curve as well as in the reproducibility and interpretability of selected genes.Availability: The R code of the proposed algorithm is given in Supplementary Material.Contact: [email protected]; [email protected] information: Supplementary data are available at Bioinformatics online.