Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

MeV+R: using MeV as a graphical user interface for Bioconductor applications in microarray analysis

MeV+R: using MeV as a graphical user interface for Bioconductor applications in microarray analysis We present MeV+R, an integration of the JAVA MultiExperiment Viewer program with Bioconductor packages. This integration of MultiExperiment Viewer and R is easily extensible to other R packages and provides users with point and click access to traditionally command line driven tools written in R. We demonstrate the ability to use MultiExperiment Viewer as a graphical user interface for Bioconductor applications in microarray data analysis by incorporating three Bioconductor packages, RAMA, BRIDGE and iterativeBMA. tational scientists, statisticians and the more computationally Rationale While microarray technology has given biologists unprece- oriented biologists. However, in our experience, many biolo- dented access to gene expression data, reliable and effective gists find themselves uncomfortable issuing command lines data analysis remains a difficult problem. There are many in a terminal. Hence, there is a need for a graphical user inter- freely or commercially available software packages, but biol- face (GUI) for Bioconductor packages that will allow biolo- ogists are often faced with trading off power and flexibility for gists easy access to data analytical tools without learning the usability and accessibility. In addition to the potentially pro- command line syntax. The tcltk package in R adds GUI ele- hibitive costs, researchers using commercial software tools ments to R by allowing programmers to write GUI-driven may find themselves waiting for state-of-the-art algorithms to modules by embedding Tk commands into the R language [5]. be implemented with the packages. The Bioconductor project There are also GUIs developed for basic statistical analysis in [1,2] is an open source software project that provides a wide R, such as the R Commander [6] and windows-based range of statistical tools primarily based on the R program- SciViews [7]. However, these GUIs are not designed for ming environment and language [3,4]. Taking advantage of microarray analysis. There are Bioconductor packages, such R's powerful statistical and graphical capabilities, developers as limmaGUI [8], affylmGUI [9] and OLINgui [10] that are have created and contributed numerous Bioconductor pack- built on the R tcltk package to provide GUIs. LimmaGUI and ages to solve a variety of data analysis needs. The use of these affylmGUI provide GUIs for the analysis of designed experi- packages, however, requires a basic understanding of the R ments and the assessment of differential expression for two- programming/command language and an understanding of color spotted microarrays and single-color Affymetrix data, the documentation accompanying each package. The primary respectively. OLINgui provides a GUI for the visualization, users of R and the Bioconductor packages have been compu- normalization and quality testing of two-channel microarray Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.2 data. However, no such GUIs are available for the majority of integration and advantages of three Bioconductor packages Bioconductor packages. In addition, since each Bioconductor (RAMA [15], BRIDGE [16], and iterativeBMA [17]) over exist- package is often written by a different research group, there is ing tools in the MeV environment through case studies. The generally no uniformity in the look and feel of the GUIs avail- underlying framework that we used to integrate these Biocon- able for the different packages. Hence, the end user may not ductor packages with MeV is easily extensible to other analy- be able to easily transfer experience gained with one analysis sis tools developed in R. The software, documentation and a tool to the use of another. tutorial are publicly available from our project home page [18]. An alternative microarray data analysis tool is the MultiEx- periment Viewer (MeV), a component of the TM4 suite of microarray analysis tools [11]. MeV has a user-friendly GUI Implementation designed with the biological community in mind. MeV is an Our integration effort is composed of three separate entities open source Java application with a simple to learn, easy to (Figure 1). MeV provides the graphical user interface while use GUI. It comes with many popular microarray analytical Rserve serves as the communication layer and R is the lan- algorithms for clustering, visualization, classification and guage and environment in which the analysis packages run. biological theme discovery, such as hierarchical clustering Rserve is a TCP/IP server that allows various languages to use [12] and Expression Analysis Systematic Explorer (EASE) the facilities of R without the need to initialize R or link [13]. MeV was carefully designed to provide an application against an R library [19]. In other words, we use R as the back programming interface (API), thus allowing straightforward end to run Bioconductor packages through the use of Rserve. contributions by the community. MeV is hosted at Source- Rserve is open source, freely available [20], and licensed Forge [14] in a concurrent versions system repository. As under GPL. such, frequent builds of the source code are made possible, greatly reducing the lag time between version releases. As such, Java, Rserve, and R must all be installed on the user's computer, and we provide an automated installer on our In this paper, we present MeV+R, which is an effort to provide project web site. Furthermore, Rserve needs to be running to more consistent and well-integrated GUIs for Bioconductor be used. However, R does not need to be started. Since Rserve packages by using MeV as a 'wrapper' application for Biocon- works through TCP/IP, it can run on the user's own machine, ductor methods. Our work brings the best of both worlds on an internal network or over the internet. By default, our together: providing state-of-the-art statistical algorithms code assumes Rserve to be running on the local host, but the from Bioconductor through the open source and easy to use user can change, add and save additional new hosts using a MeV graphical interface to the biomedical community. pull down menu. Once a connection is established, the Java MeV+R has many advantages, including platform independ- code in MeV converts the user's data from the MeV data ence, a well-defined modular API, and a point and click GUI structure to the R format and loads it into R. The appropriate that is easy to learn and use. We demonstrate the successful in Fi Our integration which the an gure 1 alysis p effort is co ackages run mposed of three separate entities: MeV as the GUI, Rserve as the communication layer, and R as the language and environment Our integration effort is composed of three separate entities: MeV as the GUI, Rserve as the communication layer, and R as the language and environment in which the analysis packages run. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.3 R libraries are loaded followed by the R commands that are tion, transformation, and nonconstant variance. Please refer necessary to initiate the analysis. Upon completion, the to [15] for a detailed description of the algorithm. returned data from R are explicitly called back into MeV and presented to the user. User interface The user can start RAMA by clicking 'Adjust Data' - 'Replicate We have incorporated three Bioconductor packages, RAMA Analysis' - 'RAMA' from the MeV main menu. The RAMA dia- [15], BRIDGE [16], and iterativeBMA [17], into MeV to illus- log box is then displayed asking the user to label the arrays trate the successful MeV+R integration. The Robust Analysis that were loaded into MeV with their appropriate dye color. of MicroArray (RAMA) algorithm computes robust estimates At this time, the user is asked to make sure that Rserve is run- of expression intensities from two-color microarray data, ning. On a Win32 system, double clicking Rserve.exe accom- which typically consist of a few replicates and potential out- plishes this. On a UNIX or Linux or Mac OS X system, the liers [15]. RAMA also takes advantage of dye swap experimen- user issues the command 'R CMD Rserve' at a prompt. By tal designs. Bayesian Robust Inference for Differential Gene default, RAMA will look on the local machine for an Rserve Expression (BRIDGE) is a robust algorithm that selects dif- server. However, since Rserve is a TCP/IP server, the Rserve ferentially expressed genes under different experimental con- server can be a remote machine. The user is allowed to adjust ditions on both one- and two-color microarray data [16]. Both a few advanced parameters, though suggested values are RAMA and BRIDGE make use of a computationally intensive given as defaults. If an Rserve connection is successfully technique called Markov Chain Monte Carlo for parameter made, the location of Rserve is written to the user's MeV con- estimation, and it is non-trivial to re-implement these algo- figuration file and will be available in later sessions. After rithms in Java. Hence, we took advantage of our previous clicking 'OK', the input data are sent to R. An indeterminate development work by simply using MeV as an interface to the progress bar is displayed while RAMA runs - unfortunately, Bioconductor packages. The iterative BMA algorithm is a the architecture of RServe and the R Server do not allow for multivariate gene selection and classification algorithm, an accurate indication of the time remaining in an ongoing which considers multiple genes simultaneously and typically analysis. Once completed, the user is given a dialog box to leads to a small number of relevant genes to classify microar- save the results. The returned results will then replace the ray data [17]. The iterativeBMA Bioconductor package imple- loaded data in a new Multiple Array Viewer (MAV). The old ments the iterative BMA algorithm as previously described MAV is deleted. The user can then choose to continue using [17] in R, and its implementation is part of our current inte- MeV as if the data were loaded through the native loading gration effort. Both RAMA and BRIDGE are included in the modules. latest release of MeV (version 4.1), and iterativeBMA will be included in future releases. The user interfaces, usage and BRIDGE: Bayesian Robust Inference for Differential case studies for RAMA, BRIDGE and iterativeBMA are briefly Gene Expression described below. Detailed documentation is included with the BRIDGE fits a robust Bayesian hierarchical model to test for software distribution [21] as well as linked in the MeV appli- differentially expressed genes on microarray data. It can be cation. Help pages are also available as Help Dialogs accessed used with both two-color microarrays and single-channel via buttons on the MeV dialog boxes. Our MeV+R implemen- Affymetrix chips. BRIDGE builds on the previous work of tation is publicly available and runs on Windows, Mac OS X Gottardo et al. [15] by allowing each gene to have a different and Linux. variance and the detection of differentially expressed genes under multiple (up to three in our current implementation) experimental conditions. Robust inference is accomplished by modeling outliers using a t-distribution, and hence Integrated Bioconductor packages: description BRIDGE is powerful even with a small number of samples and user interfaces RAMA: Robust Analysis of MicroArrays (either biological or technical replicates) under each experi- RAMA uses a Bayesian hierarchical model for the robust esti- mental condition. Parameter estimation is carried out using a mation of cDNA microarray intensities with replicates. This is novel version of Markov Chain Monte Carlo. The current highly relevant for replicated microarray experiments implementation of BRIDGE does not handle missing values. because even one outlying replicate (such as due to scratches Please refer to [16] for a detailed description of the model. or dust) can have a disastrous effect on the estimated signal intensity. Outliers are modeled explicitly using a t-distribu- User interface tion, which is more robust than the usual Gaussian model. BRIDGE starts when a user clicks the 'BRIDGE' button in the Our model borrows strength from all the genes to decide if a toolbar located on top of the MeV window. The user is once measurement is an outlier, and hence it is better at detecting again presented with a dialog box similar to that of RAMA outliers based on a small number of replicate measurements asking for the dye labeling identity of each loaded slide. The than other classical robust estimators. Our algorithm uses user is offered the option to adjust the advanced parameters Markov Chain Monte Carlo for parameter estimation, and and to establish an Rserve connection. After clicking OK, the addresses classical issues such as design effects, normaliza- input data are sent to R. An indeterminate progress bar is Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.4 displayed while BRIDGE runs. The results are presented to tion. Then, the data and the parameters are sent to R, and a the user in three formats: heat maps, expression graphs or progress bar is shown warning the user that the computation tables. In each format, the genes for which there is strong evi- could take a long time. After the iterativeBMA Bioconductor dence of differential expression are identified as 'Significant package finishes running, the following analysis results are Genes', defined by the posterior probability being above 0.5. displayed: the predicted probability and class for each test sample; the posterior probabilities of the selected genes IterativeBMA: Iterative Bayesian Model Averaging sorted in descending order; the posterior probabilities of the The iterativeBMA algorithm is a multivariate technique for selected models sorted in descending order; and the heat- gene selection and classification of microarray data. Bayesian maps of the selected genes in both classes. Model Averaging (BMA) takes model uncertainty into consid- eration by averaging over the predicted probabilities based on multiple models, weighted by their posterior model probabil- Case studies illustrating the merits of the ities [22]. The most commonly used BMA algorithm is limited integrated Bioconductor packages to data in which the number of variables is greater than the In this section, we compare the performance of the integrated number of responses, and the algorithm is inefficient for Bioconductor packages (RAMA, BRIDGE and iterativeBMA) datasets containing more than 30 genes (variables). In the to existing tools in MeV in order to illustrate the merits of the case of classifying samples using microarray data, there are integrated packages. In addition, we demonstrate that our typically thousands or tens of thousands of genes (variables) MeV+R modules can be used together with other MeV mod- under a few dozen samples (responses). In the iterative BMA ules in the integrated analysis of microarray data, hence, algorithm, we start by ranking the genes using the ratio of extending the capabilities of MeV. between-group to within-group sum of squares (BSS/WSS) [23]. In this initial preprocessing step, genes with large BSS/ RAMA: Robust Analysis of MicroArrays WSS ratios (that is, genes with relatively large variation We compared the microarray gene intensities estimated between classes and relatively small variation within classes) using RAMA to that of the log ratios over intensities averaged receive high rankings. We then apply the traditional BMA over all the replicates on two microarray datasets and the algorithm to the 30 top ranked genes, and remove genes with results are summarized in Table 1. The first dataset is a subset low posterior probabilities. Genes from the rank ordered of the HIV data [24] consisting of the expression levels of BSS/WSS ratios are then added to the set of genes to replace 1,028 transcripts, including 13 positive controls and 24 nega- genes with low probabilities. These steps of gene swaps and tive controls, in CD4-T-cell lines at time t = 1 hour after infec- iterative applications of BMA are repeated until all genes are tion with HIV virus type 1 hybridized to two-color cDNA subsequently considered. We have previously shown that the arrays. The experimental design consists of four technical iterative BMA algorithm selects small numbers of relevant replicates and balanced dye swap in which two of the four rep- genes, achieves high prediction accuracy, and produces pos- licates were hybridized with Cy3 for the control and Cy5 for terior probabilities for the predictions, selected genes and the treatment and then the dyes were reversed on the other models [17]. two replicates. The second dataset is a subset of the like and like data [15] consisting of 1,000 genes over four experiments The iterativeBMA Bioconductor package implements the iter- using the same RNA preparation isolated from a HeLa cell ative BMA algorithm described in Yeung et al. [17] (previ- line on four different microarray slides. Since the same RNA ously implemented in Splus) when there are two classes. It is was used in both channels, no genes from these data should part of the original work for this publication. The user docu- show any differential expression. Both sample datasets are mentation (vignette) is included in the package. available on our project web site and are included as part of our MeV+R package release. User interface We have integrated the iterativeBMA Bioconductor package Figure 2 shows the log ratios of all genes sorted in descending in MeV. IterativeBMA starts after the user clicks on the order after applying RAMA integrated in MeV+R to the HIV 'iBMA' icon on top of the MeV window. The current imple- data. As shown in Figure 2, the log ratios (to base 2) computed mentation of the iterativeBMA Bioconductor package is lim- with the robust intensities estimated using RAMA for all 13 ited to only two classes. After loading the data, the user is positive controls are all greater than one. The log ratios from asked to label the two classes. The default labels for the two RAMA for all 24 negative controls are smaller than one (data classes are 0 and 1, respectively. In the same dialog box, the not shown in Figure 2). On the contrary, computing the log user is asked to establish an Rserve connection. The user is ratios by simply averaging the gene intensities over the four also given the option of specifying advanced parameters for replicates produces log ratios greater than one for three neg- the analysis. The next dialog box asks the user to assign labels ative controls. Applying RAMA to the like and like data pro- to each of the samples in the data, either by using a pull-down duces no log ratio greater than one as desired since we do not menu or loading an assignment file. At this point, if Rserve is expect any differentially expressed genes. On the contrary, not already running, the user is reminded to start the connec- the average log ratio of gene intensities yields six genes with Genome Biology 2008, 9:R118 13 positive controls http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.5 Table 1 Comparing the results of RAMA to the averaged log ratios on the HIV data and the like and like data Data Benchmark RAMA Averaged log ratio HIV data 13 positive controls All 13 positive controls have log ratios All 13 positive controls have log ratios >1 >1 24 negative controls All 24 negative controls have log ratios 3 negative controls have log ratios >1 <1 Like and like data No genes expected to be differentially All log ratios <1 6 genes with log ratios >1 expressed RAMA produced the desired results on both datasets while the averaged log ratio produced three and six false positives, respectively, on these two datasets. log ratios greater than one. Please refer to the supplementary out any Bonferroni correction. Using a p-value cut-off of 0.05 material [18] for the details of our case studies. To summa- and standard Bonferroni correction, the one-sample t-test rize, RAMA produced the desired results on both datasets identified only one significant gene (which is one of the 13 while the averaged log ratio produced three and six false pos- positive controls) and incorrectly assigned the remaining 12 itives, respectively, on these two datasets. positive controls as 'insignificant'. Similarly, using one-sam- ple SAM as implemented in MeV identified 12 out of 13 posi- BRIDGE: Bayesian Robust Inference for Differential tive controls using default parameters. Gene Expression We compared the differentially expressed genes identified The second dataset we used comprises the Affymetrix U133 using BRIDGE, t-test and SAM (Significance Analysis of spike-in data [26], which consists of three technical replicates Microarrays) [25] as implemented in MeV on two datasets. of 14 separate hybridizations of 42 spiked transcripts in a Applying BRIDGE to the HIV data described in the previous complex human background at varying concentrations. section identified all 13 positive controls as 'significant' genes Thirty of the spikes are isolated from a human cell line, four (Figure 3). On the other hand, applying the one-sample t-test spikes are bacterial controls, and eight spikes are artificially as implemented in MeV to the same HIV data identified a engineered sequences believed to be unique in the human total of 14 significant genes, including all 13 positive controls genome. The data were preprocessed using GCRMA [27], and one negative control using a p-value cut-off of 0.01 with- The Figu res re 2 ults of applying RAMA to the HIV data The results of applying RAMA to the HIV data. The log ratios computed from RAMA are sorted in descending order, and the top 13 genes with log ratios greater than one are the positive controls. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.6 The Figu sign re 3 ificant genes identified by applying BRIDGE to the HIV data The significant genes identified by applying BRIDGE to the HIV data. resulting in a dataset of 22,300 genes across 42 samples. In cut-off of 0.01 without any correction for multiple compari- addition to the original 42 spiked-in genes, we included an son identified a total of 33 significant genes, of which 31 were additional 20 genes that consistently showed significant dif- spiked-in genes. Using a p-value cut-off of 0.05 and the ferential expression across the array groups and an additional standard Bonferroni correction, the t-test identified only four three genes containing probe sequences exactly matching significant genes (which are among the spiked-in genes). those for the spiked-in genes [28,29]. As a result, our SAM identified eight spiked-in genes as differentially expanded spiked-in gene list contains 65 entries in total. We expressed. used a subset of this spiked-in data consisting of 1,059 genes that include all 65 spiked-in genes across two samples in Our comparison results are summarized in Table 2. We have triplicate. In our comparison, only the 65 spiked-in genes shown that BRIDGE is the only tool that successfully identi- should be identified as differentially expressed. fied all 13 positive controls as 'significant' on the HIV data. In addition, BRIDGE identified the highest number of true pos- BRIDGE identified 45 differentially expressed genes on this itives (spiked-in genes) without any false positives on the data subset. All of these 45 genes identified by BRIDGE are Affymetrix spike-in data. spiked-in genes. On the other hand, the t-test with a p-value Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.7 Table 2 Comparing the results of BRIDGE to t-test and SAM on the HIV data and the Affymetrix spike-in data t-test Dataset Benchmark BRIDGE p-value cut-off 0.01, no p-value cut-off 0.05, standard SAM correction Bonferroni correction HIV data 13 positive controls, 24 negative controls DE 13 14 1 12 TP 13 13 112 FP 0 1 00 Affymetrix spike-in data 65 spike-in genes DE 45 33 4 8 TP 45 31 4 8 FP 0 2 00 For each dataset and each method, the number of differentially expressed (DE) genes, true positives (TP) and false positives (FP) are shown. For each dataset, the maximum TP and the minimum FP across all methods are shown in bold. BRIDGE produced the best results on both datasets in identifying the highest number of true positives without any false positives. IterativeBMA: Iterative Bayesian Model Averaging Using other MeV modules in an integrated data We compared the performance of iterativeBMA (abbreviated analysis as iBMA in our MeV+R implementation) to KNN (k-nearest The previous sub-sections showed that our MeV+R modules neighbor) [30] and USC (Uncorrelated Shrunken Centroid) achieved superior performance when compared to other [31] implemented in MeV using the well-studied leukemia existing tools implemented in MeV. Here we demonstrate data [32]. We used the filtered leukemia dataset, which con- how the R packages that we incorporated into MeV can be sists of 3,051 genes, 38 samples in the training data and 34 used in combination with other existing tools in MeV. This samples in the test set. The data consist of samples from illustrates the fact that the MeV+R framework has extended patients with either acute lymphoblastic leukemia (ALL) or the capabilities of MeV, and that using these R packages acute myeloid leukemia (AML). On the leukemia data, itera- through the MeV GUI adds value to the integrated analysis of tiveBMA produced 2 classification errors using 11 selected microarray data. genes over 11 models (Figures 4 and 5). On the other hand, KNN does not have a gene selection procedure and produced In this case study, we will follow-up on the results from apply- 2 classification errors using all 3,051 genes. Similarly, USC ing the iterativeBMA algorithm to the leukemia data [32]. The produced 2 classification errors using 51 selected genes. iterativeBMA algorithm is a multivariate gene selection method designed to select a small set of predictive genes for The second dataset we used is the breast cancer prognosis the classification of microarray data. In the case of the leuke- dataset [33], which consists of 4,919 genes with 76 samples in mia data, the iterativeBMA algorithm selected 11 genes that the training set, and 19 samples in the test set [17]. The produced two classification errors on the 34-sample test set. patient samples are divided into two categories: the good It would be interesting to identify the biological theme in this prognosis group (patients who remained disease free for at 11-gene list. Towards this end, we applied EASE [13] as imple- least five years) and the poor prognosis group (patients who mented in MeV to determine the over-represented Gene developed distant metastases within five years). The itera- Ontology categories in this gene list relative to all the genes on tiveBMA algorithm produced three classification errors using the microarray. Figure 6 shows the tabular view from the four genes averaged over three models. On the other hand, EASE analysis. KNN does not have a gene selection procedure and produced five classification errors using all genes. Similarly, USC pro- Since iterativeBMA identifies a small set of predictive genes duced four classification errors using 662 genes. for classification, other genes that exhibit similar expression patterns to the selected genes are likely of biological interest. Our results are summarized in Table 3. On the breast cancer For example, we would like to explore the gene with the high- prognosis data, iterativeBMA produced higher prediction est posterior probability 'X95735_at' from the iterativeBMA accuracy using much fewer genes. On the leukaemia data, analysis on the leukemia data [32]. We applied PTM (Tem- iterativeBMA produced comparable prediction accuracy plate Matching) [34] as implemented in MeV to identify genes using much fewer genes. that are highly correlated with 'X95735_at'. Using a p-value threshold of 0.0001, PTM identified 209 genes that are highly Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.8 T Fihg e resu ure 4lts of applying iterativeBMA to the leukemia data The results of applying iterativeBMA to the leukemia data. A heatmap showing the selected genes from iterativeBMA under the training samples labeled as class 0 and the test samples assigned to class 0 by the algorithm. correlated with 'X95735_at'. Our next task was to find the have demonstrated the successful integration of Bioconduc- biological theme among these 209 genes, so we applied EASE tor and MeV through three Bioconductor packages, RAMA, and TEASE (Tree-EASE). TEASE is a combined analytical BRIDGE and iterativeBM, and that the incorporated Biocon- tool for hierarchical clustering and EASE. TEASE computes ductor packages produced superior results in the analysis of the dendrogram using the hierarchical clustering method and microarray data compared to existing tools in MeV. Addi- displays the significantly enriched Gene Ontology categories tional Bioconductor packages are straightforward to add: the for each subtree in the dendrogram. Please refer to the sup- framework for moving data from MeV to R and back is gener- plementary materials [18] for the details of our case studies. alized for code re-use, and each new package will merely require the development of a GUI for input and output. Incorporating additional R packages We have developed a framework with built-in functions for Abbreviations the integration of Bioconductor packages into MeV. Detailed API, application programming interface; BMA, Bayesian documentation of these built-in functions is provided on our Model Averaging; BRIDGE, Bayesian Robust Inference for project web site for software developers. Using this frame- Differential Gene Expression; BSS/WSS, ratio of between- work, we have integrated three Bioconductor packages group to within-group sum of squares; EASE, Expression (RAMA, BRIDGE and iterativeBMA) into MeV as proof of Analysis Systematic Explorer; GUI, graphical user interface; concept. To integrate additional Bioconductor packages into iterativeBMA, iterative Bayesian Model Averaging; KNN, k- MeV, a software developer can simply call our built-in func- nearest neighbor; MAV, Multiple Array Viewer; MeV, Multi- tions except for complex and non-standard data views. Experiment Viewer; PTM, Template Matching; RAMA, Robust Analysis of MicroArray; SAM, Significance Analysis of Microarrays; USC, Uncorrelated Shrunken Centroid. Conclusion MeV+R is a convenient platform to provide biologists with point and click GUI access to Bioconductor packages. We Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.9 T Fihg e resu ure 5lts of applying iterativeBMA to the leukemia data The results of applying iterativeBMA to the leukemia data. A heatmap showing the selected genes from iterativeBMA under the training samples labeled as class 1 and the test samples assigned to class 1 by the algorithm. Authors' contributions Acknowledgements We would like to give special thanks to Dr John Quakenbush and his VC carried out the software implementation, and drafted part research group for the original development of MeV and for insightful of the initial manuscript. RG and AER designed and wrote the discussions related to this work. We would also like to thank Drs Chris Bioconductor packages RAMA and BRIDGE, and assisted in Volinsky and Ian Painter for their help on the development of the itera- tiveBMA Bioconductor package. We would also like to thank Dr Renee Ire- incorporating these packages into MeV. REB conceived of the ton for editing the manuscript. VTC is supported by NIH-NCI grant study, and designed and coordinated the project. KYY partic- K25CA106988 and 5R01HL072370. RG's research is supported by the ipated in the design and coordination of the study, wrote the Natural Sciences and Engineering Research Council of Canada. AER's research was supported by NICHD grant R01 HD054511, NIH grant 8 R01 iterativeBMA Bioconductor package, carried out the case studies and prepared the manuscript. All authors read and approved the final manuscript. Table 3 Comparing the results of iterativeBMA to KNN and USC on the leukemia data and the breast cancer prognosis data Data Size of data iterativeBMA KNN USC Leukemia data [32] 38 training samples 11 genes 3,051 genes 51 genes 34 test samples 2 errors 2 errors 2 errors Breast cancer prognosis data [33] 76 training samples 4 genes 4,919 genes 662 genes 19 test samples 3 errors 5 errors 4 errors The number of selected genes and the number of classification errors are shown for each method. For each dataset, the smallest number of genes and the smallest number of classification errors across all three methods are shown in bold. On the leukemia data, iterativeBMA produced the same number of classification errors using much fewer genes. On the breast cancer prognosis data, iterativeBMA produced fewer errors using much fewer genes. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.10 The Figu res re 6 ults of applying EASE to the 11 genes selected by iterativeBMA on the leukemia data The results of applying EASE to the 11 genes selected by iterativeBMA on the leukemia data. EB002137-02, NSF grant ATM 0724721 and ONR grant N00014-01-10745. 20-22 March 2003; Vienna, Austria. [http://www.ci.tuwien.ac.at/Con REB is funded by NIH-NIAID grants 5P01 AI052106-02, 1R21AI052028-01 ferences/DSC-2003/Proceedings/Grosjean.pdf]. and 1U54AI057141-01, NIH-NIEHA grant 1U19ES011387-02, NIH-NHLBI 8. Wettenhall JM, Smyth GK: limmaGUI: a graphical user interface grants 5R01HL072370-02 and 1P50HL073996-01. KYY is supported by for linear modeling of microarray data. Bioinformatics 2004, NIH-NCI grant K25CA106988. 20:3705-3706. 9. Wettenhall JM, Simpson KM, Satterley K, Smyth GK: affylmGUI: a graphical user interface for linear modeling of single channel microarray data. Bioinformatics 2006, 22:897-899. References 10. Futschik ME, Crompton T: OLIN: optimized normalization, vis- ualization and quality testing of two-channel microarray 1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, data. Bioinformatics 2005, 21:1724-1726. Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, 11. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, software development for computational biology and Trush V, Quackenbush J: TM4: a free, open-source system for bioinformatics. Genome Biol 2004, 5:R80. microarray data management and analysis. Biotechniques 2003, 2. The Bioconductor Project [http://www.bioconductor.org] 34:374-378. 3. Ihaka R, Gentleman RC: R: a language for data analysis and 12. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis graphics. J Computational Graphical Stat 1996, 5:299-314. and display of genome-wide expression patterns. Proc Natl 4. The R Project for Statistical Computing [http://www.r- Acad Sci USA 1998, 95:14863-14868. project.org] 13. Hosack DA, Dennis G Jr, Sherman BT, Lane HC, Lempicki RA: Iden- 5. Dalgaard P: The R-Tcl/Tk interface. Proceedings of the Second Inter- tifying biological themes within lists of genes with EASE. national Workshop on Distributed Statistical Computing: 15-17 March Genome Biol 2003, 4:R70. 2001; Vienna, Austria. [http://www.ci.tuwien.ac.at/Conferences/DSC- 14. SourceForge.net [http://www.sourceforge.net/projects/mev-tm4] 2001/Proceedings/Dalgaard.pdf]. 15. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE: Quality control 6. Fox J: The R commander: a basic-statistics graphical user and robust estimation for cDNA microarrays with interface to R. J Stat Software 2005, 14: [http://www.jstatsoft.org/ replicates. J Am Stat Assoc 2006, 101:30-40. v14/i09/paper]. 16. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE: Bayesian robust 7. Grosjean P: SciViews: an object-oriented abstraction layer to inference for differential gene expression in cDNA microar- design GUIs on top of various calculation kernels. Proceedings rays with multiple samples. Biometrics 2006, 62:10-18. of the Third International Workshop on Distributed Statistical Computing: Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.11 17. Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 2005, 21:2394-2402. 18. MeV+R Supplementary Web Site [http://expression.washing ton.edu/mevr] 19. Urbanek S: Rserve - a fast way to provide R functionality to applications. Proceedings of the Third International Workshop on Dis- tributed Statistical Computing: 20-22 March 2003; Vienna, Austria. [http:/ /www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/ Urbanek.pdf]. 20. Rserve [http://stats.math.uni-augsburg.de/Rserve] 21. MeV Manual [http://www.tm4.org/mev.html] 22. Raftery AE: Bayesian model selection in social research (with discussion). Sociol Methodol 1995, 25:111-196. 23. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expres- sion data. J Am Stat Assoc 2002, 97:77-87. 24. van 't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE, Geiss GK, Mullins JI: Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines. J Virol 2003, 77:1392-1402. 25. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro- arrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98:5116-5121. 26. Affymetrix U133 Spike-in Data [http://www.affymetrix.com/ support/technical/sample_data/datasets.affx] 27. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15. 28. Sheffler W, Upfal E, Sedivy J, Noble WS: A learned comparative expression measure for affymetrix genechip DNA microarrays. Proc IEEE Comput Syst Bioinform Conf 2005:144-154. 29. Lo K, Gottardo R: Flexible empirical Bayes models for differ- ential gene expression. Bioinformatics 2007, 23:328-335. 30. Theilhaber J, Connolly T, Roman-Roman S, Bushnell S, Jackson A, Call K, Garcia T, Baron R: Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data. Genome Res 2002, 12:165-176. 31. Yeung KY, Bumgarner RE: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol 2003, 4:R83. 32. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537. 33. van 't Veer LJ, Dai H, Vijver MJ van de, He YD, Hart AA, Mao M, Peterse HL, Kooy K van der, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415:530-536. 34. Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2:research0042.1-0042.15. Genome Biology 2008, 9:R118 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

MeV+R: using MeV as a graphical user interface for Bioconductor applications in microarray analysis

Loading next page...
 
/lp/springer-journals/mev-r-using-mev-as-a-graphical-user-interface-for-bioconductor-I1mOZV3ZMZ

References (46)

Publisher
Springer Journals
Copyright
2008 Chu et al.; licensee BioMed Central Ltd.
eISSN
1474-760X
DOI
10.1186/gb-2008-9-7-r118
Publisher site
See Article on Publisher Site

Abstract

We present MeV+R, an integration of the JAVA MultiExperiment Viewer program with Bioconductor packages. This integration of MultiExperiment Viewer and R is easily extensible to other R packages and provides users with point and click access to traditionally command line driven tools written in R. We demonstrate the ability to use MultiExperiment Viewer as a graphical user interface for Bioconductor applications in microarray data analysis by incorporating three Bioconductor packages, RAMA, BRIDGE and iterativeBMA. tational scientists, statisticians and the more computationally Rationale While microarray technology has given biologists unprece- oriented biologists. However, in our experience, many biolo- dented access to gene expression data, reliable and effective gists find themselves uncomfortable issuing command lines data analysis remains a difficult problem. There are many in a terminal. Hence, there is a need for a graphical user inter- freely or commercially available software packages, but biol- face (GUI) for Bioconductor packages that will allow biolo- ogists are often faced with trading off power and flexibility for gists easy access to data analytical tools without learning the usability and accessibility. In addition to the potentially pro- command line syntax. The tcltk package in R adds GUI ele- hibitive costs, researchers using commercial software tools ments to R by allowing programmers to write GUI-driven may find themselves waiting for state-of-the-art algorithms to modules by embedding Tk commands into the R language [5]. be implemented with the packages. The Bioconductor project There are also GUIs developed for basic statistical analysis in [1,2] is an open source software project that provides a wide R, such as the R Commander [6] and windows-based range of statistical tools primarily based on the R program- SciViews [7]. However, these GUIs are not designed for ming environment and language [3,4]. Taking advantage of microarray analysis. There are Bioconductor packages, such R's powerful statistical and graphical capabilities, developers as limmaGUI [8], affylmGUI [9] and OLINgui [10] that are have created and contributed numerous Bioconductor pack- built on the R tcltk package to provide GUIs. LimmaGUI and ages to solve a variety of data analysis needs. The use of these affylmGUI provide GUIs for the analysis of designed experi- packages, however, requires a basic understanding of the R ments and the assessment of differential expression for two- programming/command language and an understanding of color spotted microarrays and single-color Affymetrix data, the documentation accompanying each package. The primary respectively. OLINgui provides a GUI for the visualization, users of R and the Bioconductor packages have been compu- normalization and quality testing of two-channel microarray Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.2 data. However, no such GUIs are available for the majority of integration and advantages of three Bioconductor packages Bioconductor packages. In addition, since each Bioconductor (RAMA [15], BRIDGE [16], and iterativeBMA [17]) over exist- package is often written by a different research group, there is ing tools in the MeV environment through case studies. The generally no uniformity in the look and feel of the GUIs avail- underlying framework that we used to integrate these Biocon- able for the different packages. Hence, the end user may not ductor packages with MeV is easily extensible to other analy- be able to easily transfer experience gained with one analysis sis tools developed in R. The software, documentation and a tool to the use of another. tutorial are publicly available from our project home page [18]. An alternative microarray data analysis tool is the MultiEx- periment Viewer (MeV), a component of the TM4 suite of microarray analysis tools [11]. MeV has a user-friendly GUI Implementation designed with the biological community in mind. MeV is an Our integration effort is composed of three separate entities open source Java application with a simple to learn, easy to (Figure 1). MeV provides the graphical user interface while use GUI. It comes with many popular microarray analytical Rserve serves as the communication layer and R is the lan- algorithms for clustering, visualization, classification and guage and environment in which the analysis packages run. biological theme discovery, such as hierarchical clustering Rserve is a TCP/IP server that allows various languages to use [12] and Expression Analysis Systematic Explorer (EASE) the facilities of R without the need to initialize R or link [13]. MeV was carefully designed to provide an application against an R library [19]. In other words, we use R as the back programming interface (API), thus allowing straightforward end to run Bioconductor packages through the use of Rserve. contributions by the community. MeV is hosted at Source- Rserve is open source, freely available [20], and licensed Forge [14] in a concurrent versions system repository. As under GPL. such, frequent builds of the source code are made possible, greatly reducing the lag time between version releases. As such, Java, Rserve, and R must all be installed on the user's computer, and we provide an automated installer on our In this paper, we present MeV+R, which is an effort to provide project web site. Furthermore, Rserve needs to be running to more consistent and well-integrated GUIs for Bioconductor be used. However, R does not need to be started. Since Rserve packages by using MeV as a 'wrapper' application for Biocon- works through TCP/IP, it can run on the user's own machine, ductor methods. Our work brings the best of both worlds on an internal network or over the internet. By default, our together: providing state-of-the-art statistical algorithms code assumes Rserve to be running on the local host, but the from Bioconductor through the open source and easy to use user can change, add and save additional new hosts using a MeV graphical interface to the biomedical community. pull down menu. Once a connection is established, the Java MeV+R has many advantages, including platform independ- code in MeV converts the user's data from the MeV data ence, a well-defined modular API, and a point and click GUI structure to the R format and loads it into R. The appropriate that is easy to learn and use. We demonstrate the successful in Fi Our integration which the an gure 1 alysis p effort is co ackages run mposed of three separate entities: MeV as the GUI, Rserve as the communication layer, and R as the language and environment Our integration effort is composed of three separate entities: MeV as the GUI, Rserve as the communication layer, and R as the language and environment in which the analysis packages run. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.3 R libraries are loaded followed by the R commands that are tion, transformation, and nonconstant variance. Please refer necessary to initiate the analysis. Upon completion, the to [15] for a detailed description of the algorithm. returned data from R are explicitly called back into MeV and presented to the user. User interface The user can start RAMA by clicking 'Adjust Data' - 'Replicate We have incorporated three Bioconductor packages, RAMA Analysis' - 'RAMA' from the MeV main menu. The RAMA dia- [15], BRIDGE [16], and iterativeBMA [17], into MeV to illus- log box is then displayed asking the user to label the arrays trate the successful MeV+R integration. The Robust Analysis that were loaded into MeV with their appropriate dye color. of MicroArray (RAMA) algorithm computes robust estimates At this time, the user is asked to make sure that Rserve is run- of expression intensities from two-color microarray data, ning. On a Win32 system, double clicking Rserve.exe accom- which typically consist of a few replicates and potential out- plishes this. On a UNIX or Linux or Mac OS X system, the liers [15]. RAMA also takes advantage of dye swap experimen- user issues the command 'R CMD Rserve' at a prompt. By tal designs. Bayesian Robust Inference for Differential Gene default, RAMA will look on the local machine for an Rserve Expression (BRIDGE) is a robust algorithm that selects dif- server. However, since Rserve is a TCP/IP server, the Rserve ferentially expressed genes under different experimental con- server can be a remote machine. The user is allowed to adjust ditions on both one- and two-color microarray data [16]. Both a few advanced parameters, though suggested values are RAMA and BRIDGE make use of a computationally intensive given as defaults. If an Rserve connection is successfully technique called Markov Chain Monte Carlo for parameter made, the location of Rserve is written to the user's MeV con- estimation, and it is non-trivial to re-implement these algo- figuration file and will be available in later sessions. After rithms in Java. Hence, we took advantage of our previous clicking 'OK', the input data are sent to R. An indeterminate development work by simply using MeV as an interface to the progress bar is displayed while RAMA runs - unfortunately, Bioconductor packages. The iterative BMA algorithm is a the architecture of RServe and the R Server do not allow for multivariate gene selection and classification algorithm, an accurate indication of the time remaining in an ongoing which considers multiple genes simultaneously and typically analysis. Once completed, the user is given a dialog box to leads to a small number of relevant genes to classify microar- save the results. The returned results will then replace the ray data [17]. The iterativeBMA Bioconductor package imple- loaded data in a new Multiple Array Viewer (MAV). The old ments the iterative BMA algorithm as previously described MAV is deleted. The user can then choose to continue using [17] in R, and its implementation is part of our current inte- MeV as if the data were loaded through the native loading gration effort. Both RAMA and BRIDGE are included in the modules. latest release of MeV (version 4.1), and iterativeBMA will be included in future releases. The user interfaces, usage and BRIDGE: Bayesian Robust Inference for Differential case studies for RAMA, BRIDGE and iterativeBMA are briefly Gene Expression described below. Detailed documentation is included with the BRIDGE fits a robust Bayesian hierarchical model to test for software distribution [21] as well as linked in the MeV appli- differentially expressed genes on microarray data. It can be cation. Help pages are also available as Help Dialogs accessed used with both two-color microarrays and single-channel via buttons on the MeV dialog boxes. Our MeV+R implemen- Affymetrix chips. BRIDGE builds on the previous work of tation is publicly available and runs on Windows, Mac OS X Gottardo et al. [15] by allowing each gene to have a different and Linux. variance and the detection of differentially expressed genes under multiple (up to three in our current implementation) experimental conditions. Robust inference is accomplished by modeling outliers using a t-distribution, and hence Integrated Bioconductor packages: description BRIDGE is powerful even with a small number of samples and user interfaces RAMA: Robust Analysis of MicroArrays (either biological or technical replicates) under each experi- RAMA uses a Bayesian hierarchical model for the robust esti- mental condition. Parameter estimation is carried out using a mation of cDNA microarray intensities with replicates. This is novel version of Markov Chain Monte Carlo. The current highly relevant for replicated microarray experiments implementation of BRIDGE does not handle missing values. because even one outlying replicate (such as due to scratches Please refer to [16] for a detailed description of the model. or dust) can have a disastrous effect on the estimated signal intensity. Outliers are modeled explicitly using a t-distribu- User interface tion, which is more robust than the usual Gaussian model. BRIDGE starts when a user clicks the 'BRIDGE' button in the Our model borrows strength from all the genes to decide if a toolbar located on top of the MeV window. The user is once measurement is an outlier, and hence it is better at detecting again presented with a dialog box similar to that of RAMA outliers based on a small number of replicate measurements asking for the dye labeling identity of each loaded slide. The than other classical robust estimators. Our algorithm uses user is offered the option to adjust the advanced parameters Markov Chain Monte Carlo for parameter estimation, and and to establish an Rserve connection. After clicking OK, the addresses classical issues such as design effects, normaliza- input data are sent to R. An indeterminate progress bar is Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.4 displayed while BRIDGE runs. The results are presented to tion. Then, the data and the parameters are sent to R, and a the user in three formats: heat maps, expression graphs or progress bar is shown warning the user that the computation tables. In each format, the genes for which there is strong evi- could take a long time. After the iterativeBMA Bioconductor dence of differential expression are identified as 'Significant package finishes running, the following analysis results are Genes', defined by the posterior probability being above 0.5. displayed: the predicted probability and class for each test sample; the posterior probabilities of the selected genes IterativeBMA: Iterative Bayesian Model Averaging sorted in descending order; the posterior probabilities of the The iterativeBMA algorithm is a multivariate technique for selected models sorted in descending order; and the heat- gene selection and classification of microarray data. Bayesian maps of the selected genes in both classes. Model Averaging (BMA) takes model uncertainty into consid- eration by averaging over the predicted probabilities based on multiple models, weighted by their posterior model probabil- Case studies illustrating the merits of the ities [22]. The most commonly used BMA algorithm is limited integrated Bioconductor packages to data in which the number of variables is greater than the In this section, we compare the performance of the integrated number of responses, and the algorithm is inefficient for Bioconductor packages (RAMA, BRIDGE and iterativeBMA) datasets containing more than 30 genes (variables). In the to existing tools in MeV in order to illustrate the merits of the case of classifying samples using microarray data, there are integrated packages. In addition, we demonstrate that our typically thousands or tens of thousands of genes (variables) MeV+R modules can be used together with other MeV mod- under a few dozen samples (responses). In the iterative BMA ules in the integrated analysis of microarray data, hence, algorithm, we start by ranking the genes using the ratio of extending the capabilities of MeV. between-group to within-group sum of squares (BSS/WSS) [23]. In this initial preprocessing step, genes with large BSS/ RAMA: Robust Analysis of MicroArrays WSS ratios (that is, genes with relatively large variation We compared the microarray gene intensities estimated between classes and relatively small variation within classes) using RAMA to that of the log ratios over intensities averaged receive high rankings. We then apply the traditional BMA over all the replicates on two microarray datasets and the algorithm to the 30 top ranked genes, and remove genes with results are summarized in Table 1. The first dataset is a subset low posterior probabilities. Genes from the rank ordered of the HIV data [24] consisting of the expression levels of BSS/WSS ratios are then added to the set of genes to replace 1,028 transcripts, including 13 positive controls and 24 nega- genes with low probabilities. These steps of gene swaps and tive controls, in CD4-T-cell lines at time t = 1 hour after infec- iterative applications of BMA are repeated until all genes are tion with HIV virus type 1 hybridized to two-color cDNA subsequently considered. We have previously shown that the arrays. The experimental design consists of four technical iterative BMA algorithm selects small numbers of relevant replicates and balanced dye swap in which two of the four rep- genes, achieves high prediction accuracy, and produces pos- licates were hybridized with Cy3 for the control and Cy5 for terior probabilities for the predictions, selected genes and the treatment and then the dyes were reversed on the other models [17]. two replicates. The second dataset is a subset of the like and like data [15] consisting of 1,000 genes over four experiments The iterativeBMA Bioconductor package implements the iter- using the same RNA preparation isolated from a HeLa cell ative BMA algorithm described in Yeung et al. [17] (previ- line on four different microarray slides. Since the same RNA ously implemented in Splus) when there are two classes. It is was used in both channels, no genes from these data should part of the original work for this publication. The user docu- show any differential expression. Both sample datasets are mentation (vignette) is included in the package. available on our project web site and are included as part of our MeV+R package release. User interface We have integrated the iterativeBMA Bioconductor package Figure 2 shows the log ratios of all genes sorted in descending in MeV. IterativeBMA starts after the user clicks on the order after applying RAMA integrated in MeV+R to the HIV 'iBMA' icon on top of the MeV window. The current imple- data. As shown in Figure 2, the log ratios (to base 2) computed mentation of the iterativeBMA Bioconductor package is lim- with the robust intensities estimated using RAMA for all 13 ited to only two classes. After loading the data, the user is positive controls are all greater than one. The log ratios from asked to label the two classes. The default labels for the two RAMA for all 24 negative controls are smaller than one (data classes are 0 and 1, respectively. In the same dialog box, the not shown in Figure 2). On the contrary, computing the log user is asked to establish an Rserve connection. The user is ratios by simply averaging the gene intensities over the four also given the option of specifying advanced parameters for replicates produces log ratios greater than one for three neg- the analysis. The next dialog box asks the user to assign labels ative controls. Applying RAMA to the like and like data pro- to each of the samples in the data, either by using a pull-down duces no log ratio greater than one as desired since we do not menu or loading an assignment file. At this point, if Rserve is expect any differentially expressed genes. On the contrary, not already running, the user is reminded to start the connec- the average log ratio of gene intensities yields six genes with Genome Biology 2008, 9:R118 13 positive controls http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.5 Table 1 Comparing the results of RAMA to the averaged log ratios on the HIV data and the like and like data Data Benchmark RAMA Averaged log ratio HIV data 13 positive controls All 13 positive controls have log ratios All 13 positive controls have log ratios >1 >1 24 negative controls All 24 negative controls have log ratios 3 negative controls have log ratios >1 <1 Like and like data No genes expected to be differentially All log ratios <1 6 genes with log ratios >1 expressed RAMA produced the desired results on both datasets while the averaged log ratio produced three and six false positives, respectively, on these two datasets. log ratios greater than one. Please refer to the supplementary out any Bonferroni correction. Using a p-value cut-off of 0.05 material [18] for the details of our case studies. To summa- and standard Bonferroni correction, the one-sample t-test rize, RAMA produced the desired results on both datasets identified only one significant gene (which is one of the 13 while the averaged log ratio produced three and six false pos- positive controls) and incorrectly assigned the remaining 12 itives, respectively, on these two datasets. positive controls as 'insignificant'. Similarly, using one-sam- ple SAM as implemented in MeV identified 12 out of 13 posi- BRIDGE: Bayesian Robust Inference for Differential tive controls using default parameters. Gene Expression We compared the differentially expressed genes identified The second dataset we used comprises the Affymetrix U133 using BRIDGE, t-test and SAM (Significance Analysis of spike-in data [26], which consists of three technical replicates Microarrays) [25] as implemented in MeV on two datasets. of 14 separate hybridizations of 42 spiked transcripts in a Applying BRIDGE to the HIV data described in the previous complex human background at varying concentrations. section identified all 13 positive controls as 'significant' genes Thirty of the spikes are isolated from a human cell line, four (Figure 3). On the other hand, applying the one-sample t-test spikes are bacterial controls, and eight spikes are artificially as implemented in MeV to the same HIV data identified a engineered sequences believed to be unique in the human total of 14 significant genes, including all 13 positive controls genome. The data were preprocessed using GCRMA [27], and one negative control using a p-value cut-off of 0.01 with- The Figu res re 2 ults of applying RAMA to the HIV data The results of applying RAMA to the HIV data. The log ratios computed from RAMA are sorted in descending order, and the top 13 genes with log ratios greater than one are the positive controls. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.6 The Figu sign re 3 ificant genes identified by applying BRIDGE to the HIV data The significant genes identified by applying BRIDGE to the HIV data. resulting in a dataset of 22,300 genes across 42 samples. In cut-off of 0.01 without any correction for multiple compari- addition to the original 42 spiked-in genes, we included an son identified a total of 33 significant genes, of which 31 were additional 20 genes that consistently showed significant dif- spiked-in genes. Using a p-value cut-off of 0.05 and the ferential expression across the array groups and an additional standard Bonferroni correction, the t-test identified only four three genes containing probe sequences exactly matching significant genes (which are among the spiked-in genes). those for the spiked-in genes [28,29]. As a result, our SAM identified eight spiked-in genes as differentially expanded spiked-in gene list contains 65 entries in total. We expressed. used a subset of this spiked-in data consisting of 1,059 genes that include all 65 spiked-in genes across two samples in Our comparison results are summarized in Table 2. We have triplicate. In our comparison, only the 65 spiked-in genes shown that BRIDGE is the only tool that successfully identi- should be identified as differentially expressed. fied all 13 positive controls as 'significant' on the HIV data. In addition, BRIDGE identified the highest number of true pos- BRIDGE identified 45 differentially expressed genes on this itives (spiked-in genes) without any false positives on the data subset. All of these 45 genes identified by BRIDGE are Affymetrix spike-in data. spiked-in genes. On the other hand, the t-test with a p-value Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.7 Table 2 Comparing the results of BRIDGE to t-test and SAM on the HIV data and the Affymetrix spike-in data t-test Dataset Benchmark BRIDGE p-value cut-off 0.01, no p-value cut-off 0.05, standard SAM correction Bonferroni correction HIV data 13 positive controls, 24 negative controls DE 13 14 1 12 TP 13 13 112 FP 0 1 00 Affymetrix spike-in data 65 spike-in genes DE 45 33 4 8 TP 45 31 4 8 FP 0 2 00 For each dataset and each method, the number of differentially expressed (DE) genes, true positives (TP) and false positives (FP) are shown. For each dataset, the maximum TP and the minimum FP across all methods are shown in bold. BRIDGE produced the best results on both datasets in identifying the highest number of true positives without any false positives. IterativeBMA: Iterative Bayesian Model Averaging Using other MeV modules in an integrated data We compared the performance of iterativeBMA (abbreviated analysis as iBMA in our MeV+R implementation) to KNN (k-nearest The previous sub-sections showed that our MeV+R modules neighbor) [30] and USC (Uncorrelated Shrunken Centroid) achieved superior performance when compared to other [31] implemented in MeV using the well-studied leukemia existing tools implemented in MeV. Here we demonstrate data [32]. We used the filtered leukemia dataset, which con- how the R packages that we incorporated into MeV can be sists of 3,051 genes, 38 samples in the training data and 34 used in combination with other existing tools in MeV. This samples in the test set. The data consist of samples from illustrates the fact that the MeV+R framework has extended patients with either acute lymphoblastic leukemia (ALL) or the capabilities of MeV, and that using these R packages acute myeloid leukemia (AML). On the leukemia data, itera- through the MeV GUI adds value to the integrated analysis of tiveBMA produced 2 classification errors using 11 selected microarray data. genes over 11 models (Figures 4 and 5). On the other hand, KNN does not have a gene selection procedure and produced In this case study, we will follow-up on the results from apply- 2 classification errors using all 3,051 genes. Similarly, USC ing the iterativeBMA algorithm to the leukemia data [32]. The produced 2 classification errors using 51 selected genes. iterativeBMA algorithm is a multivariate gene selection method designed to select a small set of predictive genes for The second dataset we used is the breast cancer prognosis the classification of microarray data. In the case of the leuke- dataset [33], which consists of 4,919 genes with 76 samples in mia data, the iterativeBMA algorithm selected 11 genes that the training set, and 19 samples in the test set [17]. The produced two classification errors on the 34-sample test set. patient samples are divided into two categories: the good It would be interesting to identify the biological theme in this prognosis group (patients who remained disease free for at 11-gene list. Towards this end, we applied EASE [13] as imple- least five years) and the poor prognosis group (patients who mented in MeV to determine the over-represented Gene developed distant metastases within five years). The itera- Ontology categories in this gene list relative to all the genes on tiveBMA algorithm produced three classification errors using the microarray. Figure 6 shows the tabular view from the four genes averaged over three models. On the other hand, EASE analysis. KNN does not have a gene selection procedure and produced five classification errors using all genes. Similarly, USC pro- Since iterativeBMA identifies a small set of predictive genes duced four classification errors using 662 genes. for classification, other genes that exhibit similar expression patterns to the selected genes are likely of biological interest. Our results are summarized in Table 3. On the breast cancer For example, we would like to explore the gene with the high- prognosis data, iterativeBMA produced higher prediction est posterior probability 'X95735_at' from the iterativeBMA accuracy using much fewer genes. On the leukaemia data, analysis on the leukemia data [32]. We applied PTM (Tem- iterativeBMA produced comparable prediction accuracy plate Matching) [34] as implemented in MeV to identify genes using much fewer genes. that are highly correlated with 'X95735_at'. Using a p-value threshold of 0.0001, PTM identified 209 genes that are highly Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.8 T Fihg e resu ure 4lts of applying iterativeBMA to the leukemia data The results of applying iterativeBMA to the leukemia data. A heatmap showing the selected genes from iterativeBMA under the training samples labeled as class 0 and the test samples assigned to class 0 by the algorithm. correlated with 'X95735_at'. Our next task was to find the have demonstrated the successful integration of Bioconduc- biological theme among these 209 genes, so we applied EASE tor and MeV through three Bioconductor packages, RAMA, and TEASE (Tree-EASE). TEASE is a combined analytical BRIDGE and iterativeBM, and that the incorporated Biocon- tool for hierarchical clustering and EASE. TEASE computes ductor packages produced superior results in the analysis of the dendrogram using the hierarchical clustering method and microarray data compared to existing tools in MeV. Addi- displays the significantly enriched Gene Ontology categories tional Bioconductor packages are straightforward to add: the for each subtree in the dendrogram. Please refer to the sup- framework for moving data from MeV to R and back is gener- plementary materials [18] for the details of our case studies. alized for code re-use, and each new package will merely require the development of a GUI for input and output. Incorporating additional R packages We have developed a framework with built-in functions for Abbreviations the integration of Bioconductor packages into MeV. Detailed API, application programming interface; BMA, Bayesian documentation of these built-in functions is provided on our Model Averaging; BRIDGE, Bayesian Robust Inference for project web site for software developers. Using this frame- Differential Gene Expression; BSS/WSS, ratio of between- work, we have integrated three Bioconductor packages group to within-group sum of squares; EASE, Expression (RAMA, BRIDGE and iterativeBMA) into MeV as proof of Analysis Systematic Explorer; GUI, graphical user interface; concept. To integrate additional Bioconductor packages into iterativeBMA, iterative Bayesian Model Averaging; KNN, k- MeV, a software developer can simply call our built-in func- nearest neighbor; MAV, Multiple Array Viewer; MeV, Multi- tions except for complex and non-standard data views. Experiment Viewer; PTM, Template Matching; RAMA, Robust Analysis of MicroArray; SAM, Significance Analysis of Microarrays; USC, Uncorrelated Shrunken Centroid. Conclusion MeV+R is a convenient platform to provide biologists with point and click GUI access to Bioconductor packages. We Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.9 T Fihg e resu ure 5lts of applying iterativeBMA to the leukemia data The results of applying iterativeBMA to the leukemia data. A heatmap showing the selected genes from iterativeBMA under the training samples labeled as class 1 and the test samples assigned to class 1 by the algorithm. Authors' contributions Acknowledgements We would like to give special thanks to Dr John Quakenbush and his VC carried out the software implementation, and drafted part research group for the original development of MeV and for insightful of the initial manuscript. RG and AER designed and wrote the discussions related to this work. We would also like to thank Drs Chris Bioconductor packages RAMA and BRIDGE, and assisted in Volinsky and Ian Painter for their help on the development of the itera- tiveBMA Bioconductor package. We would also like to thank Dr Renee Ire- incorporating these packages into MeV. REB conceived of the ton for editing the manuscript. VTC is supported by NIH-NCI grant study, and designed and coordinated the project. KYY partic- K25CA106988 and 5R01HL072370. RG's research is supported by the ipated in the design and coordination of the study, wrote the Natural Sciences and Engineering Research Council of Canada. AER's research was supported by NICHD grant R01 HD054511, NIH grant 8 R01 iterativeBMA Bioconductor package, carried out the case studies and prepared the manuscript. All authors read and approved the final manuscript. Table 3 Comparing the results of iterativeBMA to KNN and USC on the leukemia data and the breast cancer prognosis data Data Size of data iterativeBMA KNN USC Leukemia data [32] 38 training samples 11 genes 3,051 genes 51 genes 34 test samples 2 errors 2 errors 2 errors Breast cancer prognosis data [33] 76 training samples 4 genes 4,919 genes 662 genes 19 test samples 3 errors 5 errors 4 errors The number of selected genes and the number of classification errors are shown for each method. For each dataset, the smallest number of genes and the smallest number of classification errors across all three methods are shown in bold. On the leukemia data, iterativeBMA produced the same number of classification errors using much fewer genes. On the breast cancer prognosis data, iterativeBMA produced fewer errors using much fewer genes. Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.10 The Figu res re 6 ults of applying EASE to the 11 genes selected by iterativeBMA on the leukemia data The results of applying EASE to the 11 genes selected by iterativeBMA on the leukemia data. EB002137-02, NSF grant ATM 0724721 and ONR grant N00014-01-10745. 20-22 March 2003; Vienna, Austria. [http://www.ci.tuwien.ac.at/Con REB is funded by NIH-NIAID grants 5P01 AI052106-02, 1R21AI052028-01 ferences/DSC-2003/Proceedings/Grosjean.pdf]. and 1U54AI057141-01, NIH-NIEHA grant 1U19ES011387-02, NIH-NHLBI 8. Wettenhall JM, Smyth GK: limmaGUI: a graphical user interface grants 5R01HL072370-02 and 1P50HL073996-01. KYY is supported by for linear modeling of microarray data. Bioinformatics 2004, NIH-NCI grant K25CA106988. 20:3705-3706. 9. Wettenhall JM, Simpson KM, Satterley K, Smyth GK: affylmGUI: a graphical user interface for linear modeling of single channel microarray data. Bioinformatics 2006, 22:897-899. References 10. Futschik ME, Crompton T: OLIN: optimized normalization, vis- ualization and quality testing of two-channel microarray 1. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, data. Bioinformatics 2005, 21:1724-1726. Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, 11. Saeed AI, Sharov V, White J, Li J, Liang W, Bhagabati N, Braisted J, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Klapa M, Currier T, Thiagarajan M, Sturn A, Snuffin M, Rezantsev A, Smith C, Smyth G, Tierney L, Yang JY, Zhang J: Bioconductor: open Popov D, Ryltsov A, Kostukovich E, Borisovsky I, Liu Z, Vinsavich A, software development for computational biology and Trush V, Quackenbush J: TM4: a free, open-source system for bioinformatics. Genome Biol 2004, 5:R80. microarray data management and analysis. Biotechniques 2003, 2. The Bioconductor Project [http://www.bioconductor.org] 34:374-378. 3. Ihaka R, Gentleman RC: R: a language for data analysis and 12. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis graphics. J Computational Graphical Stat 1996, 5:299-314. and display of genome-wide expression patterns. Proc Natl 4. The R Project for Statistical Computing [http://www.r- Acad Sci USA 1998, 95:14863-14868. project.org] 13. Hosack DA, Dennis G Jr, Sherman BT, Lane HC, Lempicki RA: Iden- 5. Dalgaard P: The R-Tcl/Tk interface. Proceedings of the Second Inter- tifying biological themes within lists of genes with EASE. national Workshop on Distributed Statistical Computing: 15-17 March Genome Biol 2003, 4:R70. 2001; Vienna, Austria. [http://www.ci.tuwien.ac.at/Conferences/DSC- 14. SourceForge.net [http://www.sourceforge.net/projects/mev-tm4] 2001/Proceedings/Dalgaard.pdf]. 15. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE: Quality control 6. Fox J: The R commander: a basic-statistics graphical user and robust estimation for cDNA microarrays with interface to R. J Stat Software 2005, 14: [http://www.jstatsoft.org/ replicates. J Am Stat Assoc 2006, 101:30-40. v14/i09/paper]. 16. Gottardo R, Raftery AE, Yeung KY, Bumgarner RE: Bayesian robust 7. Grosjean P: SciViews: an object-oriented abstraction layer to inference for differential gene expression in cDNA microar- design GUIs on top of various calculation kernels. Proceedings rays with multiple samples. Biometrics 2006, 62:10-18. of the Third International Workshop on Distributed Statistical Computing: Genome Biology 2008, 9:R118 http://genomebiology.com/2008/9/7/R118 Genome Biology 2008, Volume 9, Issue 7, Article R118 Chu et al. R118.11 17. Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 2005, 21:2394-2402. 18. MeV+R Supplementary Web Site [http://expression.washing ton.edu/mevr] 19. Urbanek S: Rserve - a fast way to provide R functionality to applications. Proceedings of the Third International Workshop on Dis- tributed Statistical Computing: 20-22 March 2003; Vienna, Austria. [http:/ /www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/ Urbanek.pdf]. 20. Rserve [http://stats.math.uni-augsburg.de/Rserve] 21. MeV Manual [http://www.tm4.org/mev.html] 22. Raftery AE: Bayesian model selection in social research (with discussion). Sociol Methodol 1995, 25:111-196. 23. Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expres- sion data. J Am Stat Assoc 2002, 97:77-87. 24. van 't Wout AB, Lehrman GK, Mikheeva SA, O'Keeffe GC, Katze MG, Bumgarner RE, Geiss GK, Mullins JI: Cellular gene expression upon human immunodeficiency virus type 1 infection of CD4(+)-T-cell lines. J Virol 2003, 77:1392-1402. 25. Tusher VG, Tibshirani R, Chu G: Significance analysis of micro- arrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 2001, 98:5116-5121. 26. Affymetrix U133 Spike-in Data [http://www.affymetrix.com/ support/technical/sample_data/datasets.affx] 27. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 2003, 31:e15. 28. Sheffler W, Upfal E, Sedivy J, Noble WS: A learned comparative expression measure for affymetrix genechip DNA microarrays. Proc IEEE Comput Syst Bioinform Conf 2005:144-154. 29. Lo K, Gottardo R: Flexible empirical Bayes models for differ- ential gene expression. Bioinformatics 2007, 23:328-335. 30. Theilhaber J, Connolly T, Roman-Roman S, Bushnell S, Jackson A, Call K, Garcia T, Baron R: Finding genes in the C2C12 osteogenic pathway by k-nearest-neighbor classification of expression data. Genome Res 2002, 12:165-176. 31. Yeung KY, Bumgarner RE: Multiclass classification of microarray data with repeated measurements: application to cancer. Genome Biol 2003, 4:R83. 32. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286:531-537. 33. van 't Veer LJ, Dai H, Vijver MJ van de, He YD, Hart AA, Mao M, Peterse HL, Kooy K van der, Marton MJ, Witteveen AT, Schreiber GJ, Kerkhoven RM, Roberts C, Linsley PS, Bernards R, Friend SH: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415:530-536. 34. Pavlidis P, Noble WS: Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2001, 2:research0042.1-0042.15. Genome Biology 2008, 9:R118

Journal

Genome BiologySpringer Journals

Published: Jul 1, 2008

Keywords: Animal Genetics and Genomics; Human Genetics; Plant Genetics and Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology

There are no references for this article.