Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Feature construction from synergic pairs to improve microarray-based classification

Feature construction from synergic pairs to improve microarray-based classification Vol. 23 no. 21 2007, pages 2866–2872 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm429 Gene expression Feature construction from synergic pairs to improve microarray-based classification 1, ,† 1,2,3,4,† 2,3,4 5 Blaise Hanczar , Jean-Daniel Zucker , Corneliu Henegar and Lorenza Saitta ´ ´ Laboratoire d’Informatique Medicale et Bioinformatique (Lim & Bio), Universite Paris 13, 93017 Bobigny, 2 3 ´ ´ Universite Paris Descarte, F-75006, Universite Pierre et Marie Curie - Paris 6, Centre de recherche des Cordeliers, 4 5 UMR S 872, INSERM, U872, Paris, F-75006, France and Dipartimento di Informatica, Universita del Piemonte Orientale, 15100 Alessandria, Italy Received on February 5, 2007; revised on July 6, 2007; accepted on August 18, 2007 Advance Access publication October 9, 2007 Associate Editor: Joaquin Dopazo ABSTRACT 1 INTRODUCTION Motivation: Microarray experiments that allow simultaneous Most cellular processes need to accommodate concomitantly expression profiling of thousands of genes in various conditions various types of solicitations, related either to the specificities of (tissues, cells or time) generate data whose analysis raises difficult a particular cellular state, or to variations of the parameters of problems. In particular, there is a vast disproportion between the the intra- or the extracellular environments. Complex regulatory number of attributes (tens of thousands) and the number of mechanisms are therefore integrating internal demands, examples (several tens). Dimension reduction is therefore a key environmental fluctuations as well as various extracellular step before applying classification approaches. Many methods have signals (e.g. growth factors, mediators, hormones, other auto- been proposed to this purpose, but only a few of them considered a crine and paracrine signals, etc.), and initiate specific adaptive direct quantification of transcriptional interactions. We describe and processes to maintain the metabolic homeostasis of the cell and experimentally validate a new dimension reduction and feature to assure its systemic role in the organism. In this article, we construction method, which assesses interactions between expres- propose an original approach designed to capture synergic sion profiles to improve microarray-based classification accuracy. interactions between cellular processes from the information Results: Our approach relies on a mutual information measure that encoded in the gene expression profiles, to improve the accuracy exposes some elementary constituents of the information contained of the classification of microarray experiments. We investigate in a pair of gene expression profiles. We show that their analysis the feasibility and the potential advantages of considering gene implies a term that represents the information of the interaction be- information interactions in the very phase of dimension tween the two genes. The principle of our method, called FeatKNN,is reduction. We show that pairs of genes with a high discrimina- tion power need not include genes that are both individually to exploit the information provided by highly synergic gene pairs to discriminant. Therefore, a feature reduction method that does improve classification accuracy. First, a heuristic search selects the not consider interactions explicitly is likely to miss such useful most informative gene pairs. Then, for each selected pair, a new fea- pairs. Figure 1 illustrates this situation. Genes Hsa.1221 and ture, representing the classification margin of a KNN classifier in the Hsa.9025 are used to discriminate between two classes gene pairs space, is constructed. We show experimentally that the (control subjects and patients affected by colon cancer), interactional information has a degree of significance comparable to represented by black and white dots, respectively. Expression that of the gene expression profiles considered separately. Our of gene Hsa.1221 is very useful for discrimination, as it appears method has been tested with different classifiers and yielded signif- from the presence of a large proportion of white dots between icant improvements in accuracy on several public microarray the values 0.5 and 2. This is not the case of gene Hsa.9025: databases. Moreover, a synthetic assessment of the biological signif- the values of the samples in both classes are spread over the icance of the concept of synergic gene pairs suggested its ability to whole range of expression levels. A standard feature selection uncover relevant mechanisms underlying interactions among various method is likely to consider only the first gene as relevant for the cellular processes. classification. However, as Fig. 1 suggests, the association of Contact: [email protected] these two genes may improve significantly the discrimination Supplementary information: Complementary results can be found between the two classes. on the companion website at http://featknn.nutriomique.org We devised an original approach that computes the mutual information contained in the gene expression profiles to identify gene pairs showing the strongest synergies, and then used them to improve the accuracy of microarray experiments *To whom correspondence should be addressed. classification. We point out that such synergic interactions The authors wish it to be known that in their opinion the first two authors should be regarded as First Authors. capture a biological information which is contextually relevant. 2866  The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Feature construction classification algorithm. Whenever available, this type of data integration should be used to improve any of the gene selection approaches mentioned above. Most of the available dimension-reduction methods do not take into account explicitly the interactions among genes, although some proposals of using pairwise gene interactions do exist. Bo and Jonassen (2002) evaluate the gene pairs by computing the projected coordinates of each example on the axis of the diagonal linear discriminant in the gene-pair space. The score is the two sample t-statistic on the projected points. Geman et al. (2004) do not use the expression value, but the expression rank of the genes. The pair score is computed from the probability that the expression rank of the first gene of the pair is higher than that of the second one, in each class. Their experimental results confirm the claim that class prediction can be improved using pairs of genes. In this article, we propose to identify strongly interacting genes and systematically exploit these pairs of synergies to improve the classification accuracy and the biological significance of the results. Fig. 1. Example of synergy between two genes. The plot shows the expressions of genes Hsa.9025 and Hsa.1221 from the colon cancer dataset. White dots represent sick patients and black dots normal 3 DECOMPOSITION OF GENE PAIR controls. The association of the two genes clearly distinguishes the two INFORMATION conditions. Let us first recall some definitions concerning mutual 2 RELATED WORK information (Shannon, 1948). The entropy H(X) of a variable X, which can take m values {x , .. . , x }, each value x with a There is a vast amount of work on gene reduction methods to 1 m i probability p(X ¼ x ), is defined as follows: improve microarray data classification. Widely used, the scoring approaches take an individual perspective by computing for HðXÞ¼  pðX ¼ x Þ log pðX ¼ x Þ i i each gene a relevance score, depending on how well the i¼1 gene distinguishes the examples of different classes. A good The following propriety about entropy holds for any pair of review of this kind of approaches is provided by Ben-Dor stochastic variables X and Y: (Ben-Dor et al., 2000). These methods are useful for microarray HðX; YÞ¼ HðY; XÞ¼ HðXjYÞþ HðY Þ data because they are fast (linear complexity with the number of The mutual information I(X,Y ) is a measure of the dimensions). However, they can only evaluate the relevance of dependency between two variables X and Y: genes with respect to the class, but cannot discover redundancy IðX; YÞ¼ HðXÞ HðXjYÞ¼ HðYÞ HðYjXÞ¼ IðY; X Þ and basic interactions among genes. For this reason, the most competitive methods are multivariate ones that rely on groups The following propriety about information holds for any pair of genes (i.e. the selection of a specific gene depends on the of stochastic variables X and Y: others), instead of considering each gene individually. 0  IðX; YÞ infðHðX Þ; HðY ÞÞ Researchers assume generally that a good gene subset is one Unlike the second order mutual information, the third order that contains genes that are highly correlated with the class, yet mutual information can be either positive or negative. The uncorrelated among them. Based on this idea, several selection mutual information between three variables (X, Y, Z) is defined methods have been developed. For example, MRMR as follows (Matsuda, 2000): (Maximum Relevance, Minimum Redundancy) (Ding and Peng, 2003) uses mutual information to select genes with IðX; Y; ZÞ¼  HðX; Y; ZÞþ HðX; YÞþ HðX; ZÞþ HðY; ZÞ maximum relevance and minimal redundancy. ProGene HðXÞ HðYÞ HðZ Þ (Hanczar et al., 2003) reduces redundancy by building new In the case of microarray-based classification, the mutual features from subsets of similar genes. In a recent work, information between a gene G and a class C represents the Dai et al. (2006) have carried out an extensive study to compare information that the gene provides to classify. The higher three reduction methods, including partial least square (PLS). the mutual information, the more informative the gene. PLS builds new features corresponding to components that For reasons of tractability and simplicity, expression levels are maximize the covariance between the variables and the class. often discretized. The most straightforward and widely used Other types of approaches aim to improve microarray approach relies on a histogram-based technique (Butte and classification by integrating available a priori biological knowl- Kohane, 2000). The data is partitioned into equal-width discrete edge about gene interactions. In this category, Rapaport et al. bins, and the equations above may be used. The mutual (2007) have recently proposed an original method, which information I(G , C) can be expressed as follows: integrates KEGG metabolic interaction networks into a spectral IðG ; CÞ¼ HðCÞ HðCjG Þ i i decomposition of gene expression profiles to derive a 2867 B.Hanczar et al. According to the above formula, the mutual information gene pair, a new feature, which summarizes the information between a gene G and the class C can be seen as the reduction contained in the pair, is constructed. of the class entropy caused by the knowledge of G . In the same way, we define the mutual information between the class C and 4.1 The search for the most informative pairs of genes a pair of gene {G G } formed by the two genes G and G : i j i j The naive approach to finding informative pairs of genes consists in computing the mutual information with respect to IðG G ; CÞ¼ IðG ; CÞþ IðG ; CÞ IðG ; G ; CÞð1Þ i j i j i j the class of all the N(N1)/2 different pairs, where N is the Formula (1) can be proved by the following argument. The number of genes, and then selecting the p best ones. However, left-hand side of (1) can also be written: this approach has a complexity O(N ); in the context of microarray data, where the number of genes is of the order of IðG G ; CÞ¼ HðCÞ HðCjG ; G Þ¼ i j i j several thousands, this solution is often computationally ¼ HðCÞ HðG ; G ; CÞþ HðG ; G Þ i j i j infeasible. But, even in the case where the complexity may not be a problem, there are two more reasons why this simple By introducing the terms H(G , C), H(G , C), H(G ), H(G ) i j i j approach might be unsuitable. The first is that we would like to and H(C), we obtain: select pairs that not only provide high information for the IðG G ; CÞ¼HðG ; CÞþ HðG Þþ HðC Þ i j i i classification, but which provide an information superior to the HðG ; CÞþ HðG Þþ HðC Þ j j one brought by the single genes (i.e. the genes should interact negatively). In fact, only those genes that satisfy this property HðG ; G ; CÞþ HðG ; G Þþ HðG ; C Þ i j i j i are interesting in the context of this work, which is based on the þHðG ; CÞ HðG Þ HðG Þ HðC Þ j i j assumption (grounded on biological findings) that synergic ¼ IðG ; CÞþ IðG ; CÞ IðG ; G ; C Þ i j i j gene interaction is important for improving classification. The second reason is that correlated attributes usually Formula (1) shows that the information of a gene pair is the provide duplicated (redundant) information; in learning, it is sum of the information of the first gene, the information of the well known that it is preferable to exploit as diverse and second gene, and the third order mutual information between independent sources of information as possible. Preliminary G , G and C. This last term represents the information provided i j experiments showed that in some rare cases a dataset may to the classification by the association of the two genes. We call contain an exceptionally informative gene, such that it forms a this term the interaction, which can be either positive or ‘good’ pair when coupled with a large number of the other negative. In the case of a positive interaction, the information genes. This situation is undesirable, and we want to avoid it. of the gene pair is lower than the sum of the information of the In order to face the above problems, we propose in this two genes. In this case part of the information provided by the article a simple search algorithm, guided by a powerful genes is similar, and therefore we may speak of redundancy heuristics, which allows the p most informative pairs of genes between the two genes. On the contrary, when the interaction is to be found with a complexity O(pN). The search proceeds as negative the information of the gene pair is higher than the sum follows: in the beginning, the mutual information I(G , C) of the information of the two genes, which means that the between each single gene G and the class C is computed, and association of the two genes provides new information. the most informative gene G is selected. Then, the mutual We speak then of synergy between the genes (Jakulin and information IðG G ; CÞ, between the class and each pair of i j Bratko, 2003). In the example of Figure 1, the information genes which include the gene G , is computed. The gene pair contained in the expression of gene Hsa.1221 is 0.20, whereas it ðG G Þ that maximizes IðG G ; CÞ is selected, and the genes i j i j is 0.03 for gene Hsa.9025, and 0.27 for their synergic G  and G  are removed from the list of genes to be analyzed. i j interaction. The information of the pair formed by these two This procedure is iterated p times to obtain p pairs of genes. The genes is 0.50 ¼ 0.20 þ 0.03(0.27). deletion of the selected genes from the list of the available ones It should be underlined that the mutual information is is motivated by the goal of eliminating the redundant pairs computed from the probability distributions of the gene mentioned earlier. expression (Steuer et al., 2002). However, because the real probabilities are unknown, as they are only estimated from 4.2 Feature construction from gene pairs limited data, we have conducted a set of experiments (details of For each of the p informative gene pairs (G G ), a new feature these experiments can be found on the companion website), i j A is constructed. The idea is the following one: the higher the which shows that the mutual information is accurate enough to i, j difference between the densities of the classes around a point, identify the most informative pairs of genes. the higher the probability that this point belongs to the higher density class. The value of the new feature is the difference between the local densities of the classes. 4 REDUCING DIMENSIONALITY USING More precisely, let E ¼ {e , ... , e } be a set of M instances, 1 M SYNERGIES each belonging to one of the two classes {C , C }. From an a b In this section, we will describe a new dimension reduction informative pair of genes (G G ) we construct a new feature A i j i, j method, called FeatKNN, based on the use of synergic pairs of as follows: the instances in E are projected onto the two- genes. The most informative gene pairs are identified using dimensional space defined by the expressions of G and G .We 1 2 a sequential forward search (SFS) procedure. Then, for each have to define the value A (x) of the feature A for every i, j i, j 2868 Feature construction point x of this space. The probabilities p (x) and p (x) at point The observation of the values of the mutual information of a b x belongs either to class C or to class C , respectively, can the genes and pairs of genes leads to the conclusion that all the a b be approximated by using the k-nearest neighbors of best pairs include, on average, a highly informative gene. This x : p ðxÞ n ðxÞ=k and p (x)  1  p (x), where n (x) is the observation validates our assumption for microarray data. a a b a a number of points belonging to class C among the k-nearest Also, this suggests a positive answer (from an empirical point of neighbors of x. The value of the new feature A at the point x view) to the second question regarding the ability of our i, j is the difference between p (x) and p (x): heuristic to find highly informative gene pairs. It also indicates a b the usefulness of this heuristic choice for selecting pairs, which n ðxÞ n ðxÞ a a AðxÞ¼ p ðxÞ p ðxÞ ð1  Þ are formed starting from the gene with the best rank. It should a b k k be underlined that standard feature selection methods, dis- n ðxÞ ¼1 þ 2 regarding interactions, may miss many useful pairs, as it was already said in the introduction. The values of the new feature are between 1 and þ1. When an instance is close to intances belonging to class C (respectively, 5.2 Analysis of the most informative pairs C ), the feature tends to þ1 (respectively, 1). Our assumption is that the explicit account of the synergic It should be underlined that the definition of this new feature interaction between genes may improve classification accuracy. is the same as the margin defined by Shapire for the voting To validate this assumption we computed, for each dataset, the methods in machine learning (Schapire et al., 1997). The new information of all genes and all gene pairs; both genes and gene feature, that we have defined, represents the classification pairs have been ordered according to increasing rank. Figure 2 margin of the k nearest neighbor classifier in the space of the shows the decomposition of the information of the 100 best gene pairs. gene pairs of the six datasets. It can be seen that around 40% of Table 1. Description of the datasets 5 RESULTS AND DISCUSSION In order to test the effectiveness of the proposed method, an experimental study was designed and set up to answer the Dataset name Number Number of Class C Class C a b following questions: is our selection heuristic adapted to find of Genes Samples informative pairs of genes? What is the amount of information Leukemia 7129 72 47 25 contained in the interaction compared to that of the individual Colon cancer 2000 62 40 22 genes? Is our feature construction method effective for Prostate cancer 12 600 102 52 50 synthesizing the information contained in a pair of genes? SRBCT 6567 63 43 20 Does FeatKNN improve classification accuracy? Lung cancer 3588 43 22 21 Six public datasets are used in these experiments, their Breast cancer 7129 49 25 24 characteristics are described in Table 1. It shows the data type, the number of genes measured and the number of samples contained in each class. 5.1 Identification of the most informative pairs of genes Our method is based on the identification of the most informative gene pairs that is performed by SFS procedure. The choice of this SFS has been based on the assumption that the most informative pairs of genes include at least one of the most informative gene. To validate this assumption, we exam- ined the rank of the genes forming the best pairs. We defined a ranking of the genes based on their mutual information with the class. The gene with the highest mutual information has rank 1, and the one with the lowest mutual information has the lowest rank. In the same way, we compute the mutual information between the class and all gene pairs, and we defined a ranking of the gene pairs. Notice that here we do not select the best pairs using the SFS procedure, but we compute the mutual information for all exclusive pairs of genes. A figure on the companion website shows the average rank of the two genes forming the most informative gene pairs. It shows that the first genes of the best pairs were among the top informative genes, while the second genes have a much higher rank. For Fig. 2. Decomposition of the information contained in the best gene example, the two genes forming the 50 best pairs have on pairs. The black part shows the amount of information of the most average rank 58 and 209, respectively. The same results were informative genes. The dark grey part shows the amount of information observed on the other six datasets which are described in the of the second genes. The light grey part shows the amount of Table 1. information of the interaction. 2869 B.Hanczar et al. the information provided by the pairs of genes resides in the best model. All of the experiments have been performed with interaction of their components. For example, in the 100 most the statistical environment R, the numbers of neighbors k for informative gene pairs of the breast cancer dataset, 38% of the KNN was 3 and the SVM has been implemented using the information is provided by the best gene, 24% by the second package ‘e1071’ with a radial kernel. gene and 38% by their interaction. We have compared FeatKNN to other methods that are widely used in the literature and reach good performances. These dimension reduction methods are the following: 5.3 Information obtained by feature construction All genes: all the genes are used. The aim of feature construction is to synthesize the information contained in the genes and their interactions. Single MI: the genes with the highest mutual information In order to measure the effectiveness of our feature construc- with respect to the class are selected. This method is tion method, we empirically compared the information commonly used in the literature (Ben-Dor et al., 2000; contained in the most informative gene pairs (G , G ) and in i j Wang et al., 2005). the associated newly constructed features A . For the p ¼ 100 i, j Pair MI: the gene pairs with the highest mutual informa- best gene pairs of the six datasets described above, the mutual tion with the class are selected (i.e. FeatKNN without the information of the newly constructed attribute is on average feature construction step). This method is tested to show in 90% of the cases higher than the information of the best gene the importance of feature construction. in the pair. These results suggest that our feature construction BO: the gene pair-based method developed by Bo and method is effective for synthesizing the information contained Jonassen (2002). in a gene pair, thus answering our third question. Geman: the gene pair selection method used by Geman et al. (2004) in their TSP classifier. 5.4 Classification accuracy PLS: new features are constructed as the components To measure the impact of our dimension reduction method on which maximize the covariance between the class and the the classification, we examined classification accuracy on the variables (Dai et al., 2006). six datasets. The cross-validation estimator is the most commonly used error estimation method in microarray-based In this article, we focus on the results obtained by 0.632 bootstrap; the results by cross-validation can be found on the classification. We have used the 10-times 10-fold cross- companion website, and leads to the same conclusions. Table 2 validation procedure in our experiment to measure the reports the classification error rates for different reduction generalization error. However, Braga-Neto and Dougherty methods. We used a paired Wilcoxon test to compare the (2004) have shown that this estimator is not the most results. The detail of the P-values of significance can be found appropriate one for a small instance sample like the ones in the companion website. available in microarrays analysis. Cross-validation has a high It is not surprising to see that dimension reduction methods variance and therefore bootstrap estimators are preferred, in improve classification performance considerably. In all cases particular the 0.632 estimator (Efron, 1983). The 0.632 boot- the performances reported in the column ‘All genes’, are worse strap estimator is a weighted sum of the empirical error and the than the others. The columns ‘Single MI’ and ‘Pair MI’ out-of-bag bootstrap error. We have also used the 0.632 represent the results obtained when we select the genes bootstrap estimator in our experiment to complete the results (respectively, gene pairs) having the highest mutual information obtained by cross-validation. Total 100 bootstrap iterations with respect to the class. We see that the methods using single were performed. It should be noted that for the evaluation genes (column ‘Single MI’) and pairs of genes (column ‘Pair procedure (both cross-validation and bootstrap) the test MI’) obtain similar results. We have shown that the gene pairs samples were not used in the dimension reduction and classifier were more informative than single genes. This may suggest that design. Thus, we avoided the problem of selection bias the information contained in the interaction between the genes pointed out by Ambroise and McLachlan (2002) and composing the pairs is not well exploited by the classification Reunanen (2003). It should be underlined that the number of algorithms, and therefore much of the information computed features used in classifier design is a meta-parameter, whose during the pair-selection phase is lost. This phenomenon is value is chosen by an internal cross-validation procedure. A figure representing the study design can be found on the avoided in FeatKNN, thanks to the feature construction step. companion website (Fig. 1). The new features constructed by FeatKNN synthesize the In our experiments, each dimension reduction methods is information contained in the genes and their interactions, associated to three different classification algorithms: the which explain the better results. FeatKNN outperforms support vector machines (SVM), k-nearest neighbors (KNN) Geman’s method in all datasets with the three classifiers. and diagonal linear discriminant (DLD). We have chosen these Bo’s method is competitive, especially with the DLD classifier: algorithms because they are among the most efficient for 9 times out of 18 Bo’s results are as good as FeatKNN’s microarray data classification. Dudoit et al. (2002) have ones, and it outperforms FeatKNN on the SRBCT dataset with pointed out the excellent results of the simplest methods like a DLD classifier. FeatKNN is statistically significantly KNN and DLD. Furey et al. (2000) and Lee et al. (2005) have (95% level) better than all other methods but PLS. PLS published a comparative study of the classification methods for results are almost as good as FeatKNN. It is the only method microarray data, and they concluded that the SVMs are the that is not significantly worse than FeatkNN. 2870 Feature construction Table 2. Classification results on six public datasets Classifier Data All gene Single MI Pair MI FeatKNN Bo Geman PLS SVM Leukemia 12.3  1.1 4.3  1 4.8  1.2 2.8  1.0 3.9  0.9 6.1  1.1 2.4  1.3 Colon cancer 17.5  1.1 12.5  1.3 11.8  1.0 10.7  1.1 13.9  1.3 14.6  1.0 11.1  1 Prostate cancer 9.5  1.1 6.1  1.0 6  1.0 6.0  0.8 5.6  0.9 6  0.7 6.2  0.6 SRBCT 7.6  0.5 2.1  0.4 3.8  0.3 0.2  0.2 0.2  0.2 0.7  0.3 1.7  0.5 Lung cancer 24.9  1.1 21.7  0.8 21  0.9 19.5  1.0 21.5  1.1 21  1.0 20.7  1.3 Breast cancer 14.6  0.8 11.4  1.1 11.2  0.9 8.7  1.0 11.4  1.1 11.2  1.0 9.7  1 KNN Leukemia 8.4  1.1 6.1  0.9 6.2  1.0 5.0  1.2 4.6  1.2 6.3  1.1 5.4  0.9 colon cancer 20.0  1.2 14.9  1.2 14.4  1.0 12.8  0.9 15.9  1.0 16  1.1 12.4  1.1 Prostate cancer 20.2  1.0 17.8  0.8 18  0.9 8.1  0.7 8.7  0.8 9.8  1.0 8.5  0.9 SRBCT 11.6  0.7 1.1  0.2 1  0.5 0.1  0.1 0.1  0.1 0.1  0.1 1.3  0.4 Lung cancer 35  0.7 29.1  1.0 28.2  0.9 20.8  1.0 23.7  1.3 24.2  1.0 21.7  1.4 Breast cancer 20.8  0.7 14.3  0.9 13.5  1.0 9.0  1.0 12  1.1 13.1  1.0 8.4  1 DLD Leukemia 11.5  1.2 4.8  1.2 4.8  1.0 3.8  1.1 4.1  1.0 5.0  1.1 2.7  0.8 colon cancer 19.5  1.4 15.7  1.2 15.4  1.0 12.5  1.0 14.4  1.1 15  1.3 12.9  1.1 Prostate cancer 37.5  1.0 10.5  0.8 10.1  0.9 7.6  0.9 7.3  1.0 8  0.7 7.3  1 SRBCT 5.4  0.7 0.8  0.2 0.5  0.2 0.7  0.2 0.2  0.1 0.1  0.1 2.5  0.6 Lung cancer 25.4  1.2 21.6  0.8 22.1  0.9 20.6  1.0 20.3  0.8 20.2  1.0 20.2  1.1 Breast cancer 14.9  0.6 10.7  1.0 10.9  1.0 9.1  1.1 9.3  0.9 10.1  0.9 9.6  1 All errors are estimated using the 0.632 bootstrap estimators. Boldfaced values highlight the best results. 5.5 A biological interpretation of the concept of synergic transcript pairs The exploration of the biological significance that may be enfolded in the concept of synergic transcript pairs had to consider two distinct aspects, one more particular related to the analyzed clinical situations, and another more general regard- ing the type of biological interactions that may explain the synergic behavior exhibited by the genes belonging to the same pair. To answer these questions, we started by separating the two components of gene pairs and then we carried out a discriminative functional profiling of the two resulting lists of genes (i.e. a first pairs component list and a second pairs component list). Based on the functional assignments pro- vided by the Gene Ontology (GO) consortium (http:// www.geneontology.org), and by the NCBI genomic repository (http://www.ncbi.nlm.nih.gov), an automated annotation pro- cedure combined with a gene set enrichment analysis allowed to identify biological themes significantly overrepresented in each of the two lists of genes. Figure 3 shows overrepresented biological themes character- izing each of the two components of the first 100 most Fig. 3. Overrepresented biological themes discriminating the two informative pairs extracted from the colon cancer dataset, components of the first 100 most informative synergic transcript pairs which resulted from microarray experiments performed to extracted from the colon cancer dataset (see text for details). compare expression profiles of tumor and normal colon tissues. The functional profiles depicted in Figure 3 seem to indicate organelle, cytoskeletal part), the second pair component highly distinct biological assignments for the two components appears to be involved exclusively in cell membrane-related of gene pairs. Thus, while the first pair component seems to be related mostly to intracellular processes located either in processes (i.e. plasma membrane, membrane part, intracellular the nucleus (i.e. nucleus, nuclear part) or in the cytoplasm membrane-bound organelle). Moreover, these findings seem to (i.e. intracellular part, intracellular non-membrane-bound be well supported by the molecular functions assigned to the 2871 B.Hanczar et al. translation products of these genes, which were found to be construction method that forces learning algorithms to take related to nucleotide binding, structural molecule activity and into account pairs with a high level of information. The transcriptional activator activity for the first pairs component, usefulness of this approach was experimentally assessed on six and to transmembrane receptor activity and transporter activity datasets and yielded a significant improvement in performance. for the second pairs component. Moreover, a synthetic assessment of the biological significance Considering the experimental framework from which this of the concept of synergic gene pairs suggested its ability to dataset resulted, these functional profiles seem to indicate that uncover relevant mechanisms underlying interactions among the biological themes that best distinguish tumoral from normal various cellular processes. colon cells concern essentially the nuclear transcriptional control Conflict of Interest: none declared. of a large panel of intracellular processes (i.e. cell differentia- tion, proliferation, metabolism, apoptosis, extracellular matrix production, etc.) on one side, and the cell communication and REFERENCES extracellular signaling modulated by cell membrane structures Ambroise,C. and McLachlan,G. (2002) Selection bias in gene extraction on the (i.e. involved in processes as focal adhesions, cell attachment basis of microarray gene expression data. Proc. Natl Acad. Sci. USA, 99, and migration, growth factor receptor expression and signaling, 6562–6566. etc.) on the other. Ben-Dor,A. et al. (2000) Scoring genes for relevance. Technical report These findings are in total agreement with the most up-to- AGL-2000-13, Agilent Technologies. Institute of Computer Science, Hebrew University, Jerusalem. date understanding of the tumoral biology. Indeed, it is well Bo,T. and Jonassen,I. (2002) New feature subset selection procedures acknowledged that tumor cell survival is dictated by both for classification of expression profiles. Genome Biology, 3, internal properties of the cell, such as status of components of research0017.1–research0017.11. the apoptotic machinery, and its extracellular environment, Braga-Neto,U. and Dougherty,E. (2004) Is cross-validation valid for small- sample microarray classification? Bioinformatics, 20, 374–380. such as extracellular matrix and growth factor receptor Butte,A.J. and Kohane,I.S. (2000) Mutual information relevance networks: expression and signaling (Dennis and Kastan, 1998) that functional genomic clustering using pairwise entropy measurements. modulate apoptosis regulation. Apoptotic anomalies represent Pac. Symp. Biocomput., 418–429. Using Smart Source Parsing. a major distinction between tumor and normal cells, as they Dai,J. et al. (2006) Dimension reduction for classification with gene expression allow tumor cells to avoid programmed cellular death, and microarray data. Stat. Appl. Genet. Mol. Biol., 5, article 6. Dennis,P.A. and Kastan,M.B. (1998) Cellular survival pathways and resistance to confer them the capacity to proliferate indefinitely. Some of cancer therapy. Drug Resist. Updat., 1, 301–309. these anomalies, which involve the nuclear compartment, may Ding,C. and Peng,H. (2003) In Proceedings of the IEEE Computer Society result in an inactivation of key tumor suppressor genes (TSGs), Conference on Bioinformatics. Stanford, CA, USA, pp. 523–529. acknowledged as being central to the development of all forms Dudoit,S. et al. (2002) Comparison of discrimination methods for classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. of human cancer (Rhee et al., 2002). This inactivation is Efron,B. (1983) Estimating the error rate of a prediction rule: improvement on induced by a combination between epigenetic silencing and cross-validation. J. Am. Stat. Assoc., 78, 316–331. promoter hypermethylation of various TSGs. On the other side, Furey,T. et al. (2000) Support vector machine classification and validation of the importance of extracellular survival signals as key cancer tissue samples using microarray expression data. Bioinformatics, 16, regulators of apoptosis is now being recognized by the ability 906–914. Geman,D. et al. (2004) Classifying gene expression profiles from pairwise mRNA of growth factors (GFs), GF receptors (GFRs) and GFR comparisons. Stat. Appl. Genet. Mol. Biol., 3, article 19. signaling to promote cellular survival. Indeed, recent evidence Hanczar,B. et al. (2003) Improving classification of microarray data using suggests that a number of abnormal cell membrane constituents prototype-based feature selection. SIGKDD Explor., 5, 23–30. may disrupt apoptosis regulation in tumor cells by pathologi- Jakulin,A. and Bratko,I. (2003) Analyzing attribute dependencies. In Proceedings A of the 7th European Conference on Principles and Practice of Knowledge cally amplifying the anti-apoptotic effect of normal extracel- Discovery in Databases (PKDD), pp. 229–240. lular survival signals (Leask and Abraham, 2006). All these Leask,A. and Abraham,D.J. (2006) All in the CCN family: essential matricellular evidences suggest that the concept of synergic gene pairs not signaling modulators emerge from the bunker. J. Cell. Sci, 119, 4803–4810. only uncovered a noteworthy functional behavior singularizing Lee,J. et al. (2005) An extensive comparison of recent classification tools applied tumor cells, but also captured the synergic aspect of the to microarray data. Comput. Stat. Data Analy., 48, 869–885. Matsuda,H. (2000) Physical nature of higher-order mutual information: intrinsic interaction between underlying biological mechanisms. correlations and frustration. Phys. Rev. E, 62, 3096–3102. Rapaport,F. et al. (2007) Classification of microarray data using gene networks. BMC Bioinformatics, 8. 6 CONCLUSION Reunanen,J. (2003) Overfitting in making comparisons between variable selection In this article, we have presented a dimension reduction methods. J. Mach. Learn. Res., 3, 1371–1382. Rhee,I. et al. (2002) DNMT1 and DNMT3b cooperate to silence genes in human procedure for microarray data oriented towards improving cancer cells. Nature, 416, 552–556. classification performance. This method is based on the idea Schapire,R.E. et al. (1997) Boosting the margin: a new explanation for the that the information provided by the interaction between genes effectiveness of voting methods. In Proceedings 14th International Conference cannot be ignored in the feature selection phase. We have on Machine Learning. Morgan Kaufmann, Nashville, TN, USA, pp. 322–330. Shannon,E. (1948) A mathematical theory of communication. Bell Sys. Tech. J., proposed a decomposition of the information contained in the 27, 623–656. gene pairs. Although it is natural to quantify information from Steuer,R. et al. (2002) The mutual information: detecting and evaluating genes and interactions from the computation of mutual dependencies between variables. Bioinformatics, 18, 231–240. information, this simple reduction does not necessarily improve Wang,Y. et al. (2005) Gene selection from microarray data for cancer performance. Therefore, we have developed a feature classification – a machine learning approach. Comput. Biol. Chem., 29, 37–46. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Feature construction from synergic pairs to improve microarray-based classification

Loading next page...
 
/lp/oxford-university-press/feature-construction-from-synergic-pairs-to-improve-microarray-based-xru9hDn498

References (29)

Publisher
Oxford University Press
Copyright
© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN
1367-4811
DOI
10.1093/bioinformatics/btm429
pmid
17925306
Publisher site
See Article on Publisher Site

Abstract

Vol. 23 no. 21 2007, pages 2866–2872 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm429 Gene expression Feature construction from synergic pairs to improve microarray-based classification 1, ,† 1,2,3,4,† 2,3,4 5 Blaise Hanczar , Jean-Daniel Zucker , Corneliu Henegar and Lorenza Saitta ´ ´ Laboratoire d’Informatique Medicale et Bioinformatique (Lim & Bio), Universite Paris 13, 93017 Bobigny, 2 3 ´ ´ Universite Paris Descarte, F-75006, Universite Pierre et Marie Curie - Paris 6, Centre de recherche des Cordeliers, 4 5 UMR S 872, INSERM, U872, Paris, F-75006, France and Dipartimento di Informatica, Universita del Piemonte Orientale, 15100 Alessandria, Italy Received on February 5, 2007; revised on July 6, 2007; accepted on August 18, 2007 Advance Access publication October 9, 2007 Associate Editor: Joaquin Dopazo ABSTRACT 1 INTRODUCTION Motivation: Microarray experiments that allow simultaneous Most cellular processes need to accommodate concomitantly expression profiling of thousands of genes in various conditions various types of solicitations, related either to the specificities of (tissues, cells or time) generate data whose analysis raises difficult a particular cellular state, or to variations of the parameters of problems. In particular, there is a vast disproportion between the the intra- or the extracellular environments. Complex regulatory number of attributes (tens of thousands) and the number of mechanisms are therefore integrating internal demands, examples (several tens). Dimension reduction is therefore a key environmental fluctuations as well as various extracellular step before applying classification approaches. Many methods have signals (e.g. growth factors, mediators, hormones, other auto- been proposed to this purpose, but only a few of them considered a crine and paracrine signals, etc.), and initiate specific adaptive direct quantification of transcriptional interactions. We describe and processes to maintain the metabolic homeostasis of the cell and experimentally validate a new dimension reduction and feature to assure its systemic role in the organism. In this article, we construction method, which assesses interactions between expres- propose an original approach designed to capture synergic sion profiles to improve microarray-based classification accuracy. interactions between cellular processes from the information Results: Our approach relies on a mutual information measure that encoded in the gene expression profiles, to improve the accuracy exposes some elementary constituents of the information contained of the classification of microarray experiments. We investigate in a pair of gene expression profiles. We show that their analysis the feasibility and the potential advantages of considering gene implies a term that represents the information of the interaction be- information interactions in the very phase of dimension tween the two genes. The principle of our method, called FeatKNN,is reduction. We show that pairs of genes with a high discrimina- tion power need not include genes that are both individually to exploit the information provided by highly synergic gene pairs to discriminant. Therefore, a feature reduction method that does improve classification accuracy. First, a heuristic search selects the not consider interactions explicitly is likely to miss such useful most informative gene pairs. Then, for each selected pair, a new fea- pairs. Figure 1 illustrates this situation. Genes Hsa.1221 and ture, representing the classification margin of a KNN classifier in the Hsa.9025 are used to discriminate between two classes gene pairs space, is constructed. We show experimentally that the (control subjects and patients affected by colon cancer), interactional information has a degree of significance comparable to represented by black and white dots, respectively. Expression that of the gene expression profiles considered separately. Our of gene Hsa.1221 is very useful for discrimination, as it appears method has been tested with different classifiers and yielded signif- from the presence of a large proportion of white dots between icant improvements in accuracy on several public microarray the values 0.5 and 2. This is not the case of gene Hsa.9025: databases. Moreover, a synthetic assessment of the biological signif- the values of the samples in both classes are spread over the icance of the concept of synergic gene pairs suggested its ability to whole range of expression levels. A standard feature selection uncover relevant mechanisms underlying interactions among various method is likely to consider only the first gene as relevant for the cellular processes. classification. However, as Fig. 1 suggests, the association of Contact: [email protected] these two genes may improve significantly the discrimination Supplementary information: Complementary results can be found between the two classes. on the companion website at http://featknn.nutriomique.org We devised an original approach that computes the mutual information contained in the gene expression profiles to identify gene pairs showing the strongest synergies, and then used them to improve the accuracy of microarray experiments *To whom correspondence should be addressed. classification. We point out that such synergic interactions The authors wish it to be known that in their opinion the first two authors should be regarded as First Authors. capture a biological information which is contextually relevant. 2866  The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Feature construction classification algorithm. Whenever available, this type of data integration should be used to improve any of the gene selection approaches mentioned above. Most of the available dimension-reduction methods do not take into account explicitly the interactions among genes, although some proposals of using pairwise gene interactions do exist. Bo and Jonassen (2002) evaluate the gene pairs by computing the projected coordinates of each example on the axis of the diagonal linear discriminant in the gene-pair space. The score is the two sample t-statistic on the projected points. Geman et al. (2004) do not use the expression value, but the expression rank of the genes. The pair score is computed from the probability that the expression rank of the first gene of the pair is higher than that of the second one, in each class. Their experimental results confirm the claim that class prediction can be improved using pairs of genes. In this article, we propose to identify strongly interacting genes and systematically exploit these pairs of synergies to improve the classification accuracy and the biological significance of the results. Fig. 1. Example of synergy between two genes. The plot shows the expressions of genes Hsa.9025 and Hsa.1221 from the colon cancer dataset. White dots represent sick patients and black dots normal 3 DECOMPOSITION OF GENE PAIR controls. The association of the two genes clearly distinguishes the two INFORMATION conditions. Let us first recall some definitions concerning mutual 2 RELATED WORK information (Shannon, 1948). The entropy H(X) of a variable X, which can take m values {x , .. . , x }, each value x with a There is a vast amount of work on gene reduction methods to 1 m i probability p(X ¼ x ), is defined as follows: improve microarray data classification. Widely used, the scoring approaches take an individual perspective by computing for HðXÞ¼  pðX ¼ x Þ log pðX ¼ x Þ i i each gene a relevance score, depending on how well the i¼1 gene distinguishes the examples of different classes. A good The following propriety about entropy holds for any pair of review of this kind of approaches is provided by Ben-Dor stochastic variables X and Y: (Ben-Dor et al., 2000). These methods are useful for microarray HðX; YÞ¼ HðY; XÞ¼ HðXjYÞþ HðY Þ data because they are fast (linear complexity with the number of The mutual information I(X,Y ) is a measure of the dimensions). However, they can only evaluate the relevance of dependency between two variables X and Y: genes with respect to the class, but cannot discover redundancy IðX; YÞ¼ HðXÞ HðXjYÞ¼ HðYÞ HðYjXÞ¼ IðY; X Þ and basic interactions among genes. For this reason, the most competitive methods are multivariate ones that rely on groups The following propriety about information holds for any pair of genes (i.e. the selection of a specific gene depends on the of stochastic variables X and Y: others), instead of considering each gene individually. 0  IðX; YÞ infðHðX Þ; HðY ÞÞ Researchers assume generally that a good gene subset is one Unlike the second order mutual information, the third order that contains genes that are highly correlated with the class, yet mutual information can be either positive or negative. The uncorrelated among them. Based on this idea, several selection mutual information between three variables (X, Y, Z) is defined methods have been developed. For example, MRMR as follows (Matsuda, 2000): (Maximum Relevance, Minimum Redundancy) (Ding and Peng, 2003) uses mutual information to select genes with IðX; Y; ZÞ¼  HðX; Y; ZÞþ HðX; YÞþ HðX; ZÞþ HðY; ZÞ maximum relevance and minimal redundancy. ProGene HðXÞ HðYÞ HðZ Þ (Hanczar et al., 2003) reduces redundancy by building new In the case of microarray-based classification, the mutual features from subsets of similar genes. In a recent work, information between a gene G and a class C represents the Dai et al. (2006) have carried out an extensive study to compare information that the gene provides to classify. The higher three reduction methods, including partial least square (PLS). the mutual information, the more informative the gene. PLS builds new features corresponding to components that For reasons of tractability and simplicity, expression levels are maximize the covariance between the variables and the class. often discretized. The most straightforward and widely used Other types of approaches aim to improve microarray approach relies on a histogram-based technique (Butte and classification by integrating available a priori biological knowl- Kohane, 2000). The data is partitioned into equal-width discrete edge about gene interactions. In this category, Rapaport et al. bins, and the equations above may be used. The mutual (2007) have recently proposed an original method, which information I(G , C) can be expressed as follows: integrates KEGG metabolic interaction networks into a spectral IðG ; CÞ¼ HðCÞ HðCjG Þ i i decomposition of gene expression profiles to derive a 2867 B.Hanczar et al. According to the above formula, the mutual information gene pair, a new feature, which summarizes the information between a gene G and the class C can be seen as the reduction contained in the pair, is constructed. of the class entropy caused by the knowledge of G . In the same way, we define the mutual information between the class C and 4.1 The search for the most informative pairs of genes a pair of gene {G G } formed by the two genes G and G : i j i j The naive approach to finding informative pairs of genes consists in computing the mutual information with respect to IðG G ; CÞ¼ IðG ; CÞþ IðG ; CÞ IðG ; G ; CÞð1Þ i j i j i j the class of all the N(N1)/2 different pairs, where N is the Formula (1) can be proved by the following argument. The number of genes, and then selecting the p best ones. However, left-hand side of (1) can also be written: this approach has a complexity O(N ); in the context of microarray data, where the number of genes is of the order of IðG G ; CÞ¼ HðCÞ HðCjG ; G Þ¼ i j i j several thousands, this solution is often computationally ¼ HðCÞ HðG ; G ; CÞþ HðG ; G Þ i j i j infeasible. But, even in the case where the complexity may not be a problem, there are two more reasons why this simple By introducing the terms H(G , C), H(G , C), H(G ), H(G ) i j i j approach might be unsuitable. The first is that we would like to and H(C), we obtain: select pairs that not only provide high information for the IðG G ; CÞ¼HðG ; CÞþ HðG Þþ HðC Þ i j i i classification, but which provide an information superior to the HðG ; CÞþ HðG Þþ HðC Þ j j one brought by the single genes (i.e. the genes should interact negatively). In fact, only those genes that satisfy this property HðG ; G ; CÞþ HðG ; G Þþ HðG ; C Þ i j i j i are interesting in the context of this work, which is based on the þHðG ; CÞ HðG Þ HðG Þ HðC Þ j i j assumption (grounded on biological findings) that synergic ¼ IðG ; CÞþ IðG ; CÞ IðG ; G ; C Þ i j i j gene interaction is important for improving classification. The second reason is that correlated attributes usually Formula (1) shows that the information of a gene pair is the provide duplicated (redundant) information; in learning, it is sum of the information of the first gene, the information of the well known that it is preferable to exploit as diverse and second gene, and the third order mutual information between independent sources of information as possible. Preliminary G , G and C. This last term represents the information provided i j experiments showed that in some rare cases a dataset may to the classification by the association of the two genes. We call contain an exceptionally informative gene, such that it forms a this term the interaction, which can be either positive or ‘good’ pair when coupled with a large number of the other negative. In the case of a positive interaction, the information genes. This situation is undesirable, and we want to avoid it. of the gene pair is lower than the sum of the information of the In order to face the above problems, we propose in this two genes. In this case part of the information provided by the article a simple search algorithm, guided by a powerful genes is similar, and therefore we may speak of redundancy heuristics, which allows the p most informative pairs of genes between the two genes. On the contrary, when the interaction is to be found with a complexity O(pN). The search proceeds as negative the information of the gene pair is higher than the sum follows: in the beginning, the mutual information I(G , C) of the information of the two genes, which means that the between each single gene G and the class C is computed, and association of the two genes provides new information. the most informative gene G is selected. Then, the mutual We speak then of synergy between the genes (Jakulin and information IðG G ; CÞ, between the class and each pair of i j Bratko, 2003). In the example of Figure 1, the information genes which include the gene G , is computed. The gene pair contained in the expression of gene Hsa.1221 is 0.20, whereas it ðG G Þ that maximizes IðG G ; CÞ is selected, and the genes i j i j is 0.03 for gene Hsa.9025, and 0.27 for their synergic G  and G  are removed from the list of genes to be analyzed. i j interaction. The information of the pair formed by these two This procedure is iterated p times to obtain p pairs of genes. The genes is 0.50 ¼ 0.20 þ 0.03(0.27). deletion of the selected genes from the list of the available ones It should be underlined that the mutual information is is motivated by the goal of eliminating the redundant pairs computed from the probability distributions of the gene mentioned earlier. expression (Steuer et al., 2002). However, because the real probabilities are unknown, as they are only estimated from 4.2 Feature construction from gene pairs limited data, we have conducted a set of experiments (details of For each of the p informative gene pairs (G G ), a new feature these experiments can be found on the companion website), i j A is constructed. The idea is the following one: the higher the which shows that the mutual information is accurate enough to i, j difference between the densities of the classes around a point, identify the most informative pairs of genes. the higher the probability that this point belongs to the higher density class. The value of the new feature is the difference between the local densities of the classes. 4 REDUCING DIMENSIONALITY USING More precisely, let E ¼ {e , ... , e } be a set of M instances, 1 M SYNERGIES each belonging to one of the two classes {C , C }. From an a b In this section, we will describe a new dimension reduction informative pair of genes (G G ) we construct a new feature A i j i, j method, called FeatKNN, based on the use of synergic pairs of as follows: the instances in E are projected onto the two- genes. The most informative gene pairs are identified using dimensional space defined by the expressions of G and G .We 1 2 a sequential forward search (SFS) procedure. Then, for each have to define the value A (x) of the feature A for every i, j i, j 2868 Feature construction point x of this space. The probabilities p (x) and p (x) at point The observation of the values of the mutual information of a b x belongs either to class C or to class C , respectively, can the genes and pairs of genes leads to the conclusion that all the a b be approximated by using the k-nearest neighbors of best pairs include, on average, a highly informative gene. This x : p ðxÞ n ðxÞ=k and p (x)  1  p (x), where n (x) is the observation validates our assumption for microarray data. a a b a a number of points belonging to class C among the k-nearest Also, this suggests a positive answer (from an empirical point of neighbors of x. The value of the new feature A at the point x view) to the second question regarding the ability of our i, j is the difference between p (x) and p (x): heuristic to find highly informative gene pairs. It also indicates a b the usefulness of this heuristic choice for selecting pairs, which n ðxÞ n ðxÞ a a AðxÞ¼ p ðxÞ p ðxÞ ð1  Þ are formed starting from the gene with the best rank. It should a b k k be underlined that standard feature selection methods, dis- n ðxÞ ¼1 þ 2 regarding interactions, may miss many useful pairs, as it was already said in the introduction. The values of the new feature are between 1 and þ1. When an instance is close to intances belonging to class C (respectively, 5.2 Analysis of the most informative pairs C ), the feature tends to þ1 (respectively, 1). Our assumption is that the explicit account of the synergic It should be underlined that the definition of this new feature interaction between genes may improve classification accuracy. is the same as the margin defined by Shapire for the voting To validate this assumption we computed, for each dataset, the methods in machine learning (Schapire et al., 1997). The new information of all genes and all gene pairs; both genes and gene feature, that we have defined, represents the classification pairs have been ordered according to increasing rank. Figure 2 margin of the k nearest neighbor classifier in the space of the shows the decomposition of the information of the 100 best gene pairs. gene pairs of the six datasets. It can be seen that around 40% of Table 1. Description of the datasets 5 RESULTS AND DISCUSSION In order to test the effectiveness of the proposed method, an experimental study was designed and set up to answer the Dataset name Number Number of Class C Class C a b following questions: is our selection heuristic adapted to find of Genes Samples informative pairs of genes? What is the amount of information Leukemia 7129 72 47 25 contained in the interaction compared to that of the individual Colon cancer 2000 62 40 22 genes? Is our feature construction method effective for Prostate cancer 12 600 102 52 50 synthesizing the information contained in a pair of genes? SRBCT 6567 63 43 20 Does FeatKNN improve classification accuracy? Lung cancer 3588 43 22 21 Six public datasets are used in these experiments, their Breast cancer 7129 49 25 24 characteristics are described in Table 1. It shows the data type, the number of genes measured and the number of samples contained in each class. 5.1 Identification of the most informative pairs of genes Our method is based on the identification of the most informative gene pairs that is performed by SFS procedure. The choice of this SFS has been based on the assumption that the most informative pairs of genes include at least one of the most informative gene. To validate this assumption, we exam- ined the rank of the genes forming the best pairs. We defined a ranking of the genes based on their mutual information with the class. The gene with the highest mutual information has rank 1, and the one with the lowest mutual information has the lowest rank. In the same way, we compute the mutual information between the class and all gene pairs, and we defined a ranking of the gene pairs. Notice that here we do not select the best pairs using the SFS procedure, but we compute the mutual information for all exclusive pairs of genes. A figure on the companion website shows the average rank of the two genes forming the most informative gene pairs. It shows that the first genes of the best pairs were among the top informative genes, while the second genes have a much higher rank. For Fig. 2. Decomposition of the information contained in the best gene example, the two genes forming the 50 best pairs have on pairs. The black part shows the amount of information of the most average rank 58 and 209, respectively. The same results were informative genes. The dark grey part shows the amount of information observed on the other six datasets which are described in the of the second genes. The light grey part shows the amount of Table 1. information of the interaction. 2869 B.Hanczar et al. the information provided by the pairs of genes resides in the best model. All of the experiments have been performed with interaction of their components. For example, in the 100 most the statistical environment R, the numbers of neighbors k for informative gene pairs of the breast cancer dataset, 38% of the KNN was 3 and the SVM has been implemented using the information is provided by the best gene, 24% by the second package ‘e1071’ with a radial kernel. gene and 38% by their interaction. We have compared FeatKNN to other methods that are widely used in the literature and reach good performances. These dimension reduction methods are the following: 5.3 Information obtained by feature construction All genes: all the genes are used. The aim of feature construction is to synthesize the information contained in the genes and their interactions. Single MI: the genes with the highest mutual information In order to measure the effectiveness of our feature construc- with respect to the class are selected. This method is tion method, we empirically compared the information commonly used in the literature (Ben-Dor et al., 2000; contained in the most informative gene pairs (G , G ) and in i j Wang et al., 2005). the associated newly constructed features A . For the p ¼ 100 i, j Pair MI: the gene pairs with the highest mutual informa- best gene pairs of the six datasets described above, the mutual tion with the class are selected (i.e. FeatKNN without the information of the newly constructed attribute is on average feature construction step). This method is tested to show in 90% of the cases higher than the information of the best gene the importance of feature construction. in the pair. These results suggest that our feature construction BO: the gene pair-based method developed by Bo and method is effective for synthesizing the information contained Jonassen (2002). in a gene pair, thus answering our third question. Geman: the gene pair selection method used by Geman et al. (2004) in their TSP classifier. 5.4 Classification accuracy PLS: new features are constructed as the components To measure the impact of our dimension reduction method on which maximize the covariance between the class and the the classification, we examined classification accuracy on the variables (Dai et al., 2006). six datasets. The cross-validation estimator is the most commonly used error estimation method in microarray-based In this article, we focus on the results obtained by 0.632 bootstrap; the results by cross-validation can be found on the classification. We have used the 10-times 10-fold cross- companion website, and leads to the same conclusions. Table 2 validation procedure in our experiment to measure the reports the classification error rates for different reduction generalization error. However, Braga-Neto and Dougherty methods. We used a paired Wilcoxon test to compare the (2004) have shown that this estimator is not the most results. The detail of the P-values of significance can be found appropriate one for a small instance sample like the ones in the companion website. available in microarrays analysis. Cross-validation has a high It is not surprising to see that dimension reduction methods variance and therefore bootstrap estimators are preferred, in improve classification performance considerably. In all cases particular the 0.632 estimator (Efron, 1983). The 0.632 boot- the performances reported in the column ‘All genes’, are worse strap estimator is a weighted sum of the empirical error and the than the others. The columns ‘Single MI’ and ‘Pair MI’ out-of-bag bootstrap error. We have also used the 0.632 represent the results obtained when we select the genes bootstrap estimator in our experiment to complete the results (respectively, gene pairs) having the highest mutual information obtained by cross-validation. Total 100 bootstrap iterations with respect to the class. We see that the methods using single were performed. It should be noted that for the evaluation genes (column ‘Single MI’) and pairs of genes (column ‘Pair procedure (both cross-validation and bootstrap) the test MI’) obtain similar results. We have shown that the gene pairs samples were not used in the dimension reduction and classifier were more informative than single genes. This may suggest that design. Thus, we avoided the problem of selection bias the information contained in the interaction between the genes pointed out by Ambroise and McLachlan (2002) and composing the pairs is not well exploited by the classification Reunanen (2003). It should be underlined that the number of algorithms, and therefore much of the information computed features used in classifier design is a meta-parameter, whose during the pair-selection phase is lost. This phenomenon is value is chosen by an internal cross-validation procedure. A figure representing the study design can be found on the avoided in FeatKNN, thanks to the feature construction step. companion website (Fig. 1). The new features constructed by FeatKNN synthesize the In our experiments, each dimension reduction methods is information contained in the genes and their interactions, associated to three different classification algorithms: the which explain the better results. FeatKNN outperforms support vector machines (SVM), k-nearest neighbors (KNN) Geman’s method in all datasets with the three classifiers. and diagonal linear discriminant (DLD). We have chosen these Bo’s method is competitive, especially with the DLD classifier: algorithms because they are among the most efficient for 9 times out of 18 Bo’s results are as good as FeatKNN’s microarray data classification. Dudoit et al. (2002) have ones, and it outperforms FeatKNN on the SRBCT dataset with pointed out the excellent results of the simplest methods like a DLD classifier. FeatKNN is statistically significantly KNN and DLD. Furey et al. (2000) and Lee et al. (2005) have (95% level) better than all other methods but PLS. PLS published a comparative study of the classification methods for results are almost as good as FeatKNN. It is the only method microarray data, and they concluded that the SVMs are the that is not significantly worse than FeatkNN. 2870 Feature construction Table 2. Classification results on six public datasets Classifier Data All gene Single MI Pair MI FeatKNN Bo Geman PLS SVM Leukemia 12.3  1.1 4.3  1 4.8  1.2 2.8  1.0 3.9  0.9 6.1  1.1 2.4  1.3 Colon cancer 17.5  1.1 12.5  1.3 11.8  1.0 10.7  1.1 13.9  1.3 14.6  1.0 11.1  1 Prostate cancer 9.5  1.1 6.1  1.0 6  1.0 6.0  0.8 5.6  0.9 6  0.7 6.2  0.6 SRBCT 7.6  0.5 2.1  0.4 3.8  0.3 0.2  0.2 0.2  0.2 0.7  0.3 1.7  0.5 Lung cancer 24.9  1.1 21.7  0.8 21  0.9 19.5  1.0 21.5  1.1 21  1.0 20.7  1.3 Breast cancer 14.6  0.8 11.4  1.1 11.2  0.9 8.7  1.0 11.4  1.1 11.2  1.0 9.7  1 KNN Leukemia 8.4  1.1 6.1  0.9 6.2  1.0 5.0  1.2 4.6  1.2 6.3  1.1 5.4  0.9 colon cancer 20.0  1.2 14.9  1.2 14.4  1.0 12.8  0.9 15.9  1.0 16  1.1 12.4  1.1 Prostate cancer 20.2  1.0 17.8  0.8 18  0.9 8.1  0.7 8.7  0.8 9.8  1.0 8.5  0.9 SRBCT 11.6  0.7 1.1  0.2 1  0.5 0.1  0.1 0.1  0.1 0.1  0.1 1.3  0.4 Lung cancer 35  0.7 29.1  1.0 28.2  0.9 20.8  1.0 23.7  1.3 24.2  1.0 21.7  1.4 Breast cancer 20.8  0.7 14.3  0.9 13.5  1.0 9.0  1.0 12  1.1 13.1  1.0 8.4  1 DLD Leukemia 11.5  1.2 4.8  1.2 4.8  1.0 3.8  1.1 4.1  1.0 5.0  1.1 2.7  0.8 colon cancer 19.5  1.4 15.7  1.2 15.4  1.0 12.5  1.0 14.4  1.1 15  1.3 12.9  1.1 Prostate cancer 37.5  1.0 10.5  0.8 10.1  0.9 7.6  0.9 7.3  1.0 8  0.7 7.3  1 SRBCT 5.4  0.7 0.8  0.2 0.5  0.2 0.7  0.2 0.2  0.1 0.1  0.1 2.5  0.6 Lung cancer 25.4  1.2 21.6  0.8 22.1  0.9 20.6  1.0 20.3  0.8 20.2  1.0 20.2  1.1 Breast cancer 14.9  0.6 10.7  1.0 10.9  1.0 9.1  1.1 9.3  0.9 10.1  0.9 9.6  1 All errors are estimated using the 0.632 bootstrap estimators. Boldfaced values highlight the best results. 5.5 A biological interpretation of the concept of synergic transcript pairs The exploration of the biological significance that may be enfolded in the concept of synergic transcript pairs had to consider two distinct aspects, one more particular related to the analyzed clinical situations, and another more general regard- ing the type of biological interactions that may explain the synergic behavior exhibited by the genes belonging to the same pair. To answer these questions, we started by separating the two components of gene pairs and then we carried out a discriminative functional profiling of the two resulting lists of genes (i.e. a first pairs component list and a second pairs component list). Based on the functional assignments pro- vided by the Gene Ontology (GO) consortium (http:// www.geneontology.org), and by the NCBI genomic repository (http://www.ncbi.nlm.nih.gov), an automated annotation pro- cedure combined with a gene set enrichment analysis allowed to identify biological themes significantly overrepresented in each of the two lists of genes. Figure 3 shows overrepresented biological themes character- izing each of the two components of the first 100 most Fig. 3. Overrepresented biological themes discriminating the two informative pairs extracted from the colon cancer dataset, components of the first 100 most informative synergic transcript pairs which resulted from microarray experiments performed to extracted from the colon cancer dataset (see text for details). compare expression profiles of tumor and normal colon tissues. The functional profiles depicted in Figure 3 seem to indicate organelle, cytoskeletal part), the second pair component highly distinct biological assignments for the two components appears to be involved exclusively in cell membrane-related of gene pairs. Thus, while the first pair component seems to be related mostly to intracellular processes located either in processes (i.e. plasma membrane, membrane part, intracellular the nucleus (i.e. nucleus, nuclear part) or in the cytoplasm membrane-bound organelle). Moreover, these findings seem to (i.e. intracellular part, intracellular non-membrane-bound be well supported by the molecular functions assigned to the 2871 B.Hanczar et al. translation products of these genes, which were found to be construction method that forces learning algorithms to take related to nucleotide binding, structural molecule activity and into account pairs with a high level of information. The transcriptional activator activity for the first pairs component, usefulness of this approach was experimentally assessed on six and to transmembrane receptor activity and transporter activity datasets and yielded a significant improvement in performance. for the second pairs component. Moreover, a synthetic assessment of the biological significance Considering the experimental framework from which this of the concept of synergic gene pairs suggested its ability to dataset resulted, these functional profiles seem to indicate that uncover relevant mechanisms underlying interactions among the biological themes that best distinguish tumoral from normal various cellular processes. colon cells concern essentially the nuclear transcriptional control Conflict of Interest: none declared. of a large panel of intracellular processes (i.e. cell differentia- tion, proliferation, metabolism, apoptosis, extracellular matrix production, etc.) on one side, and the cell communication and REFERENCES extracellular signaling modulated by cell membrane structures Ambroise,C. and McLachlan,G. (2002) Selection bias in gene extraction on the (i.e. involved in processes as focal adhesions, cell attachment basis of microarray gene expression data. Proc. Natl Acad. Sci. USA, 99, and migration, growth factor receptor expression and signaling, 6562–6566. etc.) on the other. Ben-Dor,A. et al. (2000) Scoring genes for relevance. Technical report These findings are in total agreement with the most up-to- AGL-2000-13, Agilent Technologies. Institute of Computer Science, Hebrew University, Jerusalem. date understanding of the tumoral biology. Indeed, it is well Bo,T. and Jonassen,I. (2002) New feature subset selection procedures acknowledged that tumor cell survival is dictated by both for classification of expression profiles. Genome Biology, 3, internal properties of the cell, such as status of components of research0017.1–research0017.11. the apoptotic machinery, and its extracellular environment, Braga-Neto,U. and Dougherty,E. (2004) Is cross-validation valid for small- sample microarray classification? Bioinformatics, 20, 374–380. such as extracellular matrix and growth factor receptor Butte,A.J. and Kohane,I.S. (2000) Mutual information relevance networks: expression and signaling (Dennis and Kastan, 1998) that functional genomic clustering using pairwise entropy measurements. modulate apoptosis regulation. Apoptotic anomalies represent Pac. Symp. Biocomput., 418–429. Using Smart Source Parsing. a major distinction between tumor and normal cells, as they Dai,J. et al. (2006) Dimension reduction for classification with gene expression allow tumor cells to avoid programmed cellular death, and microarray data. Stat. Appl. Genet. Mol. Biol., 5, article 6. Dennis,P.A. and Kastan,M.B. (1998) Cellular survival pathways and resistance to confer them the capacity to proliferate indefinitely. Some of cancer therapy. Drug Resist. Updat., 1, 301–309. these anomalies, which involve the nuclear compartment, may Ding,C. and Peng,H. (2003) In Proceedings of the IEEE Computer Society result in an inactivation of key tumor suppressor genes (TSGs), Conference on Bioinformatics. Stanford, CA, USA, pp. 523–529. acknowledged as being central to the development of all forms Dudoit,S. et al. (2002) Comparison of discrimination methods for classification of tumors using gene expression data. J. Am. Stat. Assoc., 97, 77–87. of human cancer (Rhee et al., 2002). This inactivation is Efron,B. (1983) Estimating the error rate of a prediction rule: improvement on induced by a combination between epigenetic silencing and cross-validation. J. Am. Stat. Assoc., 78, 316–331. promoter hypermethylation of various TSGs. On the other side, Furey,T. et al. (2000) Support vector machine classification and validation of the importance of extracellular survival signals as key cancer tissue samples using microarray expression data. Bioinformatics, 16, regulators of apoptosis is now being recognized by the ability 906–914. Geman,D. et al. (2004) Classifying gene expression profiles from pairwise mRNA of growth factors (GFs), GF receptors (GFRs) and GFR comparisons. Stat. Appl. Genet. Mol. Biol., 3, article 19. signaling to promote cellular survival. Indeed, recent evidence Hanczar,B. et al. (2003) Improving classification of microarray data using suggests that a number of abnormal cell membrane constituents prototype-based feature selection. SIGKDD Explor., 5, 23–30. may disrupt apoptosis regulation in tumor cells by pathologi- Jakulin,A. and Bratko,I. (2003) Analyzing attribute dependencies. In Proceedings A of the 7th European Conference on Principles and Practice of Knowledge cally amplifying the anti-apoptotic effect of normal extracel- Discovery in Databases (PKDD), pp. 229–240. lular survival signals (Leask and Abraham, 2006). All these Leask,A. and Abraham,D.J. (2006) All in the CCN family: essential matricellular evidences suggest that the concept of synergic gene pairs not signaling modulators emerge from the bunker. J. Cell. Sci, 119, 4803–4810. only uncovered a noteworthy functional behavior singularizing Lee,J. et al. (2005) An extensive comparison of recent classification tools applied tumor cells, but also captured the synergic aspect of the to microarray data. Comput. Stat. Data Analy., 48, 869–885. Matsuda,H. (2000) Physical nature of higher-order mutual information: intrinsic interaction between underlying biological mechanisms. correlations and frustration. Phys. Rev. E, 62, 3096–3102. Rapaport,F. et al. (2007) Classification of microarray data using gene networks. BMC Bioinformatics, 8. 6 CONCLUSION Reunanen,J. (2003) Overfitting in making comparisons between variable selection In this article, we have presented a dimension reduction methods. J. Mach. Learn. Res., 3, 1371–1382. Rhee,I. et al. (2002) DNMT1 and DNMT3b cooperate to silence genes in human procedure for microarray data oriented towards improving cancer cells. Nature, 416, 552–556. classification performance. This method is based on the idea Schapire,R.E. et al. (1997) Boosting the margin: a new explanation for the that the information provided by the interaction between genes effectiveness of voting methods. In Proceedings 14th International Conference cannot be ignored in the feature selection phase. We have on Machine Learning. Morgan Kaufmann, Nashville, TN, USA, pp. 322–330. Shannon,E. (1948) A mathematical theory of communication. Bell Sys. Tech. J., proposed a decomposition of the information contained in the 27, 623–656. gene pairs. Although it is natural to quantify information from Steuer,R. et al. (2002) The mutual information: detecting and evaluating genes and interactions from the computation of mutual dependencies between variables. Bioinformatics, 18, 231–240. information, this simple reduction does not necessarily improve Wang,Y. et al. (2005) Gene selection from microarray data for cancer performance. Therefore, we have developed a feature classification – a machine learning approach. Comput. Biol. Chem., 29, 37–46.

Journal

BioinformaticsOxford University Press

Published: Oct 9, 2007

There are no references for this article.