Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Support vector machine classification and validation of cancer tissue samples using microarray... Vol. 16 no. 10 2000 BIOINFORMATICS Pages 906–914 Support vector machine classification and validation of cancer tissue samples using microarray expression data 1,∗ 2 1 Terrence S. Furey , Nello Cristianini , Nigel Duffy , David W. 3 3 1 Bednarski , Michel ` Schummer and David Haussler Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA 95064, USA, Department of Engineering Mathematics, University of Bristol, Bristol, BS8 ITH, UK and Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195, USA Received on April 4, 2000; accepted on May 19, 2000 Abstract These experiments primarily consist of either monitoring each gene multiple times under many conditions (Spell- Motivation: DNA microarray experiments generating man et al., 1998; Chu et al., 1998; DeRisi et al., 1997; thousands of gene expression measurements, are being used to gather information from tissue and cell sam- Wen et al., 1998; Roberts et al., 2000), or alternately eval- ples regarding gene expression differences that will uating each gene in a single environment but in different be useful in diagnosing disease. We have developed a types of tissues, especially cancerous tissues (DeRisi et new method to analyse this kind of data using support al., 1996; Alon et al., 1999; Golub et al., 1999; Perou et vector machines (SVMs). This analysis consists of both al., 1999; Zhu et al., 1998; Wang et al., 1999; Schum- classification of the tissue samples, and an exploration of mer et al., 1999; Zhang et al., 1997; Slonim et al., 2000). the data for mis-labeled or questionable tissue results. Those of the first type have allowed for the identification Results: We demonstrate the method in detail on samples of functionally related genes due to common expression consisting of ovarian cancer tissues, normal ovarian patterns (Brown et al., 2000; Eisen et al., 1998; Wen et al., tissues, and other normal tissues. The dataset consists 1998; Roberts et al., 2000), while the latter experiments of expression experiment results for 97 802 cDNAs for have shown promise in classifying tissue types (diagno- each tissue. As a result of computational analysis, a sis) and in the identification of genes whose expressions tissue sample is discovered and confirmed to be wrongly are good diagnostic indicators (Golub et al., 1999; Alon et al., 1999; Slonim et al., 2000). In order to extract in- labeled. Upon correction of this mistake and the removal formation from gene expression measurements, different of an outlier, perfect classification of tissues is achieved, methods have been employed to analyse this data includ- but not with high confidence. We identify and analyse a ing SVMs (Brown et al., 2000; Mukherjee et al., 1999) subset of genes from the ovarian dataset whose expression clustering methods (Eisen et al., 1998; Spellman et al., is highly differentiated between the types of tissues. To 1998; Alon et al., 1999; Perou et al., 1999; Ben-Dor et al., show robustness of the SVM method, two previously 2000; Hastie et al., 2000), self-organizing maps (Tamayo published datasets from other types of tissues or cells are et al., 1999; Golub et al., 1999), and a weighted correla- analysed. The results are comparable to those previously tion method (Golub et al., 1999; Slonim et al., 2000). obtained. We show that other machine learning methods also perform comparably to the SVM on many of those Support vector machines (SVMs), a supervised machine datasets. learning technique, have been shown to perform well in Availability: The SVM software is available at http:// www. multiple areas of biological analysis including evaluating cs.columbia.edu/∼bgrundy/ svm. microarray expression data (Brown et al., 2000), detect- Contact: [email protected] ing remote protein homologies (Jaakkola et al., 1999), and recognizing translation initiation sites (Zien et al., 2000). Introduction We have also recently become aware of another effort that uses SVMs in analyzing expression data (Mukherjee et al., Microarray expression experiments allow the recording of 1999). SVMs have demonstrated the ability to not only expression levels of thousands of genes simultaneously. correctly separate entities into appropriate classes, but also To whom correspondence should be addressed. to identify instances whose established classification is not 906  c Oxford University Press 2000 SVM tissue classification using expression data supported by the data. Expression datasets contain mea- of the perceptron algorithm. No single classification surements for thousands of genes which proves problem- technique has proven to be significantly superior to all atic for many traditional methods. SVMs, though, are well others in the experiments we have performed. Indeed, suited to working with high dimensional data such as this. the different kernels we tried performed nearly equally Here a systematic and principled method is introduced well and variations of the perceptron algorithm are shown that analyses microarray expression data from thousands to perform comparably to the SVM on all tests. It of genes tested in multiple tissue or cell samples. The is unfortunate that typical diagnostic gene expression primary goal is the proper classification of new samples. datasets today involve only a few tissue samples. As more We do this by training the SVM on samples classified datasets become available with larger numbers of samples, by experts, then testing the SVM on samples it has not we predict that our method will continue to demonstrate good performance. seen before. We demonstrate how SVMs can not only classify new samples, but can also help in the identification of those which have been wrongly classified by experts. Methods SVMs are not unique among classification methods in this In recent years, several methods have been developed for regard, but we show they are effective. Our method is performing gene expression experiments. Measurements demonstrated in detail on data from experiments involving from these experiments can give expression levels for 31 ovarian cancer, normal ovarian and other normal genes (or ESTs) in tissue or cell samples. For more in tissues. We are able to identify one tissue sample as mis- depth discussions of these techniques, see Lockhart et labeled, and another as an outlier, which is shown in al. (1996) and Schummer et al. (1999). Datasets used the Results Section and illustrated in Figure 1. Though for our experiments consist of a relatively small number perfect classification is finally achieved in one instance, of tissue samples (less than 100) each with expression this performance is not consistently shown in multiple measurements for thousands of genes. tests and, therefore, cannot be considered too significant. Previous methods used in the analysis of similar datasets We also experimented with the method introduced in start with a procedure to extract the most relevant features. (Golub et al., 1999) to focus the analysis on a smaller Most learning techniques do not perform well on datasets subset of genes that appear to be the best diagnostic where the number of features is large compared to indicators. This amounts to a kind of dimensionality the number of examples. SVMs are believed to be an reduction on the dataset. If one can identify particular exception. We are able to begin with tests using the full genes that are diagnostic for the classification one is dataset, and systematically reduce the number of features trying to make, e.g. the presence of cancer, then there selecting those we believe to be the most relevant. In this is also hope that some of these genes may be found to way, we can show whether an improvement is made using be of value in further investigations of the disease and smaller sets, thus indicating whether these contain the in future therapies. Here we find that this dimensionality most meaningful genes. reduction does not significantly improve classification To understand our method, a familiarity with SVMs performance. It does reveal some genes that may be of is required, and a brief introduction follows. We explain interest in ovarian cancer. However, further work needs below how we rank the features, and present an outline of to be carried out to identify the most effective feature how we use the SVM to perform classification and error selection/dimensionality reduction methods for this kind detection. of data. Support vector machines To test the generality of the approach, we also ran experiments using leukemia data from Golub et al. (1999) SVMs (Cristianini and Shawe-Taylor, 2000) are a (72 patient samples) and colon tumor data from (Alon et relatively new type of learning algorithm, originally al., 1999) (62 tissue samples). Our results are comparable introduced by Vapnik and co-workers (Boser et al., 1992; to other methods used by the authors of those papers. Vapnik, 1998) and successively extended by a number of Since no special effort was made to tune the method to other researchers. Their remarkably robust performance these other datasets, this increases our confidence that our with respect to sparse and noisy data is making them the approach will have broad applications in analyzing data of system of choice in a number of applications from text this type. categorization to protein function prediction. It is difficult to show that one diagnostic method is When used for classification, they separate a given significantly better than another with small data sets set of binary labeled training data with a hyper-plane such as those we have examined. We have conducted a that is maximally distant from them (known as ‘the full hold-one-out cross-validation (jackknife) evaluation maximal margin hyper-plane’). For cases in which no of the classification performance of the methods we linear separation is possible, they can work in combination tested. These include both SVM methods and variants with the technique of ‘kernels’, that automatically realizes 907 T.S.Furey et al. a non-linear mapping to a feature space. The hyper-plane which that point is embedded in the final decision func- found by the SVM in feature space corresponds to a non- tion. A remarkable property of this alternative representa- linear decision boundary in the input space. tion is that often only a subset of the points will be asso- j j ciated with non-zero α . These points are called support Let the j th input point x = (x ,..., x ) be the j vectors and are the points that lie closest to the separating realization of the random vector X . Let this input point j hyper-plane. The sparseness of the α vector has several be labeled by the random variable Y ∈{−1, +1}. n N computational and learning theoretic consequences. Let φ : I ⊆ → F ⊆ be a mapping n Notice that for a test point (x, y) the quantity from the input space I ⊆ to a feature space F . Let y α y φ(x ),φ(x)− b is negative if the i i i=1 us assume that we have a sample S of m labeled data 1 1 m m prediction of the machine is wrong, and a large negative points: S ={(x , y ), ...,(x , y )}. The SVM learning value would indicate that the point (x, y) is regarded by algorithm finds a hyper-plane (w, b) such that the quantity the algorithm as ‘different’ from the training data. i j The matrix K =φ(x ),φ(x ) is called the kernel i i ij γ = min y {w,φ(x )− b} (1) matrix and will be particularly important in the extensions of the algorithm that will be discussed later. In the case is maximized, where ,  denotes an inner product, the when the data are not linearly separable, one can use vector w has the same dimensionality as F , ||w|| is 2 i j more general functions, K = K (x , x ), that provide ij held constant, b is a real number, and γ is called the non-linear decision boundaries. Two classical choices are margin. The quantity (w,φ(x )− b) corresponds to the i j i j d polynomial kernels K (x , x ) = (x , x + 1) and distance between the point x and the decision boundary. i j x −x i i j When multiplied by the label y , it gives a positive value σ Gaussian kernels K (x , x ) = e , where d and for all correct classifications and a negative value for σ are kernel parameters. In our experiments, we use i j i j the incorrect ones. The minimum of this quantity over K (x , x ) = (x , x + 1). all the data is positive if the data is linearly separable, In the presence of noise, the standard maximum margin and is called the margin. Given a new data point x to algorithm described above can be subject to over-fitting, classify, a label is assigned according to its relationship and more sophisticated techniques are necessary. This to the decision boundary, and the corresponding decision problem arises because the maximum margin algorithm function is always finds a perfectly consistent hypothesis and does not tolerate training error. Sometimes, however, it is necessary f (x) = sign(w,φ(x)− b). (2) to trade some training accuracy for better predictive power. The need for tolerating training error has led It is easy to prove (Cristianini and Shawe-Taylor, 2000) to the development of the soft-margin and the margin- that, for the maximal margin hyper-plane, distribution classifiers (Cortes and Vapnik, 1995). One of these techniques (Shawe-Taylor and Cristianini, 1999) replaces the kernel matrix in the training phase as follows: i i w = α y φ(x ) (3) i=1 K ← K + λ1, (7) where α are positive real numbers that maximize while still using the standard kernel function in the decision phase (6). We call λ the diagonal factor. By tuning m m i j i j λ, one can control the training error, and it is possible to α − α α y y φ(x ),φ(x ) (4) i i j prove that the risk of misclassifying unseen points can be i=1 ij=1 decreased with a suitable choice of λ (Shawe-Taylor and subject to Cristianini, 1999). If instead of controlling the overall training error one α y = 0,α > 0, (5) i i wants to control the trade-off between false positives and i=1 false negatives, it is possible to modify K as follows: the decision function can equivalently be expressed as K ← K + λ D, (8) where D is a diagonal matrix whose entries are either d f (x) = sign α y φ(x ),φ(x)− b . (6) i i or d , in locations corresponding to positive and negative i=1 examples. It is possible to prove that this technique is From this equation it is possible to see that the α associ- equivalent to controlling the size of the α in a way i i ated with the training point x expresses the strength with that depends on the size of the class, introducing a 908 SVM tissue classification using expression data bias for larger α in the class with smaller d . This from only the known samples, some number of the top in turn corresponds to an asymmetric margin; i.e. the features are extracted, and then these are then used to train class with smaller d will be kept further away from the the SVM and classify the unknown sample. Examples decision boundary (Brown et al., 2000). In the case of which have been consistently misclassified in all tests are + − imbalanced data sets, choosing d = and d = identified. These examples can then be investigated by the biologist, and if it is determined that the original label is provides a heuristic way to automatically adjust the incorrect, a correction is made, and the process is repeated. relative importance of the two classes, based on their Alternatively, an example may be deemed an outlier that respective cardinalities. is very different from the rest, and is therefore removed. The experiments presented in this paper were performed In the SVM tests reported here, only the simple dot- using a freely available implementation of the SVM classi- product kernel is used. A more complex kernel is not fier which can be obtained at http://www.cs.columbia.edu/ required. As possibly more complex datasets become ∼bgrundy/svm. This implementation is based on that de- available providing more examples, higher-order kernels scribed in Jaakkola et al. (1999) and differs slightly from may become necessary (Mukherjee et al., 1999). the above explanation in that it does not include a bias term, b, forcing all decision boundaries to contain the ori- Results gin in feature space. Our method is tested in detail using a previously unpub- Feature selection lished ovarian tissue dataset. A short analysis of the fea- ture selection is included. To demonstrate the generality Our feature selection criterion is essentially that used in of our method, we also performed experiments using pre- Golub et al. (1999) and Slonim et al. (2000). We start viously published datasets. The first dataset contains ex- with a dataset S consisting of m expression vectors x = i i amples of patients with human acute leukemia, originally (x ,..., x ), 1 ≤ i ≤ m, where m is the number of tissue 1 n analysed by Golub et al. (1999) with further results re- or cell samples and n is the number of genes measured. ported by Slonim et al. (2000). The dataset can be ob- Each sample is labeled with Y ∈{+1, −1} (e.g. cancer tained at http://waldo.wi.mit.edu/MPR/cancer class.html. vs normal). For each gene x , we calculate the mean µ The second dataset is comprised of human tumor and nor- − + − (resp. µ ) and standard deviation σ (resp. σ using j i i mal colon tissues. Alon et al. (1999), originally analysed only the tissues labeled +1 (resp. −1). We want to find this data which is available on their website, http://www. genes that will help discriminate between the two classes, molbio.princeton.edu/colondata. therefore we calculate a score Ovarian dataset + − µ − µ j j Microarray expression experiments are performed using F (x ) = (9) + − σ + σ 97 802 DNA clones, each of which may or may not corre- J j spond to human genes, for 31 tissue samples. These sam- which gives the highest score to those genes whose ples are either cancerous ovarian tissue, normal ovarian expression levels differ most on average in the two classes tissue, or normal non-ovarian tissue. For the purpose of these experiments, the two types of normal tissue are con- while also favoring those with small deviations in scores sidered together as a single class. The expression values in the respective classes. We then simply take the genes for each of the genes are normalized such that the distri- with the highest F (x ) scores as our top features. bution over the samples had a zero mean and unit variance. Complete SVM method Hold-one-out cross-validation experiments are per- formed. The SVM is trained using data from all but one of The complete SVM method can be described as follows: the tissue samples. The sample not used in training is then we begin by choosing a kernel, starting with the simple assigned a class by the SVM. A single SVM experiment dot-product kernel, and tune the diagonal factor to achieve consists of a series of hold-one-out experiments, each the best performance on hold-one-out cross-validation sample being held out and tested exactly once. tests using the full dataset. The SVM tuning procedure is Initially, experiments are carried out using all expression then repeated with a specified number of the top-ranked scores with diagonal factor settings of 0, 2, 5 and 10. The features. In these cases, for each individual hold-one-out genes are then ranked in the manner described previously, test, the features are ranked based on (9) using the scores and datasets consisting of the top 25, 50, 100, 500 We use default values set in the software except for the diagonal factor, −11 § which varies, the convergence threshold, which we set to 10 , and using We experimented with polynomial and radial basis kernels on the ovarian the ‘noconstraint’ option. data, and found that on data containing the mis-labeled point, they performed This score is closely related to the Fisher criterion score for the j th feature, worse than the linear kernel, but on the correctly labeled data, performance + − 2 + 2 − 2 F ( j ) = (µ − µ ) /((σ ) + (σ ) ) (Bishop, 1995). is similar to the linear kernel. j j j j 909 T.S.Furey et al. Table 1. Error rates for ovarian cancer tissue experiments Kernel DF Feature FP FN TP TN 0.5 Dot-product 0 25 5 4 10 12 Dot-product 2 25 5 2 12 12 Dot-product 5 25 4 2 12 13 Dot-product 10 25 4 2 12 13 -0.5 Dot-product 0 50 4 2 12 13 Dot-product 2 50 3 2 12 14 Dot-product 5 50 3 2 12 14 -1 Dot-product 10 50 3 2 12 14 Dot-product 0 100 4 3 11 13 Dot-product 2 100 5 3 11 12 -1.5 0 5 10 15 20 25 30 Dot-product 5 100 5 3 11 12 Tissue Samples Dot-product 10 100 5 3 11 12 Dot-product 0 97 802 17 0 14 0 Dot-product 2 97 802 9 2 12 8 Fig. 1. SVM classification margins for ovarian tissues. When Dot-product 5 97 802 7 3 11 10 classifying, the SVM calculates a margin which is the distance of Dot-product 10 97 802 5 3 11 12 an example from the decision boundary it has learned. In this graph, the margin for each tissue sample calculated using (6) is shown. For each setting of the SVM consisting of a kernel and A positive value indicates a correct classification, and a negative diagonal factor (DF), each tissue was classified. value indicates an incorrect classification. The most negative point Column 2 is the number of features (clones) used. Reported are the number of normal tissues misclassified corresponds to tissue N039. The second most negative point (FP), tumor tissues misclassified (FN), tumor tissues corresponds to tissue HWBC3. classified correctly (TP), and normal tissues classified correctly (TN). features and a diagonal factor of 0. No other setting is and 1000 features are created. Experiments using similar able to make fewer than three mistakes and most make diagonal factors to those above are performed using these at least four, therefore we can not place much confidence smaller feature sets. Table 1 displays the most significant in one perfect experiment. results from these experiments. The best classification is After ranking the features using all 31 samples, we done using the top 50 features with a diagonal factor of 2 attempt to sequence the ten top-ranked genes to determine if they are biologically significant. Three of these did or 5. Though the smaller datasets achieve slightly better not yield a readable sequence, and two are repetitive scores compared to using all features, we do not believe this improvement to be significant. sequences which occur naturally at 3’ ends of messenger An analysis of the misclassified examples reveals that RNAs and do not correspond to actual genes. Therefore, one normal ovarian tissue sample, N039, is misclassified only five represent expressed genes for which cancer- in all instances. In addition, the margin of misclassifi- relatedness information is thus available, either by its cation, calculated using (6), is relatively large meaning homology to a known or assumed tumor gene, or its the SVM strongly believes it to be cancerous. Figure 1 presence in cDNA libraries from tumor tissues in the case shows classification margins for experiments using the top of ESTs. Indeed, three of these five are cancer-related 50 features and a diagonal factor of 2. Upon investigation, (Ferritin H and two cancer-library ESTs), and one is it is discovered that this tissue had been mistakenly related to the presence of white blood cells in the tumor. labeled and is, in fact, cancerous. This analysis seems to suggest that the feature selection With a corrected label, the above experiments are run method is able to identify clones that are cancer-related, again, but disappointingly, classification results do not and rank them highly. Some clones however, obtain a improve. A second tissue, called HWBC3, is consistently high ranking while not having a meaningful biological misclassified by a large margin in these new tests, and explanation. Random sequencing of some of the bottom- ranked clones also reveal some known tumor genes which was also strongly misclassified in the original tests, as shown in Figure 1. This non-ovarian normal tissue is the would be expected to be ranked highly. Given this and the only tissue of its type, and an SVM trained on tissues inability of this feature selection method to significantly with little similarity might give spurious classification improve classification performance, we conclude that results. Therefore, we remove this tissue and repeat the additional effort is needed to develop ways of identifying experiments. Perfect classification is achieved using all meaningful features in these types of datasets. From a Size of Margin SVM tissue classification using expression data tumor biologist’s point of view however, the accumulation classification using the 250 and 500 top-ranked features of tumor-related genes at the top is a very useful feature. with multiple diagonal factor settings on hold-one-out cross-validation tests. Using the full dataset, the SVM AML/ALL dataset misclassifies a single tissue using a zero diagonal factor. Golub et al. uses SOMs to create four clusters containing Bone marrow or peripheral blood samples are taken from all training set examples, including the AML samples. 72 patients with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). Following the The first cluster contains 10 AML samples, the second experimental setup of the original authors, the data is eight T-lineage ALL samples and one B-lineage ALL split into a training set consisting of 38 samples of sample, the third five B-lineage ALL samples, and the which 27 are ALL and 11 are AML, and a test set of last one 13 B-lineage ALL samples and a single AML 24 samples, 20 ALL and 14 AML. The dataset provided sample. Additional tests in Slonim et al. (2000) use the contains expression levels for 7129 human genes produced weighted voting predictor to classify 33 samples of which by Affymetrix high-density oligonucleotide microarrays. it predicted on 32, all being correct. The scores in the dataset represent the intensity of gene Lastly, the success of chemotherapy treatments for 15 of expression after being re-scaled to make overall intensities the AML patients is available. Slonim et al. report that for each chip equivalent. Following the methods in Golub they were able to create a predictor which made only et al. (1999), we normalize these scores by subtracting two mistakes using a single gene, HOXA9, but that other the mean and dividing by the standard deviation of the predictors using more than this gene generally had error expression values for each gene. rates above 30%. On hold-one-out cross-validation tests, Golub et al. perform hold-one-out cross validation the SVM is able to classify 10 of the 15 patients using tests using a weighted voting scheme to classify the the top 5 or 10 ranked features and a diagonal factor training set and also cluster this set using self-organizing of two, thus performing only slightly better than chance. One misclassified sample, patient 37, is consistently maps (SOMs). The first method correctly classifies all misclassified by a relatively large margin. samples for which a prediction is made, 36 of the 38 samples, while a two-cluster SOM produces one cluster Colon tumor dataset with 24 ALL and one AML sample, and the second with Using Affymetrix oligonucleotide arrays, expression lev- 10 AML and three ALL samples. els for 40 tumor and 22 normal colon tissues are measured We also did a full hold-one-out cross-validation tests on for 6500 human genes. Of these genes, the 2000 with the the training set, and our SVM method correctly classifies highest minimal intensity across the tissues are selected all samples with a diagonal factor setting of two. Retesting for classification purposes and these scores are publicly subsets containing the top-ranked 25, 250, 500, and available. Each score represents a gene intensity derived in 1000 features, perfect classification is obtained using a a process described in Alon et al. (1999). The data is not diagonal factor of two in all cases. processed further before performing classification. Alon Using an SVM trained only with examples in the et al. use a clustering method to create clusters of tissues. training set and the subsets of features that perform In their experiments, one cluster consists of 35 tumor and optimally on this training set, we classify examples in the three normal tissues, and the other 19 normal and five tu- test set producing results ranging between classifying 30 mor tissues. to 32 of the 34 samples correctly. Golub et al. use a Using the SVM method with full hold-one-out cross- predictor trained using their weighted voting scheme on validation, we classify correctly all but six tissues using all the training samples, and classify correctly on all samples 2000 features and a diagonal factor of two. Using the top for which a prediction is made, 29 of the 34, declining to 1000 genes, the SVM misclassifies these same six samples predict for the other five. In all tests, our SVM correctly which correspond to three tumor tissues (T30, T33, T36) classifies the 29 predicted by their method, and for the and three normal tissues (N8, N34, N36). T30, T33, and five unpredicted samples, each is misclassified in at least T36 are among the five tumor tissues in the Alon et al. one SVM test. Two samples, patients 54 and 66, are cluster with a majority of normal tissues, and N8 and misclassified in all SVM tests. N32 are in the cluster containing a majority of the tumor Lineage information, either T-cell or B-cell, is provided tissues. for the ALL samples. Using all 47 ALL samples from Alon et al. define a muscle index based on the average the training and test sets, the SVM achieves perfect intensity of ESTs that are homologous to 17 smooth The weighted voting scheme selects 50 genes as described in the subsection muscle genes, and hypothesize that tumor tissues should ‘Feature selection’. Each gene predicts a class for each sample. These have a smaller muscle index. In general, this proves correct predictions are combined, each being weighted by the F (g) score defined with the notable exceptions that all tumor tissues have a above, and if a threshold is exceeded in favor of one class over the other, a prediction is made. muscle index less than or equal to 0.3 except for T30, T33, 911 T.S.Furey et al. and T36, and all normal tissues have an index greater than Table 2. Results for the perceptron using all features or equal to 0.3 except N8, N34, and N36. Two samples, N36 and T36, are especially interesting because their SVM SVM names indicate that they originate from the same patient, Dataset Features FP FN FP FN both are consistently misclassified by the SVM, and N36 has a muscle index or 0.1 and T36 has a muscle index of Ovarian I 97 802 4.6 4.8 5 3 Ovarian II 97 802 4.4 3.4 0 0 0.7, contrary to the proposed hypothesis. AML/ALL train 7 129 0.6 2.8 0 0 AML treatment 7 129 4.8 3.5 3 6 Comparison to perceptron-like classification Colon 2 000 3.8 3.7 3 3 algorithms As discussed in the introduction, we do not claim that Results are averaged over five shufflings of the data as this we can prove the superiority of the SVM method over algorithm is sensitive to the order of the samples. The first column is the dataset and the second is the number of other classification techniques on this type of dataset. The features considered. Ovarian I refers to the original full second family of algorithms we test are generalizations of dataset with the incorrectly labeled N039 tissue, while the perceptron algorithm (Rosenblatt, 1958). This simple Ovarian II is the dataset with the label corrected and the algorithm considers each sample individually, and updates HWBC3 tissue removed. The ovarian and colon datasets its weight vector each time it makes a mistake according show the number of normal tissues misclassified (FP) and the number of tumor tissues misclassified (FN). The to AML/ALL training dataset report the number of AML i+1 i i i w = w + y x . (10) samples misclassified (FP) and the number of ALL patients misclassified (FN). The AML treatment dataset The resulting decision rule is linear (no bias is used), shows the number of unsuccessfully treated patients and classification is given by sign(w , x). However, this misclassified (FP) and the number of successfully treated algorithm requires modification when there is no perfect patients misclassified (FN). The last two columns report linear decision rule. Helmbold and Warmuth (1995) show the corresponding SVM score using all features. that taking a linear combination of the decision rules used at each iteration of the algorithm is sufficient, and are able to derive performance guarantees. The final decision rule more examples become available, the use of more complex is sign w , x . Results for this modified perceptron kernels may become necessary and will allow the SVM are comparable to those for the SVM, and scores using all to continue its good performance. As an added feature of features in each dataset are given in Table 2. our SVM method, we demonstrate that it can be used to Freund and Schapire (1998) demonstrate that kernels identify mis-labeled data. other than the simple inner product can be applied effec- Microarray expression experiments have great potential tively to this algorithm, achieving performance compara- for use as part of standard diagnosis tests performed in ble to the best SVM on a benchmark test of Hand-Written the medical community. We have shown along with others Digits. As in the case of SVMs, though, the use of a more that expression data can be used in the identification of complex kernel did not improve performance. the presence of a disease and the determination of its cell We also test an algorithm known as the p-norm percep- lineage. In addition, there is a hope that predictions of tron (Grove et al., 1997), using the same averaging proce- the success or failure of a particular treatment may be dure. Theoretical results suggest that these algorithms will possible, but so far, results from these types of experiments perform well when good sparse hypotheses are available. are inconclusive. The p-norm perceptron, though, did not perform as well as the theory might suggest (results not shown). Acknowledgements Conclusion We used SVM software written by Bill Grundy and thank him for his assistance and for comments on an earlier We have presented a method to analyse microarray draft. We are particularly grateful to Tomaso Poggio for expression data for genes from several tissue or cell pointing out a flaw in our method in earlier experiments. types using SVMs. While our results indicate that SVMs We thank Manuel Ares for suggesting we look at the Alon are able to classify tissue and cell types based on et al. data, and Dick Karp for putting us in contact with this data, we show that other methods such as the each other to study the ovarian cancer data. Finally, we ones based on the perceptron algorithm are able to are grateful to Al Globus, Computer Sciences Corporation perform similarly. The datasets currently available contain relatively few examples and thus do not allow one method at NASA Ames Research Center, for providing some to demonstrate superiority. The SVM performs well using of the computational resources required to perform our a simple kernel, and we believe that as datasets containing experiments. 912 SVM tissue classification using expression data Jaakkola,T., Diekhans,M. and Haussler,D. (1999) Using the Fisher References kernel method to detect remote protein homologies. In Proceed- Alon,U., Barkai,N., Notterman,D.A., Gish,K., Mack,S.Y.D. and ings of the 7th International Conference on Intelligent Systems Levine,J. (1999) Broad patterns of gene expression revealed by for Molecular Biology AAAI Press, Menlo Park, CA. clustering analysis of tumor and normal colon tissues probed Lockhart,D., Dong,H., Byrne,M., Follettie,M., Gallo,M., Chee,M., by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745– Mittmann,M., Wang,C., Kobayashi,M., Horton,H. and Brown,E. (1996) Expression monitoring by hybridization to high– Ben-Dor,A., Bruhn,L., Friedman,N., Nachman,I., Schummer,M. density oligonucleotide arrays. Nature Biotechnol., 14, 1675– and Yakhini,Z. (2000) Tissue classification with gene expression profiles. In Proceedings of the 4th Annual International Confer- Mukherjee,S., Tamayo,P., Mesirov,J., Slonim,D., Verri,A. and Pog- ence on Computational Molecular Biology (RECOMB) Univer- gio,T. (1999) Support vector machine classification of microar- sal Academy Press, Tokyo. ray data. Technical Report CBCL Paper 182/AI Memo 1676 MIT. Bishop,C. (1995) Neural Networks for Pattern Recognition. Oxford Perou,C., Jeffrey,S., van de Rijn,M., Rees,C., Eisen,M., Ross,D., University Press, Oxford. Pergamenschikov,A., Williams,C., Zhu,S., Lee,J., Lashkari,D., Boser,B.E., Guyon,I.M. and Vapnik,V.N. (1992) A training algo- Shalon,D., Brown,P. and Botstein,D. (1999) Distinctive gene rithm for optimal margin classifiers. In Proceedings of the expression patterns in human mammary epithelial cells and 5th Annual ACM Workshop on Computational Learning Theory breast cancers. Proc. Natl. Acad. Sci. USA, 96, 9212–9217. ACM Press, Pittsburgh, PA, pp. 144–152. Roberts,C., Nelson,B., Marton,M., Stoughton,R., Meyer,M., Brown,M., Grundy,W., Lin,D., Cristianini,N., Sugnet,C., Furey,T., Bennett,H., He,Y., Dai,H., Walker,W., Hughes,T., Tyers,M., M. Ares,J. and Haussler,D. (2000) Knowledge-based analysis Boone,C. and Friend,S. (2000) Signaling and circuitry of of microarray gene expression data by using support vector multiple mapk pathways revealed by a matrix of global gene machines. Proc. Natl. Acad. Sci. USA, 97, 262–267. expression profiles. Science, 287, 873–880. Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P. Rosenblatt,F. (1958) The perceptron: a probabilistic model for and Herskowitz,I. (1998) The transcriptional program of sporu- information storage and organization in the brain. Psych. Rev., lation in budding yeast. Science, 282, 699–705. 65, 386–407. Cortes,C. and Vapnik,V. (1995) Support-vector networks. Machine Schummer,M., Ng,W., Bumgarner,R., Nelson,P., Schummer,B., Learning, 20, 273–297. Bednarski,D., Hassell,L., Baldwin,R., Karlan,B. and Hood,L. Cristianini,N. and Shawe-Taylor,J. (2000) An Introduction to (1999) Comparative hybridization of an array of 21 500 ovar- Support Vector Machines. Cambridge University Press, Cam- ian cDNAs for the discovery of genes overexpressed in ovar- bridge, www.support-vector.net. ian carcinomas. Gene, 238, 375–385. DeRisi,J., Iyer,V. and Brown,P. (1997) Exploring the metabolic and Shawe-Taylor,J. and Cristianini,N. (1999) Further results on the genetic control of gene expression on a genomic scale. Science, margin distribution. In Proceedings of the 12th Annual Confer- 278, 680–686. ence on Computational Learning Theory ACM Press, New York. DeRisi,J., Penland,L., Brown,P., Bittner,M., Meltzer,P., Ray,M., Slonim,D., Tamayo,P., Mesirov,J., Golub,T. and Lander,E. (2000) Chen,Y., Su,Y. and Trent,J. (1996) Use of a cdna microarray to Class prediction and discovery using gene expression data. analyse gene expression patterns in human cancer. Nat. Genet., In Proceedings of the 4th Annual International Conference 4, 457–460. on Computational Molecular Biology (RECOMB) Universal Eisen,M., Spellman,P., Brown,P. and Botstein,D. (1998) Cluster Academy Press, Tokyo, Japan, pp. 263–272. analysis and display of genome-wide expression patterns. Proc. Spellman,P., Sherlock,G., Zhang,M., Iyer,V., Anders,K., Eisen,M., Natl. Acad. Sci. USA, 95, 14863–14868. Brown,P., Botstein,D. and Futcher,B. (1998) Comprehensive Freund,Y. and Schapire,R.E. (1998) Large margin classification identification of cell cycle-regulated genes of the yeast Saccha- using the perceptron algorithm. In Proceedings of the 11th romyces cerevisiae by microarray hybridization. Mol. Biol. Cell, Annual Conference on Computational Learning Theory ACM 9, 273–3297. Press, New York, pp. 209–217. Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro- Golub,T., Slonim,D., Tamayo,P., Huard,C., Gaasenbeek,M., vsky,E., Lander,E. and Golub,T. (1999) Interpreting patterns of Mesirov,J., Coller,H., Loh,M., Downing,J., Caligiuri,M., gene expression with self-organizing maps. Proc. Natl. Acad. Sci. Bloomfield,C. and Lander,E. (1999) Molecular classification of USA, 96, 2907–2912. cancer: class discovery and class prediction by gene expression Vapnik,V. (1998) Statistical Learning Theory. Wiley, New York. monitoring. Science, 286, 531–537. Wang,K., Gan,L., Jefferey,E., Gayle,M., Gown,A., Skelly,M., Nel- Grove,A.J., Littlestone,N. and Schuurmans,D. (1997) General con- son,P., Ng,W., Schummer,M., Hood,L. and Mulligan,J. (1999) vergence results for linear discriminant updates. In Proceedings Monitoring gene expression profile changes in ovarian carcino- of the 10th Annual Conference on Computational Learning The- mas using cDNA microarray. Gene, 229, 101–108. ory ACM Press, New York, pp. 171–183. Wen,X., Fuhrman,S., Michaels,G., Carr,D., Smith,S., Barker,J. Hastie,T., Tibshirani,R., Eisen,M., Brown,P., Ross,D., Scherf,U., and Somogyi,R. (1998) Large-scale temporal gene expression Weinstein,J., Alizadeh,A., Staudt,L. and Botstein,D. (2000) Gene mapping of central nervous system development. Proc. Natl. Shaving: a new class of clustering methods for expression arrays. Acad. Sci. USA, 95, 334–339. Stanford University Technical report. Zhang,L., Zhou,W., Velculescu,V., Kern,S., Hruban,R., Hamil- Helmbold,D. and Warmuth,M.K. (1995) On weak learning. J. ton,S., Vogelstein,B. and Kinzler,K. (1997) Gene expression pro- Comput. Syst. Sci., 50, 551–573. files in normal and cancer cells. Science, 276, 1268–1272. 913 T.S.Furey et al. Zhu,H., Cong,J., Mamtora,G., Gingeras,T. and Schenk,T. (1998) Zien,A., Ratsch,G., ¨ Mika,S., Scholk ¨ opf,B., Lemmen,C., Smola,A., Cellular gene expression altered by human cytomegalovirus: Lengauer,T. and Muller,K. ¨ (2000) Engineering support vector global monitoring with oligonucleotide arrays. Proc. Natl. Acad. machine kernels that recognize translation initiation sites. Bioin- Sci. USA, 95, 14470–14475. formatics, to appear. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

Support vector machine classification and validation of cancer tissue samples using microarray expression data

Loading next page...
 
/lp/oxford-university-press/support-vector-machine-classification-and-validation-of-cancer-tissue-C0yleaTidy

References (32)

Publisher
Oxford University Press
Copyright
© Oxford University Press 2000
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/16.10.906
Publisher site
See Article on Publisher Site

Abstract

Vol. 16 no. 10 2000 BIOINFORMATICS Pages 906–914 Support vector machine classification and validation of cancer tissue samples using microarray expression data 1,∗ 2 1 Terrence S. Furey , Nello Cristianini , Nigel Duffy , David W. 3 3 1 Bednarski , Michel ` Schummer and David Haussler Department of Computer Science, University of California, Santa Cruz, Santa Cruz, CA 95064, USA, Department of Engineering Mathematics, University of Bristol, Bristol, BS8 ITH, UK and Department of Molecular Biotechnology, University of Washington, Seattle, WA 98195, USA Received on April 4, 2000; accepted on May 19, 2000 Abstract These experiments primarily consist of either monitoring each gene multiple times under many conditions (Spell- Motivation: DNA microarray experiments generating man et al., 1998; Chu et al., 1998; DeRisi et al., 1997; thousands of gene expression measurements, are being used to gather information from tissue and cell sam- Wen et al., 1998; Roberts et al., 2000), or alternately eval- ples regarding gene expression differences that will uating each gene in a single environment but in different be useful in diagnosing disease. We have developed a types of tissues, especially cancerous tissues (DeRisi et new method to analyse this kind of data using support al., 1996; Alon et al., 1999; Golub et al., 1999; Perou et vector machines (SVMs). This analysis consists of both al., 1999; Zhu et al., 1998; Wang et al., 1999; Schum- classification of the tissue samples, and an exploration of mer et al., 1999; Zhang et al., 1997; Slonim et al., 2000). the data for mis-labeled or questionable tissue results. Those of the first type have allowed for the identification Results: We demonstrate the method in detail on samples of functionally related genes due to common expression consisting of ovarian cancer tissues, normal ovarian patterns (Brown et al., 2000; Eisen et al., 1998; Wen et al., tissues, and other normal tissues. The dataset consists 1998; Roberts et al., 2000), while the latter experiments of expression experiment results for 97 802 cDNAs for have shown promise in classifying tissue types (diagno- each tissue. As a result of computational analysis, a sis) and in the identification of genes whose expressions tissue sample is discovered and confirmed to be wrongly are good diagnostic indicators (Golub et al., 1999; Alon et al., 1999; Slonim et al., 2000). In order to extract in- labeled. Upon correction of this mistake and the removal formation from gene expression measurements, different of an outlier, perfect classification of tissues is achieved, methods have been employed to analyse this data includ- but not with high confidence. We identify and analyse a ing SVMs (Brown et al., 2000; Mukherjee et al., 1999) subset of genes from the ovarian dataset whose expression clustering methods (Eisen et al., 1998; Spellman et al., is highly differentiated between the types of tissues. To 1998; Alon et al., 1999; Perou et al., 1999; Ben-Dor et al., show robustness of the SVM method, two previously 2000; Hastie et al., 2000), self-organizing maps (Tamayo published datasets from other types of tissues or cells are et al., 1999; Golub et al., 1999), and a weighted correla- analysed. The results are comparable to those previously tion method (Golub et al., 1999; Slonim et al., 2000). obtained. We show that other machine learning methods also perform comparably to the SVM on many of those Support vector machines (SVMs), a supervised machine datasets. learning technique, have been shown to perform well in Availability: The SVM software is available at http:// www. multiple areas of biological analysis including evaluating cs.columbia.edu/∼bgrundy/ svm. microarray expression data (Brown et al., 2000), detect- Contact: [email protected] ing remote protein homologies (Jaakkola et al., 1999), and recognizing translation initiation sites (Zien et al., 2000). Introduction We have also recently become aware of another effort that uses SVMs in analyzing expression data (Mukherjee et al., Microarray expression experiments allow the recording of 1999). SVMs have demonstrated the ability to not only expression levels of thousands of genes simultaneously. correctly separate entities into appropriate classes, but also To whom correspondence should be addressed. to identify instances whose established classification is not 906  c Oxford University Press 2000 SVM tissue classification using expression data supported by the data. Expression datasets contain mea- of the perceptron algorithm. No single classification surements for thousands of genes which proves problem- technique has proven to be significantly superior to all atic for many traditional methods. SVMs, though, are well others in the experiments we have performed. Indeed, suited to working with high dimensional data such as this. the different kernels we tried performed nearly equally Here a systematic and principled method is introduced well and variations of the perceptron algorithm are shown that analyses microarray expression data from thousands to perform comparably to the SVM on all tests. It of genes tested in multiple tissue or cell samples. The is unfortunate that typical diagnostic gene expression primary goal is the proper classification of new samples. datasets today involve only a few tissue samples. As more We do this by training the SVM on samples classified datasets become available with larger numbers of samples, by experts, then testing the SVM on samples it has not we predict that our method will continue to demonstrate good performance. seen before. We demonstrate how SVMs can not only classify new samples, but can also help in the identification of those which have been wrongly classified by experts. Methods SVMs are not unique among classification methods in this In recent years, several methods have been developed for regard, but we show they are effective. Our method is performing gene expression experiments. Measurements demonstrated in detail on data from experiments involving from these experiments can give expression levels for 31 ovarian cancer, normal ovarian and other normal genes (or ESTs) in tissue or cell samples. For more in tissues. We are able to identify one tissue sample as mis- depth discussions of these techniques, see Lockhart et labeled, and another as an outlier, which is shown in al. (1996) and Schummer et al. (1999). Datasets used the Results Section and illustrated in Figure 1. Though for our experiments consist of a relatively small number perfect classification is finally achieved in one instance, of tissue samples (less than 100) each with expression this performance is not consistently shown in multiple measurements for thousands of genes. tests and, therefore, cannot be considered too significant. Previous methods used in the analysis of similar datasets We also experimented with the method introduced in start with a procedure to extract the most relevant features. (Golub et al., 1999) to focus the analysis on a smaller Most learning techniques do not perform well on datasets subset of genes that appear to be the best diagnostic where the number of features is large compared to indicators. This amounts to a kind of dimensionality the number of examples. SVMs are believed to be an reduction on the dataset. If one can identify particular exception. We are able to begin with tests using the full genes that are diagnostic for the classification one is dataset, and systematically reduce the number of features trying to make, e.g. the presence of cancer, then there selecting those we believe to be the most relevant. In this is also hope that some of these genes may be found to way, we can show whether an improvement is made using be of value in further investigations of the disease and smaller sets, thus indicating whether these contain the in future therapies. Here we find that this dimensionality most meaningful genes. reduction does not significantly improve classification To understand our method, a familiarity with SVMs performance. It does reveal some genes that may be of is required, and a brief introduction follows. We explain interest in ovarian cancer. However, further work needs below how we rank the features, and present an outline of to be carried out to identify the most effective feature how we use the SVM to perform classification and error selection/dimensionality reduction methods for this kind detection. of data. Support vector machines To test the generality of the approach, we also ran experiments using leukemia data from Golub et al. (1999) SVMs (Cristianini and Shawe-Taylor, 2000) are a (72 patient samples) and colon tumor data from (Alon et relatively new type of learning algorithm, originally al., 1999) (62 tissue samples). Our results are comparable introduced by Vapnik and co-workers (Boser et al., 1992; to other methods used by the authors of those papers. Vapnik, 1998) and successively extended by a number of Since no special effort was made to tune the method to other researchers. Their remarkably robust performance these other datasets, this increases our confidence that our with respect to sparse and noisy data is making them the approach will have broad applications in analyzing data of system of choice in a number of applications from text this type. categorization to protein function prediction. It is difficult to show that one diagnostic method is When used for classification, they separate a given significantly better than another with small data sets set of binary labeled training data with a hyper-plane such as those we have examined. We have conducted a that is maximally distant from them (known as ‘the full hold-one-out cross-validation (jackknife) evaluation maximal margin hyper-plane’). For cases in which no of the classification performance of the methods we linear separation is possible, they can work in combination tested. These include both SVM methods and variants with the technique of ‘kernels’, that automatically realizes 907 T.S.Furey et al. a non-linear mapping to a feature space. The hyper-plane which that point is embedded in the final decision func- found by the SVM in feature space corresponds to a non- tion. A remarkable property of this alternative representa- linear decision boundary in the input space. tion is that often only a subset of the points will be asso- j j ciated with non-zero α . These points are called support Let the j th input point x = (x ,..., x ) be the j vectors and are the points that lie closest to the separating realization of the random vector X . Let this input point j hyper-plane. The sparseness of the α vector has several be labeled by the random variable Y ∈{−1, +1}. n N computational and learning theoretic consequences. Let φ : I ⊆ → F ⊆ be a mapping n Notice that for a test point (x, y) the quantity from the input space I ⊆ to a feature space F . Let y α y φ(x ),φ(x)− b is negative if the i i i=1 us assume that we have a sample S of m labeled data 1 1 m m prediction of the machine is wrong, and a large negative points: S ={(x , y ), ...,(x , y )}. The SVM learning value would indicate that the point (x, y) is regarded by algorithm finds a hyper-plane (w, b) such that the quantity the algorithm as ‘different’ from the training data. i j The matrix K =φ(x ),φ(x ) is called the kernel i i ij γ = min y {w,φ(x )− b} (1) matrix and will be particularly important in the extensions of the algorithm that will be discussed later. In the case is maximized, where ,  denotes an inner product, the when the data are not linearly separable, one can use vector w has the same dimensionality as F , ||w|| is 2 i j more general functions, K = K (x , x ), that provide ij held constant, b is a real number, and γ is called the non-linear decision boundaries. Two classical choices are margin. The quantity (w,φ(x )− b) corresponds to the i j i j d polynomial kernels K (x , x ) = (x , x + 1) and distance between the point x and the decision boundary. i j x −x i i j When multiplied by the label y , it gives a positive value σ Gaussian kernels K (x , x ) = e , where d and for all correct classifications and a negative value for σ are kernel parameters. In our experiments, we use i j i j the incorrect ones. The minimum of this quantity over K (x , x ) = (x , x + 1). all the data is positive if the data is linearly separable, In the presence of noise, the standard maximum margin and is called the margin. Given a new data point x to algorithm described above can be subject to over-fitting, classify, a label is assigned according to its relationship and more sophisticated techniques are necessary. This to the decision boundary, and the corresponding decision problem arises because the maximum margin algorithm function is always finds a perfectly consistent hypothesis and does not tolerate training error. Sometimes, however, it is necessary f (x) = sign(w,φ(x)− b). (2) to trade some training accuracy for better predictive power. The need for tolerating training error has led It is easy to prove (Cristianini and Shawe-Taylor, 2000) to the development of the soft-margin and the margin- that, for the maximal margin hyper-plane, distribution classifiers (Cortes and Vapnik, 1995). One of these techniques (Shawe-Taylor and Cristianini, 1999) replaces the kernel matrix in the training phase as follows: i i w = α y φ(x ) (3) i=1 K ← K + λ1, (7) where α are positive real numbers that maximize while still using the standard kernel function in the decision phase (6). We call λ the diagonal factor. By tuning m m i j i j λ, one can control the training error, and it is possible to α − α α y y φ(x ),φ(x ) (4) i i j prove that the risk of misclassifying unseen points can be i=1 ij=1 decreased with a suitable choice of λ (Shawe-Taylor and subject to Cristianini, 1999). If instead of controlling the overall training error one α y = 0,α > 0, (5) i i wants to control the trade-off between false positives and i=1 false negatives, it is possible to modify K as follows: the decision function can equivalently be expressed as K ← K + λ D, (8) where D is a diagonal matrix whose entries are either d f (x) = sign α y φ(x ),φ(x)− b . (6) i i or d , in locations corresponding to positive and negative i=1 examples. It is possible to prove that this technique is From this equation it is possible to see that the α associ- equivalent to controlling the size of the α in a way i i ated with the training point x expresses the strength with that depends on the size of the class, introducing a 908 SVM tissue classification using expression data bias for larger α in the class with smaller d . This from only the known samples, some number of the top in turn corresponds to an asymmetric margin; i.e. the features are extracted, and then these are then used to train class with smaller d will be kept further away from the the SVM and classify the unknown sample. Examples decision boundary (Brown et al., 2000). In the case of which have been consistently misclassified in all tests are + − imbalanced data sets, choosing d = and d = identified. These examples can then be investigated by the biologist, and if it is determined that the original label is provides a heuristic way to automatically adjust the incorrect, a correction is made, and the process is repeated. relative importance of the two classes, based on their Alternatively, an example may be deemed an outlier that respective cardinalities. is very different from the rest, and is therefore removed. The experiments presented in this paper were performed In the SVM tests reported here, only the simple dot- using a freely available implementation of the SVM classi- product kernel is used. A more complex kernel is not fier which can be obtained at http://www.cs.columbia.edu/ required. As possibly more complex datasets become ∼bgrundy/svm. This implementation is based on that de- available providing more examples, higher-order kernels scribed in Jaakkola et al. (1999) and differs slightly from may become necessary (Mukherjee et al., 1999). the above explanation in that it does not include a bias term, b, forcing all decision boundaries to contain the ori- Results gin in feature space. Our method is tested in detail using a previously unpub- Feature selection lished ovarian tissue dataset. A short analysis of the fea- ture selection is included. To demonstrate the generality Our feature selection criterion is essentially that used in of our method, we also performed experiments using pre- Golub et al. (1999) and Slonim et al. (2000). We start viously published datasets. The first dataset contains ex- with a dataset S consisting of m expression vectors x = i i amples of patients with human acute leukemia, originally (x ,..., x ), 1 ≤ i ≤ m, where m is the number of tissue 1 n analysed by Golub et al. (1999) with further results re- or cell samples and n is the number of genes measured. ported by Slonim et al. (2000). The dataset can be ob- Each sample is labeled with Y ∈{+1, −1} (e.g. cancer tained at http://waldo.wi.mit.edu/MPR/cancer class.html. vs normal). For each gene x , we calculate the mean µ The second dataset is comprised of human tumor and nor- − + − (resp. µ ) and standard deviation σ (resp. σ using j i i mal colon tissues. Alon et al. (1999), originally analysed only the tissues labeled +1 (resp. −1). We want to find this data which is available on their website, http://www. genes that will help discriminate between the two classes, molbio.princeton.edu/colondata. therefore we calculate a score Ovarian dataset + − µ − µ j j Microarray expression experiments are performed using F (x ) = (9) + − σ + σ 97 802 DNA clones, each of which may or may not corre- J j spond to human genes, for 31 tissue samples. These sam- which gives the highest score to those genes whose ples are either cancerous ovarian tissue, normal ovarian expression levels differ most on average in the two classes tissue, or normal non-ovarian tissue. For the purpose of these experiments, the two types of normal tissue are con- while also favoring those with small deviations in scores sidered together as a single class. The expression values in the respective classes. We then simply take the genes for each of the genes are normalized such that the distri- with the highest F (x ) scores as our top features. bution over the samples had a zero mean and unit variance. Complete SVM method Hold-one-out cross-validation experiments are per- formed. The SVM is trained using data from all but one of The complete SVM method can be described as follows: the tissue samples. The sample not used in training is then we begin by choosing a kernel, starting with the simple assigned a class by the SVM. A single SVM experiment dot-product kernel, and tune the diagonal factor to achieve consists of a series of hold-one-out experiments, each the best performance on hold-one-out cross-validation sample being held out and tested exactly once. tests using the full dataset. The SVM tuning procedure is Initially, experiments are carried out using all expression then repeated with a specified number of the top-ranked scores with diagonal factor settings of 0, 2, 5 and 10. The features. In these cases, for each individual hold-one-out genes are then ranked in the manner described previously, test, the features are ranked based on (9) using the scores and datasets consisting of the top 25, 50, 100, 500 We use default values set in the software except for the diagonal factor, −11 § which varies, the convergence threshold, which we set to 10 , and using We experimented with polynomial and radial basis kernels on the ovarian the ‘noconstraint’ option. data, and found that on data containing the mis-labeled point, they performed This score is closely related to the Fisher criterion score for the j th feature, worse than the linear kernel, but on the correctly labeled data, performance + − 2 + 2 − 2 F ( j ) = (µ − µ ) /((σ ) + (σ ) ) (Bishop, 1995). is similar to the linear kernel. j j j j 909 T.S.Furey et al. Table 1. Error rates for ovarian cancer tissue experiments Kernel DF Feature FP FN TP TN 0.5 Dot-product 0 25 5 4 10 12 Dot-product 2 25 5 2 12 12 Dot-product 5 25 4 2 12 13 Dot-product 10 25 4 2 12 13 -0.5 Dot-product 0 50 4 2 12 13 Dot-product 2 50 3 2 12 14 Dot-product 5 50 3 2 12 14 -1 Dot-product 10 50 3 2 12 14 Dot-product 0 100 4 3 11 13 Dot-product 2 100 5 3 11 12 -1.5 0 5 10 15 20 25 30 Dot-product 5 100 5 3 11 12 Tissue Samples Dot-product 10 100 5 3 11 12 Dot-product 0 97 802 17 0 14 0 Dot-product 2 97 802 9 2 12 8 Fig. 1. SVM classification margins for ovarian tissues. When Dot-product 5 97 802 7 3 11 10 classifying, the SVM calculates a margin which is the distance of Dot-product 10 97 802 5 3 11 12 an example from the decision boundary it has learned. In this graph, the margin for each tissue sample calculated using (6) is shown. For each setting of the SVM consisting of a kernel and A positive value indicates a correct classification, and a negative diagonal factor (DF), each tissue was classified. value indicates an incorrect classification. The most negative point Column 2 is the number of features (clones) used. Reported are the number of normal tissues misclassified corresponds to tissue N039. The second most negative point (FP), tumor tissues misclassified (FN), tumor tissues corresponds to tissue HWBC3. classified correctly (TP), and normal tissues classified correctly (TN). features and a diagonal factor of 0. No other setting is and 1000 features are created. Experiments using similar able to make fewer than three mistakes and most make diagonal factors to those above are performed using these at least four, therefore we can not place much confidence smaller feature sets. Table 1 displays the most significant in one perfect experiment. results from these experiments. The best classification is After ranking the features using all 31 samples, we done using the top 50 features with a diagonal factor of 2 attempt to sequence the ten top-ranked genes to determine if they are biologically significant. Three of these did or 5. Though the smaller datasets achieve slightly better not yield a readable sequence, and two are repetitive scores compared to using all features, we do not believe this improvement to be significant. sequences which occur naturally at 3’ ends of messenger An analysis of the misclassified examples reveals that RNAs and do not correspond to actual genes. Therefore, one normal ovarian tissue sample, N039, is misclassified only five represent expressed genes for which cancer- in all instances. In addition, the margin of misclassifi- relatedness information is thus available, either by its cation, calculated using (6), is relatively large meaning homology to a known or assumed tumor gene, or its the SVM strongly believes it to be cancerous. Figure 1 presence in cDNA libraries from tumor tissues in the case shows classification margins for experiments using the top of ESTs. Indeed, three of these five are cancer-related 50 features and a diagonal factor of 2. Upon investigation, (Ferritin H and two cancer-library ESTs), and one is it is discovered that this tissue had been mistakenly related to the presence of white blood cells in the tumor. labeled and is, in fact, cancerous. This analysis seems to suggest that the feature selection With a corrected label, the above experiments are run method is able to identify clones that are cancer-related, again, but disappointingly, classification results do not and rank them highly. Some clones however, obtain a improve. A second tissue, called HWBC3, is consistently high ranking while not having a meaningful biological misclassified by a large margin in these new tests, and explanation. Random sequencing of some of the bottom- ranked clones also reveal some known tumor genes which was also strongly misclassified in the original tests, as shown in Figure 1. This non-ovarian normal tissue is the would be expected to be ranked highly. Given this and the only tissue of its type, and an SVM trained on tissues inability of this feature selection method to significantly with little similarity might give spurious classification improve classification performance, we conclude that results. Therefore, we remove this tissue and repeat the additional effort is needed to develop ways of identifying experiments. Perfect classification is achieved using all meaningful features in these types of datasets. From a Size of Margin SVM tissue classification using expression data tumor biologist’s point of view however, the accumulation classification using the 250 and 500 top-ranked features of tumor-related genes at the top is a very useful feature. with multiple diagonal factor settings on hold-one-out cross-validation tests. Using the full dataset, the SVM AML/ALL dataset misclassifies a single tissue using a zero diagonal factor. Golub et al. uses SOMs to create four clusters containing Bone marrow or peripheral blood samples are taken from all training set examples, including the AML samples. 72 patients with either acute myeloid leukemia (AML) or acute lymphoblastic leukemia (ALL). Following the The first cluster contains 10 AML samples, the second experimental setup of the original authors, the data is eight T-lineage ALL samples and one B-lineage ALL split into a training set consisting of 38 samples of sample, the third five B-lineage ALL samples, and the which 27 are ALL and 11 are AML, and a test set of last one 13 B-lineage ALL samples and a single AML 24 samples, 20 ALL and 14 AML. The dataset provided sample. Additional tests in Slonim et al. (2000) use the contains expression levels for 7129 human genes produced weighted voting predictor to classify 33 samples of which by Affymetrix high-density oligonucleotide microarrays. it predicted on 32, all being correct. The scores in the dataset represent the intensity of gene Lastly, the success of chemotherapy treatments for 15 of expression after being re-scaled to make overall intensities the AML patients is available. Slonim et al. report that for each chip equivalent. Following the methods in Golub they were able to create a predictor which made only et al. (1999), we normalize these scores by subtracting two mistakes using a single gene, HOXA9, but that other the mean and dividing by the standard deviation of the predictors using more than this gene generally had error expression values for each gene. rates above 30%. On hold-one-out cross-validation tests, Golub et al. perform hold-one-out cross validation the SVM is able to classify 10 of the 15 patients using tests using a weighted voting scheme to classify the the top 5 or 10 ranked features and a diagonal factor training set and also cluster this set using self-organizing of two, thus performing only slightly better than chance. One misclassified sample, patient 37, is consistently maps (SOMs). The first method correctly classifies all misclassified by a relatively large margin. samples for which a prediction is made, 36 of the 38 samples, while a two-cluster SOM produces one cluster Colon tumor dataset with 24 ALL and one AML sample, and the second with Using Affymetrix oligonucleotide arrays, expression lev- 10 AML and three ALL samples. els for 40 tumor and 22 normal colon tissues are measured We also did a full hold-one-out cross-validation tests on for 6500 human genes. Of these genes, the 2000 with the the training set, and our SVM method correctly classifies highest minimal intensity across the tissues are selected all samples with a diagonal factor setting of two. Retesting for classification purposes and these scores are publicly subsets containing the top-ranked 25, 250, 500, and available. Each score represents a gene intensity derived in 1000 features, perfect classification is obtained using a a process described in Alon et al. (1999). The data is not diagonal factor of two in all cases. processed further before performing classification. Alon Using an SVM trained only with examples in the et al. use a clustering method to create clusters of tissues. training set and the subsets of features that perform In their experiments, one cluster consists of 35 tumor and optimally on this training set, we classify examples in the three normal tissues, and the other 19 normal and five tu- test set producing results ranging between classifying 30 mor tissues. to 32 of the 34 samples correctly. Golub et al. use a Using the SVM method with full hold-one-out cross- predictor trained using their weighted voting scheme on validation, we classify correctly all but six tissues using all the training samples, and classify correctly on all samples 2000 features and a diagonal factor of two. Using the top for which a prediction is made, 29 of the 34, declining to 1000 genes, the SVM misclassifies these same six samples predict for the other five. In all tests, our SVM correctly which correspond to three tumor tissues (T30, T33, T36) classifies the 29 predicted by their method, and for the and three normal tissues (N8, N34, N36). T30, T33, and five unpredicted samples, each is misclassified in at least T36 are among the five tumor tissues in the Alon et al. one SVM test. Two samples, patients 54 and 66, are cluster with a majority of normal tissues, and N8 and misclassified in all SVM tests. N32 are in the cluster containing a majority of the tumor Lineage information, either T-cell or B-cell, is provided tissues. for the ALL samples. Using all 47 ALL samples from Alon et al. define a muscle index based on the average the training and test sets, the SVM achieves perfect intensity of ESTs that are homologous to 17 smooth The weighted voting scheme selects 50 genes as described in the subsection muscle genes, and hypothesize that tumor tissues should ‘Feature selection’. Each gene predicts a class for each sample. These have a smaller muscle index. In general, this proves correct predictions are combined, each being weighted by the F (g) score defined with the notable exceptions that all tumor tissues have a above, and if a threshold is exceeded in favor of one class over the other, a prediction is made. muscle index less than or equal to 0.3 except for T30, T33, 911 T.S.Furey et al. and T36, and all normal tissues have an index greater than Table 2. Results for the perceptron using all features or equal to 0.3 except N8, N34, and N36. Two samples, N36 and T36, are especially interesting because their SVM SVM names indicate that they originate from the same patient, Dataset Features FP FN FP FN both are consistently misclassified by the SVM, and N36 has a muscle index or 0.1 and T36 has a muscle index of Ovarian I 97 802 4.6 4.8 5 3 Ovarian II 97 802 4.4 3.4 0 0 0.7, contrary to the proposed hypothesis. AML/ALL train 7 129 0.6 2.8 0 0 AML treatment 7 129 4.8 3.5 3 6 Comparison to perceptron-like classification Colon 2 000 3.8 3.7 3 3 algorithms As discussed in the introduction, we do not claim that Results are averaged over five shufflings of the data as this we can prove the superiority of the SVM method over algorithm is sensitive to the order of the samples. The first column is the dataset and the second is the number of other classification techniques on this type of dataset. The features considered. Ovarian I refers to the original full second family of algorithms we test are generalizations of dataset with the incorrectly labeled N039 tissue, while the perceptron algorithm (Rosenblatt, 1958). This simple Ovarian II is the dataset with the label corrected and the algorithm considers each sample individually, and updates HWBC3 tissue removed. The ovarian and colon datasets its weight vector each time it makes a mistake according show the number of normal tissues misclassified (FP) and the number of tumor tissues misclassified (FN). The to AML/ALL training dataset report the number of AML i+1 i i i w = w + y x . (10) samples misclassified (FP) and the number of ALL patients misclassified (FN). The AML treatment dataset The resulting decision rule is linear (no bias is used), shows the number of unsuccessfully treated patients and classification is given by sign(w , x). However, this misclassified (FP) and the number of successfully treated algorithm requires modification when there is no perfect patients misclassified (FN). The last two columns report linear decision rule. Helmbold and Warmuth (1995) show the corresponding SVM score using all features. that taking a linear combination of the decision rules used at each iteration of the algorithm is sufficient, and are able to derive performance guarantees. The final decision rule more examples become available, the use of more complex is sign w , x . Results for this modified perceptron kernels may become necessary and will allow the SVM are comparable to those for the SVM, and scores using all to continue its good performance. As an added feature of features in each dataset are given in Table 2. our SVM method, we demonstrate that it can be used to Freund and Schapire (1998) demonstrate that kernels identify mis-labeled data. other than the simple inner product can be applied effec- Microarray expression experiments have great potential tively to this algorithm, achieving performance compara- for use as part of standard diagnosis tests performed in ble to the best SVM on a benchmark test of Hand-Written the medical community. We have shown along with others Digits. As in the case of SVMs, though, the use of a more that expression data can be used in the identification of complex kernel did not improve performance. the presence of a disease and the determination of its cell We also test an algorithm known as the p-norm percep- lineage. In addition, there is a hope that predictions of tron (Grove et al., 1997), using the same averaging proce- the success or failure of a particular treatment may be dure. Theoretical results suggest that these algorithms will possible, but so far, results from these types of experiments perform well when good sparse hypotheses are available. are inconclusive. The p-norm perceptron, though, did not perform as well as the theory might suggest (results not shown). Acknowledgements Conclusion We used SVM software written by Bill Grundy and thank him for his assistance and for comments on an earlier We have presented a method to analyse microarray draft. We are particularly grateful to Tomaso Poggio for expression data for genes from several tissue or cell pointing out a flaw in our method in earlier experiments. types using SVMs. While our results indicate that SVMs We thank Manuel Ares for suggesting we look at the Alon are able to classify tissue and cell types based on et al. data, and Dick Karp for putting us in contact with this data, we show that other methods such as the each other to study the ovarian cancer data. Finally, we ones based on the perceptron algorithm are able to are grateful to Al Globus, Computer Sciences Corporation perform similarly. The datasets currently available contain relatively few examples and thus do not allow one method at NASA Ames Research Center, for providing some to demonstrate superiority. The SVM performs well using of the computational resources required to perform our a simple kernel, and we believe that as datasets containing experiments. 912 SVM tissue classification using expression data Jaakkola,T., Diekhans,M. and Haussler,D. (1999) Using the Fisher References kernel method to detect remote protein homologies. In Proceed- Alon,U., Barkai,N., Notterman,D.A., Gish,K., Mack,S.Y.D. and ings of the 7th International Conference on Intelligent Systems Levine,J. (1999) Broad patterns of gene expression revealed by for Molecular Biology AAAI Press, Menlo Park, CA. clustering analysis of tumor and normal colon tissues probed Lockhart,D., Dong,H., Byrne,M., Follettie,M., Gallo,M., Chee,M., by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745– Mittmann,M., Wang,C., Kobayashi,M., Horton,H. and Brown,E. (1996) Expression monitoring by hybridization to high– Ben-Dor,A., Bruhn,L., Friedman,N., Nachman,I., Schummer,M. density oligonucleotide arrays. Nature Biotechnol., 14, 1675– and Yakhini,Z. (2000) Tissue classification with gene expression profiles. In Proceedings of the 4th Annual International Confer- Mukherjee,S., Tamayo,P., Mesirov,J., Slonim,D., Verri,A. and Pog- ence on Computational Molecular Biology (RECOMB) Univer- gio,T. (1999) Support vector machine classification of microar- sal Academy Press, Tokyo. ray data. Technical Report CBCL Paper 182/AI Memo 1676 MIT. Bishop,C. (1995) Neural Networks for Pattern Recognition. Oxford Perou,C., Jeffrey,S., van de Rijn,M., Rees,C., Eisen,M., Ross,D., University Press, Oxford. Pergamenschikov,A., Williams,C., Zhu,S., Lee,J., Lashkari,D., Boser,B.E., Guyon,I.M. and Vapnik,V.N. (1992) A training algo- Shalon,D., Brown,P. and Botstein,D. (1999) Distinctive gene rithm for optimal margin classifiers. In Proceedings of the expression patterns in human mammary epithelial cells and 5th Annual ACM Workshop on Computational Learning Theory breast cancers. Proc. Natl. Acad. Sci. USA, 96, 9212–9217. ACM Press, Pittsburgh, PA, pp. 144–152. Roberts,C., Nelson,B., Marton,M., Stoughton,R., Meyer,M., Brown,M., Grundy,W., Lin,D., Cristianini,N., Sugnet,C., Furey,T., Bennett,H., He,Y., Dai,H., Walker,W., Hughes,T., Tyers,M., M. Ares,J. and Haussler,D. (2000) Knowledge-based analysis Boone,C. and Friend,S. (2000) Signaling and circuitry of of microarray gene expression data by using support vector multiple mapk pathways revealed by a matrix of global gene machines. Proc. Natl. Acad. Sci. USA, 97, 262–267. expression profiles. Science, 287, 873–880. Chu,S., DeRisi,J., Eisen,M., Mulholland,J., Botstein,D., Brown,P. Rosenblatt,F. (1958) The perceptron: a probabilistic model for and Herskowitz,I. (1998) The transcriptional program of sporu- information storage and organization in the brain. Psych. Rev., lation in budding yeast. Science, 282, 699–705. 65, 386–407. Cortes,C. and Vapnik,V. (1995) Support-vector networks. Machine Schummer,M., Ng,W., Bumgarner,R., Nelson,P., Schummer,B., Learning, 20, 273–297. Bednarski,D., Hassell,L., Baldwin,R., Karlan,B. and Hood,L. Cristianini,N. and Shawe-Taylor,J. (2000) An Introduction to (1999) Comparative hybridization of an array of 21 500 ovar- Support Vector Machines. Cambridge University Press, Cam- ian cDNAs for the discovery of genes overexpressed in ovar- bridge, www.support-vector.net. ian carcinomas. Gene, 238, 375–385. DeRisi,J., Iyer,V. and Brown,P. (1997) Exploring the metabolic and Shawe-Taylor,J. and Cristianini,N. (1999) Further results on the genetic control of gene expression on a genomic scale. Science, margin distribution. In Proceedings of the 12th Annual Confer- 278, 680–686. ence on Computational Learning Theory ACM Press, New York. DeRisi,J., Penland,L., Brown,P., Bittner,M., Meltzer,P., Ray,M., Slonim,D., Tamayo,P., Mesirov,J., Golub,T. and Lander,E. (2000) Chen,Y., Su,Y. and Trent,J. (1996) Use of a cdna microarray to Class prediction and discovery using gene expression data. analyse gene expression patterns in human cancer. Nat. Genet., In Proceedings of the 4th Annual International Conference 4, 457–460. on Computational Molecular Biology (RECOMB) Universal Eisen,M., Spellman,P., Brown,P. and Botstein,D. (1998) Cluster Academy Press, Tokyo, Japan, pp. 263–272. analysis and display of genome-wide expression patterns. Proc. Spellman,P., Sherlock,G., Zhang,M., Iyer,V., Anders,K., Eisen,M., Natl. Acad. Sci. USA, 95, 14863–14868. Brown,P., Botstein,D. and Futcher,B. (1998) Comprehensive Freund,Y. and Schapire,R.E. (1998) Large margin classification identification of cell cycle-regulated genes of the yeast Saccha- using the perceptron algorithm. In Proceedings of the 11th romyces cerevisiae by microarray hybridization. Mol. Biol. Cell, Annual Conference on Computational Learning Theory ACM 9, 273–3297. Press, New York, pp. 209–217. Tamayo,P., Slonim,D., Mesirov,J., Zhu,Q., Kitareewan,S., Dmitro- Golub,T., Slonim,D., Tamayo,P., Huard,C., Gaasenbeek,M., vsky,E., Lander,E. and Golub,T. (1999) Interpreting patterns of Mesirov,J., Coller,H., Loh,M., Downing,J., Caligiuri,M., gene expression with self-organizing maps. Proc. Natl. Acad. Sci. Bloomfield,C. and Lander,E. (1999) Molecular classification of USA, 96, 2907–2912. cancer: class discovery and class prediction by gene expression Vapnik,V. (1998) Statistical Learning Theory. Wiley, New York. monitoring. Science, 286, 531–537. Wang,K., Gan,L., Jefferey,E., Gayle,M., Gown,A., Skelly,M., Nel- Grove,A.J., Littlestone,N. and Schuurmans,D. (1997) General con- son,P., Ng,W., Schummer,M., Hood,L. and Mulligan,J. (1999) vergence results for linear discriminant updates. In Proceedings Monitoring gene expression profile changes in ovarian carcino- of the 10th Annual Conference on Computational Learning The- mas using cDNA microarray. Gene, 229, 101–108. ory ACM Press, New York, pp. 171–183. Wen,X., Fuhrman,S., Michaels,G., Carr,D., Smith,S., Barker,J. Hastie,T., Tibshirani,R., Eisen,M., Brown,P., Ross,D., Scherf,U., and Somogyi,R. (1998) Large-scale temporal gene expression Weinstein,J., Alizadeh,A., Staudt,L. and Botstein,D. (2000) Gene mapping of central nervous system development. Proc. Natl. Shaving: a new class of clustering methods for expression arrays. Acad. Sci. USA, 95, 334–339. Stanford University Technical report. Zhang,L., Zhou,W., Velculescu,V., Kern,S., Hruban,R., Hamil- Helmbold,D. and Warmuth,M.K. (1995) On weak learning. J. ton,S., Vogelstein,B. and Kinzler,K. (1997) Gene expression pro- Comput. Syst. Sci., 50, 551–573. files in normal and cancer cells. Science, 276, 1268–1272. 913 T.S.Furey et al. Zhu,H., Cong,J., Mamtora,G., Gingeras,T. and Schenk,T. (1998) Zien,A., Ratsch,G., ¨ Mika,S., Scholk ¨ opf,B., Lemmen,C., Smola,A., Cellular gene expression altered by human cytomegalovirus: Lengauer,T. and Muller,K. ¨ (2000) Engineering support vector global monitoring with oligonucleotide arrays. Proc. Natl. Acad. machine kernels that recognize translation initiation sites. Bioin- Sci. USA, 95, 14470–14475. formatics, to appear.

Journal

BioinformaticsOxford University Press

Published: Oct 1, 2000

There are no references for this article.