Abstract: DNA arrays yield a global view of the cell by enabling the measurement of expression levels of thousands of genes simultaneously. When used to compare normal tissues and tissues at various stages of disease, or diseased tissues with different responses to treatment, arrays present opportunities for improved disease diagnosis and a deeper understanding of the molecular basis of observed phenotypes. Several machine learning methods have been applied to array data to classify genes on the basis of their expression levels in particular samples, and to classify tissue samples on the basis of their global patterns of gene expression [2-4,9, 12,21]. These tasks are made more difficult by the noisy nature of array data, and when classifying tissues, by the overwhelming number of gene attributes relative to the number of training samples. In this paper, we present a naive Bayes method for classifying tissues on the basis of DNA array data, and use a likelihood-based metric to select the most useful subset of genes for inclusion in the classifier. We applied this method to data sets with tissues of two different classes, and found its accuracy to exceed that of a recently described method [12,21] in two of the three cases. Furthermore, our method is easily extendible to multiclass classification, and performed well when applied to a data set with three different classes of tissues.
Download: PDF (Contains color figures, if you have a color printer.)
E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu