Validating Clustering for Gene Expression Data

Ka Yee Yeung, David R. Haynor and Walter L. Ruzzo

Technical Report UW-CSE-00-01-01, January, 2000.

Abstract: Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. We provide a systematic and quantitative framework to assess the results of clustering algorithms. A typical gene expression data set contains measurements of the expression levels of a fixed set of genes under various experimental conditions. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level, hopefully revealing biologically meaningful patterns of activity or control. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters---meaningful clusters should exhibit less variation in the remaining condition than clusters formed by coincidence. We have successfully applied the methodology to compare three clustering algorithms on three published gene expression data sets. In particular, we found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality (functional categorizations of genes known for two of the three data sets).

Download: PostScript PDF (Contains color figures, if you have a color printer.)

E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu