Model-Based Clustering and Data Transformations for Gene Expression Data

Ka Yee Yeung¹, Chris Fraley², Alejandro Murua³, Adrian E. Raftery², and Walter L. Ruzzo¹

Technical Report UW-CSE-2001-04-02, April, 2001.

(Also UW Statistics Department Technical Report 396.)

Abstract: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. This Gaussian mixture model has been shown to be a powerful tool for many applications. In addition, the issues of selecting a ``good'' clustering method and determining the ``correct'' number of clusters are reduced to model selection problems in the probability framework.

We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the right number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits.

Download: PDF

Supplementary Web Site

¹Computer Science and Engineering, Box 352350, University of Washington, Seattle, WA 98195, USA
²Statistics, Box 354322, University of Washington, Seattle, WA 98195, USA
³Insightful Corporation, 1700 Westlake Avenue North, Suite 500, Seattle, WA 98109, USA

E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu

Model-Based Clustering and Data Transformations for Gene Expression Data

Ka Yee Yeung1, Chris Fraley2, Alejandro Murua3, Adrian E. Raftery2, and Walter L. Ruzzo1

Technical Report UW-CSE-2001-04-02, April, 2001.

(Also UW Statistics Department Technical Report 396.)

Ka Yee Yeung¹, Chris Fraley², Alejandro Murua³, Adrian E. Raftery², and Walter L. Ruzzo¹