Abstract:
Motivation: Clustering is a useful exploratory
technique for the analysis of gene expression data. Many
different heuristic clustering algorithms have been proposed in
this context. Clustering algorithms based on probability models
offer a principled alternative to heuristic algorithms. In
particular, model-based clustering assumes that the data is
generated by a finite mixture of underlying probability
distributions such as multivariate normal distributions. The
issues of selecting a "good" clustering method and determining
the "correct" number of clusters are reduced to model selection
problems in the probability framework. Gaussian mixture models
have been shown to be a powerful tool for clustering in many
applications.
Results: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data.
Preprint: PDF
E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu