An empirical study of Principal Component Analysis for clustering gene expression data

Ka Yee Yeung and Walter L. Ruzzo

Technical Report UW-CSE-00-11-03, November, 2000.

Abstract: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PC's) in capturing cluster structure. In other words, we empirically compared the quality of clusters obtained from the original data set to the quality of clusters obtained from clustering the PC's using both real gene expression data sets and synthetic data sets. Our empirical study showed that clustering with the PC's instead of the original variables does not necessarily improve cluster quality. In particular, the first few PC's (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PC's has different impact on different algorithms and different similarity metrics.

Download: PDF

Supplementary Web Site

E-mail: ruzzo /at/ cs /dot/ washington /dot/ edu