Intro to Cluster Analysis

Clutsering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations, we partition the profiles into distinct groups so that the profiles are similar within the groups but different from other groups. To do this, we must define what makes observations similar or different.

PCA vs Clustering

Both clustering and PCA seek to simplify the data via a small number of summaries, but their mechanisms are different:

Notation

Input is data with p variables and n subjects
image-1668562361706.png

A distance between two vectors i and j must obey several rules:

Clustering Procedures

K-Means Clustering

We partition the K clusters so that we maximize the similarity within clusters and minimize the similarity between clusters.

We can represent the data with a vector of means, which is the overall profile or cluster:
image-1668606369350.png

Supposed we split the data into 2 clusters, C1 with 𝑛1 observations of the p variables, and C2 with 𝑛2 observations of the p variables. We would have two different vectors of means representing the centroids of the clusters. The total sum of squares within clusters would be:
image-1668606548035.png
And we seek to keep WSS small, but it is NOT guaranteed to give the minimum WSS so ideally one should start from different initial values.

Example with K = 3, p = 2:
image-1668607255893.png

Standardization

When the variables are measured on different scales the measurement units may bias the cluster analysis. The Euclidean distance is not scale invariant.

Hierarchical Clustering

This is an alternative approach that does not require a fixed number of clusters. The algorithm essentially rearranges profiles so that similar profiles are displayed next to each other in a tree (dendrogram) and the dissimilar profiles are displayed in different branches.

image-1668611245622.png

image-1668611376451.png

We do this we define the similarity (distance) between:

Similarity Between Clusters

Complete-Linkage Clustering
Single-Linkage Clustering
Average-Linkage Clustering
Centroid Clustering

Complete and average linkage are similar, but complete linkage is faster because it does not require recalculation of the similarity matrix at each step.

image-1668612158830.png

Detection of Clusters

Inspection of the QQ-plot would inform about the existence of clusters in the data.

image-1668612379283.png

We can also use a 95%-tile to detect the number of clusters using extreme percentiles of the reference distribution. The idea is to color the entry of the data set, so that the colors represent the standardized difference of the cell intensity from a baseline. Typically, columns are samples and rows are variables.

image-1668612715262.png

image-1668612745091.png

 


Revision #2
Created 16 November 2022 01:15:01 by Elkip
Updated 16 November 2022 15:32:49 by Elkip