Unsupervised Learning: Clustering
This is the 14th in a series of class notes as I go through the Georgia Tech/Udacity Machine Learning course. The class textbook is Machine Learning by Tom Mitchell.
Defining the basic Clustering Problem
We want to take a set of objects and put them into groups.
Formally,
Given:
- set of objects X
- inter object distances D(x,y)
Output:
- Partition Pd(x) = Pd(y)
if x & y in same cluster
D
, The distance algorithm, can be any similarity metric you choose depending on your domain knowledge.
Single Linkage Clustering
The simplest clustering algorithm is Single Linkage.
- start by considering each object a cluster (so you have n objects and n clusters)
- define intercluster distance as the distance between the closest 2 points in the two clusters
- merge the two closest clusters
- repeat
n-k
times to arrive atk
clusters.
This helps you build a Hierarchical Agglomerative Cluster Structure. By cutting off the tree at k
levels from the top you can get a predetermined number of clusters, which is handy.
Instead of the closest 2 points, you can also modify this algorithm to use the furthest 2 points or the average (mean, median) point of each cluster.
SLC is deterministic, and also generates a minimum spanning tree. The running time is O(n^3)
. However it generates some nonintuitive clusters because of the “closest points” rule. So we need something better.
K-Means Clustering
Try:
- Pick k “centers” at random
- Each center “claims” all its closest points
- recompute centers by averaging the clustered points
- repeat until converged (recomputing doesn’t have any change)
Centers do not have to be any of the points, its just the “center” of the cluster.
K-Means can be viewed as an optimization problem, in that we are always trying to get better and better answers with each round. To use optimization terminology, we are saying that we can score each set of configurations (the centers), and we try to move from neighborhood to neighborhood trying to improve.
If we assume we know the “true” partitions, P(x)
, we can sum the squared errors between center of P(x) - x
for an Error score to minimize. “Neighborhood” is a bit less intuitive - its the set of possible moves at the end of a round, basically, you can either move the partition, or you can move the centers.
You can prove that K-means Clustering:
- always improves or statys the same at each step (when you move partitions, you only move if error goes down, and when you move centers, by recomputing the center you immediately jump to the least error possible given the partitions) aka “monotonically non-increasing in Error”
- converges (because you have a finite number of configurations) as long as you have a way to do consistent tie breaking
If you recall the Randomized Optimization chapter, this sounds a lot like hill-climbing, where you pick and move to neighbors based on their score. This also means it can get stuck in local optima (bad clustering based on the random starting point), so the only fix is random restarts.
There’s also an edge case where if a point is somehow equidistant between two perfectly converged clusters, it would nondeterministically be either part of one or the other. To deal with that we’ll use an algorithm where sharing is allowed…
Soft Clustering (Best!) also known as Expectation Maximization or Gaussian Mixing
The main trick here is to do like we did with MIMIC and assume points come from a probability distribution. Try:
- Assume the data was generated by 1 of
k
possible Gaussians with known variance - Sample Xi from that Gaussian
- Repeat n times
The task is to find a hypothesis h=<u1, u2,... uk>
of “k means” that maximizes the probability of the data (aka Maximum Likelihood). The Max Likelihood mean of the gaussian is just the mean of the data, so there’s nothing really new here. However you can use k
Hidden Variables ([0,1]
membership in Partition 1 to k
) run the algorithm in a similar way to k-Means Clustering. Instead of always improving in the Error metric, you’re improving in the probabilistic metric. Running the algorithm is a process of expectation maximization.
Properties of Expectation Maximization:
- monotonically non-decreasing likelihood
- theoretically might not converge (because Gaussian has an infinite extent, but in practice it does)
- cannot diverge
- can get stuck
- works with any distribution
Practical issue: Picking k
: Elbows, Silhouettes, and more
Although you can lean on domain knowledge to pick k
, there can algorithms to help pick the “Elbow”, the “best” point given a tradeoff between number of clusters vs explanatory power.
You can choose any number of “good” metrics to quantify cluster quality: adj. Mutual Information, Homogeneity, Completeness, and V_Measure (see original paper).
Many ways to do this exist, see the Wikipedia article for more.
Silhouette profiles can help visually inspect the appropriateness of clusters.
Practical issue: Visualizing Clusters
Using t-SNE
is very helpful:
A better alternative may have emerged in Feb 2018: UMAP, and the benefits are worth considering. Here is the author explaining the idea:
(theoretical aside) Kleinberg Impossibility Theorem
There are three desirable properties of clustering algorithms:
- Richness (can describe any clustering as long as we tweak variables)
- Scale invariance (works the same regardless of units)
- Consistency (clusters the same every time)
SLC is consistent and but not Rich. You can modify SLC’s stopping point to make it richer, but you will lose scale invariance.
So on and so forth. This is a proven theorem, that these three properties are mutually contradictory.
Jon Kleinberg’s original paper is called An impossibility theorem for clustering.
Next in our series
Further notes on this topic:
- The Expectation Maximization Algorithm
- K-Means vs Gaussian Mixture Models - a nice slide deck
- Supervised Learning of Gaussian Mixture Models for Visual Vocabulary Generation - a useful application
Hopefully that was a good introduction to Clustering. I am planning more primers and would love your feedback and questions on:
- Overview
- Supervised Learning
- Unsupervised Learning
- Randomized Optimization
- Information Theory
- Clustering
- Feature Selection
- Feature Transformation - week of Mar 11
- Reinforcement Learning
- Markov Decision Processes - week of Mar 25
- “True” RL - week of Apr 1
- Game Theory - week of Apr 15