Tutorial Time: 20 Minutes
K Means Clustering is a way of finding K groups in your data. This tutorial will walk you a simple example of clustering by hand / in excel (to make the calculations a little bit faster).
Customer Segmentation K Means Example
A very common task is to segment your customer set in to distinct groups. You’re hoping that you find a set of customers that behave similarly and thus can be given tailored marketing pieces. Using K Means, we can do just that.
Download Data: CSV | Excel with Possible Solution
Data Exploration: 20 customers. 4 products + customer ID. Binary where 1 = purchased. 0 = no purchase.
- 9 / 20 (45%) have purchased product A
- 9 / 20 (45%) purchased product B
- 8 / 20 (40%) purchased product C
- 11/ 20 (55%) purchased product D
We could also calculate the probability of each product being bought with another given product (e.g. 44% of the time a customer will by both Product A and B). While a good step to understanding your data, as a toy example it’s not necessary for this analysis. Since the data is binary, K Means will do something like this for us.
Applying K Means: Based on the definition of K Means, we’re going to pick a random K points and then iteratively assign all other points to one of the K clusters by looking for the smallest distance. Based on advice from Andrew Ng’s Machine Learning course (Coursera), I like to use K actual data points to make sure you don’t end up in some empty space in your data.
In this case, let’s use K = 3 and see if we can find three interesting clusters. The Excel file linked above has a walkthrough
I chose customer number 4, 19, and 20 randomly. To start off we’ll calculate the distance of each customer from these three starting centroids.
We’ll use Euclidean Distance to measure how far apart these customers are from the centroids.
We’ll start with C# 1 and determine which cluster it belongs to
[table]
Cluster, ProductA,ProductB, ProductC, ProductD
1,1,1,0,1
2,0,1,1,1
3,0,1,0,1
[/table]
Iteration # 1
C# 1 has the values 0, 0, 1, 1. Now we’ll calculate the Euclidean distance by doing SQRT[(Cluster.ProductA-Customer.ProductA)^2+(Cluster.ProductB-Customer.ProductB)^2+(Cluster.ProductC-Customer.ProductC)^2+(Cluster.ProductD-Customer.ProductD)^2]
- Cluster 1: SQRT[ (1-0)^2+(1-0)^2+(0-1)^2+(1-1)^2] =SQRT(1+1+1+0) = ~1.73
- Cluster 2: SQRT[ (0-0)^2+(1-0)^2+(1-1)^2+(1-1)^2] =SQRT(0+1+0+0) = 1
- Cluster 3: SQRT[ (0-0)^2+(1-0)^2+(0-1)^2+(1-1)^2] = SQRT(0+1+1+0) = ~1.41
For the first iteration, Customer # 1 belongs to Cluster 2 since its distance to the centroid is smallest.
C# 7 has the values 1, 0, 1, 1. We’ll calculate which cluster this customer belongs to as well.
- Cluster 1: SQRT[ (1-1)^2+(1-0)^2+(0-1)^2+(1-1)^2] =SQRT(0+1+1+0) = ~1.41
- Cluster 2: SQRT[ (0-1)^2+(1-0)^2+(1-1)^2+(1-1)^2] =SQRT(1+1+0+0) = ~1.41
- Cluster 3: SQRT[ (0-1)^2+(1-0)^2+(0-1)^2+(1-1)^2] = SQRT(1+1+1+0) = ~1.73
In this case, Cluster 1 and Cluster 2 are tied so I’m going to arbitrarily decide it belongs to cluster 1. Sometimes you just need to break a tie and I like to break it by taking the earliest cluster.
Let’s pretend you did this for each of the 20 customers (resulting in 60 distance calculations). With the same tie breaking method I used (Cluster 1 beats 2 and 3, Cluster 2 beats 3) you should get these results for the first iteration.
[table]
Cust#, Cluster ID, Cust#, Cluster ID
1,2,11,2
2,3,12,1
3,1,13,1*
4,1,14,3
5,1,15,3
6,2,16,2
7,1*,17,1
8,1,18,3
9,1,19,2
10,2,20,3
[/table]
Customers 7 and 13 resulted in a tie and Cluster 1 was chosen arbitrarily.
Using these new cluster assignments, we have to move the centroid to the new center of the points assigned to that cluster.
You should get the following for your new centroids after this first iteration.
[table]
Cluster,ProductA,ProductB,ProductC,ProductD
1,1,0.444,0.222,0.2222
2,0,0.167,1,0.833
3,0,0.8,0,0.8
[/table]
Using these new centers, you’ll reassign each customer to the clusters.
Iteration # 2
C# 1 has the values 0, 0, 1, 1. C# 1 belonged to cluster 1 during the first iteration. Using the new centroids, here are the distance calculations.
- Cluster 1: SQRT[ (1-0)^2+(0.444-0)^2+(0.222-1)^2+(0.222-1)^2] =~1.55
- Cluster 2: SQRT[ (0-0)^2+(0.167-0)^2+(1-1)^2+(0.833-1)^2] = ~0.24
- Cluster 3: SQRT[ (0-0)^2+(0.8-0)^2+(0-1)^2+(0.8-1)^2] = ~1.30
For the second iteration, C# 1 still belongs to cluster 1
C# 7 has the values 1, 0, 1, 1. Previous iteration C# 7 belonged to cluster 1.
- Cluster 1: SQRT[ (1-1)^2+(0.444-0)^2+(0.222-1)^2+(0.222-1)^2] =~1.19
- Cluster 2: SQRT[ (0-1)^2+(0.167-0)^2+(1-1)^2+(0.833-1)^2] = ~1.03
- Cluster 3: SQRT[ (0-1)^2+(0.8-0)^2+(0-1)^2+(0.8-1)^2] = ~1.64
For the second iteration, C# 7 now belongs to cluster 2. This is the only customer to change its cluster assignment during this iteration. However, since there is at least one change, we must iterate until the cluster assignments are stabel OR we reach some maximum number of iterations.
You should get the following for your new centroids after this second iteration.
[table]
Cluster,ProductA,ProductB,ProductC,ProductD
1,1.000,0.500,0.125,0.125
2,0.143,0.143,1.000,0.857
3,0.000,0.800,0.000,0.800
[/table]
Iteration # 3
C# 1 has the values 0, 0, 1, 1. C# 1 belonged to cluster 1 during the second iteration. Using the new centroids, here are the distance calculations.
- Cluster 1: SQRT[ (1-0)^2+(0.5-0)^2+(0.125-1)^2+(0.125-1)^2] =~1.67
- Cluster 2: SQRT[ (0.143-0)^2+(0.143-0)^2+(1-1)^2+(0.857-1)^2] = ~0.25
- Cluster 3: SQRT[ (0-0)^2+(0.8-0)^2+(0-1)^2+(0.8-1)^2] = ~1.30
For the third iteration, C# 1 still belongs to cluster 1
C# 7 has the values 1, 0, 1, 1. Previous iteration C# 7 belonged to cluster 2.
- Cluster 1: SQRT[ (1-1)^2+(0.5-0)^2+(0.125-1)^2+(0.125-1)^2] =~1.33
- Cluster 2: SQRT[ (0.143-1)^2+(0.143-0)^2+(1-1)^2+(0.857-1)^2] = ~0.88
- Cluster 3: SQRT[ (0-1)^2+(0.8-0)^2+(0-1)^2+(0.8-1)^2] = ~1.64
Repeat this for every customer for iteration 3 and you’ll find that the cluster assignments are stable (i.e. no customer changes clusters).
The final centroids are:
[table]
Cluster,ProductA,ProductB,ProductC,ProductD
1,1.000,0.500,0.125,0.125
2,0.143,0.143,1.000,0.857
3,0.000,0.800,0.000,0.800
[/table]
You now have done K Means clustering by hand!
Going Further
There’s more to K Means than just running it once and accepting the results.
- You should try different values of K. You may think 3 is the right number but maybe not.
- You should evaluate the clustering. Examine how close are the data points to the centroid and how far apart are the centroids to each other.
- You should run K Means multiple times. Since K Means is dependent on where your initial centroids are placed, you should try randomly selecting points a dozen to a hundred times to see if you’re getting similar results (both cluster centers, distance of points to center, and distance between each centroid).
- Are you sure K Means is the right way? K Means is known to be sensitive to outliers.
K Means is a common but useful tool in your analytical toolbelt. However, you’d never want to do this all by hand. Take a look at using R for K Means or using SciPy / SciKitLearn for K Means.