K-means¶

Distance function L2 type.
C = \({C_1,C_2,.......,C_k}\), cost(C) = \(\sum_x\) \(\underset{c_x}{min}\) \(d(x,Cx)^2\),
where d is euclidean distance. \(C_x\) is chosen as the nearest point to data X.

Alternative View¶

Objective : Define best k-clustering of the data by -

Minimizing the variance of each cluster. i.e, the cost of K-mean is the variance of the point in each cluster.
The mean \(C_i\) = \(\frac{1}{|C_i|}\) \(\sum_{x \epsilon C_i}\)x is the expected location of a point.
Therefore, variance of \(C_i\) = \(\sum_{x \epsilon C_i}x\) \(||x - C_i||^2\)

Connection with max likelihood estimation :

Consider the likelihood of set points if center is u, co-variance is \(𝜎^2\)I. Take Gaussian PDF.

p(\(x_1,x_2,.......x_k | 𝜇,\sigma\)) \(\propto\) \(\underset{i}{\prod}\) exp(-\(\frac{|x_i-\mu|^2}{𝜎^2}\)) \(\propto\) exp(-\(\underset{i}{\sum}\) \(\frac{|x_i-\mu|^2}{𝜎^2}\))

Taking log of probability likelihood,

log(p(\(x_1,x_2,.......x_k | 𝜇,\sigma\))) \(\propto\) \(\underset{i}{\sum}\) \(|X_i - 𝜇^2|\) + function(\(\sigma\),dimension).

Likelihood of gaussian is exactly the same as K-means cost of each set of data points.

Choosing Optimal position for Center

Given partition C, C is the proposed center then C must be equal to u. i.e, if optimal center is given then optimal partition can be determined and vice versa.

Lloyd’s algorithm (Iterative algorithm)

Steps:-

Find current centers of partition.
Assign Points to nearest centers.
Recalculate centers.
Iterate the above 3 steps.

Stopping Criteria:-

When no points change clusters.
When cluster centers don’t shift much.

Analysis

K centers, N points, d dimensions
Time taken to calculate the new cluster - O(kNd)
Time for the new center - O(Nd).

CS328-2022 Notes

K-means

Contents

K-means¶

Alternative View¶