[Unsupervised Clustering #1] How mean can be a Kmean ?

Sandeep Nayak - Sep 10 - - Dev Community

Intuition and simple example

K-Means is an unsupervised machine learning algorithm used for clustering data into groups based on similarity. It aims to partition data points into 'k' clusters, where each cluster represents a group of data points with similar characteristics.

Intuition:
Imagine you have a dataset of points scattered in space. K-Means works by finding 'k' cluster centers in the data such that each point is assigned to the cluster with the nearest center. The centers represent the "average" of the points in their cluster.

Step-by-Step Explanation:

1. Initialization:

  • Start with your dataset of data points.
  • Choose a value for 'k,' which represents the number of clusters you want to create.
  • Randomly initialize 'k' cluster centers in the feature space.

2. Assignment Step:

  • For each data point, calculate the distance (e.g., Euclidean distance) to all cluster centers.
  • Assign the data point to the cluster whose center is closest (i.e., the cluster that minimizes the distance).

3. Update Step:

  • Recalculate the cluster centers by taking the mean of all data points assigned to each cluster.
  • The new centers become the centroids for their respective clusters.

4. Repeat Assignment and Update:

  • Repeat the Assignment and Update steps until one of the stopping criteria is met (e.g., a maximum number of iterations or convergence, where cluster assignments and centers no longer change significantly).

Mathematical Explanation:

Let's illustrate K-Means with a simple mathematical example. Suppose we have a dataset of 2D points:

Data Point     X     Y
-----------------------
Point 1       2.0   3.0
Point 2       2.5   3.5
Point 3       5.0   5.0
Point 4       5.5   4.5
Point 5       6.0   6.0

Enter fullscreen mode Exit fullscreen mode

And let's say we want to find 2 clusters (k=2):

Initialization:
Randomly initialize two cluster centers, e.g., Center 1 at (2.0, 3.0) and Center 2 at (5.0, 5.0).

Assignment Step:
Calculate the distances and assign points to the nearest cluster center:

  • Point 1 is closer to Center 1.
  • Point 2 is closer to Center 1.
  • Point 3 is closer to Center 2.
  • Point 4 is closer to Center 2.
  • Point 5 is closer to Center 2.

Update Step:
Recalculate the cluster centers:

  • Center 1: (2.25, 3.25) - the mean of points 1 and 2.
  • Center 2: (5.17, 5.17) - the mean of points 3, 4, and 5.

Repeat Assignment and Update:
Repeat the Assignment and Update steps until convergence (cluster assignments and centers no longer change significantly).

Python Code Example:

Here's a simplified Python code example for K-Means clustering using the scikit-learn library:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans

# Sample dataset
X = np.array([[2.0, 3.0], [2.5, 3.5], [5.0, 5.0], [5.5, 4.5], [6.0, 6.0]])

# Create a K-Means clusterer with k = 2
kmeans = KMeans(n_clusters=2, init='random', max_iter=100)

# Fit the model to the data
kmeans.fit(X)

# Get cluster assignments and centers
cluster_assignments = kmeans.labels_
cluster_centers = kmeans.cluster_centers_

# Plot the original points before clustering
plt.figure(figsize=(10, 5))

# Before clustering
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], color='blue', label='Data Points')
plt.title('Before Clustering')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()

# After clustering with cluster centers
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=cluster_assignments, cmap='rainbow', label='Data Points')
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], color='black', marker='x', s=200, label='Cluster Centers')
plt.title('After Clustering')
plt.xlabel('X1')
plt.ylabel('X2')
plt.legend()

# Show plots
plt.tight_layout()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Image description

Use Cases:

  • Customer Segmentation: Segmenting customers based on their behavior for targeted marketing.
  • Image Compression: Reducing the number of colors in an image by clustering similar colors together.
  • Anomaly Detection: Identifying anomalous data points as those that don't belong to any cluster.
  • Document Clustering: Grouping similar documents together in text analysis.
  • Recommendation Systems: Clustering users or items for collaborative filtering-based recommendations.
  • Image Segmentation: Separating an image into meaningful regions or objects based on similarity.
.
Terabox Video Player