The homework for this week consisted of implementing in Python an algorithm called k-means clustering, that is an automatized way of slicing a dataset into a pre-defined number of groups or clusters.

Here is the code in GitHub and I copied it below.

This was a good challenge, mainly because of the use of Python (getting used to the ways of looping), numpy (making sure I’m using the right dimensions) and matplotlib (figuring out how to use it). I tried to keep calm and to build one functional block at a time. At some point I had some problem with assignments between numpy arrays – apparently when copying them we have to use the copyto() function, otherwise a reference mess starts to appear (similar to the case when working with PVectors in Processing or p5.Vector in p5.js) .

I used randomly generated points, but now that I have this I want to try it with another dataset that makes more sense.

I was able to create a figure for each iteration of the process, and afterwards I created animated GIFs showing three different examples:

Here’s my code:

import numpy as np import matplotlib.pyplot as plt # # PARAMETERS AND VARIABLES N = 500 # Number of data points D = 2 # Number of dimensions of data points C = 8 # Number of clusters pointSize = 3 # Data point size in the plot # X = np.random.random((N,D)) # Set of data points V = np.random.random((C,D)) # Set of cluster centers clusterColors = np.random.random( (C,3) ) # Colors for each cluster assignments = np.random.randint(0,C, N)# List of correspondences between X and V figCount = 0 # # # FUNCTIONS # Return distance between two points def distance(p1,p2): # calculate vector difference difference = p2-p1 # calculate magnitude of difference (i.e. Euclidean distance) return np.linalg.norm(difference) # # Return index of closest cluster center def closestClusterCenter(p): # Prepare variables of minimum distance minDistance = 10 minIndex = 0 for index, clusterCenter in enumerate(V): d = distance(p,clusterCenter) if d < minDistance: minDistance = d minIndex = index return minIndex # # Check distances to cluster centers and updates assignments # Returns true if there was a change in assignments def updateAssignments(): # Store a copy of the assignments before the process originalAssignments = np.zeros(N) np.copyto(originalAssignments,assignments) # For each point with index in X... for index,point in enumerate(X): # get the closest cluster center c = closestClusterCenter(point) # Assign it to the assignments list assignments[index] = c # Returns True if there was a change return not np.array_equal(originalAssignments, assignments) # # Make cluster centers the centroid of the points that belong to it def updateClusterCenters(): # Count how many points are there per cluster center counts = np.zeros(C) # Initialize a place where to sum the points sums = np.zeros( (C, D) ) # For each point with index in X for index, point in enumerate(X): # Get currently assigned cluster center index ccIndex = assignments[index] # Increment counter for that cluster center counts[ccIndex] += 1 # Vector sum of the points in that cluster sums[ccIndex] += point # We can know calculate the centroids for index, clusterCenter in enumerate(V): # Calculate centroid newClusterCenter = sums[index] / counts[index] # Copy new cluster center to cluster center np.copyto(clusterCenter, newClusterCenter) # # PLOTTING! def plot(): plt.clf() # For each point with index in X... for index, point in enumerate(X): # Get the cluster it belongs to clusterNumber = int(assignments[index]) # Get the corresponding color color = clusterColors[ clusterNumber ] # Add to the scatter plot with the specific color plt.scatter(point[0],point[1], c=color, s=pointSize) # For each cluster center... for index, clusterCenter in enumerate(V): # Get corresponding color color = clusterColors[index] # Add to the scatter plot with a star plt.scatter(clusterCenter[0], clusterCenter[1], c=color, marker='*') # Generate a name for the image figname = 'Figure%02d.png' % (figCount) # Save it as an image plt.savefig(figname) # # # SCRIPT # # Repeat the process of updating assignments while there are changes while updateAssignments(): # After updating assignments, update the cluster centers updateClusterCenters() # Make a plot plot() figCount += 1 print( "Iteration %d..." % figCount ) # Show the plot plt.show()

## Deja un comentario