Skip to main content
Commonmark migration
Source Link

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.
.
Source Link

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
 
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • 'kmean_step'kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
 
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • 'kmean_step' seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

  • kmean_step seems really ugly for an algorithm that can be described by:
    • Recluster the dataset around the centroids;
    • Recenter the centroids with their clusters.
  • I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
    • Centroids don't move,
    • or Cluster membership does not change,
    • or number of iterations exceeds preset max.
  • I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
  • Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.
edited tags
Link
200_success
  • 145.7k
  • 22
  • 191
  • 481
Fixed mistake in finding number of generations
Source Link
Loading
.
Source Link
Loading
edited tags
Link
Jamal
  • 35.2k
  • 13
  • 134
  • 238
Loading
fix grammatial errors
Source Link
Phrancis
  • 20.5k
  • 6
  • 70
  • 155
Loading
Source Link
Loading