Revisions to K-Mean with Numpy - Code Review Stack Exchange

Commonmark migration

Source Link

edited Jun 10, 2020 at 13:24

1

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

.

Source Link

edited Aug 31, 2014 at 7:07

Frames Catherine White

1.3k
3
15
27

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
 
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

'kmean_step'kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
 
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

'kmean_step' seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

I have implemented the K-Mean clustering Algorithm in Numpy:

from __future__ import division
import numpy as np

def kmean_step(centroids, datapoints):
    ds = centroids[:,np.newaxis]-datapoints
    e_dists =  np.sqrt(np.sum(np.square(ds),axis=-1))
    cluster_allocs = np.argmin(e_dists, axis=0)
    clusters = [datapoints[cluster_allocs==ci] for ci in range(len(centroids))]
    new_centroids = np.asarray([(1/len(cl))*np.sum(cl, axis=0) for cl in clusters if len(cl)>0])
    return new_centroids, clusters

def kmean(centroids, datapoints, n_gen, do_print=True):
    clusters=None
    for ii in range(1,n_gen):
        new_centroids, clusters = kmean_step(centroids, datapoints)
        if np.array_equal(centroids,new_centroids):
            print "After %i generations stabalised" % ii
            break
        else:
            centroids = new_centroids
            if do_print:
                print_clusters(clusters,centroids)
    return clusters

I have left out the print_clusters method. It as expected, displays the current clusters using coloured plots in matplotlib.

###Issues

kmean_step seems really ugly for an algorithm that can be described by:
- Recluster the dataset around the centroids;
- Recenter the centroids with their clusters.
I am a bit unsure about the termination conditions in the kmean function. I have read that they are:
- Centroids don't move,
- or Cluster membership does not change,
- or number of iterations exceeds preset max.
I have the first and last of those, but the second seems like it will happen automatically by the first. If the cluster membership doesn't change, then the center doesn't move. Do I really need to include it? (it is surprisingly hard to implement) Or is it only included in the algorithm description for clarity?
Is this an optimal way to implement it? To fully take advantage of vectorized operations? I am a bit iff about having 2 list comprehensions, particularly since one immediately cases to a ndarrray.

edited tags

Link

edited Aug 31, 2014 at 5:02

200_success

145.7k
22
191
481

Fixed mistake in finding number of generations

Source Link

edited Aug 31, 2014 at 4:43

Frames Catherine White

1.3k
3
15
27

Loading

.

Source Link

edited Aug 31, 2014 at 4:37

Frames Catherine White

1.3k
3
15
27

Loading

edited tags

Link

edited Aug 31, 2014 at 4:35

Jamal

35.2k
13
134
238

Loading

fix grammatial errors

Source Link

edited Aug 31, 2014 at 4:33

Phrancis

20.5k
6
70
155

Loading

Source Link

asked Aug 31, 2014 at 4:28

Frames Catherine White

1.3k
3
15
27

Loading

Stack Exchange Network

Return to Question

Issues

Issues