Skip to main content
fix a typo and add a drop of mathjax
Source Link
Peilonrayz
  • 44.6k
  • 7
  • 80
  • 158

Since we can assume all your points are on a sphere, there is a better way to compute distance than using the L2 Norm (Euclidean distance, which you're using).

If we transform your data points to Spherical coordinates, we could use a simple Manhattan distance on the theta and rho coordinates, which is less expansiveexpensive to compute. Since you compute this distance about N*m\$Nm\$ times (6 million points * 1000 sample points), I'm sure it would make a good performance difference.

Small note on the Spherical Coordinates :

Instead of having x,y,z coordinates, we use r, theta, rho, which correspond to :

  • r : the length of the vector
  • theta : angle on the XY plane
  • phi : angle of inclination starting from theta

So, to change data to spherical coordinates, we'd do the following (I assume data is arranged in a x,y,z manner). The conversion is very fast, but it could be further optimized if needed :

def convert_to_spherical(data):
    spherical_data = np.zeros_like(data)
    for i,d in enumerate(data):
        r = math.sqrt(d[0]**2 + d[1]**2 + d[2]**2)
        theta = math.acos(d[2] / r)
        phi = math.atan(d[1] / d[0])
    
        spherical_data[i] = np.array([r, theta, phi])

    return spherical_data

Then, you could use this instead of data and change the metric used by pairwise_distances to manhattan. You'll need to dig a little bit to understand why this works, but consider that since the parameter r is always the same, you can simply compute the difference between the angles.

Since my laptop is pretty bad, I couldn't generate data to sample 1000 points out of 1000000. These times were generated sampling 300 points from 100000 points, I started timing after generating the points but I included the coordinates transformation in the times :

Time using your optimized solution : 213.1844208240509s
Time using the spherical coordinates : 140.83173203468323s

Considering this pretty big time difference, I'm pretty sure performance would be much better on your data sample.

If afterwards, you need to get your coordinates back to the cartesian plane, the formula is pretty easy and you can find it online easily.

I've generated the points using thanks to this answer :

def sample_spherical(npoints, ndim=3):
    vec = numpy.random.randn(ndim, npoints)
    vec /= numpy.linalg.norm(vec, axis=0)
    return numpy.transpose(vec)

Since we can assume all your points are on a sphere, there is a better way to compute distance than using the L2 Norm (Euclidean distance, which you're using).

If we transform your data points to Spherical coordinates, we could use a simple Manhattan distance on the theta and rho coordinates, which is less expansive to compute. Since you compute this distance about N*m times (6 million points * 1000 sample points), I'm sure it would make a good performance difference.

Small note on the Spherical Coordinates :

Instead of having x,y,z coordinates, we use r, theta, rho, which correspond to :

  • r : the length of the vector
  • theta : angle on the XY plane
  • phi : angle of inclination starting from theta

So, to change data to spherical coordinates, we'd do the following (I assume data is arranged in a x,y,z manner). The conversion is very fast, but it could be further optimized if needed :

def convert_to_spherical(data):
    spherical_data = np.zeros_like(data)
    for i,d in enumerate(data):
        r = math.sqrt(d[0]**2 + d[1]**2 + d[2]**2)
        theta = math.acos(d[2] / r)
        phi = math.atan(d[1] / d[0])
    
        spherical_data[i] = np.array([r, theta, phi])

    return spherical_data

Then, you could use this instead of data and change the metric used by pairwise_distances to manhattan. You'll need to dig a little bit to understand why this works, but consider that since the parameter r is always the same, you can simply compute the difference between the angles.

Since my laptop is pretty bad, I couldn't generate data to sample 1000 points out of 1000000. These times were generated sampling 300 points from 100000 points, I started timing after generating the points but I included the coordinates transformation in the times :

Time using your optimized solution : 213.1844208240509s
Time using the spherical coordinates : 140.83173203468323s

Considering this pretty big time difference, I'm pretty sure performance would be much better on your data sample.

If afterwards, you need to get your coordinates back to the cartesian plane, the formula is pretty easy and you can find it online easily.

I've generated the points using thanks to this answer :

def sample_spherical(npoints, ndim=3):
    vec = numpy.random.randn(ndim, npoints)
    vec /= numpy.linalg.norm(vec, axis=0)
    return numpy.transpose(vec)

Since we can assume all your points are on a sphere, there is a better way to compute distance than using the L2 Norm (Euclidean distance, which you're using).

If we transform your data points to Spherical coordinates, we could use a simple Manhattan distance on the theta and rho coordinates, which is less expensive to compute. Since you compute this distance about \$Nm\$ times (6 million points * 1000 sample points), I'm sure it would make a good performance difference.

Small note on the Spherical Coordinates :

Instead of having x,y,z coordinates, we use r, theta, rho, which correspond to :

  • r : the length of the vector
  • theta : angle on the XY plane
  • phi : angle of inclination starting from theta

So, to change data to spherical coordinates, we'd do the following (I assume data is arranged in a x,y,z manner). The conversion is very fast, but it could be further optimized if needed :

def convert_to_spherical(data):
    spherical_data = np.zeros_like(data)
    for i,d in enumerate(data):
        r = math.sqrt(d[0]**2 + d[1]**2 + d[2]**2)
        theta = math.acos(d[2] / r)
        phi = math.atan(d[1] / d[0])
    
        spherical_data[i] = np.array([r, theta, phi])

    return spherical_data

Then, you could use this instead of data and change the metric used by pairwise_distances to manhattan. You'll need to dig a little bit to understand why this works, but consider that since the parameter r is always the same, you can simply compute the difference between the angles.

Since my laptop is pretty bad, I couldn't generate data to sample 1000 points out of 1000000. These times were generated sampling 300 points from 100000 points, I started timing after generating the points but I included the coordinates transformation in the times :

Time using your optimized solution : 213.1844208240509s
Time using the spherical coordinates : 140.83173203468323s

Considering this pretty big time difference, I'm pretty sure performance would be much better on your data sample.

If afterwards, you need to get your coordinates back to the cartesian plane, the formula is pretty easy and you can find it online easily.

I've generated the points using thanks to this answer :

def sample_spherical(npoints, ndim=3):
    vec = numpy.random.randn(ndim, npoints)
    vec /= numpy.linalg.norm(vec, axis=0)
    return numpy.transpose(vec)
Source Link
IEatBagels
  • 12.7k
  • 3
  • 48
  • 99

Since we can assume all your points are on a sphere, there is a better way to compute distance than using the L2 Norm (Euclidean distance, which you're using).

If we transform your data points to Spherical coordinates, we could use a simple Manhattan distance on the theta and rho coordinates, which is less expansive to compute. Since you compute this distance about N*m times (6 million points * 1000 sample points), I'm sure it would make a good performance difference.

Small note on the Spherical Coordinates :

Instead of having x,y,z coordinates, we use r, theta, rho, which correspond to :

  • r : the length of the vector
  • theta : angle on the XY plane
  • phi : angle of inclination starting from theta

So, to change data to spherical coordinates, we'd do the following (I assume data is arranged in a x,y,z manner). The conversion is very fast, but it could be further optimized if needed :

def convert_to_spherical(data):
    spherical_data = np.zeros_like(data)
    for i,d in enumerate(data):
        r = math.sqrt(d[0]**2 + d[1]**2 + d[2]**2)
        theta = math.acos(d[2] / r)
        phi = math.atan(d[1] / d[0])
    
        spherical_data[i] = np.array([r, theta, phi])

    return spherical_data

Then, you could use this instead of data and change the metric used by pairwise_distances to manhattan. You'll need to dig a little bit to understand why this works, but consider that since the parameter r is always the same, you can simply compute the difference between the angles.

Since my laptop is pretty bad, I couldn't generate data to sample 1000 points out of 1000000. These times were generated sampling 300 points from 100000 points, I started timing after generating the points but I included the coordinates transformation in the times :

Time using your optimized solution : 213.1844208240509s
Time using the spherical coordinates : 140.83173203468323s

Considering this pretty big time difference, I'm pretty sure performance would be much better on your data sample.

If afterwards, you need to get your coordinates back to the cartesian plane, the formula is pretty easy and you can find it online easily.

I've generated the points using thanks to this answer :

def sample_spherical(npoints, ndim=3):
    vec = numpy.random.randn(ndim, npoints)
    vec /= numpy.linalg.norm(vec, axis=0)
    return numpy.transpose(vec)