Skip to main content
VECLIB_MAXIMUM_THREADS
Source Link
denis
  • 141
  • 5

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but equivalent numpy runsseems to run only 1. (This is regardless of OMP_NUM_THREADSVECLIB_MAXIMUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but equivalent numpy seems to run only 1. (This is regardless of VECLIB_MAXIMUM_THREADS, which I don't understand.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

cdist multiple cores
Source Link
denis
  • 141
  • 5

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot andBut normscipy cdist seems to convert float32 to float64, for which numpy slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot and norm, for which numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Source Link
denis
  • 141
  • 5

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot and norm, for which numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS is a messy business.
Don't trust any runtimes that you haven't run yourself.