Revisions to Cosine similarity computation - Code Review Stack Exchange

VECLIB_MAXIMUM_THREADS

Source Link

edited Jan 9, 2017 at 10:07

denis

141
5

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but equivalent numpy runsseems to run only 1. (This is regardless of OMP_NUM_THREADSVECLIB_MAXIMUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but equivalent numpy seems to run only 1. (This is regardless of VECLIB_MAXIMUM_THREADS, which I don't understand.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

cdist multiple cores

Source Link

edited Dec 31, 2016 at 17:00

denis

141
5

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot andBut normscipy cdist seems to convert float32 to float64, for which numpy slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot and norm, for which numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS is a messy business.
Don't trust any runtimes that you haven't run yourself.

Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ? On my mac with Accelerate framework, it runs all 4 cores, but numpy runs only 1. (This is regardless of OMP_NUM_THREADS, which I don't understand -- see ... on stackoverflow.)

If the big matrix doesn't change very often, normalize it once outside the loop:

A /= np.linalg.norm( A, axis=1 )  # <-- once

def cosdist( A, b, dtype=np.float32 ):
    """ (A . b) / |b| -- A normalized """
    Adotb = A.dot(b) / 
    Adotb /= np.linalg.norm(b)
    return Adotb

Since A.dot(b) and norm(A) take roughly the same time, this runs about twice as fast -- on 1 core.

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks. But scipy cdist seems to convert float32 to float64, slower.

Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.

Source Link

answered Dec 30, 2016 at 12:08

denis

141
5

How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?

Using np.float32 instead of the default np.float64 might be faster, or allow bigger chunks.

As you know, Cosine similarity is mainly dot and norm, for which numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:

from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )

from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )

Even though these are the same (Accelerate framework on my mac), numpy and scipy have different wrappers around BLAS (numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so) with different overheads.

Summary: numpy / scipy dot and norm to BLAS is a messy business.
Don't trust any runtimes that you haven't run yourself.

Stack Exchange Network

Return to Answer