Does scipy.spatial.distance.cdist run multiple cores in parallel on your machine ?
On my mac with Accelerate framework, it runs all 4 cores,
but equivalent numpy runsseems to run only 1.
(This is regardless of OMP_NUM_THREADSVECLIB_MAXIMUM_THREADS, which I don't understand -- see ... on stackoverflow.)
If the big matrix doesn't change very often, normalize it once outside the loop:
A /= np.linalg.norm( A, axis=1 ) # <-- once
def cosdist( A, b, dtype=np.float32 ):
""" (A . b) / |b| -- A normalized """
Adotb = A.dot(b) /
Adotb /= np.linalg.norm(b)
return Adotb
Since A.dot(b) and norm(A) take roughly the same time,
this runs about twice as fast -- on 1 core.
How much memory do you have ? 1M x 300 x 8 bytes is 2.4 Gbytes, which may be a reasonable chunk size; 4.5M, 11Gbytes, will be memory-bound. Can you monitor memory usage / swapping ?
Using np.float32 instead of the default np.float64
might be faster, or allow bigger chunks.
But scipy cdist seems to convert float32 to float64, slower.
Numpy and scipy link to libraries for BLAS, Basic Linear Algebra Subprograms. These libs are usually vendor-tuned, faster than C or Cython loops. To see what numpy / scipy link to:
from numpy import __config__
print "numpy blas:\n", __config__.get_info( "blas_opt" )
from scipy import __config__
print "scipy blas:\n", __config__.get_info( "blas_opt" )
Even though these are the same (Accelerate framework on my mac),
numpy and scipy have different wrappers around BLAS
(numpy/linalg/linalg/lapack_lite.so, scipy/linalg/cython_blas.so)
with different overheads.
Summary: numpy / scipy dot and norm to BLAS
to multiple cores is a messy business.
Don't trust any runtimes that you haven't run yourself.