LSH hashcode generator function

Question

For building an LSH based system in Python, I need to have a very fast calculation of the hashcode.

I won't need to explain here the LSH algorithm itself, but I need help in improving performance of the generate-Hash-Code operation:

Given a big number of features n (example: 5000 or even 50000):

multiply a dense matrix (13xn) with a sparse vector (nx1)
in the resulting vector (13x1) change to 1 if positive and 0 otherwise
generate a binary number out of the result

I tried various approaches and I put here two implementation generateCode1 and generateCode2 but both are still too heavy.

generateCode1 takes 56 seconds for 100 call (avg: 0.56 s)
generateCode2 takes 20 seconds for 100 call (avg: 0.20 s)

I am sure it is possible to do it faster but not sure how. For you to be able to play with it, i am write a full working sample program:

import time
import numpy as np
from scipy import sparse
from scipy import stats

def randomMatrix(rows, cols, density):
    rvs = stats.norm(loc=0).rvs  # scale=2,
    M = sparse.random(rows, cols, format='lil', density=density, data_rvs=rvs)
    return M

def generateCode1(matrix, vector):
    nonzeros = np.nonzero(vector)
    temp_hashcode = []

    for i in range(matrix.shape[0]):
        d = 0
        for j in nonzeros[1]:
            d += matrix[i, j] * vector[0, j]

        temp_hashcode.append('1' if d > 0 else '0')

    return ''.join(temp_hashcode)

def generateCode2(matrix, vector):
    m = matrix * vector.T
    m [m > 0] = 1
    m [m < 0] = 0
    txt = ''.join(m.A.ravel().astype('U1'))
    return txt

features = 50000
test_count = 100

matrix = randomMatrix(13, features, 1.0)
vector = randomMatrix(1, features, 0.01)
vector = abs(vector)


methods = [ generateCode1, generateCode2 ]

for func in methods:
    print ('run {0} times of method {1}:'.format( test_count, func ) )
    time1 = time.time()
    for i in range(test_count):
        code1 = func(matrix, vector)
    time1 = time.time() - time1

    print ('\tcode: {0} - run time: '.format(code1), end="" )
    print ('{0}(s) == {1}(m) [ avergae {2}(s) ]'.format(time1, time1/60, time1/test_count) )

Note: Please don't go to parallelism and multi processing, that won't fit in my overall application (unfortunately).

Gareth Rees · Accepted Answer · 2017-01-09 19:10:49Z

5

There are two things I don't understand about the code in the post.

Are you really committed to using LIL (list-of-lists) format? The documentation is very clear that LIL is not suitable for computing matrix products, for example here:

To perform manipulations such as multiplication or inversion, first convert the matrix to either CSC or CSR format.

and here:

Disadvantages of the LIL format [...] slow matrix vector products (consider CSR or CSC).

I find that using CSR instead of LIL for matrix and vector gives an immediate 20× speedup.

Even if you have a good reason for using the LIL format in some parts of your code, you should still convert to CSR before computing the product, as suggested in the documentation.
Given that matrix is dense, why does the code represent it using a sparse matrix? This is bound to incur some overhead that could be avoided by using an ordinary np.matrix.

I find that switching matrix from CSR to np.matrix gives a further 1.5× speedup.

answered Jan 9, 2017 at 19:10

Gareth Rees

50.1k3 gold badges130 silver badges211 bronze badges

\$\begingroup\$ your comments are so helpful. In fact i don't have to use the matrix at all \$\endgroup\$

Samer Aamar
– Samer Aamar

2017-01-11 08:12:15 +00:00
Commented Jan 11, 2017 at 8:12
\$\begingroup\$ I tried these comments and i see the improvement But unfortunately i am looking for more :) The idea is that i am building a system that can get different "vectors" and needs to generate hashcode fast enough. The matrix can be treated as constant (not changing) I am open for any suggestions and i don't have to use arrays from my side new ideas? \$\endgroup\$

Samer Aamar
– Samer Aamar

2017-01-11 08:14:13 +00:00
Commented Jan 11, 2017 at 8:14
\$\begingroup\$ @SamerAamar: If there is more to your code than you showed in this post, then you need to make a new post including more details (it's too late to change this post). \$\endgroup\$

Gareth Rees
– Gareth Rees

2017-01-12 20:11:34 +00:00
Commented Jan 12, 2017 at 20:11

Add a comment |

Community · Accepted Answer · 2017-05-23 11:33:44Z

I did try to shorten the necessary operations such as:

def generateCode3(matrix, vector):
    m = matrix * vector.T
    n = (m.data > 0).astype(int)
    return ''.join(map(str,n))

or less readable but more compact without any further variable m or n:

def generateCode4(matrix, vector):
    return ''.join(map(str,((matrix * vector.T).data > 0).astype(int)))

Finally, using the second answer to this question:

def generateCode5(matrix, vector):
    return ''.join(np.char.mod('%d', (((matrix * vector.T).data > 0).astype(int))))

Yet the resulting time difference is marginal, therefore no real improvement. Using your timing comparison, the result is:

run 100 times of method <function generateCode2 at 0x00000000159816A8>:
    code: 1000000011110 - run time: 7.869999885559082(s) == 0.13116666475931804(m) [ avergae 0.07869999885559081(s) ]
run 100 times of method <function generateCode3 at 0x00000000159810D0>:
    code: 1000000011110 - run time: 7.759999990463257(s) == 0.1293333331743876(m) [ avergae 0.07759999990463257(s) ]
run 100 times of method <function generateCode4 at 0x0000000015981158>:
    code: 1000000011110 - run time: 7.757499933242798(s) == 0.12929166555404664(m) [ avergae 0.07757499933242798(s) ]
run 100 times of method <function generateCode5 at 0x00000000159811E0>:
    code: 1000000011110 - run time: 7.75(s) == 0.12916666666666668(m) [ avergae 0.0775(s) ]

Stack Exchange Network

LSH hashcode generator function

2 Answers 2

You must log in to answer this question.

Hot Network Questions

LSH hashcode generator function

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions