For building an LSH based system in Python, I need to have a very fast calculation of the hashcode.
I won't need to explain here the LSH algorithm itself, but I need help in improving performance of the generate-Hash-Code operation:
Given a big number of features n (example: 5000 or even 50000):
- multiply a dense matrix (13xn) with a sparse vector (nx1)
- in the resulting vector (13x1) change to 1 if positive and 0 otherwise
- generate a binary number out of the result
I tried various approaches and I put here two implementation generateCode1 and generateCode2 but both are still too heavy.
generateCode1 takes 56 seconds for 100 call (avg: 0.56 s)
generateCode2 takes 20 seconds for 100 call (avg: 0.20 s)
I am sure it is possible to do it faster but not sure how. For you to be able to play with it, i am write a full working sample program:
import time
import numpy as np
from scipy import sparse
from scipy import stats
def randomMatrix(rows, cols, density):
rvs = stats.norm(loc=0).rvs # scale=2,
M = sparse.random(rows, cols, format='lil', density=density, data_rvs=rvs)
return M
def generateCode1(matrix, vector):
nonzeros = np.nonzero(vector)
temp_hashcode = []
for i in range(matrix.shape[0]):
d = 0
for j in nonzeros[1]:
d += matrix[i, j] * vector[0, j]
temp_hashcode.append('1' if d > 0 else '0')
return ''.join(temp_hashcode)
def generateCode2(matrix, vector):
m = matrix * vector.T
m [m > 0] = 1
m [m < 0] = 0
txt = ''.join(m.A.ravel().astype('U1'))
return txt
features = 50000
test_count = 100
matrix = randomMatrix(13, features, 1.0)
vector = randomMatrix(1, features, 0.01)
vector = abs(vector)
methods = [ generateCode1, generateCode2 ]
for func in methods:
print ('run {0} times of method {1}:'.format( test_count, func ) )
time1 = time.time()
for i in range(test_count):
code1 = func(matrix, vector)
time1 = time.time() - time1
print ('\tcode: {0} - run time: '.format(code1), end="" )
print ('{0}(s) == {1}(m) [ avergae {2}(s) ]'.format(time1, time1/60, time1/test_count) )
Note: Please don't go to parallelism and multi processing, that won't fit in my overall application (unfortunately).