Numpy versus Theano GPU parallelization

Ask Question

Asked 7 years, 8 months ago

Modified 7 years, 8 months ago

Viewed 109 times

I am learning Theano to accelerate my image processing functions. As a start, I am trying to reimplement the function to turn images from colors to black and white (with the same number of channels): http://www.marcogiordanotd.com/blog/python/image-processing-pycuda . The author wants to show how efficient it is to do it using CUDA.

I want to do the same using OpenCL and an AMDGPU. My installation worked fine and I passed the test from http://deeplearning.net/software/theano/tutorial/using_gpu.html# successfully.

However I am dumbfunded by the result: my numpy implementation of the black and white function is faster than Theano's. Note I couldn't help but vectorizing the "blackWhite" function from above-mentioned author: he was using for loops to do matrix calculations...

my code is as follows, if you want to test it you need to change inPath to the path of a colorful image under if __name__=='__main__':

from PIL import Image
import time
import os
#import pycuda.driver as cuda
#import pycuda.autoinit
#from pycuda.compiler import SourceModule
from theano import function, config, shared, tensor
import numpy as np

#
def blackWhite(inPath , outPath , mode = "luminosity",log = 0):

    if log == 1 :
        print ("----------> SERIAL CONVERSION")
    totalT0 = time.time()

    im = Image.open(inPath)
    px = np.array(im)

    getDataT1 = time.time()

    print ("-----> Opening path :" , inPath)

    processT0 =  time.time()
    if mode == 'luminosity':
        px=np.rollaxis(np.tile(np.dot(px,(0.21,0.71,0.07)).astype('uint8'),(3,1,1)),0,3)
    else:
        px=np.rollaxis(np.tile(np.dot(px,(1/3,1/3,1/3)).astype('uint8'),(3,1,1)),0,3)

    processT1= time.time()
    #px = np.array(im.getdata())
    im = Image.fromarray(px)
    im.save(outPath)

    print ("-----> Saving path :" , outPath)
    totalT1 = time.time()

    if log == 1 :
        print ("Image size : ",im.size)
        print ("get and convert Image data  : " ,getDataT1-totalT0 )
        print ("Processing data : " , processT1 - processT0 )
        print ("Save image time : " , totalT1-processT1)
        print ("total  Execution time : " ,totalT1-totalT0 )
        print ("\n")

def TheanoBlackWhite(inPath, outPath, mode = "luminosity" , log = 0):
    if log == 1 :
        print ("----------> THEANO CONVERSION")
    totalT0 = time.time()

    im = Image.open(inPath)
    px = np.array(im)

    getDataT1 = time.time()

    print ("-----> Opening path :" , inPath)

    processT0 =  time.time()
    x = shared(px)

    if mode == 'luminosity':
        weights = shared(np.array([0.21,0.71,0.07]))
    else:
        weights = shared(np.array([1/3,1/3,1/3]))
    f = function([], tensor.tile(tensor.dot(x,weights),(3,1,1)).dimshuffle((1,2,0)))

    px=f()

    processT1= time.time()
    im = Image.fromarray(px.astype('uint8'))
    im.save(outPath)

    print ("-----> Saving path :" , outPath)
    totalT1 = time.time()
    if np.any([isinstance(x.op, tensor.Elemwise) and
              ('Gpu' not in type(x.op).__name__)
              for x in f.maker.fgraph.toposort()]):
        print('Used the cpu')
    else:
        print('Used the gpu')
    if log == 1 :
        print ("Image size : ",im.size)
        print ("get and convert Image data  : " ,getDataT1-totalT0 )
        print ("Processing data : " , processT1 - processT0 )
        print ("Save image time : " , totalT1-processT1)
        print ("total  Execution time : " ,totalT1-totalT0 )
        print ("\n")

if __name__=='__main__':
    lim = os.listdir(path=r'./train_sample')
    inPath = os.path.join(r'./train_sample',lim[0])#just first image
    blackWhite(inPath, 'out_np.jpg', mode = "luminosity" , log = 1)
    TheanoBlackWhite(inPath, 'out_theano.jpg', mode = "luminosity" , log = 1)

result:

----------> SERIAL CONVERSION
-----> Opening path : ./train_sample/e04e300c909046d8.jpg
-----> Saving path : out_np.jpg
Image size :  (1600, 1200)
get and convert Image data  :  0.03160548210144043
Processing data :  0.06990170478820801
Save image time :  0.08301019668579102
total  Execution time :  0.18471074104309082


----------> THEANO CONVERSION
-----> Opening path : ./train_sample/e04e300c909046d8.jpg
-----> Saving path : out_theano.jpg
Used the cpu
Image size :  (1600, 1200)
get and convert Image data  :  0.03149127960205078
Processing data :  0.13131022453308105
Save image time :  0.06376218795776367
total  Execution time :  0.2267618179321289

edited Mar 10, 2018 at 15:10

200_success

146k22 gold badges191 silver badges481 bronze badges

asked Mar 10, 2018 at 13:37

hyamanieu

1313 bronze badges

\$\begingroup\$ (I was under the impression that CUDA run-times depend on driver quality and heavily on the hardware used.) \$\endgroup\$

greybeard
– greybeard

2018-03-10 14:00:51 +00:00
Commented Mar 10, 2018 at 14:00
\$\begingroup\$ You are certainly right. I believe for this particular task, my CPU was fast enough. It would be nice to try on 100 images at the same time but I have no idea how to do that. \$\endgroup\$

hyamanieu
– hyamanieu

2018-03-10 18:44:40 +00:00
Commented Mar 10, 2018 at 18:44

Add a comment |

0 You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Numpy versus Theano GPU parallelization

0

You must log in to answer this question.

Hot Network Questions

Numpy versus Theano GPU parallelization

0

You must log in to answer this question.

Related

Hot Network Questions