Skip to main content
Bounty Awarded with 25 reputation awarded by CommunityBot
added 1 character in body
Source Link
AlexV
  • 7.4k
  • 2
  • 24
  • 47

WhiletalkingWhile talking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:

Whiletalking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:

While talking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:

deleted 8 characters in body
Source Link
AlexV
  • 7.4k
  • 2
  • 24
  • 47

Since \$(A\cdot B)^T=B^T\cdot A^T\$ you can get rid of some of the transposition madnesstranspositions in image_warp:

2. TranspositionReducing conversions and conversion madnesstranspositions

When talkingWhiletalking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:

As you can see, thisThis implementation pre-allocatescreates a numpy array of the right size and then fills its 1st and 3rd column with the floored coordinates and the 2nd and 4th column with the ceiled coordinates of the points. Again, this should be fully compatible with your original implementation, but without all the transpositions and converting back and forth between Python and numpy.

* When timing functions that try(try to) use numba JIT, you have to take care to run the code at least twice and ignore the first measurement, since the first measurement would otherwise include the time numba needs to compile the function. In this example here, the first run takes about \$4.3s\$, which means that round about \$1.1s\$ are spent on the compilation process.

Since \$(A\cdot B)^T=B^T\cdot A^T\$ you can get rid of some of the transposition madness in image_warp:

2. Transposition and conversion madness

When talking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help:

As you can see, this implementation pre-allocates a numpy array of the right size and then fills its 1st and 3rd column with the floored coordinates and the 2nd and 4th column with the ceiled coordinates of the points. Again, this should be fully compatible with your original implementation, but without all the transpositions and converting back and forth between Python and numpy.

* When timing functions that try to use numba JIT, you have to take care to run the code at least twice and ignore the first measurement, since the first measurement would otherwise include the time numba needs to compile the function. In this example here, the first run takes about \$4.3s\$, which means that round about \$1.1s\$ are spent on the compilation process.

Since \$(A\cdot B)^T=B^T\cdot A^T\$ you can get rid of some of the transpositions in image_warp:

2. Reducing conversions and transpositions

Whiletalking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:

This implementation creates a numpy array of the right size and then fills its 1st and 3rd column with the floored coordinates and the 2nd and 4th column with the ceiled coordinates of the points. Again, this should be fully compatible with your original implementation, but without all the transpositions and converting back and forth between Python and numpy.

* When timing functions that (try to) use numba JIT, you have to take care to run the code at least twice and ignore the first measurement, since the first measurement would otherwise include the time numba needs to compile the function. In this example here, the first run takes about \$4.3s\$, which means that round about \$1.1s\$ are spent on the compilation process.

Source Link
AlexV
  • 7.4k
  • 2
  • 24
  • 47

I would like to share some observations about your main concerns given at the end of the question. Let's start from the back:

5. extend_col/reduce_row

From what I can see, the "trick" here is to bring the points into a homogenous coordinate system and back. Therefore, I would propose to change the name of both functions to to_homogeneous and from_homogeneous or something similar. I also would propose a more straightforward implementation of from_homogeneous:

def to_homogeneous(array):
    """Extend `array` adding a last columns of ones"""
    # alternatively:
    # rows, cols = array.shape
    # arr = np.empty((rows, cols+1))
    # arr[:, :-1] = array
    # arr[:, -1] = 1
    # or:
    # arr = np.ones((rows, cols+1))
    # arr[:, :-1] = array
    return np.hstack((array, np.ones((array.shape[0], 1))))


def from_homogeneous(array):
    """Divide the array by the last column and return all but the last one"""
    return array[:, :-1] / array[:, -1, np.newaxis]

Since \$(A\cdot B)^T=B^T\cdot A^T\$ you can get rid of some of the transposition madness in image_warp:

rxc = np.array(list(product(range(shape[0]), range(shape[1]))))
uv = from_homogeneous(to_homogeneous(rxc) @ la.inv(M).T)

I'm also relatively sure that there has to be a better way other than itertools.product here, but I haven't found it yet ;-)

4. clip vs. np.clip

As you rightfully suspected, you could use np.clip for the task directly. If you look at its documentation, you'll see, that you can pass array-like arguments as upper and lower bounds.

uv_neigh = np.clip(
    uv_neigh,
    [lower_u, lower_u, lower_v, lower_v],
    [upper_u, upper_u, upper_v, upper_v]
)

Please verify this yourself, for what I've seen it always delivers the same results as your original implementation.

3. Rotating and flipping the results in image_warp

Maybe later.

2. Transposition and conversion madness

When talking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help:

def neighboring_points(points):
    """Return the neighbor points of given uv"""
    neigh_np = np.empty((points.shape[0], 4))
    neigh_np[:, 0::2] = np.floor(points)
    neigh_np[:, 1::2] = np.ceil(points)
    return neigh_np

As you can see, this implementation pre-allocates a numpy array of the right size and then fills its 1st and 3rd column with the floored coordinates and the 2nd and 4th column with the ceiled coordinates of the points. Again, this should be fully compatible with your original implementation, but without all the transpositions and converting back and forth between Python and numpy.

With this change also in place, image_warp now looks much friendlier:

def image_warp(M, img, shape):
    rxc = np.array(list(product(range(shape[0]), range(shape[1]))))
    uv = from_homogeneous(to_homogeneous(rxc) @ la.inv(M).T)

    uv_neigh = neighboring_points(uv)

    # you could also move this into a function as before
    lower_u, upper_u, lower_v, upper_v = 0, img.shape[1]-1, 0, img.shape[0]-1
    uv_neigh = np.clip(
        uv_neigh,
        [lower_u, lower_u, lower_v, lower_v],
        [upper_u, upper_u, upper_v, upper_v]
    )
    coords = np.hstack((uv, uv_neigh))

    return np.flip(np.rot90(interpolate(coords, img).reshape(shape), 3), 1).astype('uint8')

1. List comprehension in bilinear_interp

Indeed, the list comprehension here seems to be the biggest bottleneck of the code. Since I wasn't fully able to decipher all the cryptic variable names in your code and not had so much time at hand to really wrap my head around the problem, I took the lazy approach and threw the just-in-time compiler numba (i.e. it compiles plain Python/numpy code to a faster platform-specific code) at the problem to see how it went. This is the code I ended up with:

from numba import jit


@jit(nopython=True)
def _bilinear_core(xx, q, yy):
    n = xx.shape[1]
    zz = np.empty((n, ))
    xx = xx.T
    q = q.T
    yy = yy.T
    for i in range(n):
        zz[i] = xx[i, :] @ q[i, :] @ yy[i, :].T
    return zz


def bilinear_interp(x, y, x0, x1, y0, y1, q00, q01, q10, q11):
    """Do bilinear interpolation given a point, a box and
    values on box vetices"""
    q = np.array([[q00, q01], [q10, q11]])
    xx = np.array([(x1 - x), (x - x0)])
    yy = np.array([(y1 - y), (y - y0)])
    return _bilinear_core(xx, q, yy)

As you can see, I had to make some changes to use numba's faster nopython mode. The biggest change is that you cannot use zip(...) in that mode, as well as some other convenient functions available in Python. Splitting the code up in two functions was likely not necessary, but I like to do it nevertheless to keep numba-specific modifications contained to a small scope. Other than this, the code is almost unchanged and it's still written in pure Python/numpy.

But what are the benefits of these extra hoops you now have to jump through? Ignoring all the plotting and OpenCV's reference implementation, your main function runs in about \$10s\$ on my old laptop. Using numba and all the changes presented here in this answer to this point, the same main function now runs in \$3.2s\$*. Not bad, isn't it?


* When timing functions that try to use numba JIT, you have to take care to run the code at least twice and ignore the first measurement, since the first measurement would otherwise include the time numba needs to compile the function. In this example here, the first run takes about \$4.3s\$, which means that round about \$1.1s\$ are spent on the compilation process.