I would like to share some observations about your main concerns given at the end of the question. Let's start from the back:
5. extend_col/reduce_row
From what I can see, the "trick" here is to bring the points into a homogenous coordinate system and back. Therefore, I would propose to change the name of both functions to to_homogeneous and from_homogeneous or something similar. I also would propose a more straightforward implementation of from_homogeneous:
def to_homogeneous(array):
"""Extend `array` adding a last columns of ones"""
# alternatively:
# rows, cols = array.shape
# arr = np.empty((rows, cols+1))
# arr[:, :-1] = array
# arr[:, -1] = 1
# or:
# arr = np.ones((rows, cols+1))
# arr[:, :-1] = array
return np.hstack((array, np.ones((array.shape[0], 1))))
def from_homogeneous(array):
"""Divide the array by the last column and return all but the last one"""
return array[:, :-1] / array[:, -1, np.newaxis]
Since \$(A\cdot B)^T=B^T\cdot A^T\$ you can get rid of some of the transpositions in image_warp:
rxc = np.array(list(product(range(shape[0]), range(shape[1]))))
uv = from_homogeneous(to_homogeneous(rxc) @ la.inv(M).T)
I'm also relatively sure that there has to be a better way other than itertools.product here, but I haven't found it yet ;-)
4. clip vs. np.clip
As you rightfully suspected, you could use np.clip for the task directly. If you look at its documentation, you'll see, that you can pass array-like arguments as upper and lower bounds.
uv_neigh = np.clip(
uv_neigh,
[lower_u, lower_u, lower_v, lower_v],
[upper_u, upper_u, upper_v, upper_v]
)
Please verify this yourself, for what I've seen it always delivers the same results as your original implementation.
3. Rotating and flipping the results in image_warp
Maybe later.
2. Reducing conversions and transpositions
Whiletalking about the 5th point, I already presented a first step towards reducing the amount of transpositions. The part where uv_neigh is computed is another candidate to cut some transpositions. Rewriting it to make use of the full power of numpy will help as a first step:
def neighboring_points(points):
"""Return the neighbor points of given uv"""
neigh_np = np.empty((points.shape[0], 4))
neigh_np[:, 0::2] = np.floor(points)
neigh_np[:, 1::2] = np.ceil(points)
return neigh_np
This implementation creates a numpy array of the right size and then fills its 1st and 3rd column with the floored coordinates and the 2nd and 4th column with the ceiled coordinates of the points. Again, this should be fully compatible with your original implementation, but without all the transpositions and converting back and forth between Python and numpy.
With this change also in place, image_warp now looks much friendlier:
def image_warp(M, img, shape):
rxc = np.array(list(product(range(shape[0]), range(shape[1]))))
uv = from_homogeneous(to_homogeneous(rxc) @ la.inv(M).T)
uv_neigh = neighboring_points(uv)
# you could also move this into a function as before
lower_u, upper_u, lower_v, upper_v = 0, img.shape[1]-1, 0, img.shape[0]-1
uv_neigh = np.clip(
uv_neigh,
[lower_u, lower_u, lower_v, lower_v],
[upper_u, upper_u, upper_v, upper_v]
)
coords = np.hstack((uv, uv_neigh))
return np.flip(np.rot90(interpolate(coords, img).reshape(shape), 3), 1).astype('uint8')
1. List comprehension in bilinear_interp
Indeed, the list comprehension here seems to be the biggest bottleneck of the code. Since I wasn't fully able to decipher all the cryptic variable names in your code and not had so much time at hand to really wrap my head around the problem, I took the lazy approach and threw the just-in-time compiler numba (i.e. it compiles plain Python/numpy code to a faster platform-specific code) at the problem to see how it went. This is the code I ended up with:
from numba import jit
@jit(nopython=True)
def _bilinear_core(xx, q, yy):
n = xx.shape[1]
zz = np.empty((n, ))
xx = xx.T
q = q.T
yy = yy.T
for i in range(n):
zz[i] = xx[i, :] @ q[i, :] @ yy[i, :].T
return zz
def bilinear_interp(x, y, x0, x1, y0, y1, q00, q01, q10, q11):
"""Do bilinear interpolation given a point, a box and
values on box vetices"""
q = np.array([[q00, q01], [q10, q11]])
xx = np.array([(x1 - x), (x - x0)])
yy = np.array([(y1 - y), (y - y0)])
return _bilinear_core(xx, q, yy)
As you can see, I had to make some changes to use numba's faster nopython mode. The biggest change is that you cannot use zip(...) in that mode, as well as some other convenient functions available in Python. Splitting the code up in two functions was likely not necessary, but I like to do it nevertheless to keep numba-specific modifications contained to a small scope. Other than this, the code is almost unchanged and it's still written in pure Python/numpy.
But what are the benefits of these extra hoops you now have to jump through? Ignoring all the plotting and OpenCV's reference implementation, your main function runs in about \$10s\$ on my old laptop. Using numba and all the changes presented here in this answer to this point, the same main function now runs in \$3.2s\$*. Not bad, isn't it?
* When timing functions that (try to) use numba JIT, you have to take care to run the code at least twice and ignore the first measurement, since the first measurement would otherwise include the time numba needs to compile the function. In this example here, the first run takes about \$4.3s\$, which means that round about \$1.1s\$ are spent on the compilation process.