13

It doesn’t make sense to me. It looks like nonzero(arr != 0) just creates an intermediate array, allocating more memory. No way it is faster, otherwise why doesn’t NumPy optimize it? But here is my benchmark:

import numpy as np
from timeit import timeit

arr = np.random.randint(0, 2, 10_000_000)

a = timeit(lambda: np.nonzero(arr != 0), number=10)
b = timeit(lambda: np.nonzero(arr), number=10)
print(f"nonzero(arr != 0): {a}")
print(f"nonzero(arr):      {b}")

Results:

nonzero(arr != 0): 0.20066774962469935
nonzero(arr):      0.5988789172843099

It seems that nonzero is just much better if you convert the input into a Boolean array first. I tested it using smaller integers, and the results are consistent:

arr = np.random.randint(0, 2, 10_000_000, dtype=np.uint8)

a = timeit(lambda: np.nonzero(arr != 0), number=10)
b = timeit(lambda: np.nonzero(arr), number=10)
c = timeit(lambda: np.nonzero(arr.view(bool)), number=10)
print(f"nonzero(arr != 0): {a}")
print(f"nonzero(arr):      {b}")
print(f"nonzero(view):     {c}")

Results:

nonzero(arr != 0): 0.15293325018137693
nonzero(arr):      0.5660374169237912
nonzero(view):     0.13332204101607203

So an intermediate array does add some overhead, but the overhead is much smaller than the nonzero() overhead on non-Boolean arrays.

Are there some reasons for this? My guess is that nonzero() is optimized (maybe using SIMD?) for a Boolean array, but somehow the optimization doesn’t happen for other types.

1
  • sounds like != is a faster implementation than nonzero for integer inputs, and nonzero is just returning the input when it's already booleans. Commented Feb 24 at 3:49

1 Answer 1

11

NumPy uses a different way to read the array when it's Boolean. It can grab like 64 or 256 values at the same time using SIMD (basically CPU tricks to process things in bulk). With integers, it just checks each one by one, so it's slower.

arr != 0 makes a Boolean array and then goes through the fast path, while arr stays as an int array and goes one by one.

Also, arr.view(bool) is the fastest, because it skips making a new array and just reads the same memory as if it was Boolean already.

NumPy has some optimizations documentation for further reading.

Sign up to request clarification or add additional context in comments.

3 Comments

That's silly. Seems like an obvious missing feature. All mainstream SIMD ISAs have packed-integer comparison, like x86 vpcmpeqd for SIMD vectors of 32-bit integers, or AArch64 cmp v0.s4, v1.s4, v2.s4, maybe allowing an immediate operand to compare against zero without even having to broadcast a constant first. Branching every SIMD vector as an early-out isn't cheap even with x86 (where SIMD-int to scalar int is always fairly fast with vpmovmskb to extract a mask of high bits of vec elems), but it can be amortized to check every few vectors. Seems like a major missed-optimization.
Narrowing an array to boolean before checking it for all-zero takes more work if you're only doing one pass. (If you're converting once and then testing multiple times, each pass over the bool array is 8x or 32x faster.) So yeah, this is something NumPy should optimize better.
I would refine your thinking on arr.view(bool). Read the documentation for view: For a.view(somedtype), if somedtype has a different number of bytes per entry than the previous dtype (for example, converting a regular array to a structured array) [...] This axis will be resized in the result. If your integer type is not 8-bit, then this is going to severely limit what kind of operation is still valid on this data. Something like .all() is safe, but .nonzero() is NOT.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.