communicate the problem
define the issue
high cardinality this operation becomes so costly.
This would have been a stronger submission if you more
carefully defined what "costly" means in this context.
Does your Use Case mostly care about time- or memory-complexity?
demonstrate the issue
This would have been a stronger submission if it included
automated
tests which construct "big" synthetic inputs,
display elapsed time for an intersection,
and verify correctness of result.
Then we would have a better idea of what "efficient"
means to you, and what "high cardinality" means in your Use Case.
use an appropriate algorithm
Let n = max(len(gs), len(ni))
.
We would expect a solution should have time complexity
of \$O(n \log n)\$ or perhaps better, perhaps linear.
The OP solution appears to have quadratic cost.
This is actually feasible, given that the ni_unique
vector
is ordered, constructed in \$O(n \log n)\$ time if it's
the bigger one. A sensible algorithm could exploit binary
search of ni_unique
in processing each element of gs
.
But from the
man page
it's not clear to me that we have exactly armed the library
routine with that fact, since "unique" is not "ordered",
and we only told it to assume unique.
In fact the man page advises us:
Deprecated since version 2.0: Use isin instead of in1d for new code.
use modern solution
Call
isin
instead.
return np.isin(gp, ni)
The documentation explains that there will be some temp memory
cost (of course the OP already allocated a large temp vector),
and that this will proceed at the speed of mergesort.
Presumably this better fits your efficiency needs.