-2
\$\begingroup\$

I have two unidimensional numpy arrays in python, gs and ni. I want to intersect them but preserving the order in gs.

When they have high cardinality this operation becomes so costly. Is there any efficient way to do this?

import numpy as np

def intersect_preserve_order_np(gs, ni):
    ni_unique = np.unique(ni)
    mask = np.in1d(gs, ni_unique, assume_unique=True)
    return gs[mask]

I am trying with this:

\$\endgroup\$
1
  • 10
    \$\begingroup\$ Add some example calls to the function, especially a "costly" one. \$\endgroup\$
    – toolic
    Commented Feb 25 at 13:32

1 Answer 1

3
\$\begingroup\$

communicate the problem

define the issue

high cardinality this operation becomes so costly.

This would have been a stronger submission if you more carefully defined what "costly" means in this context. Does your Use Case mostly care about time- or memory-complexity?

demonstrate the issue

This would have been a stronger submission if it included automated tests which construct "big" synthetic inputs, display elapsed time for an intersection, and verify correctness of result. Then we would have a better idea of what "efficient" means to you, and what "high cardinality" means in your Use Case.

use an appropriate algorithm

Let n = max(len(gs), len(ni)). We would expect a solution should have time complexity of \$O(n \log n)\$ or perhaps better, perhaps linear. The OP solution appears to have quadratic cost.

This is actually feasible, given that the ni_unique vector is ordered, constructed in \$O(n \log n)\$ time if it's the bigger one. A sensible algorithm could exploit binary search of ni_unique in processing each element of gs. But from the man page it's not clear to me that we have exactly armed the library routine with that fact, since "unique" is not "ordered", and we only told it to assume unique. In fact the man page advises us:

Deprecated since version 2.0: Use isin instead of in1d for new code.

use modern solution

Call isin instead.

    return np.isin(gp, ni)

The documentation explains that there will be some temp memory cost (of course the OP already allocated a large temp vector), and that this will proceed at the speed of mergesort. Presumably this better fits your efficiency needs.

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.