How to optimize special array_intersect in hive sql executed by spark engine?

buckets is a column of type array<string>. The logic is similar to array_intersect, except only the prefix of each string in buckets (before the first -) is compared. How can I optimize the following clause?

transform(
    a.buckets, -- buckets is a set of strings
    x -> if(
        array_contains(c.names, split(x, '-')[0]), -- c is a CTE, names is an array/set of strings
        x,
        null
    )
) AS buckets

There is also a case where bucket is a single string column:

transform(
    split(bucket, ','), -- bucket is a string
    x -> if(
        array_contains(c.names, split(x, '-')[0]), -- c is a CTE, names is an array/set of strings
        x,
        null
    )
) AS buckets

How can this prefix-based filtering be optimized in Hive SQL executed by the Spark engine?

asked Nov 22 at 17:27

Dong Ye

111 bronze badge

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to optimize special array_intersect in hive sql executed by spark engine?

0

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.