polars assign to multiple columns

Question

I am trying to understand if there is any way to do when..then..otherwise in polars and assign to multiple columns. I have a elo dataset with millions of rows where I want to assign the current elo to anything greater than date. In pandas, I would do

elo_df.loc[(id, date:)), ["elo", "true_skill_mu", "true_skill_sigma"]] = elo, true_skill_mu, true_skill_sigma

The code below works but is very slow. I am hoping I can increase the speed by at least 3x by making the filter happen once. Also, if have any suggestion on to how tomake this faster, please let me know.

elo_df = elo_df.with_columns([pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_rating)).otherwise(pl.col("elo")).alias("elo"),
                                           
pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_mu)).otherwise(pl.col("true_skill_mu")).alias("true_skill_mu"),
                                            pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_sigma)).otherwise(pl.col("true_skill_sigma")).alias("true_skill_sigma")]

Since they are in a with_columns context, the three when/then/otherwise expressions will run in parallel (as long as your CPU has at least 3 cores). So from a wall-clock standpoint, you will not gain much by trying to rewrite them as one filter. That said, are you updating large batches of id's at one time? If so, then there is a speed-up for that. — user18559875, Commented Jul 24, 2022 at 19:34
Hmm, I'm somewhat puzzled. I created a dataset of 605 million records, and ran the three when/then/otherwise expressions are you described. I'm getting times of about 3-4 seconds. Are you getting something significantly worse? — user18559875, Commented Jul 24, 2022 at 23:21
My issue is I am probably doing something silly. I am rolling thru each date and updating all going forward on each date — Michael WS, Commented Jul 24, 2022 at 23:52

David Waterworth · Accepted Answer · 2023-03-14 11:42:28Z

0

use polars.col followed by polars.Expr.map

def on_when(s : pl.Series, df = elo_df):
    # or pass them as default arguments
    global new_rating, new_mu, new_sigma
    if s.name == "elo":
        return new_rating
    elif s.name == "true_skill_mu":
        return new_mu
    elif s.name == "true_skill_sigma":
        return new_sigma


def otherwise(s : pl.Series, df = elo_df):
    if s.name == "elo":
        return df.get_column("elo")
    elif s.name == "true_skill_mu":
        return df.get_column("true_skill_mu")
    elif s.name == "true_skill_sigma":
        return df.get_column("true_skill_sigma")
    # or you can simply just use `return s` since we are not doing any operations
    # this is just an example if we want to do calculations based on other columns


elo_df = elo_df.with_columns([
    pl.when((pl.col("id") == col) & (pl.col("date") >= date))
    .then(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(on_when))
    .otherwise(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(otherwise))
])

edited Mar 14, 2023 at 11:42

David Waterworth

2,9431 gold badge24 silver badges53 bronze badges

answered Jul 28, 2022 at 16:38

lost bit

1391 silver badge8 bronze badges

I assume that's faster?
– Michael WS
Commented Jul 29, 2022 at 12:26
could update a bunch of "id"s at once with map in some way?
– Michael WS
Commented Jul 29, 2022 at 12:27
1

have you tried using is_in. I can't guarantee that is is faster
– lost bit
Commented Jul 29, 2022 at 13:44
Yes. I was testing a giant pandas project in polars and this part is the only slow bit
– Michael WS
Commented Jul 30, 2022 at 12:17

Add a comment |

Collectives™ on Stack Overflow

polars assign to multiple columns

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related