0

I am trying to understand if there is any way to do when..then..otherwise in polars and assign to multiple columns. I have a elo dataset with millions of rows where I want to assign the current elo to anything greater than date. In pandas, I would do

elo_df.loc[(id, date:)), ["elo", "true_skill_mu", "true_skill_sigma"]] = elo, true_skill_mu, true_skill_sigma

The code below works but is very slow. I am hoping I can increase the speed by at least 3x by making the filter happen once. Also, if have any suggestion on to how tomake this faster, please let me know.

elo_df = elo_df.with_columns([pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_rating)).otherwise(pl.col("elo")).alias("elo"),
                                           
pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_mu)).otherwise(pl.col("true_skill_mu")).alias("true_skill_mu"),
                                            pl.when((pl.col("id") == col) & (pl.col("date") >= date)).then(pl.lit(new_sigma)).otherwise(pl.col("true_skill_sigma")).alias("true_skill_sigma")]
8
  • Since they are in a with_columns context, the three when/then/otherwise expressions will run in parallel (as long as your CPU has at least 3 cores). So from a wall-clock standpoint, you will not gain much by trying to rewrite them as one filter. That said, are you updating large batches of id's at one time? If so, then there is a speed-up for that.
    – user18559875
    Commented Jul 24, 2022 at 19:34
  • I am updating one id at a time
    – Michael WS
    Commented Jul 24, 2022 at 19:44
  • @cbilot Is there a faster way to update than these filters?
    – Michael WS
    Commented Jul 24, 2022 at 22:26
  • Hmm, I'm somewhat puzzled. I created a dataset of 605 million records, and ran the three when/then/otherwise expressions are you described. I'm getting times of about 3-4 seconds. Are you getting something significantly worse?
    – user18559875
    Commented Jul 24, 2022 at 23:21
  • My issue is I am probably doing something silly. I am rolling thru each date and updating all going forward on each date
    – Michael WS
    Commented Jul 24, 2022 at 23:52

1 Answer 1

0

use polars.col followed by polars.Expr.map

def on_when(s : pl.Series, df = elo_df):
    # or pass them as default arguments
    global new_rating, new_mu, new_sigma
    if s.name == "elo":
        return new_rating
    elif s.name == "true_skill_mu":
        return new_mu
    elif s.name == "true_skill_sigma":
        return new_sigma


def otherwise(s : pl.Series, df = elo_df):
    if s.name == "elo":
        return df.get_column("elo")
    elif s.name == "true_skill_mu":
        return df.get_column("true_skill_mu")
    elif s.name == "true_skill_sigma":
        return df.get_column("true_skill_sigma")
    # or you can simply just use `return s` since we are not doing any operations
    # this is just an example if we want to do calculations based on other columns


elo_df = elo_df.with_columns([
    pl.when((pl.col("id") == col) & (pl.col("date") >= date))
    .then(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(on_when))
    .otherwise(pl.col(["elo", "true_skill_mu", "true_skill_sigma"]).map(otherwise))
])
4
  • I assume that's faster?
    – Michael WS
    Commented Jul 29, 2022 at 12:26
  • could update a bunch of "id"s at once with map in some way?
    – Michael WS
    Commented Jul 29, 2022 at 12:27
  • 1
    have you tried using is_in. I can't guarantee that is is faster
    – lost bit
    Commented Jul 29, 2022 at 13:44
  • Yes. I was testing a giant pandas project in polars and this part is the only slow bit
    – Michael WS
    Commented Jul 30, 2022 at 12:17

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.