Looking for performance improvement of my custom dataframe aggregation function (Python/Pandas)

Question

I would like to aggregate rows of my dataframe together, following those rules:

Rows are grouped by "Id", and sorted by anticipation
One row can only be aggregated with rows next to it, with the same "Id"
The "Anticipation" of the new row is the weighted mean of the "Anticipation" (weighted by "Size") of aggregated rows
One row will be included in a group of rows to be aggregated to if, without the new row, the sum of "Size" is inferior or equal to "max_size"

In other words, with this dataframe as input:

   Anticipation   Id  Size
0            10  foo    10
1             9  foo    11
2             8  foo    30
3            10  bar    10
4             9  bar     9
5             8  bar    10
6            10  baz     7

and max_size = 10, the function should return this:

    Id  Size  Anticipation
0  bar    19      9.526316
1  bar    10      8.000000
2  baz     7     10.000000
3  foo    21      9.476190
4  foo    30      8.000000

I'm looking for an improvement of my function's performance (through a more idiomatic pandas coding?).

Here is my current code

import pandas as pd
records = [{"Id": "foo", "Size": 10, "Anticipation":10},
{"Id": "foo", "Size": 11, "Anticipation":9},
{"Id": "foo", "Size": 30, "Anticipation":8},
{"Id": "bar", "Size": 10, "Anticipation":10},
{"Id": "bar", "Size": 9, "Anticipation":9},
{"Id": "bar", "Size": 10, "Anticipation":8},
{"Id": "baz", "Size": 7, "Anticipation":10}]

df = pd.DataFrame(records)
max_size = 10

def assembly_lines(df, max_size):

    df.sort_values("Anticipation", ascending=False)
    df["Cumsum"] = df[["Id", "Size"]].groupby(["Id"]).cumsum()
    df["Anticipation*Size"] = df["Anticipation"] * df["Size"]

    L_fin = [] # L_fin stores aggregated rows.

    for name, group in df.groupby(["Id"]): # Group "foo", "bar", "baz" together
        i = 0
        L = [] # L will temporarly stores rows before aggregation

        for index, row in group.iterrows():

            L.append(row.to_dict()) # Stores row in L

            if row["Cumsum"] > i + max_size: # If cumulated size of all rows in L is above maximal size authorized

                i = row["Cumsum"]
                temp = pd.DataFrame.from_dict(L).drop(["Cumsum", "Anticipation"], axis=1) \
                    .groupby("Id") \
                    .agg({"Size": "sum", "Anticipation*Size": "sum"}) #  Then rows are aggregated
                temp["Anticipation"] = temp["Anticipation*Size"] / temp["Size"] # Mean anticipation is calculated
                L_fin.append(temp) # Aggregated row is added to aggregated rows list
                L = [] # Bucket is emptyied

        if L: # If no rows remains in the grouped df

            temp = pd.DataFrame.from_dict(L).drop(["Cumsum", "Anticipation"], axis=1) \
                .groupby(["Id"]) \
                .agg({"Size": "sum", "Anticipation*Size": "sum"}) # Rows in the bucket are aggregated
            temp["Anticipation"] = temp["Anticipation*Size"] / temp["Size"]
            L_fin.append(temp) # And added to L_fin

    df = pd.concat(L_fin).drop(["Anticipation*Size"], axis=1).reset_index() 

    return df

scnerd · Accepted Answer · 2019-09-12 19:58:59Z

Normally, I really despise the fact that itertools.groupby only groups adjacent elements with the same key... in your case, though, this seems ideal. Forgive me for just rewriting the code rather than critique'ing what you have, but using this grouping function completely changes how the overall task is best approached.

Let's use itertools.groupby to perform the grouping rather than pandas.groupby specifically because of this adjacency behavior:

In [1]: grouped = itertools.groupby(df.itertuples(False), key=lambda x: x.Id)

In [2]: {k: list(v) for k, v in grouped}
Out[2]:
{'foo': [Pandas(Anticipation=10, Id='foo', Size=10),
  Pandas(Anticipation=9, Id='foo', Size=11),
  Pandas(Anticipation=8, Id='foo', Size=30)],
 'bar': [Pandas(Anticipation=10, Id='bar', Size=10),
  Pandas(Anticipation=9, Id='bar', Size=9),
  Pandas(Anticipation=8, Id='bar', Size=10)],
 'baz': [Pandas(Anticipation=10, Id='baz', Size=7)]}

Note that non-adjacent rows won't be aggregated together:

In [3]: [list(v) for _, v in itertools.groupby([1, 1, 2, 1], key=lambda x: x)]
Out[3]: [[1, 1], [2], [1]]

So then we just need a custom aggregation function. Let's use a generator function that can produce multiple outputs per input, so that we can aggregate each group and output one or more rows as appropriate. It'll operate on a single adjacent group of rows and handle the funkiness of your grouping logic:

def funky_aggregate(k, vs, max_size=10):
    cur_size = 0
    cur_ant = 0
    for v in vs:
        cur_size += v.Size
        cur_ant += v.Anticipation * v.Size
        if cur_size > max_size:
            yield {'Id': k, 'Size': cur_size, 'Anticipation': cur_ant / cur_size}
            cur_size = cur_ant = 0
    if cur_size != 0:
        yield {'Id': k, 'Size': cur_size, 'Anticipation': cur_ant / cur_size}

This can then be easily joined back together into a dataframe:

In [4]: pd.DataFrame([row
            for key, group in itertools.groupby(df.itertuples(False), key=lambda x: x.Id)
            for row in funky_aggregate(key, group)
        ])
Out[4]:
   Anticipation   Id  Size
0      9.476190  foo    21
1      8.000000  foo    30
2      9.526316  bar    19
3      8.000000  bar    10
4     10.000000  baz     7

The original post says Rows are grouped by "Id". I interpret this as all rows with the same ID are already adjacent so the behaviour of itertools.groupby and pandas.groupby are the same. — GZ0
– GZ0, Commented Sep 13, 2019 at 5:57
@GZ0, I'm pretty sure that "One row can only be aggregated with rows next to it, with the same 'Id'" is specifically meant to indicate that this grouping is a local thing. Of course, it doesn't matter if the dataframe is sorted by ID then anticipation, but this sentence made me think that the locality was important to the OP. Can the OP clarify what kind of grouping is desired? — scnerd
– scnerd, Commented Sep 13, 2019 at 14:20
I interpret that second sentence as no more than "aggregation cannot go beyond group boundaries (indicated by ID, which is consistent with the first sentence)". — GZ0
– GZ0, Commented Sep 13, 2019 at 14:24
@scnerd Sorry for the delay ! First, thank you for the time you have taken to answer my question. So the dataframe is previously sorted by "Id"/"Anticipation", you won't have the problem you describe in your answer (I still keep it in mind, if one day I have to deal with such an issue). What I meant by "merge only with rows next" is that if you have, for the same "Id", 3 rows with anticipation "10", "9", "8", you can't merge "10" and "8" and put "9" in another cluster — Doe Jowns
– Doe Jowns, Commented Sep 16, 2019 at 12:26

Stack Exchange Network

Looking for performance improvement of my custom dataframe aggregation function (Python/Pandas)

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Looking for performance improvement of my custom dataframe aggregation function (Python/Pandas)

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions