I would like to aggregate rows of my dataframe together, following those rules:
- Rows are grouped by "Id", and sorted by anticipation
- One row can only be aggregated with rows next to it, with the same "Id"
- The "Anticipation" of the new row is the weighted mean of the "Anticipation" (weighted by "Size") of aggregated rows
- One row will be included in a group of rows to be aggregated to if, without the new row, the sum of "Size" is inferior or equal to "max_size"
In other words, with this dataframe as input:
Anticipation Id Size
0 10 foo 10
1 9 foo 11
2 8 foo 30
3 10 bar 10
4 9 bar 9
5 8 bar 10
6 10 baz 7
and max_size = 10, the function should return this:
Id Size Anticipation
0 bar 19 9.526316
1 bar 10 8.000000
2 baz 7 10.000000
3 foo 21 9.476190
4 foo 30 8.000000
I'm looking for an improvement of my function's performance (through a more idiomatic pandas coding?).
Here is my current code
import pandas as pd
records = [{"Id": "foo", "Size": 10, "Anticipation":10},
{"Id": "foo", "Size": 11, "Anticipation":9},
{"Id": "foo", "Size": 30, "Anticipation":8},
{"Id": "bar", "Size": 10, "Anticipation":10},
{"Id": "bar", "Size": 9, "Anticipation":9},
{"Id": "bar", "Size": 10, "Anticipation":8},
{"Id": "baz", "Size": 7, "Anticipation":10}]
df = pd.DataFrame(records)
max_size = 10
def assembly_lines(df, max_size):
df.sort_values("Anticipation", ascending=False)
df["Cumsum"] = df[["Id", "Size"]].groupby(["Id"]).cumsum()
df["Anticipation*Size"] = df["Anticipation"] * df["Size"]
L_fin = [] # L_fin stores aggregated rows.
for name, group in df.groupby(["Id"]): # Group "foo", "bar", "baz" together
i = 0
L = [] # L will temporarly stores rows before aggregation
for index, row in group.iterrows():
L.append(row.to_dict()) # Stores row in L
if row["Cumsum"] > i + max_size: # If cumulated size of all rows in L is above maximal size authorized
i = row["Cumsum"]
temp = pd.DataFrame.from_dict(L).drop(["Cumsum", "Anticipation"], axis=1) \
.groupby("Id") \
.agg({"Size": "sum", "Anticipation*Size": "sum"}) # Then rows are aggregated
temp["Anticipation"] = temp["Anticipation*Size"] / temp["Size"] # Mean anticipation is calculated
L_fin.append(temp) # Aggregated row is added to aggregated rows list
L = [] # Bucket is emptyied
if L: # If no rows remains in the grouped df
temp = pd.DataFrame.from_dict(L).drop(["Cumsum", "Anticipation"], axis=1) \
.groupby(["Id"]) \
.agg({"Size": "sum", "Anticipation*Size": "sum"}) # Rows in the bucket are aggregated
temp["Anticipation"] = temp["Anticipation*Size"] / temp["Size"]
L_fin.append(temp) # And added to L_fin
df = pd.concat(L_fin).drop(["Anticipation*Size"], axis=1).reset_index()
return df