I am attempting to build a function that processes data and subsets across two combinations of dimensions, grouping on a status label and sums on price creating a single row dataframe with the different combinations of subsets of the summed prices as output.
edit to clarify, what I'm looking for is to subset on two different combinations of dimensions; a time delta and an association label.
I'm then looking to group on a different status label (which is different from the association label) and sum those on price.
Combinations of subsets:
- the association labels are in the "Association Label" column and the three of interest are
["SDAR", "NSDCAR", "PSAR"]
there are others in the column/data but they can be ignored - the time interval are
[7, 30, 60, 90, 120, None]
and are in the "Status Date" column
What's being grouped and summed as per those combination of subsets:
- The Status Labelled are transaction statuses which are to be grouped on as per the different combinations of the above subsets from time deltas and association labels. They include
["Active","Pending","Sold",Withdrawn","Contingent","Unknown"]
(this is not an exhaustive list but just an example) - And finally ['List Price (H)'] which is to be summed per each of those status labelled and as per each combination of the fist two subsets.
So example columns of desired output would be something like PSAR_7_Contingent_price
or SDAR_60_Withdrawn_price
This builds off of this question and answer which worked fantastic for value counts, but I'm having difficulty modifying it for summing on a price variable.
The code I used to build off of is
def crossubsets(df):
labels = ["SDAR", "NSDCAR", "PSAR"]
time_intervals = [7, 30, 60, 90, 120, None]
group_dfs = df.loc[
df["Association Label"].isin(labels)
].groupby("Association Label")
data = []
for l, g in group_dfs:
for ti in time_intervals:
s = (
g[g["Status Date"] > (pd.Timestamp.now() - pd.Timedelta(ti, "d"))]
if ti is not None else g
)
data.append(s["Status Labelled"].value_counts().rename(f"counts_{l}_{ti}"))
return pd.concat(data, axis=1) #with optional .T to have 18 rows instead of cols
# additional code to flatten the output to a (1, 180) dataframe
counts_processeed = counts_processeed.unstack().to_frame().sort_index(level=1).T
counts_processeed .columns = counts_processeed.columns.map('_'.join)
This worked great for the value_counts per Status Labelled, but now I'm looking to sum the associated price per those that Status Labelled, and across those dimensions of subsets. I naively attempted to modify the above function with:
def crossubsetsprice(df):
labels = ["SDAR", "NSDCAR", "PSAR"]
time_intervals = [7, 30, 60, 90, 120, None]
group_dfs = df.loc[
df["Association Label"].isin(labels)
].groupby("Association Label")
data = []
for l, g in group_dfs:
for ti in time_intervals:
s = (
g[g["Status Date"] > (pd.Timestamp.now() - pd.Timedelta(ti, "d"))]
if ti is not None else g
)
data.append(s['List Price (H)'].sum().rename(f"price_{l}_{ti}"))
return pd.concat(data, axis=1) #with optional .T to have 18 rows instead of cols
But that throws and error AttributeError: 'numpy.float64' object has no attribute 'rename'
and I don't think makes much sense or would get the desired output anyway.
The alternative I want to avoid, but I know would work, is creating 18 distinct functions for each of combination of subsets then concatinating the output. An example would be:
def price_PSAR_90(df):
subset_90 = df[df['Status Date'] > (datetime.now() - pd.to_timedelta("90day"))]
subset_90_PSAR= subset_90[subset_90['Association Label']=="PSAR"]
grouped_90_PSAR = subset_90_PSAR.groupby(['Status Labelled'])
price_summed_90_PSAR = (pd.DataFrame(grouped_90_PSAR['List Price (H)'].sum()))
price_summed_90_PSAR.columns = ['Price']
price_summed_90_PSAR = price_summed_90_PSAR.reset_index()
price_summed_90_PSAR = price_summed_90_PSAR.T
price_summed_90_PSAR = price_summed_90_PSAR.reset_index()
price_summed_90_PSAR.drop(price_summed_90_PSAR.columns[[0]], axis=1, inplace=True)
price_summed_90_PSAR_header = price_summed_90_PSAR.iloc[0] #grab the first row for the header
price_summed_90_PSAR = price_summed_90_PSAR[1:] #take the data less the header row
price_summed_90_PSAR.columns = price_summed_90_PSAR_header
return price_summed_90_PSAR
The last code snippet works, but without looping would need to be repeated with the time delta and association label being changed for each combination, and then relabelling the output columns and concatenated them together, which I want to avoid if possible.