1

I am attempting to build a function that processes data and subsets across two combinations of dimensions, grouping on a status label and sums on price creating a single row dataframe with the different combinations of subsets of the summed prices as output.

edit to clarify, what I'm looking for is to subset on two different combinations of dimensions; a time delta and an association label.

I'm then looking to group on a different status label (which is different from the association label) and sum those on price.

Combinations of subsets:

  • the association labels are in the "Association Label" column and the three of interest are ["SDAR", "NSDCAR", "PSAR"] there are others in the column/data but they can be ignored
  • the time interval are [7, 30, 60, 90, 120, None] and are in the "Status Date" column

What's being grouped and summed as per those combination of subsets:

  • The Status Labelled are transaction statuses which are to be grouped on as per the different combinations of the above subsets from time deltas and association labels. They include ["Active","Pending","Sold",Withdrawn","Contingent","Unknown"] (this is not an exhaustive list but just an example)
  • And finally ['List Price (H)'] which is to be summed per each of those status labelled and as per each combination of the fist two subsets.

So example columns of desired output would be something like PSAR_7_Contingent_price or SDAR_60_Withdrawn_price

This builds off of this question and answer which worked fantastic for value counts, but I'm having difficulty modifying it for summing on a price variable.

The code I used to build off of is

def crossubsets(df):
    labels = ["SDAR", "NSDCAR", "PSAR"]
    time_intervals = [7, 30, 60, 90, 120, None]
    group_dfs = df.loc[
        df["Association Label"].isin(labels)
    ].groupby("Association Label")

    data = []
    for l, g in group_dfs:
        for ti in time_intervals:
            s = (
                g[g["Status Date"] > (pd.Timestamp.now() - pd.Timedelta(ti, "d"))]
                if ti is not None else g
            )
            data.append(s["Status Labelled"].value_counts().rename(f"counts_{l}_{ti}"))

    return pd.concat(data, axis=1) #with optional .T to have 18 rows instead of cols

# additional code to flatten the output to a (1, 180) dataframe

counts_processeed = counts_processeed.unstack().to_frame().sort_index(level=1).T
counts_processeed .columns = counts_processeed.columns.map('_'.join)

This worked great for the value_counts per Status Labelled, but now I'm looking to sum the associated price per those that Status Labelled, and across those dimensions of subsets. I naively attempted to modify the above function with:

def crossubsetsprice(df):
    labels = ["SDAR", "NSDCAR", "PSAR"]
    time_intervals = [7, 30, 60, 90, 120, None]
    group_dfs = df.loc[
        df["Association Label"].isin(labels)
    ].groupby("Association Label")

    data = []
    for l, g in group_dfs:
        for ti in time_intervals:
            s = (
                g[g["Status Date"] > (pd.Timestamp.now() - pd.Timedelta(ti, "d"))]
                if ti is not None else g
            )
            data.append(s['List Price (H)'].sum().rename(f"price_{l}_{ti}"))

    return pd.concat(data, axis=1) #with optional .T to have 18 rows instead of cols

But that throws and error AttributeError: 'numpy.float64' object has no attribute 'rename' and I don't think makes much sense or would get the desired output anyway.

The alternative I want to avoid, but I know would work, is creating 18 distinct functions for each of combination of subsets then concatinating the output. An example would be:

def price_PSAR_90(df):
    subset_90 = df[df['Status Date'] > (datetime.now() - pd.to_timedelta("90day"))]
    subset_90_PSAR= subset_90[subset_90['Association Label']=="PSAR"]  

    grouped_90_PSAR = subset_90_PSAR.groupby(['Status Labelled'])

    price_summed_90_PSAR = (pd.DataFrame(grouped_90_PSAR['List Price (H)'].sum()))
    price_summed_90_PSAR.columns = ['Price']
    price_summed_90_PSAR = price_summed_90_PSAR.reset_index()
    price_summed_90_PSAR = price_summed_90_PSAR.T
    price_summed_90_PSAR = price_summed_90_PSAR.reset_index()
    price_summed_90_PSAR.drop(price_summed_90_PSAR.columns[[0]], axis=1, inplace=True)
    price_summed_90_PSAR_header = price_summed_90_PSAR.iloc[0] #grab the first row for the header
    price_summed_90_PSAR = price_summed_90_PSAR[1:] #take the data less the header row
    price_summed_90_PSAR.columns = price_summed_90_PSAR_header


    return price_summed_90_PSAR

The last code snippet works, but without looping would need to be repeated with the time delta and association label being changed for each combination, and then relabelling the output columns and concatenated them together, which I want to avoid if possible.

1 Answer 1

2

Maybe you can try to use a dict for data instead of a list. Something like:

def crossubsetsprice(df):
    labels = ["SDAR", "NSDCAR", "PSAR"]
    time_intervals = [7, 30, 60, 90, 120, None]
    group_dfs = df.loc[
        df["Association Label"].isin(labels)
    ].groupby(["Association Label", 'Status Labelled'])

    data = {}  # HERE
    for (l1, l2), g in group_dfs:
        for ti in time_intervals:
            s = (
                g[g["Status Date"] > (pd.Timestamp.now() - pd.Timedelta(ti, "d"))]
                if ti is not None else g
            )
            data[(l1, l2, ti)] = s['List Price (H)'].sum()  # HERE

    names = ['Association Label', 'Status Labelled', 'Time Interval']
    return pd.Series(data, name='Price').rename_axis(names)  # HERE

Output:

>>> crossubsetsprice(df)
Association Label  Status Labelled  Time Interval
NSDCAR             Active           7.0               1393
                                    30.0              6090
                                    60.0             11397
                                    90.0             16540
                                    120.0            21660
                                                     ...  
SDAR               Withdrawn        30.0              3167
                                    60.0              8897
                                    90.0             15768
                                    120.0            21806
                                    NaN              28379
Name: Price, Length: 108, dtype: int64

Minimal Reproducible Example:

import pandas as pd
import numpy as np

N = 1000
rng = np.random.default_rng(42)
labels = rng.choice(["SDAR", "NSDCAR", "PSAR"], N)
status = rng.choice(["Active", "Pending", "Sold", "Withdrawn", "Contingent", "Unknown"], N)
today = pd.Timestamp.today()
start = pd.Timestamp('2023-01-01 00:00:00')
offsets = rng.integers(0, (today - start).total_seconds(), N)
dates = start + pd.to_timedelta(offsets, unit='S')
prices = rng.integers(1, 1001, N)
df = pd.DataFrame({'Association Label': labels,
                   'Status Date': dates,
                   'Status Labelled': status,
                   'List Price (H)': prices})
3
  • So just to clarify, you're missing a column Status Labelled which is the column that would actually be grouped on. That is different from the Association Label.
    – JLuu
    Commented Jun 4, 2023 at 15:14
  • I updated my answer. Can you check it please. I think you just have to add 'Status Labelled' as a key of groupby.
    – Corralien
    Commented Jun 4, 2023 at 16:29
  • 1
    This is magnificent.
    – JLuu
    Commented Jun 5, 2023 at 0:37

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.