7
$\begingroup$

I am currently working on a dataset that has two columns: customerID and date.

I want to find the minimum date for each customerID.

Initially, I used the following code:

dataframe['min_date'] = dataframe.groubpy('customerID')['date'].min()

However, this returned null values.

Then, I used this code instead:

dataframe['min_date'] = dataframe.groubpy('customerID')['date'].transform('min')

This returned the correct values.

I would like to understand the difference between these two operations.

$\endgroup$
1
  • $\begingroup$ dataframe['min_date'] = dataframe.groubpy('customerID')['date'].min() will return a single value. So, how could a single value be looped over each value and assigned it to new 'min_date' $\endgroup$ Commented Mar 27 at 15:19

1 Answer 1

8
$\begingroup$

df.groubpy('customerID')['date'].transform('min') will give you a dataframe (a series to be exact) with one column and the same index (and so the same number of rows) as the original dataframe df.
So you can initialize a new df column with it.

df.groubpy('customerID')['date'].min() will give you also a dataframe with one column but with a different index and with less rows. Indeed the index will be all the unique values in the column 'customerID'.

For example if you have 1000 rows in df and the column 'customerID' has 200 different ID. transform() will give you a dataframe with 1000 rows and min() will give you a daframe with 200 rows. For this second case, you cannot successfully initialize a new df column with the result of min() as there will be a mismatch of the index between both dataframes (leading to NaN).

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.