I have a Pandas Dataframe that I derive from a process like this:
df1 = pd.DataFrame({'c1':['A','B','C','D','E'],'c2':[1,2,3,4,5]})
df2 = pd.DataFrame({'c1':['A','B','C'],'c2':[1,2,3],'c3': [np.array((1,2,3,4,5,6)),np.array((6,7,8,9,10,11)),np.full((6,),np.nan)]})
df3 = df1.merge(df2,how='left',on=['c1','c2'])
This looks like this:
c1 | c2 | c3 |
---|---|---|
A | 1 | [1,2,3,4,5,6] |
B | 2 | [6,7,8,9,10,11] |
C | 3 | [nan,nan,nan,nan,nan,nan] |
D | 4 | NaN |
E | 5 | NaN |
In order to run the next step of my code, I need all of the arrays in c3
to have a consistent length. For the inputs coming in that were present in the join (i.e. row 1 through 3) this was already taken care of. However, for the rows that were missing from df2
where I now have only a single NaN
value (rows 4 and 5) I need to replace those NaN
's with an array of NaN
values like in row 3. The problem is that I can't figure out how to do that.
I've tried a number of things, starting with the obvious:
df3.loc[pd.isnull(df3.c3),'c3'] = np.full((6,),np.nan)
Which gave me a
ValueError: Must have equal len keys and value when setting with an iterable
Fair enough; I understand this error and why python is confused about what I'm trying to do. How about this?
for i in df3.index:
df3.at[i,'c3'] = np.full((6,),np.nan) if all(pd.isnull(df3.c3)) else df3.c3
That code runs without error but then when I go to print out df3 (or use it) I get this error:
RecursionError: maximum recursion depth exceeded
That one I don't understand, but moving on, what if I preassign a column full of my NaN arrays and then I can do some logic after the join:
for i in df1.index: df1.at[i,'c4'] = np.full((6,),np.nan)
This gives me the understandable error:
ValueError: setting an array element with a sequence
How about another variation of the same idea:
df1['c4'] = np.full((6,),np.nan)
This one gives a different, also understandable error:
ValueError: Length of values (6) does not match length of index (5)
Hence, the question: How do I replace values in my dataframe (in this case null values) with an empty numpy array of a given length?
For clarity, the desired final result is this:
c1 | c2 | c3 |
---|---|---|
A | 1 | [1,2,3,4,5,6] |
B | 2 | [6,7,8,9,10,11] |
C | 3 | [nan,nan,nan,nan,nan,nan] |
D | 4 | [nan,nan,nan,nan,nan,nan] |
E | 5 | [nan,nan,nan,nan,nan,nan] |
df3[i].c3
ordf3.at[i,'c3']
instead ofdf3.c3
becausedf3.c3
gives all values in column but you need only value from current row.