Convert pandas column of numpy arrays to numpy array of higher dimension

Question

I have a pandas dataframe of shape (75,9).

Only one of those columns is of numpy arrays, each of which is of shape (100, 4, 3)

I have a strange phenomenon:

data = self.df[self.column_name].values[0]

is of shape (100,4,3), but

data = self.df[self.column_name].values

is of shape (75,), with min and max are 'not a numeric object'

I expected data = self.df[self.column_name].values to be of shape (75, 100, 4, 3), with some min and max.

How can I make a column of numpy arrays behave like a numpy array of a higher dimension (with length=number of rows in the dataframe)?

Reproducing:

    some_df = pd.DataFrame(columns=['A'])
    for i in range(10):
        some_df.loc[i] = [np.random.rand(4, 6)]
    print some_df['A'].values.shape
    print some_df['A'].values[0].shape

prints (10L,),(4L,6L) instead of desired (10L, 4L, 6L),(4L,6L)

Hopefully not for long. I believe a solution would be the same for any python — Gulzar, Commented Jun 16, 2019 at 10:12
np.stack(....values) may create an array with the desired shape. It doesn't change the dataframe's own storage. — hpaulj, Commented Jun 16, 2019 at 10:21
@hpaulj That's it! I'll accept if you post it as an answer. I'm guessing it isn't the best performance-wise, but still works for me — Gulzar, Commented Jun 16, 2019 at 10:27

hpaulj · Accepted Answer · 2019-06-16 15:34:36Z

In [42]: some_df = pd.DataFrame(columns=['A']) 
    ...: for i in range(4): 
    ...:         some_df.loc[i] = [np.random.randint(0,10,(1,3))] 
    ...:                                                                                  
In [43]: some_df                                                                          
Out[43]: 
             A
0  [[7, 0, 9]]
1  [[3, 6, 8]]
2  [[9, 7, 6]]
3  [[1, 6, 3]]

The numpy values of the column are an object dtype array, containing arrays:

In [44]: some_df['A'].to_numpy()                                                          
Out[44]: 
array([array([[7, 0, 9]]), array([[3, 6, 8]]), array([[9, 7, 6]]),
       array([[1, 6, 3]])], dtype=object)

If those arrays all have the same shape, stack does a nice job of concatenating them on a new dimension:

In [45]: np.stack(some_df['A'].to_numpy())                                                
Out[45]: 
array([[[7, 0, 9]],

       [[3, 6, 8]],

       [[9, 7, 6]],

       [[1, 6, 3]]])
In [46]: _.shape                                                                          
Out[46]: (4, 1, 3)

This only works with one column. stack like all concatenate treats the input argument as an iterable, effectively a list of arrays.

In [48]: some_df['A'].to_list()                                                           
Out[48]: 
[array([[7, 0, 9]]),
 array([[3, 6, 8]]),
 array([[9, 7, 6]]),
 array([[1, 6, 3]])]
In [50]: np.stack(some_df['A'].to_list()).shape                                           
Out[50]: (4, 1, 3)

after over a year, we meet again. I remember this method giving me many headaches, and wonder if this is the wrong way to go. Is there a standard way for handling tabular data which is long lists of multi dimensional arrays? [each with its own title, and same shape] — Gulzar, Commented Oct 26, 2020 at 16:01

John Zwinck · Accepted Answer · 2019-06-16 10:15:54Z

1

What you're asking for is not quite possible. Pandas DataFrames are 2D. Yes, you can store NumPy arrays as objects (references) inside DataFrame cells, but this is not really well supported, and expecting to get a shape which has one dimension from the DataFrame and two from the arrays inside is not possible at all.

You should consider storing your data either entirely in NumPy arrays of the appropriate shape, or in a single, properly 2D DataFrame with MultiIndex. For example you can "pivot" a column of 1D arrays to become a column of scalars if you move the extra dimension to a new level of a MultIndex on the rows:

  A
x [2, 3]
y [5, 6]

becomes:

or pivot to the columns:

answered Jun 16, 2019 at 10:15

John Zwinck

250k43 gold badges342 silver badges454 bronze badges

Now i have time to make this right. What is the code that pivots in each direction?
– Gulzar
Commented Jun 23, 2019 at 12:33
DataFrame.stack(), after you break the lists into separate columns (see stackoverflow.com/questions/35491274/… for that).
– John Zwinck
Commented Jun 23, 2019 at 14:37

Add a comment |

Collectives™ on Stack Overflow

Convert pandas column of numpy arrays to numpy array of higher dimension

2 Answers 2

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Linked

Related