I am working with a dataframe of over 21M rows.
df.head()
+---+-----------+-------+--------------+-----------+----------------+------------+
| | id | speed | acceleration | jerk | bearing_change | travelmode |
+---+-----------+-------+--------------+-----------+----------------+------------+
| 0 | 533815001 | 17.63 | 0.000000 | -0.000714 | 209.028008 | 3 |
| 1 | 533815001 | 17.63 | -0.092872 | 0.007090 | 56.116237 | 3 |
| 2 | 533815001 | 0.17 | 1.240000 | -2.040000 | 108.494680 | 3 |
| 3 | 533815001 | 1.41 | -0.800000 | 0.510000 | 11.847480 | 3 |
| 4 | 533815001 | 0.61 | -0.290000 | 0.150000 | 36.7455703 | 3 |
+---+-----------+-------+--------------+-----------+----------------+------------+
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21713545 entries, 0 to 21713544
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 id int64
1 speed float64
2 acceleration float64
3 jerk float64
4 bearing_change float64
5 travelmode int64
dtypes: float64(4), int64(2)
memory usage: 994.0 MB
I would like to convert this dataframe to a multi-dimensional array. So I write this function to do it:
def transform(dataframe, chunk_size=5):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'speed':'bearing_change'].values
label = group.loc[:, 'travelmode'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
When I attempt converting the above df
, the operation takes nearly 2hrs, so I cancelled it.
Input, Label = transform(df, 100)
Is the there a way I can optimize the code to speed up? I am using Ubuntu 18.04 with 16GB RAM running jupyter notebook.
EDIT
Expected output (using subset of df
):
Input, Label = transform(df[:300], 100)
Input.shape
(3, 1, 100, 4)
Input
array([[[[ 1.76300000e+01, 0.00000000e+00, -7.14402619e-04,
2.09028008e+02],
[ 1.76300000e+01, -9.28723404e-02, 7.08974649e-03,
5.61162369e+01],
[ 1.70000000e-01, 1.24000000e+00, -2.04000000e+00,
1.08494680e+02],
...,
[ 1.31000000e+00, -5.90000000e-01, 1.16000000e+00,
3.72171697e+01],
[ 7.20000000e-01, 5.70000000e-01, -1.28000000e+00,
4.38722198e+01],
[ 1.29000000e+00, -7.10000000e-01, 1.30000000e-01,
5.55044975e+01]]],
[[[ 5.80000000e-01, -5.80000000e-01, 7.60000000e-01,
6.89803288e+01],
[ 0.00000000e+00, 1.80000000e-01, 2.20000000e-01,
1.31199034e+02],
[ 1.80000000e-01, 4.00000000e-01, -4.80000000e-01,
1.09246728e+02],
...,
[ 5.80000000e-01, 1.70000000e-01, -1.30000000e-01,
2.50337736e+02],
[ 7.50000000e-01, 4.00000000e-02, 2.40000000e-01,
1.94073476e+02],
[ 7.90000000e-01, 2.80000000e-01, -8.10000000e-01,
1.94731287e+02]]],
[[[ 1.07000000e+00, -5.30000000e-01, 6.30000000e-01,
2.02516564e+02],
[ 5.40000000e-01, 1.00000000e-01, 2.80000000e-01,
3.74852074e+01],
[ 6.40000000e-01, 3.80000000e-01, -7.70000000e-01,
2.56066654e+02],
...,
[ 3.90000000e-01, 1.14000000e+00, -7.20000000e-01,
5.72686112e+01],
[ 1.53000000e+00, 4.20000000e-01, -4.30000000e-01,
1.62305984e+01],
[ 1.95000000e+00, -1.00000000e-02, 1.10000000e-01,
2.43819280e+01]]]])
performance
tag to your question. \$\endgroup\$df[:300]
) in question edit. \$\endgroup\$