6
\$\begingroup\$

I am working with a dataframe of over 21M rows.

df.head()
+---+-----------+-------+--------------+-----------+----------------+------------+
|   |    id     | speed | acceleration |   jerk    | bearing_change | travelmode |
+---+-----------+-------+--------------+-----------+----------------+------------+
| 0 | 533815001 | 17.63 | 0.000000     | -0.000714 | 209.028008     |          3 |
| 1 | 533815001 | 17.63 | -0.092872    | 0.007090  | 56.116237      |          3 |
| 2 | 533815001 | 0.17  | 1.240000     | -2.040000 | 108.494680     |          3 |
| 3 | 533815001 | 1.41  | -0.800000    | 0.510000  | 11.847480      |          3 |
| 4 | 533815001 | 0.61  | -0.290000    | 0.150000  | 36.7455703     |          3 |
+---+-----------+-------+--------------+-----------+----------------+------------+

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21713545 entries, 0 to 21713544
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   id              int64  
 1   speed           float64
 2   acceleration    float64
 3   jerk            float64
 4   bearing_change  float64
 5   travelmode      int64  
dtypes: float64(4), int64(2)
memory usage: 994.0 MB

I would like to convert this dataframe to a multi-dimensional array. So I write this function to do it:

def transform(dataframe, chunk_size=5):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'speed':'bearing_change'].values
        label = group.loc[:, 'travelmode'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

When I attempt converting the above df, the operation takes nearly 2hrs, so I cancelled it.

Input, Label = transform(df, 100)

Is the there a way I can optimize the code to speed up? I am using Ubuntu 18.04 with 16GB RAM running jupyter notebook.

EDIT

Expected output (using subset of df):

Input, Label = transform(df[:300], 100)
Input.shape
 (3, 1, 100, 4)

Input
array([[[[ 1.76300000e+01,  0.00000000e+00, -7.14402619e-04,
           2.09028008e+02],
         [ 1.76300000e+01, -9.28723404e-02,  7.08974649e-03,
           5.61162369e+01],
         [ 1.70000000e-01,  1.24000000e+00, -2.04000000e+00,
           1.08494680e+02],
         ...,
         [ 1.31000000e+00, -5.90000000e-01,  1.16000000e+00,
           3.72171697e+01],
         [ 7.20000000e-01,  5.70000000e-01, -1.28000000e+00,
           4.38722198e+01],
         [ 1.29000000e+00, -7.10000000e-01,  1.30000000e-01,
           5.55044975e+01]]],


       [[[ 5.80000000e-01, -5.80000000e-01,  7.60000000e-01,
           6.89803288e+01],
         [ 0.00000000e+00,  1.80000000e-01,  2.20000000e-01,
           1.31199034e+02],
         [ 1.80000000e-01,  4.00000000e-01, -4.80000000e-01,
           1.09246728e+02],
         ...,
         [ 5.80000000e-01,  1.70000000e-01, -1.30000000e-01,
           2.50337736e+02],
         [ 7.50000000e-01,  4.00000000e-02,  2.40000000e-01,
           1.94073476e+02],
         [ 7.90000000e-01,  2.80000000e-01, -8.10000000e-01,
           1.94731287e+02]]],


       [[[ 1.07000000e+00, -5.30000000e-01,  6.30000000e-01,
           2.02516564e+02],
         [ 5.40000000e-01,  1.00000000e-01,  2.80000000e-01,
           3.74852074e+01],
         [ 6.40000000e-01,  3.80000000e-01, -7.70000000e-01,
           2.56066654e+02],
         ...,
         [ 3.90000000e-01,  1.14000000e+00, -7.20000000e-01,
           5.72686112e+01],
         [ 1.53000000e+00,  4.20000000e-01, -4.30000000e-01,
           1.62305984e+01],
         [ 1.95000000e+00, -1.00000000e-02,  1.10000000e-01,
           2.43819280e+01]]]])
\$\endgroup\$
9
  • 2
    \$\begingroup\$ Welcome to Code Review, to increase odds of more detailed answers you could include the performance tag to your question. \$\endgroup\$ Commented Aug 13, 2020 at 13:51
  • \$\begingroup\$ Welcome to Code Review! Just so we don't have to parse the code to understand the desired output, can you include a sample of your desired output? \$\endgroup\$ Commented Aug 13, 2020 at 15:15
  • \$\begingroup\$ @Dannnno I added the expected output using subset of the dataframe (df[:300]) in question edit. \$\endgroup\$
    – arilwan
    Commented Aug 13, 2020 at 15:26
  • \$\begingroup\$ Did you check how much time it takes for the groupby operation? If that’s the bottleneck or one of the bottlenecks, you could consider using VAEX or Modin to speed it up in pandas. The rest of your code uses numpy and fills in zeroes to make the sizes of groups the same. This could potentially require a lot more memory to store the same information on disk and also on you RAM. You could consider using sparse matrices in that case to keep the memory footprint of the multidimensional array manageable. \$\endgroup\$
    – CypherX
    Commented Aug 13, 2020 at 16:50
  • 1
    \$\begingroup\$ Have you heard of tensorflow image pipelines? If not try it. \$\endgroup\$ Commented Aug 14, 2020 at 10:27

1 Answer 1

1
\$\begingroup\$

Use cProfile to determine what portion of the function is consuming the bulk of the elapsed time.


Add a unit test that times a somewhat large run and verifies no regressions as the code is maintained.


The .groupby('id') is potentially expensive. Consider first sorting values, and using .set_index('id').


Per-row storage really matters when you have 21 M rows. Consider using int16 or float32 .dtypes where feasible.


The public API of transform() hands back some very large final objects:

    return X, y

Consider offering an API with a more economical memory footprint, where within the loop you generate small intermediate objects:

        yield X, y

Or use .to_hdf and return a filename rather than a very large object.


An inner loop allocates memory in this way:

            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

Repeatedly extending a numpy allocation can involve lots of copying, with quadratic total cost. Prefer to identify the total number of rows up-front, and do the allocation all at once. If it needs to, it's better for a loop to overwrite pre-allocated storage than to keep requesting additional storage.

\$\endgroup\$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.