PyTables: indexing multiple dimensions of large arrays

Question

I'm analysing some imaging data that consists of large 3-dimensional arrays of pixel intensities with dimensions [frame, x, y]. Since these are usually too big to hold in memory, they reside on the hard disk as PyTables arrays.

What I'd like to be able to do is read out the intensities in an arbitrary subset of pixels across all frames. The natural way to do this seems to be list indexing:

import numpy as np
import tables

tmph5 = tables.open_file('temp.hdf5', 'w')
bigarray = tmph5.create_array('/', 'bigarray', np.random.randn(1000, 200, 100))

roipixels = [[0, 1, 2, 4, 6], [34, 35, 36, 40, 41]]
roidata = bigarray[:, roipixels[0], roipixels[1]]
# IndexError: Only one selection list is allowed

Unfortunately it seems that PyTables currently only supports a single set of list indices. A further problem is that a list index can't contain duplicates - I couldn't simultaneously read pixels [1, 2] and [1, 3], since my list of pixel x-coordinates would contain [1, 1]. I know that I can iterate over rows in the array:

roidata = np.asarray([row[roipixels[0], roipixels[1]] for row in bigarray])

but these iterative reads become quite slow for the large number of frames I'm processing.

Is there a nicer way of doing this? I'm relatively new to PyTables, so if you have any tips on organising datasets in large arrays I'd love to hear them.

I think that's actually an underlying limitation of hdf5, rather than a pytables limitation (I could be entirely wrong there...). At any rate, I find h5py is much more natural for non-table-like data. However, it has the same limitation, in this case. — Joe Kington, Commented Jun 17, 2012 at 0:37

Joe Kington · Accepted Answer · 2012-06-17 00:50:19Z

3

For whatever it's worth, I often do the same thing with 3D seismic data stored in hdf format.

The iterative read is slow due to the nested loops. If you only do a single loop (rather than looping over each row) it's quite fast (at least when using h5py. I typically only store table-like data using pytables) and does exactly what you want.

In most cases, you'll want to iterate over your lists of indicies, rather than over each row.

Basically, you want:

roidata = np.vstack([bigarray[:,i,j] for i,j in zip(*roipixels)])

Instead of:

roidata = np.asarray([row[roipixels[0],roipixels[1]] for row in bigarray])

If this is your most common use case, adjusting the chunksize of the stored array will help dramatically. You'll want long, narrow chunks, with the longest length along the first axis, in your case.

(Caveat: I haven't tested this with pytables, but it works perfectly with h5py.)

answered Jun 17, 2012 at 0:50

Joe Kington

285k73 gold badges619 silver badges472 bronze badges

Thanks, Joe. I guess that kind of answers my question, but I was surprised to find that your suggested line actually reads much slower in my case (by a factor of about 15x). I'm not sure if this is a Pytables vs h5py issue, or whether it has something to do with the chunk shape or compression I'm using. I'll check out h5py and see if it performs any better.
– ali_m
Commented Jun 17, 2012 at 14:31
It probably is controlled by the chunksize that your array is stored with. If the underlying hdf file is optimized for access by row, then row-based access will be faster. I don't know offhand how pytables calculates a chunksize to use. Good luck, at any rate!
– Joe Kington
Commented Jun 18, 2012 at 15:00

Add a comment |

Collectives™ on Stack Overflow

PyTables: indexing multiple dimensions of large arrays

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related