PyTorch Dataset / Dataloader batching

Question

I'm a little confused regarding the 'best practise' to implement a PyTorch data pipeline on time series data.

I have a HD5 file which I read using a custom DataLoader. It seems that I should return the data samples as a (features,targets) tuple with the shape of each being (L,C) where L is seq_len and C is number of channels - i.e. don't preform batching in the data loader, just return as a table.

PyTorch modules seem to require a batch dim, i.e. Conv1D expects (N, C, L).

I was under the impression that the DataLoader class would prepend the batch dimension but it isn't, I'm getting data shaped (N,L).

dataset = HD5Dataset(args.dataset)

dataloader = DataLoader(dataset,
                        batch_size=N,
                        shuffle=True,
                        pin_memory=is_cuda,
                        num_workers=num_workers)

for i, (x, y) in enumerate(train_dataloader):
    ...

In the code above the shape of x is (N,C) not (1,N,C), which results in the code below (from a public git repo) to fail on the first line.

def forward(self, x):
    """expected input shape is (N, L, C)"""
    x = x.transpose(1, 2).contiguous() # input should have dimension (N, C, L)

The documentation states When automatic batching is enabled It always prepends a new dimension as the batch dimension which leads me to believe that automatic batching is disabled but I don't understand why?

"I'm getting data shaped (N,L)" "the shape of x is (N,C)" these two statements are contradictory. Did you make a typo with one? What is the shape of dataset? — iacob, Commented Apr 20, 2021 at 17:14

iacob · Accepted Answer · 2021-04-20 17:25:16Z

If you have a dataset of pairs of tensors (x, y), where each x is of shape (C,L), then:

N, C, L = 5, 3, 10
dataset = [(torch.randn(C,L), torch.ones(1)) for i in range(50)]
dataloader = data_utils.DataLoader(dataset, batch_size=N)

for i, (x,y) in enumerate(dataloader):
    print(x.shape)

Will produce (50/N)=10 batches of shape (N,C,L) for x:

torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])

akshayk07 · Accepted Answer · 2020-06-19 07:14:28Z

I've found a few things which seem to work, one option seems to be to use the DataLoader's collate_fn but a simpler option is to use a BatchSampler i.e.

dataset = HD5Dataset(args.dataset)
train, test = train_test_split(list(range(len(dataset))), test_size=.1)

train_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(train),batch_size=len(train), drop_last=True)
                        )

test_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(test),batch_size=len(test), drop_last=True)
                        )

for i, (x, y) in enumerate(train_dataloader):
    print (x,y)

This converts the dataset dim (L, C) into a single batch of (1, L, C) (not particularly efficiently).

Collectives™ on Stack Overflow

PyTorch Dataset / Dataloader batching

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related