2

I'm a little confused regarding the 'best practise' to implement a PyTorch data pipeline on time series data.

I have a HD5 file which I read using a custom DataLoader. It seems that I should return the data samples as a (features,targets) tuple with the shape of each being (L,C) where L is seq_len and C is number of channels - i.e. don't preform batching in the data loader, just return as a table.

PyTorch modules seem to require a batch dim, i.e. Conv1D expects (N, C, L).

I was under the impression that the DataLoader class would prepend the batch dimension but it isn't, I'm getting data shaped (N,L).

dataset = HD5Dataset(args.dataset)

dataloader = DataLoader(dataset,
                        batch_size=N,
                        shuffle=True,
                        pin_memory=is_cuda,
                        num_workers=num_workers)

for i, (x, y) in enumerate(train_dataloader):
    ...

In the code above the shape of x is (N,C) not (1,N,C), which results in the code below (from a public git repo) to fail on the first line.

def forward(self, x):
    """expected input shape is (N, L, C)"""
    x = x.transpose(1, 2).contiguous() # input should have dimension (N, C, L)

The documentation states When automatic batching is enabled It always prepends a new dimension as the batch dimension which leads me to believe that automatic batching is disabled but I don't understand why?

1
  • "I'm getting data shaped (N,L)" "the shape of x is (N,C)" these two statements are contradictory. Did you make a typo with one? What is the shape of dataset?
    – iacob
    Commented Apr 20, 2021 at 17:14

2 Answers 2

2

If you have a dataset of pairs of tensors (x, y), where each x is of shape (C,L), then:

N, C, L = 5, 3, 10
dataset = [(torch.randn(C,L), torch.ones(1)) for i in range(50)]
dataloader = data_utils.DataLoader(dataset, batch_size=N)

for i, (x,y) in enumerate(dataloader):
    print(x.shape)

Will produce (50/N)=10 batches of shape (N,C,L) for x:

torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
torch.Size([5, 3, 10])
0

I've found a few things which seem to work, one option seems to be to use the DataLoader's collate_fn but a simpler option is to use a BatchSampler i.e.

dataset = HD5Dataset(args.dataset)
train, test = train_test_split(list(range(len(dataset))), test_size=.1)

train_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(train),batch_size=len(train), drop_last=True)
                        )

test_dataloader = DataLoader(dataset,
                        pin_memory=is_cuda,
                        num_workers=num_workers,
                        sampler=BatchSampler(SequentialSampler(test),batch_size=len(test), drop_last=True)
                        )

for i, (x, y) in enumerate(train_dataloader):
    print (x,y)

This converts the dataset dim (L, C) into a single batch of (1, L, C) (not particularly efficiently).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.