12

i use iris-dataset to train a simple network with pytorch.

trainset = iris.Iris(train=True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=150,
                                          shuffle=True, num_workers=2)

dataiter = iter(trainloader)

the dataset itself has only 150 data points, and pytorch dataloader iterates jus t once over the whole dataset, because of the batch size of 150.

My question is now, is there generally any way to tell dataloader of pytorch to repeat over the dataset if it's once done with iteration?

thnaks

update

got it runnning :) just created a sub class of dataloader and implemented my own __next__()

1
  • Of course i did, what has that got to do with my question? Commented Dec 8, 2017 at 17:40

5 Answers 5

11

Using itertools.cycle has an important drawback, in that it does not shuffle the data after each iteration:

When the iterable is exhausted, return elements from the saved copy.

This can negatively affect the performance of your model in some situations. A solution to this can be to write your own cycle generator:

def cycle(iterable):
    while True:
        for x in iterable:
            yield x

Which you would use as:

dataiter = iter(cycle(trainloader))
7
  • You could get over this limitation by resetting the dataset right? If the dataset object uses a key list to iterate through, simply shuffling the key would work. trainloader.dataset.reset() Commented Sep 16, 2020 at 2:04
  • 1
    The previous comment touches on it, but just to make it more explicit. This solution will not reshuffle the samples. That likely will negatively affect the performance of the trained model. Commented Dec 10, 2021 at 8:52
  • @VictorZuanazzi you mean to say that even cycle(trainloader) will not shuffle after exhaustion? Or that cycle(trainloader) will shuffle after the exhaustion but itertools.cycle will not shuffle? Commented Dec 22, 2022 at 12:09
  • @RakshitKothari could you please elaborate on your comment? I did not get it. Perhaps if posting as a separate answer helps in giving a code snippet. Commented Dec 22, 2022 at 12:11
  • @Pim isn't iter redundant in dataiter = iter(cycle(trainloader)) i.e won't just dataiter = cycle(trainloader) work? Commented Dec 22, 2022 at 12:13
10

To complement the previous answers. To be comparable between datasets, it is often better to use the total number of steps instead of the total number of epochs as a hyper-parameter. That is because the number of iterations should not relly on the dataset size, but on its complexity.

I am using the following code for training. It ensures that the data loader re-shuffles the data every time it is re-initiated.

# main training loop
    generator = iter(trainloader)
    for i in range(max_steps):

        try:
            # Samples the batch
            x, y = next(generator)
        except StopIteration:
            # restart the generator if the previous generator is exhausted.
            generator = iter(trainloader)
            x, y = next(generator)

I will agree that is not the most elegant solution, but it keeps me from having to rely on epochs for training.

3

The simplest option is to just use a nested loop:

for i in range(10):
    for batch in trainloader:
        do_something(batch)

Another option would be to use itertools.cycle, perhaps in combination with itertools.take.

Of course, using a DataLoader with batch size equal to the whole dataset is a bit unusual. You don't need to call iter() on the trainloader either.

1

If you want to use only 1 for loop:
Without tqdm, the best solution is:

for batch_index, (x, y) in enumerate(itertools.chain(validation_loader,
                                                     validation_loader,
                                                     validation_loader,
                                                     validation_loader)): # 4 loop
...

With tqdm, the best solution is:

from tqdm import tqdm
pbar = tqdm(itertools.chain(validation_loader,
    validation_loader,
    validation_loader,
    validation_loader)) # 4 times loop through
for batch_index, (x, y) in enumerate(pbar):
    ...
1
  • 2
    tqdm has nothing to do with this question (or answer). Just loop on the loader multiple times.
    – scnerd
    Commented Aug 14, 2019 at 14:27
1

Below I discuss two ways of iterating over the dataset, which though has been covered in different answers above, the below code should make things crystal clear

import torch
from torch.utils.data import Dataset, DataLoader
import itertools

def cycle(iterable):
    while True:
        for x in iterable:
            yield x

class CustomImageDataset(Dataset):
    def __init__(self):
        self.my_list = [1,2,3,4,5,6]

    def __len__(self):
        return len(self.my_list)

    def __getitem__(self, idx):
        return self.my_list[idx]


def print_iterations(dataiter,batchsize):
    for idx in range(20):
        print(f'In iteration {idx+1} sample is {next(dataiter)}')
        if (idx+1)%(6/batchsize)==0:
            print('----')

def test(batchsize):
    print(f'****** Batch size = {batchsize} **********')

    train_dataloader = DataLoader(CustomImageDataset(), batch_size=batchsize, shuffle=True)

    dataiter = cycle(train_dataloader) # Note I do not wrap "iter" before "cycle()"

    print_iterations(dataiter,batchsize)
    print('\n---> Custom cycle works fine i.e after exhaustions samples are shuffling\n\n')

    dataiter = itertools.cycle(train_dataloader)
    print_iterations(dataiter,batchsize)
    print('\n---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling')

test(2)
test(1)

And the expected output is

****** Batch size = 2 **********
In iteration 1 sample is tensor([4, 1])
In iteration 2 sample is tensor([6, 3])
In iteration 3 sample is tensor([2, 5])
----
In iteration 4 sample is tensor([1, 3])
In iteration 5 sample is tensor([5, 4])
In iteration 6 sample is tensor([6, 2])
----
In iteration 7 sample is tensor([4, 1])
In iteration 8 sample is tensor([2, 6])
In iteration 9 sample is tensor([5, 3])
----
In iteration 10 sample is tensor([2, 1])
In iteration 11 sample is tensor([4, 3])
In iteration 12 sample is tensor([6, 5])
----
In iteration 13 sample is tensor([5, 2])
In iteration 14 sample is tensor([4, 6])
In iteration 15 sample is tensor([3, 1])
----
In iteration 16 sample is tensor([2, 1])
In iteration 17 sample is tensor([6, 5])
In iteration 18 sample is tensor([4, 3])
----
In iteration 19 sample is tensor([6, 3])
In iteration 20 sample is tensor([5, 1])

---> Custom cycle works fine i.e after exhaustions samples are shuffling


In iteration 1 sample is tensor([5, 4])
In iteration 2 sample is tensor([6, 2])
In iteration 3 sample is tensor([1, 3])
----
In iteration 4 sample is tensor([5, 4])
In iteration 5 sample is tensor([6, 2])
In iteration 6 sample is tensor([1, 3])
----
In iteration 7 sample is tensor([5, 4])
In iteration 8 sample is tensor([6, 2])
In iteration 9 sample is tensor([1, 3])
----
In iteration 10 sample is tensor([5, 4])
In iteration 11 sample is tensor([6, 2])
In iteration 12 sample is tensor([1, 3])
----
In iteration 13 sample is tensor([5, 4])
In iteration 14 sample is tensor([6, 2])
In iteration 15 sample is tensor([1, 3])
----
In iteration 16 sample is tensor([5, 4])
In iteration 17 sample is tensor([6, 2])
In iteration 18 sample is tensor([1, 3])
----
In iteration 19 sample is tensor([5, 4])
In iteration 20 sample is tensor([6, 2])

---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling
****** Batch size = 1 **********
In iteration 1 sample is tensor([3])
In iteration 2 sample is tensor([5])
In iteration 3 sample is tensor([4])
In iteration 4 sample is tensor([2])
In iteration 5 sample is tensor([6])
In iteration 6 sample is tensor([1])
----
In iteration 7 sample is tensor([5])
In iteration 8 sample is tensor([4])
In iteration 9 sample is tensor([3])
In iteration 10 sample is tensor([1])
In iteration 11 sample is tensor([2])
In iteration 12 sample is tensor([6])
----
In iteration 13 sample is tensor([3])
In iteration 14 sample is tensor([2])
In iteration 15 sample is tensor([1])
In iteration 16 sample is tensor([5])
In iteration 17 sample is tensor([4])
In iteration 18 sample is tensor([6])
----
In iteration 19 sample is tensor([1])
In iteration 20 sample is tensor([3])

---> Custom cycle works fine i.e after exhaustions samples are shuffling


In iteration 1 sample is tensor([3])
In iteration 2 sample is tensor([1])
In iteration 3 sample is tensor([6])
In iteration 4 sample is tensor([4])
In iteration 5 sample is tensor([5])
In iteration 6 sample is tensor([2])
----
In iteration 7 sample is tensor([3])
In iteration 8 sample is tensor([1])
In iteration 9 sample is tensor([6])
In iteration 10 sample is tensor([4])
In iteration 11 sample is tensor([5])
In iteration 12 sample is tensor([2])
----
In iteration 13 sample is tensor([3])
In iteration 14 sample is tensor([1])
In iteration 15 sample is tensor([6])
In iteration 16 sample is tensor([4])
In iteration 17 sample is tensor([5])
In iteration 18 sample is tensor([2])
----
In iteration 19 sample is tensor([3])
In iteration 20 sample is tensor([1])

---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.