pytorch data loader multiple iterations

Question

i use iris-dataset to train a simple network with pytorch.

trainset = iris.Iris(train=True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=150,
                                          shuffle=True, num_workers=2)

dataiter = iter(trainloader)

the dataset itself has only 150 data points, and pytorch dataloader iterates jus t once over the whole dataset, because of the batch size of 150.

My question is now, is there generally any way to tell dataloader of pytorch to repeat over the dataset if it's once done with iteration?

thnaks

update

got it runnning :) just created a sub class of dataloader and implemented my own __next__()

Of course i did, what has that got to do with my question?
– arash javanmard
Commented Dec 8, 2017 at 17:40 — arash javanmard, Commented Dec 8, 2017 at 17:40

Pim · Accepted Answer · 2018-04-23 18:28:59Z

11

Using itertools.cycle has an important drawback, in that it does not shuffle the data after each iteration:

When the iterable is exhausted, return elements from the saved copy.

This can negatively affect the performance of your model in some situations. A solution to this can be to write your own cycle generator:

def cycle(iterable):
    while True:
        for x in iterable:
            yield x

Which you would use as:

dataiter = iter(cycle(trainloader))

edited Apr 23, 2018 at 18:28

answered Apr 23, 2018 at 18:23

Pim

1961 silver badge8 bronze badges

You could get over this limitation by resetting the dataset right? If the dataset object uses a key list to iterate through, simply shuffling the key would work. trainloader.dataset.reset()
– Rakshit Kothari
Commented Sep 16, 2020 at 2:04
1

The previous comment touches on it, but just to make it more explicit. This solution will not reshuffle the samples. That likely will negatively affect the performance of the trained model.
– Victor Zuanazzi
Commented Dec 10, 2021 at 8:52
@VictorZuanazzi you mean to say that even cycle(trainloader) will not shuffle after exhaustion? Or that cycle(trainloader) will shuffle after the exhaustion but itertools.cycle will not shuffle?
– Mohit Lamba
Commented Dec 22, 2022 at 12:09
@RakshitKothari could you please elaborate on your comment? I did not get it. Perhaps if posting as a separate answer helps in giving a code snippet.
– Mohit Lamba
Commented Dec 22, 2022 at 12:11
@Pim isn't iter redundant in dataiter = iter(cycle(trainloader)) i.e won't just dataiter = cycle(trainloader) work?
– Mohit Lamba
Commented Dec 22, 2022 at 12:13

| Show 2 more comments

Victor Zuanazzi · Accepted Answer · 2019-11-15 12:15:59Z

To complement the previous answers. To be comparable between datasets, it is often better to use the total number of steps instead of the total number of epochs as a hyper-parameter. That is because the number of iterations should not relly on the dataset size, but on its complexity.

I am using the following code for training. It ensures that the data loader re-shuffles the data every time it is re-initiated.

# main training loop
    generator = iter(trainloader)
    for i in range(max_steps):

        try:
            # Samples the batch
            x, y = next(generator)
        except StopIteration:
            # restart the generator if the previous generator is exhausted.
            generator = iter(trainloader)
            x, y = next(generator)

I will agree that is not the most elegant solution, but it keeps me from having to rely on epochs for training.

nnnmmm · Accepted Answer · 2017-12-08 14:20:49Z

3

The simplest option is to just use a nested loop:

for i in range(10):
    for batch in trainloader:
        do_something(batch)

Another option would be to use itertools.cycle, perhaps in combination with itertools.take.

Of course, using a DataLoader with batch size equal to the whole dataset is a bit unusual. You don't need to call iter() on the trainloader either.

answered Dec 8, 2017 at 14:20

nnnmmm

8,8444 gold badges29 silver badges47 bronze badges

Add a comment |

Koke Cacao · Accepted Answer · 2019-08-15 17:08:39Z

1

If you want to use only 1 for loop:
Without tqdm, the best solution is:

for batch_index, (x, y) in enumerate(itertools.chain(validation_loader,
                                                     validation_loader,
                                                     validation_loader,
                                                     validation_loader)): # 4 loop
...

With tqdm, the best solution is:

from tqdm import tqdm
pbar = tqdm(itertools.chain(validation_loader,
    validation_loader,
    validation_loader,
    validation_loader)) # 4 times loop through
for batch_index, (x, y) in enumerate(pbar):
    ...

edited Aug 15, 2019 at 17:08

answered Dec 26, 2018 at 19:18

Koke Cacao

4766 silver badges9 bronze badges

2

tqdm has nothing to do with this question (or answer). Just loop on the loader multiple times.
– scnerd
Commented Aug 14, 2019 at 14:27

Add a comment |

Mohit Lamba · Accepted Answer · 2022-12-23 03:45:57Z

Below I discuss two ways of iterating over the dataset, which though has been covered in different answers above, the below code should make things crystal clear

import torch
from torch.utils.data import Dataset, DataLoader
import itertools

def cycle(iterable):
    while True:
        for x in iterable:
            yield x

class CustomImageDataset(Dataset):
    def __init__(self):
        self.my_list = [1,2,3,4,5,6]

    def __len__(self):
        return len(self.my_list)

    def __getitem__(self, idx):
        return self.my_list[idx]


def print_iterations(dataiter,batchsize):
    for idx in range(20):
        print(f'In iteration {idx+1} sample is {next(dataiter)}')
        if (idx+1)%(6/batchsize)==0:
            print('----')

def test(batchsize):
    print(f'****** Batch size = {batchsize} **********')

    train_dataloader = DataLoader(CustomImageDataset(), batch_size=batchsize, shuffle=True)

    dataiter = cycle(train_dataloader) # Note I do not wrap "iter" before "cycle()"

    print_iterations(dataiter,batchsize)
    print('\n---> Custom cycle works fine i.e after exhaustions samples are shuffling\n\n')

    dataiter = itertools.cycle(train_dataloader)
    print_iterations(dataiter,batchsize)
    print('\n---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling')

test(2)
test(1)

And the expected output is

****** Batch size = 2 **********
In iteration 1 sample is tensor([4, 1])
In iteration 2 sample is tensor([6, 3])
In iteration 3 sample is tensor([2, 5])
----
In iteration 4 sample is tensor([1, 3])
In iteration 5 sample is tensor([5, 4])
In iteration 6 sample is tensor([6, 2])
----
In iteration 7 sample is tensor([4, 1])
In iteration 8 sample is tensor([2, 6])
In iteration 9 sample is tensor([5, 3])
----
In iteration 10 sample is tensor([2, 1])
In iteration 11 sample is tensor([4, 3])
In iteration 12 sample is tensor([6, 5])
----
In iteration 13 sample is tensor([5, 2])
In iteration 14 sample is tensor([4, 6])
In iteration 15 sample is tensor([3, 1])
----
In iteration 16 sample is tensor([2, 1])
In iteration 17 sample is tensor([6, 5])
In iteration 18 sample is tensor([4, 3])
----
In iteration 19 sample is tensor([6, 3])
In iteration 20 sample is tensor([5, 1])

---> Custom cycle works fine i.e after exhaustions samples are shuffling


In iteration 1 sample is tensor([5, 4])
In iteration 2 sample is tensor([6, 2])
In iteration 3 sample is tensor([1, 3])
----
In iteration 4 sample is tensor([5, 4])
In iteration 5 sample is tensor([6, 2])
In iteration 6 sample is tensor([1, 3])
----
In iteration 7 sample is tensor([5, 4])
In iteration 8 sample is tensor([6, 2])
In iteration 9 sample is tensor([1, 3])
----
In iteration 10 sample is tensor([5, 4])
In iteration 11 sample is tensor([6, 2])
In iteration 12 sample is tensor([1, 3])
----
In iteration 13 sample is tensor([5, 4])
In iteration 14 sample is tensor([6, 2])
In iteration 15 sample is tensor([1, 3])
----
In iteration 16 sample is tensor([5, 4])
In iteration 17 sample is tensor([6, 2])
In iteration 18 sample is tensor([1, 3])
----
In iteration 19 sample is tensor([5, 4])
In iteration 20 sample is tensor([6, 2])

---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling
****** Batch size = 1 **********
In iteration 1 sample is tensor([3])
In iteration 2 sample is tensor([5])
In iteration 3 sample is tensor([4])
In iteration 4 sample is tensor([2])
In iteration 5 sample is tensor([6])
In iteration 6 sample is tensor([1])
----
In iteration 7 sample is tensor([5])
In iteration 8 sample is tensor([4])
In iteration 9 sample is tensor([3])
In iteration 10 sample is tensor([1])
In iteration 11 sample is tensor([2])
In iteration 12 sample is tensor([6])
----
In iteration 13 sample is tensor([3])
In iteration 14 sample is tensor([2])
In iteration 15 sample is tensor([1])
In iteration 16 sample is tensor([5])
In iteration 17 sample is tensor([4])
In iteration 18 sample is tensor([6])
----
In iteration 19 sample is tensor([1])
In iteration 20 sample is tensor([3])

---> Custom cycle works fine i.e after exhaustions samples are shuffling


In iteration 1 sample is tensor([3])
In iteration 2 sample is tensor([1])
In iteration 3 sample is tensor([6])
In iteration 4 sample is tensor([4])
In iteration 5 sample is tensor([5])
In iteration 6 sample is tensor([2])
----
In iteration 7 sample is tensor([3])
In iteration 8 sample is tensor([1])
In iteration 9 sample is tensor([6])
In iteration 10 sample is tensor([4])
In iteration 11 sample is tensor([5])
In iteration 12 sample is tensor([2])
----
In iteration 13 sample is tensor([3])
In iteration 14 sample is tensor([1])
In iteration 15 sample is tensor([6])
In iteration 16 sample is tensor([4])
In iteration 17 sample is tensor([5])
In iteration 18 sample is tensor([2])
----
In iteration 19 sample is tensor([3])
In iteration 20 sample is tensor([1])

---> itertools.cycle DOES NOT works fine i.e after exhaustions samples are NOT shuffling

Collectives™ on Stack Overflow

pytorch data loader multiple iterations

5 Answers 5

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Related