Tensorflow has tf.data.Dataset.repeat(x) that iterates through the data x number of times. It also has iterator.initializer which when iterator.get_next() is exhausted, iterator.initializer can be used to restart the iteration. My question is is there difference when using tf.data.Dataset.repeat(x) technique vs iterator.initializer?
1 Answer
As we know, each epoch in the training process of a model takes in the whole dataset and breaks it into batches. This happens on every epoch. Suppose, we have a dataset with 100 samples. On every epoch, the 100 samples are broken into 5 batches ( of 20 each ) for feeding them to the model. But, if I have to train the model for say 5 epochs then, I need to repeat the dataset 5 times. Meaning, the total elements in the repeated dataset will have 500 samples ( 100 samples multipled 5 times ).
Now, this job is done by the tf.data.Dataset.repeat() method. Usually we pass the num_epochs argument to the method.
The iterator.get_next() is just a way of getting the next batch of data from the tf.data.Dataset. You are iterating the dataset batch by batch.
That's the difference. The tf.data.Dataset.repeat() repeats the samples in the dataset whereas iterator.get_next() one-by-one fetches the data in the form of batches.