I am training WGAN-GP on Eurosat dataset, splitted into train/val/test sets in counts 18900/4050/4050. Since FID scores are widely used in GANs in image generation, I based my hyperparameter search on FID score: I have 12 combinations of hyperparameters in a for loop. Each run goes for 300 epochs and FID score is computed (against my val set) at each 10 epochs. At the end of each run, lowest FID score's epoch count is saved along with that run's hyperparameter values. When all 12 runs finish, hyperparameter values and epoch count belonging to the lowest FID score among 12 are chosen as the best results so far.
Here is how I use inceptionV3:
tf.keras.backend.clear_session()
inception = tf.keras.applications.InceptionV3(
include_top=False, pooling="avg", input_shape=(75, 75, 3)
)
Instead of 299, I am using 75.
Also my validation set is small enough or at the mininmum for a healthy FID score.
My question is: FID scores I am observing are too high, but can I still rely on it in my hyperparameter search? My reasoning is that eventually I am computing FID in between same training epochs and against the same values set for hyperparameter search, so not for image quality reporting yet. How do you think that sounds?
Help much appreciated. Also, I couldn't find an academic paper with a similar case that could answer or give hints about my qquestion. If you know any, please let me know.