I have a dataset that I have divided into training and testing data, with approximately 160 samples in the training set and 40 in the testing set. I fitted a probability distribution to each dataset separately and used the negative log-likelihood (NLL) metric to assess how well the distribution fitted each dataset. I am using the following formula for evaluating NLL:
$$\text{NLL}= -\sum_{i=1}^n\log(P(y_i)) $$
Now, I want to compare the NLL values of the two datasets. However, there is a problem: the formula for calculating NLL incorporates the number of samples. Consequently, I believe that the NLL values for the training and testing datasets cannot be compared. How can I properly compare this metric on the training and testing datasets? Would it be fair to compare $\dfrac{\text{NLL}_{\text{train}}}{160}$ and $\dfrac{\text{NLL}_{\text{test}}}{40}$?