I am working on a multiple regression model examining the effects of several predictors on morphological traits to make interspecific comparisons across many species:
Length ~ Height + Average Weight + Diet Category
Firstly, because my data was collected from deceased specimens certain data was not obtainable specimen-by-specimen, including weight and individuals' diets. Secondly, because I am making interspecies comparisons, I find it acceptable to use species-wide/species average data in certain circumstances in lieu of specimen-level data. I have obtained average species weight and species diet category from published databases. (I am making the assumption that my data and the database data is representative of the population.)
My question is about the use of averaged data when it is identical for every observation (specimen) of the same group (species)...and this is the case for every group.
For example, this sample data where length and height are taken on a by-specimen basis but every specimen that's a part of species A has the same weight and diet category, and the same for species B and C:
| Species | Length | Height | Weight | Diet Category |
|---|---|---|---|---|
| A | 10 | 10 | 20 | Carnivore |
| A | 11 | 13 | 20 | Carnivore |
| A | 12 | 11 | 20 | Carnivore |
| B | 5 | 7 | 15 | Herbivore |
| B | 6 | 5 | 15 | Herbivore |
| B | 7 | 6 | 15 | Herbivore |
| C | 13 | 11 | 30 | Omnivore |
| C | 14 | 16 | 30 | Omnivore |
| C | 15 | 16 | 30 | Omnivore |
My initial thought to deal with this was to average length and height for each species so I would be comparing species averages to species averages and my unit of comparison is the species. However, it was recommended to me to look into using all my observations for higher statistical power (500 specimens instead of 40 species) and to avoid losing intraspecific variance.
I first looked into mixed modeling, but I found that this does seem to be an issue when you have a variable that is perfectly confounding with the grouping variable. In this case, I have up to 2+ perfectly confounding variables, depending on the model I'm running.
So I went back to regular regression. My findings seems to be that this kind of model is OK in multiple regression as long as there is 'enough' variation across the observations overall. However, I have struggled to find answers on this specific kind of problem where predictors are identical within every single group across every, or any way to judge what constitutes 'enough' variation. I understand that a single "average" variable is not too much of a concern because grouping is not part of the actual statistical analysis in regular regression like it is in mixed models, so I am especially concerned with what happens if there are multiple variables that are identical within groups.