7
$\begingroup$

I am working on a multiple regression model examining the effects of several predictors on morphological traits to make interspecific comparisons across many species:

Length ~ Height + Average Weight + Diet Category

Firstly, because my data was collected from deceased specimens certain data was not obtainable specimen-by-specimen, including weight and individuals' diets. Secondly, because I am making interspecies comparisons, I find it acceptable to use species-wide/species average data in certain circumstances in lieu of specimen-level data. I have obtained average species weight and species diet category from published databases. (I am making the assumption that my data and the database data is representative of the population.)

My question is about the use of averaged data when it is identical for every observation (specimen) of the same group (species)...and this is the case for every group.

For example, this sample data where length and height are taken on a by-specimen basis but every specimen that's a part of species A has the same weight and diet category, and the same for species B and C:

Species Length Height Weight Diet Category
A 10 10 20 Carnivore
A 11 13 20 Carnivore
A 12 11 20 Carnivore
B 5 7 15 Herbivore
B 6 5 15 Herbivore
B 7 6 15 Herbivore
C 13 11 30 Omnivore
C 14 16 30 Omnivore
C 15 16 30 Omnivore

My initial thought to deal with this was to average length and height for each species so I would be comparing species averages to species averages and my unit of comparison is the species. However, it was recommended to me to look into using all my observations for higher statistical power (500 specimens instead of 40 species) and to avoid losing intraspecific variance.

I first looked into mixed modeling, but I found that this does seem to be an issue when you have a variable that is perfectly confounding with the grouping variable. In this case, I have up to 2+ perfectly confounding variables, depending on the model I'm running.

So I went back to regular regression. My findings seems to be that this kind of model is OK in multiple regression as long as there is 'enough' variation across the observations overall. However, I have struggled to find answers on this specific kind of problem where predictors are identical within every single group across every, or any way to judge what constitutes 'enough' variation. I understand that a single "average" variable is not too much of a concern because grouping is not part of the actual statistical analysis in regular regression like it is in mixed models, so I am especially concerned with what happens if there are multiple variables that are identical within groups.

$\endgroup$
1
  • $\begingroup$ Welcome to CV, and thank you for a nice first question! $\endgroup$ Commented Mar 25 at 8:00

2 Answers 2

11
$\begingroup$

First off:

I find it acceptable to use species-wide/species average data in certain circumstances in lieu of specimen-level data

This is essentially a strategy for dealing with missing data, specifically replacing all missing values with a single per-group mean. This is already problematic by itself, because your regression does not know that this is an imputed value, and will assume that this is an "actual" value. So it will miss a source of uncertainty, and all conclusions from your model will be too certain.

Dealing with missing data is a large field all by itself. The first step always is to find out whether the missingness is at random, or correlated with some other features. For instance, perhaps you couldn't collect the diet of some specimens because they were dead because they were sick, which correlated with weight or height. Then missingness would be systematic, and this should really be factored into the treatment, especially if missingness affects a large proportion of your data (as it seems to do here).

However, the above is an issue for "isolated" cases of missingness. What you seem to have here is missingness for a feature across one or multiple entire groups. And that is a bigger problem. More precisely, it is not a statistical problem. If you don't have the weight for any specimen of a species, then there is no way to disentangle the contribution of weight from that of the species. Is the first specimen of a certain length because it is of species A or because it has a certain weight? You can't tell. And note that this is not because you imputed the same weight to all specimens in that species; the same issue would be there if you measured the weight and it came out identical for all specimens in a species.

Any conclusions about the relationship in the contributions of species and weight will derive solely from additional structure you impose on the problem and the model, which could take the form of functional assumptions (e.g., assuming identical relationships between length and weight across all species) or regularization or Bayesian priors (which are a form of regularization).

Finally, note that the sheer fact that your statistical software will likely give you some result can obscure the fact that you will not be able to learn anything about these relationships from your data. So you need to be careful about interpreting statistical software outputs.

As a way forward, it looks like you may simply not have the data to learn the kind of things you would like to investigate (point 2 above). As to point 1, if you have less missing data than a feature missing across an entire group, there are established methods for dealing with that, like MICE - do read up on that, and ask another question here if necessary.

$\endgroup$
7
$\begingroup$

I very much like Stephen's answer. In particular, he notes that you may not have the data you need to answer the questions you ask.

From your question, I'd say you do not have the data.

And this gives me an excuse to put in a poem I wrote when I was bored in a stats class, based on a quote from Ronald Fisher:

To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

and here's the poem:

Dissertation Blues


I’ve designed a great experiment
And collected all my data.
I’ve no idea what it all means —
I’ll get to that stuff later.

I’ve forgotten all the stats I learned
(And I never learned that much)
I needed it to pass my comps
But since then I’ve lost touch.

I’ll do another lit review
And find another theory
But when it’s time to analyze
Everything goes bleary.

So I hired a consultant
To tell me what I’d got.
He looked at three years of my life
And answered “not a lot.”

“There is no dissertation here
"There aren’t any theses.
"Basically what you have got
"Is a great big pile of feces.”

“You should have called me years ago
"Now get this through your head:
"You’ve hired a physician
"But the patient is quite dead”.

Sorry.

As a way forward, redo the study with living organisms or, at least,with individual data available.

$\endgroup$
1
  • 4
    $\begingroup$ I am a simple creature. I see a good cheeky poem (this is the second time you used this), I upvote. $\endgroup$ Commented Mar 25 at 13:00

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.