6
$\begingroup$

I'm fitting a regression of Y on predictors X₁, X₂, X₃ using pre-intervention data, then plugging in observed post-intervention X values to generate a counterfactual Y and compare it to the observed Y.

I'm willing to assume the intervention didn't affect the predictors themselves.

However, the predictors plausibly have causal relationships among themselves (e.g., X₁ → X₂ → X₃) — which I know matters for inference on individual coefficients (mediator/collider issues), but I'm using the model purely for forecasting Y.

Does inter-predictor causal structure bias my counterfactual estimate, or is it irrelevant in the forecasting setting?

enter image description here

$\endgroup$
3
  • 1
    $\begingroup$ A DAG would really illustrate what you're thinking. Can you post a DAG of your situation? $\endgroup$ Commented 17 hours ago
  • $\begingroup$ I have now added a DAG. the t index is used to denote the time of measurement ( t is most current, t1 is 1 time unit ago, t2 is 2 time units ago) $\endgroup$ Commented 11 hours ago
  • $\begingroup$ You might be interested in en.wikipedia.org/wiki/Multicollinearity - which is not related to causality but correlation, but is a problem likely to show up in your setting. $\endgroup$ Commented 1 hour ago

1 Answer 1

2
$\begingroup$

For building the model, it doesn't make a difference. Causation is more a conceptual explanation of the data rather than being directly related to the properties of the data itself - you can have perfectly correlated variables without no causative relationship, or have a causative relationship with no correlation whatsoever, or anywhere in between. That one variable causes another doesn't actually imply anything at all about their statistical relationship. And if we observe all the data anyway, their upstream causes don't really matter - making inferences about X2 given some upstream cause X1 isn't necessary if you have X2 directly.

Suppose you take some set of X's and Y's, and build a predictive model. That model is no less valid if I tell you that some of the X's cause one another, or that none of them cause anything, or even that it's actually Y that causes X. The predictive model encapsulates the observed statistical numerical relationship between variables, which is not affected by the conceptual relationship among them. It's possible to observe the exact same data, and build the exact same model, with the exact same statistical performance, in scenarios with causation and in scenarios without. Any observed relationship among variables could have happened equally well under causation, or not.

What might change is your interpretation of the model. If you have, say, two variables with equal importance in the model, you may prefer to focus on the one with a causal relationship. Causation may give you explanatory power of what the model does, but knowing causal relationships among the input variables won't change how the model performed statistically.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.