1
$\begingroup$

I'm not sure if the following question of mine sound silly. I thought I would just go ahead and ask. The question is the following. We often find in probability text books questions, for example, of the form

Let $X_i$ denote the percentage of votes cast in a given election that are for candidate $i$, and suppose that $X_1$ and $X_2$ have a joint density function $$f_{X_1,X_2}(x,y) = \begin{cases} 3(x+y)&\quad \text{if } X\ge 0, y\ge 0, 0\le x+y\le 1\\ 0,&\quad\text{otherwise.}\end{cases}$$

Is this possible to know (closed form expressions of) the density functions in real life situations? Or do we use the available data to somehow approximately find it? Is this related to the research area of 'Density Estimation'? I am asking this in the context of teaching. As I am teaching a first level course in probability and statistics, what if a student asks such a question?

$\endgroup$
1
  • 1
    $\begingroup$ Because one uses probability distributions to reason about hypothetical situations, it's not necessary to "know" them. Students need the skills and knowledge to understand what probability hypotheses mean and how to reason about them. The function of textbook questions like this is to provide examples. $\endgroup$ Commented Oct 6 at 14:46

1 Answer 1

4
$\begingroup$

Is this possible to know (closed form expressions of) the density functions in real life situations?

No, usually not – certainly not exactly. An obvious exception is a numerical simulation where you provide a specific, exactly-known distribution to be sampled from.

Or do we use the available data to somehow approximately find it? Is this related to the research area of 'Density Estimation'?

Yes and yes – numerically approximating the density (and/or the cumulative distribution) function based on a finite sample is certainly one possibility if you want to know it.

Another possibility is to follow a parametric approach, where you assume a specific closed form PDF/PMF/CDF, usually based on some theoretical consideration. For example, continuous variables emerging as a combination of many independent additive (resp. multiplicative) factors are often assumed to follow a normal (resp. log-normal) distribution. Random variables that involve a positive, practically unbounded number of events are often modeled using a Poisson distribution, and so forth.

In this case, the approximation "only" lies in finding estimators of the parameters that exactly describe the assumed theoretical distribution. This is often convenient for carrying out exact symbolic calculations; however, the assumptions are rarely (if ever) met in practice; the question then becomes how much deviation between your model and reality you are willing to tolerate.

$\endgroup$
2
  • $\begingroup$ Thank you. In the example that I mentioned, assuming elections happen once in five years, hardly a maximum of three to four data values we would have (unless the candidate is a veteran politician). What can one do in such situations in practice? $\endgroup$ Commented Oct 6 at 13:28
  • 1
    $\begingroup$ @Ashok In theory, e.g. kernel-based density estimation methods can technically work with few data points. However, there's nothing special about density estimation in this regard – if you only have 3-4 data points, there isn't anything you can in reality modell well enough to be useful at all. It's practically useless to try and fit models to such negligible data in the vast majority of the time. $\endgroup$ Commented Oct 6 at 13:41

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.