2
$\begingroup$

I am not mathematician, neither statistician, but I try to use statistics in my work, so I do my best here to explain the problem I have.

I have a map of millions of hectares that consist of nine predicted categories, and a second map that is the reference map (the one that I consider the truth). So, for the predicted map I want to know the area well classified ($\hat{t}$), the proportion of the area well classified ($\hat{p}$) and their 95% confidence intervals.

In order to evaluate the uncertainty of the predicted map the approach was to split the area in small $N$ number of segments and sample $n$ number of them using stratified simple random sampling. The first issue here was that the segments are different in size (between 10 and 25 hectares) but their sampling probability was the same for all of them within each stratum. So, within each stratum the probability of inclusion was $pi_{hi} = n/N$, with $pi_{hi}$ the probability of inclusion of a segment $i$ in the stratum $h$. That is given, no chance to change this approach.

A segment i in the stratum h ($y_{hi}$) is correctly classified when the category of the predicted and the reference map are equal. Then, $y_{hi} = a_{hi}$ when it was correctly classified and $y_{hi} = 0$ when it was wrongly classified, with $a_i$ equal the area of the segment. Then, $\hat{t}_h = \sum_{i=1}^{n_h} y_{hi}/pi_{hi}$, and $\hat{t} = \sum_{h=1}^{L} \hat{t}_h$. The proportions are $\hat{p}_h = \hat{t}_h / A_h$, with $A$ the area of the stratum $h$.

So far so good.

In order to get the 95% CI I apply this equation (which I assume it is correct)

$CI_h = \hat{t}_h ± t_{h ~\alpha/2,df=n_h-1} * \sqrt{var(\hat{t}_h)}$

My problem is: is the variance of $\hat{t}_h$: $var(\hat{t}_h) = var(y_{hi})/n$ or is it $var(\hat{t}_h) = var(y_hi/pi_{hi})$?

The first of them gives non-sense values. The second one is more reasonable, but still quite low I think. Here a reproducible example to run in R

####### for R #############################################################
# this is an example to run for a stratum
set.seed(seed = 1234)
A = 2500000 # area of the stratum
n = 2300     # number of sampled segments within the stratum (by simple random sampling)
N = 175000  # total number of segments within the stratum
a_i = rnorm(n = 2300, mean = 14, sd = 4) # area of the i segments for i = 1, 2, ..., n.
index = rbinom(n = 2300,size = 1, prob = .7) # assume 70% of the segments well classified (1 means reference map and predicted map matched)
y_i = a_i * index # y_i = 0 for wrongly classified; y_i = a_i segment correctly classified
(pi_i = n/N)      # probability of inclusion (equal for all segments within the stratum, since their area (a_i) were neglected)
#[1] 0.01314286
(hatt = sum(y_i/pi_i)) # area well classified
#[1] 1734970
(hatp = hatt/A) # proportion of area well classified
#[1] 0.693988

# Option 1
(var_hatt1 = var(y_i)/n) # variance of hatt (too small, isn't it?)
#[1] 0.02278085
# I think here is the mistake in the equation. I think it should be
# Option 2
(var_hatt2 = var(y_i / pi_i)) # based on my common sense (probably wrong?)
#[1] 303332

# Confidence intervals with option 1
(error = qt(0.975, df = 15 - 1) * sqrt(var_hatt1)) # confidence rage/2  
#[1] 0.3237197
(hatt.ul.CI = hatt + error) # area upper limit confidence interval
#[1] 1734970
(hatt.ll.CI = hatt - error) # area lower limit confidence interval
#[1] 1734970
(hatt.ul.CI/A) # proportions
#[1] 0.6939882
(hatt.ll.CI/A)
#[1] 0.6939879

# Confidence intervals with option 2
(error = qt(0.975, df = 15 - 1) * sqrt(var_hatt2)) # confidence rage/2  
#[1] 1181.254
(hatt.ul.CI = hatt + error) # area upper limit confidence interval
#[1] 1736151
(hatt.ll.CI = hatt - error) # area lower limit confidence interval
#[1] 1733789
(hatt.ul.CI/A) # proportions
#[1] 0.6944605
(hatt.ll.CI/A)
#[1] 0.6935155
##############################################################################

Are any of the options (Option 1 or Option 2) correct? Can you see any other weird thing?

Should I compute the prediction interval instead of confidence intervals? How?

$\endgroup$
2
  • $\begingroup$ Could you calculate the confidence intervals using options 1 and 2 to compare them? $\endgroup$ Commented Jun 26, 2019 at 12:11
  • 1
    $\begingroup$ I added a new piece of code to make it clearer $\endgroup$ Commented Jun 26, 2019 at 12:34

1 Answer 1

0
$\begingroup$

The correct equation for $var(\hat{t}_h)$ was

$$ var(\hat{t}_h) = N^2 var(y_i)/n $$

That solved the problem

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.