Skip to main content
5 of 9
deleted 6 characters in body
GZ0
  • 2.4k
  • 8
  • 19

I think a different approach is needed to achieve a better performance. The current approach recomputes the Jaccard similarity from scratch for each possible threshold value. However, going from one threshold to the next, only a small fraction of prediction values change as well as the intersection and the union. Therefore a lot of unnecessary computation is performed.

A better approach can first compute for each patch a histogram of the prediction probabilities given ground truth = 1, using the thresholds as bin edges. The frequency counts in each bin give the amount of predictions affected going from one threshold to the next. Therefore, the Jaccard similarity values for all thresholds can be computed directly from cummulative frequency counts derived from the histogram.

In your program, the prediction probabilities are used directly as thresholds. Therefore the histograms coincide with the inputs sorted by the probabilities. Consider the following example input probablities and true labels:

Labels   1    1    0    0    1     0    0    0
Prob     0.9  0.8  0.7  0.6  0.45  0.4  0.2  0.1

The labels themselves are also the counts of positive instances within each interval. Given a threshold \$t\$ and its index \$i\$, \$|Label \cap Predicted|\$ is just the sum of labels with indices \$\leq i\$, which is the cumulative sum of labels until \$i\$. Also note that \$|Predicted|=i+1\$ and \$Label\$ is the count of true positive instances. Therefore the Jaccard similarity

$$ \begin{align*} Jaccard(Label, Predicted) & = \frac{|Label \cap Predicted|}{|Label \cup Predicted|} \\ & = \frac{|Label \cap Predicted|}{|Label|+|Predicted|-|Label \cap Predicted|} \\ & = \frac{cumsum(Label, i)}{(\text{# of positive instances}) + i + 1 - cumsum(Label, i)} \end{align*} $$

This computation can be easily vectorized for all possible \$i\$s to get a Jaccard similarity vector for every threshold.

GZ0
  • 2.4k
  • 8
  • 19