Talk:Additive smoothing

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics
Low	This article has been rated as Low-importance on the importance scale.

Linguistics: Applied Linguistics

	Linguistics portal This article is within the scope of WikiProject Linguistics, a collaborative effort to improve the coverage of linguistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.LinguisticsWikipedia:WikiProject LinguisticsTemplate:WikiProject LinguisticsLinguistics
???	This article has not yet received a rating on the project's importance scale.
	This article is supported by Applied Linguistics Task Force.

The content of Pseudocount was merged into Additive smoothing on 5 April 2017. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. For the discussion at that location, see its talk page.

"Add-one" vs. "additive"

The entry claims that Jurafsky and Martin use the term "additive smoothing", but they actually use the term "add-one smoothing" (as do Russell/Norvig). It would be worth seeing which term Manning/Schütze actually use, but I would have to get the book out of the library to find out. -AlanUS (talk) 18:22, 9 October 2011 (UTC)[reply]

Proposed merge with Pseudocount

Same concept. QVVERTYVS (hm?) 15:31, 8 May 2014 (UTC)[reply]

Not really the same, but rather the pseudocount is used as a tool in additive smoothing, so doesn't have independent notability. So, agree that it's worth merging. Klbrain (talk) 18:13, 5 April 2017 (UTC)[reply]

Merger proposal

We should merge Bayesian_average with this page. The methods are the same, and only the interpretations differ. Bscan (talk) 16:26, 20 July 2018 (UTC)[reply]

Weak oppose on the grounds that the application is sufficiently different to warrant separate discussion. Klbrain (talk) 07:04, 16 August 2019 (UTC)[reply]

Disadvantages

The issues with this method, as outlined in Ken Church's paper "What's wrong with adding one" should be discussed here: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.134.2237 - Francis Tyers · 04:47, 14 February 2020 (UTC)[reply]

Slightly agree. Is there any reason you don't start a small section about this yourself? BernardoSulzbach (talk) 04:50, 14 February 2020 (UTC)[reply]

Overly technical

This article is all jargon and insider baseball, without one explicit statement of the obvious and important point that 0/5 is different strength of evidence than 0/500, which add one smoothing is supposed to help address. 2601:647:CD02:88A0:5560:1900:364C:CFB2 (talk) 03:01, 9 October 2023 (UTC)[reply]

Simple improved pseudocount for empty bins using empirical Bayesian inference

Conventional additive smoothing adds $a=1$ (Laplace) or $a=0.5$ (Jeffreys). Both are "non-informative" priors. We can improve on these by applying rudimentary empirical-Bayes to estimate a symmetric Dirichlet prior by applying the Good-Turing insight that the singleton count ( $B_{1}$ ) and empty bin count ( $B_{0}$ ) are the most valuable pieces of information for estimating the unseen mass in empty bins, as demonstrated by Gale & Sampson (1995). We can use this insight about singletons to empirically determine the best parameter $a$ for the prior, as shown below.

Conventional pseudocount smoothing with a symmetric Dirichlet prior (Lidstone): $p_{i}=(c_{i}+a)/(N+B_{tot}\cdot a)$
Good-Turing smoothing for an empty bin: $p_{i}=B_{1}/(B_{0}\cdot N)$ (Note that this is the probability for a single empty bin, not the cumulative probability for all empty bins. Here I impose an exchangeability assumption by dividing the unseen mass $B_{1}/N$ ) equally between $B_{0}$ bins.

where:
$a$ = tunable pseudocount. Jeffreys assumed 0.5
$B_{r}$ = number of bins containing count $r$ , e.g. $B_{0}$ is number of empty bins; $B_{1}$ number of singleton bins.
$B_{tot}$ = total number of bins
$c_{i}$ = count in bin $i$
$N$ = sum of all observed counts in all bins ${\textstyle =\sum _{i=1}^{B_{t}}c_{i}}$

We want to tune the value of $a$ such that the two $p_{i}$ equations above, are equal for empty bins:
$p_{i}=(c_{i}+a)/(N+B_{tot}\cdot a)=B_{1}/(B_{0}\cdot N)$ , where $c_{i}=0$ in empty bins
$\Rightarrow a/(N+B_{tot}\cdot a)=B_{1}/(B_{0}\cdot N)$ , because $c_{i}=0$
$\Rightarrow a=(B_{1}\cdot N)/(B_{0}\cdot N-B_{1}\cdot B_{tot})$

Edge cases:

When the denominator becomes negative or if $1<a$ , it is an indication that the true distribution is close to uniform, but I would recommend to limit the prior to Laplace's $a=1$ in such situations.
When $B_{1}=0$ , then this method fails, in which case you can fall back on more advanced methods, either (i) Simple Good-Turing (SGT) regression, giving a rigorous data-driven estimate of the unseen mass when when $B_{1}=0$ , or (ii) Minka's maximum-likelihood estimate by fixed-point iteration.
Incidentally, as $N\rightarrow \infty$ (i.e. more Monte Carlo trials), with fixed number of bins, such that $B_{t}\ll N$ , the value of $a$ asymptotically approaches $a\simeq B_{1}/B_{0}$ , though below I choose not to use this asymptotic simplification.

Pseudocode showing how to implement this:

Const B_t = 1000 as Long '1000 bins
Dim N as Long, B_0 as Long, B_1 as Long 'long integers
Dim a as Double 'double precision pseudocount
Dim denom as Double 'denominator
Dim c(1 to B_t) as Long 'input discrete count histogram from Monte Carlo simulation.
Dim p(1 to B_t) as Double 'output smoothed probability
'... insert code here to to precalculate c() with values from MC-simulation, then count N, B_0 and B_1 ...
If B_0 = 0 Then
   a = 0 'Edge case: no unseen mass, so no need to add pseudocount
Elseif B_1 = 0 Then 
   a = 1E-15 'Edge case: Should strictly use SGT or Minka's estimation instead!
Else 
   denom = B_0 * N - B_1 * B_t
   If denom <= 0 then
      a = 1 'Edge case: Distribution close to uniform.
   Else
      a = B_1 * N / denom 'additive pseudocount (empirical prior value)
      If 1 < a Then a = 1 'Edge case: Limit to max 1, to prevent overzealous smoothing (without more information).
   Endif
Endif
For i = 1 To B_t 'loop though all bins
   p(i) = (c(i) + a) / (N + B_t * a) 'Pseudocount smoothing, ensures all empty bins have non-zero probability
Next i

Peter.schild (talk) 11:31, 2 October 2025 (UTC)[reply]