3.4 Smoothing

DEF Smoothing / discounting means shaving off some probability mass from frequent events and giving it to unseen events.

  • Laplace smoothing / add one smoothing increase every count of words by 1. Therefore PL(wi)=count(wi)+1N+VP_L(w_i)=\frac{\mathrm{count}(w_i)+1}{N+|V|} .

    • adjusted count is a virtual count measuring the effect of a smoothing algorithm. Normalizing adjusted count by N will give us the smoothed probability. In the case of Laplace smoothing, ci=NPL(wi)=(ci+1)NN+Vc_i^* = NP_L(w_i) = (c_i+1)\frac{N}{N+|V|} .

    • discount is defined as the ratio of the adjusted counts to the original counts: dc=c/cd_c=c^*/c .

  • Add-k smoothing increase every count of words by k. Therefore Pk(wi)=ci+kN+kVP_k(w_i) = \frac{c_i+k}{N+k|V|} ​.

It turns out that add-k still doesn’t work well for language modeling, generating counts with poor variances and often inappropriate discounts. ( ref: [1] )

Reference

[1] Gale, W. A. and Church, K. W. (1994). What is wrong with adding one?. In Oostdijk, N. and de Haan, P. (Eds.), Corpus-Based Research into Language, 189–198. Rodopi.

Last updated

Was this helpful?