3.3 Generalization and Zeros

  • N-gram model performs better with increase of N.

  • Performance of n-gram models can be visualized by random sentence generation.

  • Outputs of n-gram models are influenced by genre and dialect of the training corpora.

  • Problem of sparsity: some valid combination of words may not appear in training set.

DEF zeros are things that do not occur in the training set but do occur in the test set.

3.3.1 Unknown Words

DEF In a closed vocabulary system we know all the words that can occur.

DEF Unknown / out of vocabulary / OOV words are words the model haven't seen before.

  • OOV rate is the percentage of OOV words in the test set.

  • In a open vocabulary system, OOV words are modeled by <UNK>.

  • 2 ways to train the probabilities of <UNK>:

    • Before training, choose a fixed vocabulary, convert in the training set any word that is not in the vocabulary to <UNK>, and then treat <UNK> like other words.

    • Before training, convert in the training set any word with a small enough frequency to <UNK>, and then treat <UNK> like other words.

A language model can achieve low perplexity by choosing a small vocabulary and assigning the <UNK> a high probability, so perplexities should only be compared across models with the same vocabularies.

Last updated

Was this helpful?