# 3.3 Generalization and Zeros

* N-gram model performs better with increase of N.
* Performance of n-gram models can be visualized by random sentence generation.
* Outputs of n-gram models are influenced by genre and dialect of the training corpora.
* Problem of **sparsity**: some valid combination of words may not appear in training set.

**DEF** **zeros** are things that do not occur in the training set but do occur in the test set.

## **3.3.1 Unknown Words**

**DEF** In a **closed vocabulary** system we know all the words that can occur.

**DEF** **Unknown / out of vocabulary / OOV** words are words the model haven't seen before.

* **OOV rate** is the percentage of OOV words in the test set.
* In a **open vocabulary** system, OOV words are modeled by `<UNK>`.
* 2 ways to train the probabilities of `<UNK>`:
  * Before training, choose a fixed vocabulary, convert in the training set any word that is not in the vocabulary to `<UNK>`, and then treat `<UNK>` like other words.
  * Before training, convert in the training set any word with a small enough frequency to `<UNK>`, and then treat `<UNK>` like other words.

A language model can achieve low perplexity by choosing a small vocabulary and assigning the `<UNK>` a high probability, so perplexities should only be compared across models with the same vocabularies.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deemolover.gitbook.io/log-os/theory/language-processing/chapter-3-n-gram-language-models/3.3-generalization-and-zeros.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
