# 3.1 N-Grams

**DEF** Our task is calculating $$P(w|h)$$ , where $$w$$ is a word and $$h$$ is a text history. For example $$P(\mathrm{apple}|\mathrm{I\ have\ an})$$ . To formalize the problem, treat every word in a sequence as a random variable $$X\_i$$ . The probability of $$X\_i$$ taking on a value $$w\_i$$ is written as $$P(X\_i=w\_i)$$ . $$w\_1^n$$ represents a sequence of N words $$w\_1\dots w\_n$$ . $$P(w\_1,\dots,w\_n)$$ represents the joint probability $$P(X\_1=w\_1,\dots,X\_n=w\_n)$$ .

* So we have $$P(w\_1^n) = \prod \_{k=1}^nP(w\_k|w\_1^{k-1})$$&#x20;
* 这一条件概率难以统计获得，因为自然语言的特质就是要不断产生新文本，相同的文字序列几乎不可能重复出现。

**LM** **bigram** model makes the following approximation (**Markov** assumption): $$P(w\_n|w\_1^{n-1}) \approx P(w\_n|w\_{n-1})$$

* As a generalization, in **n-gram:** $$P(w\_n|w\_1^{n-1}) \approx P(w\_n|w\_{n-N+1}^{n-1})$$

  and in **trigram** model N takes on the value of 3.

  * To estimate the probability, we use **maximum likelihood estimation / MLE**. We count its frequency and **normalize** it to a value between 0 and 1: $$P(w\_n|w\_{n-1}) = \frac{C(w\_{n-1}w\_n)}{\sum\_w C(w\_{n-1}w)}= \frac{C(w\_{n-1}w\_n)}{C(w)}$$&#x20;
  * For the general case of MLE n-gram parameter estimation: $$P(w\_n|w\_{n-N+1}^{n-1}) = \frac{C(w\_{n-N+1}^{n-1}w\_n)}{C(w\_{n-N+1}^{n-1})}$$&#x20;
* How do we understand MLE?
* If a word occurs k times in a corpus of size n, then MLE p=k/n is the probability that makes it **most likely** that the word will occur k times in a corpus of size n. (may be under assumption that all words in a corpus take on values independently?)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://deemolover.gitbook.io/log-os/main/theory/language-processing/chapter-3-n-gram-language-models/3.1-n-grams.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
