3.1 N-Grams

DEF Our task is calculating $P(w|h)$ , where $w$ is a word and $h$ is a text history. For example $P(\mathrm{apple}|\mathrm{I\ have\ an})$ . To formalize the problem, treat every word in a sequence as a random variable $X_i$ . The probability of $X_i$ taking on a value $w_i$ is written as $P(X_i=w_i)$ . $w_1^n$ represents a sequence of N words $w_1\dots w_n$ . $P(w_1,\dots,w_n)$ represents the joint probability $P(X_1=w_1,\dots,X_n=w_n)$ .

So we have $P(w_1^n) = \prod _{k=1}^nP(w_k|w_1^{k-1})$
这一条件概率难以统计获得，因为自然语言的特质就是要不断产生新文本，相同的文字序列几乎不可能重复出现。

LM bigram model makes the following approximation (Markov assumption): $P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-1})$

As a generalization, in n-gram: $P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1})$
and in trigram model N takes on the value of 3.
- To estimate the probability, we use maximum likelihood estimation / MLE. We count its frequency and normalize it to a value between 0 and 1: $P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_w C(w_{n-1}w)}= \frac{C(w_{n-1}w_n)}{C(w)}$
- For the general case of MLE n-gram parameter estimation: $P(w_n|w_{n-N+1}^{n-1}) = \frac{C(w_{n-N+1}^{n-1}w_n)}{C(w_{n-N+1}^{n-1})}$
How do we understand MLE?
If a word occurs k times in a corpus of size n, then MLE p=k/n is the probability that makes it most likely that the word will occur k times in a corpus of size n. (may be under assumption that all words in a corpus take on values independently?)

PreviousChapter 3 N-gram Language Models Next3.2 Evaluating Language Models

Last updated 3 years ago

Was this helpful?