3.1 N-Grams

DEF Our task is calculating P(wh)P(w|h) , where ww is a word and hh is a text history. For example P(appleI have an)P(\mathrm{apple}|\mathrm{I\ have\ an}) . To formalize the problem, treat every word in a sequence as a random variable XiX_i . The probability of XiX_i taking on a value wiw_i is written as P(Xi=wi)P(X_i=w_i) . w1nw_1^n represents a sequence of N words w1wnw_1\dots w_n . P(w1,,wn)P(w_1,\dots,w_n) represents the joint probability P(X1=w1,,Xn=wn)P(X_1=w_1,\dots,X_n=w_n) .

  • So we have P(w1n)=k=1nP(wkw1k1)P(w_1^n) = \prod _{k=1}^nP(w_k|w_1^{k-1})

  • 这一条件概率难以统计获得,因为自然语言的特质就是要不断产生新文本,相同的文字序列几乎不可能重复出现。

LM bigram model makes the following approximation (Markov assumption): P(wnw1n1)P(wnwn1)P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-1})

  • As a generalization, in n-gram: P(wnw1n1)P(wnwnN+1n1)P(w_n|w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1})

    and in trigram model N takes on the value of 3.

    • To estimate the probability, we use maximum likelihood estimation / MLE. We count its frequency and normalize it to a value between 0 and 1: P(wnwn1)=C(wn1wn)wC(wn1w)=C(wn1wn)C(w)P(w_n|w_{n-1}) = \frac{C(w_{n-1}w_n)}{\sum_w C(w_{n-1}w)}= \frac{C(w_{n-1}w_n)}{C(w)}

    • For the general case of MLE n-gram parameter estimation: P(wnwnN+1n1)=C(wnN+1n1wn)C(wnN+1n1)P(w_n|w_{n-N+1}^{n-1}) = \frac{C(w_{n-N+1}^{n-1}w_n)}{C(w_{n-N+1}^{n-1})}

  • How do we understand MLE?

  • If a word occurs k times in a corpus of size n, then MLE p=k/n is the probability that makes it most likely that the word will occur k times in a corpus of size n. (may be under assumption that all words in a corpus take on values independently?)

Last updated

Was this helpful?