3.1 N-Grams
DEF Our task is calculating P(w∣h) , where w is a word and h is a text history. For example P(apple∣I have an) . To formalize the problem, treat every word in a sequence as a random variable Xi . The probability of Xi taking on a value wi is written as P(Xi=wi) . w1n represents a sequence of N words w1…wn . P(w1,…,wn) represents the joint probability P(X1=w1,…,Xn=wn) .
So we have P(w1n)=∏k=1nP(wk∣w1k−1)
这一条件概率难以统计获得,因为自然语言的特质就是要不断产生新文本,相同的文字序列几乎不可能重复出现。
LM bigram model makes the following approximation (Markov assumption): P(wn∣w1n−1)≈P(wn∣wn−1)
As a generalization, in n-gram: P(wn∣w1n−1)≈P(wn∣wn−N+1n−1)
and in trigram model N takes on the value of 3.
To estimate the probability, we use maximum likelihood estimation / MLE. We count its frequency and normalize it to a value between 0 and 1: P(wn∣wn−1)=∑wC(wn−1w)C(wn−1wn)=C(w)C(wn−1wn)
For the general case of MLE n-gram parameter estimation: P(wn∣wn−N+1n−1)=C(wn−N+1n−1)C(wn−N+1n−1wn)
How do we understand MLE?
If a word occurs k times in a corpus of size n, then MLE p=k/n is the probability that makes it most likely that the word will occur k times in a corpus of size n. (may be under assumption that all words in a corpus take on values independently?)
Last updated