3.1 N-Grams
Last updated
Was this helpful?
Last updated
Was this helpful?
DEF Our task is calculating , where is a word and is a text history. For example . To formalize the problem, treat every word in a sequence as a random variable . The probability of taking on a value is written as . represents a sequence of N words . represents the joint probability .
So we have
这一条件概率难以统计获得,因为自然语言的特质就是要不断产生新文本,相同的文字序列几乎不可能重复出现。
LM bigram model makes the following approximation (Markov assumption):
As a generalization, in n-gram:
and in trigram model N takes on the value of 3.
To estimate the probability, we use maximum likelihood estimation / MLE. We count its frequency and normalize it to a value between 0 and 1:
For the general case of MLE n-gram parameter estimation:
How do we understand MLE?
If a word occurs k times in a corpus of size n, then MLE p=k/n is the probability that makes it most likely that the word will occur k times in a corpus of size n. (may be under assumption that all words in a corpus take on values independently?)