3.2 Evaluating Language Models
ways to evaluate:
DEF extrinsic evaluation embeds the LM in an application and measure how it improves.
DEF intrinsic evaluation need not any application.
training set / training corpus is the corpus LM is trained on
test set / test corpus is the data used to measure its performance
sometimes called held out corpora because it's unseen to LM
When two LMs are compared, the one that assigns a higher probability to the test set is better and we say that it has a tighter fit to the test data.
Training on the test set will introduce a bias that makes probabilities look too high.
development set / devset is an initial test set that LM is tuned to. It's not really a test set, which is completely unseen to the LM.
DEF perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words: . Sentence boundaries are often counted in.
it can be interpreted as the weighed average branching factor of a language. Branching factor of a language is the number of possible next words that can follow any word. Suppose that in all corpora of some language, 10 characters appear randomly on a uniform frequency and are completely independent of others. In this case the branching factor is 10. But if some characters appear much more frequently, the weight for branches will change and perplexity will decrease.
Last updated
Was this helpful?