# 3.2 Evaluating Language Models

ways to evaluate:

* **DEF** **extrinsic evaluation** embeds the LM in an application and measure how it improves.
* **DEF** **intrinsic evaluation** need not any application.
  * **training set / training corpus** is the corpus LM is trained on
  * **test set / test corpus** is the data used to measure its performance
    * sometimes called **held out corpora** because it's unseen to LM
* When two LMs are compared, the one that assigns a higher probability to the test set is better and we say that it has a tighter fit to the test data.
* **Training on the test set** will introduce a bias that makes probabilities look too high.
* **development set / devset** is an initial test set that LM is tuned to. It's not really a test set, which is completely unseen to the LM.

**DEF** **perplexity** of a language model on a test set is the inverse probability of the test set, normalized by the number of words: $$PP(W)=P(w\_1\dots w\_N)^{-1/N}$$ . Sentence boundaries are often counted in.

* it can be interpreted as the **weighed average branching factor** of a language. Branching factor of a language is the number of possible next words that can follow any word. Suppose that in all corpora of some language, 10 characters appear randomly on a uniform frequency and are completely independent of others. In this case the branching factor is 10. But if some characters appear much more frequently, the weight for branches will change and perplexity will decrease.
