2.2 Words

DEF corpus is a computer-readable collection of text or speech.

DEF utterance is the spoken correlate of a sentence.

(Utterance) is a continuous piece of speech beginning and ending with a clear pause. [1]

DEF disfluencies: example:

I do uh main - mainly business data processing.

  • main- is called a fragment.

  • uh is called fillers / filled pauses.

DEF part-of-speech / POS a category of words that have similar grammatical properties(词性)

DEF a lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The wordform is the full inflected or derived form of the word.

DEF The number of types are the number of distinct words in a corpus, often denoted by V. The number of tokens are the total number of running words, often denoted by N.

Herdan's Law / Heaps' Law For large corpus we have V=kNβ|V| = kN^\beta. roughly 0.67<β<0.750.67 < \beta < 0.75.

Reference

[1] Utterance - Wikipedia

Last updated

Was this helpful?