2.2 Words
Last updated
Was this helpful?
Last updated
Was this helpful?
DEF corpus is a computer-readable collection of text or speech.
DEF utterance is the spoken correlate of a sentence.
(Utterance) is a continuous piece of speech beginning and ending with a clear pause. [1]
DEF disfluencies: example:
I do uh main - mainly business data processing.
main- is called a fragment.
uh is called fillers / filled pauses.
DEF part-of-speech / POS a category of words that have similar grammatical properties(词性)
DEF a lemma is a set of lexical forms having the same stem, the same major part-of-speech, and the same word sense. The wordform is the full inflected or derived form of the word.
DEF The number of types are the number of distinct words in a corpus, often denoted by V. The number of tokens are the total number of running words, often denoted by N.
Herdan's Law / Heaps' Law For large corpus we have . roughly .