Chapter 2 RE, Text Norm, Edit dist.
Introduction
DEF text normalization tasks involving converting text to a more convenient, standard form
大意就是所谓文本规范化或者结构化。
EX text normalization:
DEF tokenization of / tokenizing words from running text (main text of a document, as distinguished from captions, titles, lists, etc. - Merriam Webster) means to chopping up the text into pieces called tokens. A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
reference: Tokenization
也就是说,将文本切分为某种语义单位的序列,切分所得序列的元素,也就是这些语义单位,叫做 token。token 并不一定是 word,例如一些惯用短语在特定语境下也可以成为单个 token。
甚至 emoticons 和 hashtags 也可以成为 token。
DEF lemmatization is the task of determining that two words are derived from the same lemma (a form of a word that appears as an entry in a dictionary and is used to represent all the other possible forms - Cambridge). a lemmatizer maps a word to its lemma.
lemma 应该就是原型或者词典型。lemmatization 对于一些小语种很有必要。
DEF stemming means stripping suffixes from the word.
DEF sentence segmentation means breaking up texts into individual sentences.
Regular Expressions
Please refer to Regular Expressions.
Last updated
Was this helpful?