2.4 Text Normalization

useful UNIX tools:

tr substitutes certain strings in text with other strings
sort sort lines in a text
uniq remove duplicate lines

2.4.2 Word Tokenization

DEF clitic contractions (for example they are == they're)

DEF named entity detection task of detecting names, dates, organizations, etc.

Penn Treebank tokenization standard is a commonly used tokenization standard.

Linguistic Data Consortium

nltk, a useful python-based NLP toolkit!

for many Chinese NLP tasks it turns out to work better at a character level. ( ref: [1] )
for languages like Japanese and Thai that requires word segmentation, a neural sequence model is usually applied.

2.4.3 Byte-Pair Encoding for Tokenization

ALGO byte-pair encoding / BPE ( ref: [2] )
- We learn a list of merge rules on a given training data. Every rule is a token pair. Learning process:
  1. At initial state, merge rule list is empty. Vocabulary (a list of tokens) contains all single characters. end-of-word is viewed as a special character. All words have been segmented by character.
  2. At every step, merge the most frequent token pair (subsequence of length 2) in words (not inter-words) to create a new token. Append the token pair to the end of the merge rule list. Append the new token to the end of the vocabulary.
  3. Re-tokenize all words according to the new vocabulary.
  4. Repeat 2-3, until k merges are done (k is a parameter of the algorithm)
- We use the learned rules to tokenize given corpus. Tokenization / Decoding:
  1. At initial state, all words are segmented into instances by character.
  2. At every step, pop the first token pair (A, B) in the merge rule list. Linearly scan the corpus and greedily merge every occurrence of (A, B) in words into token (AB).
  3. Repeat 2, until rule list becomes empty.
- EX Suppose we have following words and their frequency in training corpus:
  frequency
  dictionary(_ is end-of-word)
  5
  l o w _
  2
  l o w e s t _
  6
  n e w e r _
  3
  w i d e r _
  During training, we first merge r and _ for its 9 occurrences. r_ is added to the vocabulary and all r_ in words are merged because of re-tokenization. A rule (r, _) is created.
  frequency
  dictionary(_ is end-of-word)
  5
  l o w _
  2
  l o w e s t _
  6
  n e w e r_
  3
  w i d e r_
  Then the steps are repeated. e and r_ will be merged into er_, and another rule (e, r_) is created.
  frequency
  dictionary(_ is end-of-word)
  5
  l o w _
  2
  l o w e s t _
  6
  n e w er_
  3
  w i d er_
  During tokenization, we first apply the rule (r, _) and merge every r and _ , then the rule (e, r_) is applied.
- 简单来说，训练过程是在每个迭代中合并出现频率最高的 token 对，tokenization 就是把这个合并的操作序列再执行一次。
ALGO wordpiece ( ref: [3] )
- Word-boundary token _ appears at the beginning of words.
- Rather than merging the pairs that are most frequent, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data.
- The BERT tokenizer is a variant of wordpiece algorithm. It uses a maximum matching algorithm to do tokenization (decoding).
ALGO SentencePiece is similar to wordpiece but does training and decoding directly on raw text. Merges take place not only in words but also across words. It can be applied to languages like Chinese. ( ref: [4] )

2.4.4 Word Normalization, Lemmatization and Stemming

DEF Word normalization is the task of putting tokens in a standard format, choosing a single normal forms for words with multiple forms like USA & US.

Case folding maps everything to lower case.
Sometimes we also want morphologically different forms of a word to behave similarly (like plural form vs singular form)

Lemmatization

morphemes
- stems are central morpheme of the word
- affixes add additional meanings
morphological analysis
- stemming
  - ALGO Porter stemmer

2.4.5 Sentence Segmentation

Punctuation is important! 标点符号是切分句子的重要信息。
But may be ambiguous. 句末的标点也可能在句中使用。

Reference

[1] Li, X., Meng, Y., Sun, X., Han, Q., Yuan, A., and Li, J. (2019). Is word segmentation necessary for deep learning of Chinese representations?. In ACL 2019, 3242–3252.

[2] Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In ACL 2016.

[3] Wu et al. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

[4] Kudo, T. and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP 2018, 66–71.

Previous2.3 Corpora Next2.5 Minimum Edit Distance

Last updated 3 years ago

Was this helpful?