2.4 Text Normalization
useful UNIX tools:
tr
substitutes certain strings in text with other stringssort
sort lines in a textuniq
remove duplicate lines
2.4.2 Word Tokenization
DEF clitic contractions (for example they are == they're
)
DEF named entity detection task of detecting names, dates, organizations, etc.
Penn Treebank tokenization standard is a commonly used tokenization standard.
nltk
, a useful python-based NLP toolkit!
for many Chinese NLP tasks it turns out to work better at a character level. ( ref: [1] )
for languages like Japanese and Thai that requires word segmentation, a neural sequence model is usually applied.
2.4.3 Byte-Pair Encoding for Tokenization
ALGO byte-pair encoding / BPE ( ref: [2] )
We learn a list of merge rules on a given training data. Every rule is a token pair. Learning process:
At initial state, merge rule list is empty. Vocabulary (a list of tokens) contains all single characters. end-of-word is viewed as a special character. All words have been segmented by character.
At every step, merge the most frequent token pair (subsequence of length 2) in words (not inter-words) to create a new token. Append the token pair to the end of the merge rule list. Append the new token to the end of the vocabulary.
Re-tokenize all words according to the new vocabulary.
Repeat 2-3, until k merges are done (k is a parameter of the algorithm)
We use the learned rules to tokenize given corpus. Tokenization / Decoding:
At initial state, all words are segmented into instances by character.
At every step, pop the first token pair (A, B) in the merge rule list. Linearly scan the corpus and greedily merge every occurrence of (A, B) in words into token (AB).
Repeat 2, until rule list becomes empty.
EX Suppose we have following words and their frequency in training corpus:
frequencydictionary(_ is end-of-word)5
l o w _
2
l o w e s t _
6
n e w e r _
3
w i d e r _
During training, we first merge
r
and_
for its 9 occurrences.r_
is added to the vocabulary and allr_
in words are merged because of re-tokenization. A rule(r, _)
is created.frequencydictionary(_ is end-of-word)5
l o w _
2
l o w e s t _
6
n e w e r_
3
w i d e r_
Then the steps are repeated.
e
andr_
will be merged intoer_
, and another rule(e, r_)
is created.frequencydictionary(_ is end-of-word)5
l o w _
2
l o w e s t _
6
n e w er_
3
w i d er_
During tokenization, we first apply the rule
(r, _)
and merge everyr
and_
, then the rule(e, r_)
is applied.简单来说,训练过程是在每个迭代中合并出现频率最高的 token 对,tokenization 就是把这个合并的操作序列再执行一次。
ALGO wordpiece ( ref: [3] )
Word-boundary token
_
appears at the beginning of words.Rather than merging the pairs that are most frequent, wordpiece instead merges the pairs that minimizes the language model likelihood of the training data.
The BERT tokenizer is a variant of wordpiece algorithm. It uses a maximum matching algorithm to do tokenization (decoding).
ALGO SentencePiece is similar to wordpiece but does training and decoding directly on raw text. Merges take place not only in words but also across words. It can be applied to languages like Chinese. ( ref: [4] )
2.4.4 Word Normalization, Lemmatization and Stemming
DEF Word normalization is the task of putting tokens in a standard format, choosing a single normal forms for words with multiple forms like USA & US.
Case folding maps everything to lower case.
Sometimes we also want morphologically different forms of a word to behave similarly (like plural form vs singular form)
Lemmatization
morphemes
stems are central morpheme of the word
affixes add additional meanings
morphological analysis
stemming
ALGO Porter stemmer
2.4.5 Sentence Segmentation
Punctuation is important! 标点符号是切分句子的重要信息。
But may be ambiguous. 句末的标点也可能在句中使用。
Reference
[1] Li, X., Meng, Y., Sun, X., Han, Q., Yuan, A., and Li, J. (2019). Is word segmentation necessary for deep learning of Chinese representations?. In ACL 2019, 3242–3252.
[2] Sennrich, R., Haddow, B., and Birch, A. (2016). Neural machine translation of rare words with subword units. In ACL 2016.
[3] Wu et al. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
[4] Kudo, T. and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP 2018, 66–71.
Last updated
Was this helpful?