Regular Expressions
DEF Regular Expression, pattern, corpus
A regular expression search function will search through the corpus, returning all texts that match the pattern.
Basic patterns: (note: // is used to mark its content as a RE pattern but is not part of the pattern)
sequence of digits and Latin letters (and other special characters except for these listed below): search for an identical substring
/Buttercup/search for a substringButtercup
[]specifies a disjunction of characters to match/[Bb]/matchesBorb
[-]specifies a range of characters/[a-d]/matchesa,b,cord./[0-9]/matches an arbitrary digit/[A-Z]/matches an arbitrary upper case letter
[^]specifies any character that does not match expressions following^symbol^must be the first symbol after[, otherwise it simply matches a^也就是说在中括号内部的开头放一个
^就可以对要匹配的内容取反/[^A-Z]/matches any character that is not an upper case letter/[e^]/matches a lettereor^
?specifies the preceding character to be optional./colou?r/matchescolororcolour.
*(Kleene star) means "zero or more occurrences"/a/matches null,a,aa, etc.
+(Kleene plus) means "one or more occurrences"/a/matchesa,aa, etc.
Anchors anchor RE to particular places in a string
也就是定位用
^matches the start of a line./^The/matches aTheat the start of a line
$matches the end of a line/.$/matches a.at the end of a line
\bmatches a word boundary, while\Bmatches a non-boundarya word is a sequence of digits, underscores or letters
|is a disjunction operator matching with patterns at either side/cat|dog/matchescatordog
DEF false positive - incorrectly matched
DEF false negative - incorrectly missed
DEF increasing precision - minimizing false positives
DEF increasing recall - minimizing false negatives
s///works as a substitution that replace a string with anothers/colour/colorreplacescolourwithcolor
If we need to refer to a particular subpart of a matching result we can use number operator:
s/([0-9]+)/<\1>encloses every number with a<>.\1refers to the number captured by()./(.*) are \1/will matchcats are catsbut notcats are dogs.
()serves as a capture group. strings matching the RE inside are stored in numbered registers. Data in registers can be referred to through number operators like\1,\2, etc.sometimes we use
()only to specify priority of operators. In that case we use a non-capturing group(?:)./(?:some|a few) people/
DEF lookahead assertions help to look ahead in the text without advancing the match cursor.
(这似乎已经不是形式语言中的正则表达式语法了)
(?=pattern)returns true ifpatternoccurs理解时应当有 match cursor 的概念。简单来说就是
pattern匹配到的东西不会作为结果的一部分,而且因为是在 look ahead,它完成匹配后被匹配串的匹配状态还是和匹配pattern前一样。`
(?!pattern)returns true ifpatterndoes not occur/^(?!Volcano)[A-Za-z]+/匹配所有不是Volcano的单词。
reference
Last updated
Was this helpful?