Regular Expressions
DEF Regular Expression, pattern, corpus
A regular expression search function will search through the corpus, returning all texts that match the pattern.
Basic patterns: (note: //
is used to mark its content as a RE pattern but is not part of the pattern)
sequence of digits and Latin letters (and other special characters except for these listed below): search for an identical substring
/Buttercup/
search for a substringButtercup
[]
specifies a disjunction of characters to match/[Bb]/
matchesB
orb
[-]
specifies a range of characters/[a-d]/
matchesa
,b
,c
ord
./[0-9]/
matches an arbitrary digit/[A-Z]/
matches an arbitrary upper case letter
[^]
specifies any character that does not match expressions following^
symbol^
must be the first symbol after[
, otherwise it simply matches a^
也就是说在中括号内部的开头放一个
^
就可以对要匹配的内容取反/[^A-Z]/
matches any character that is not an upper case letter/[e^]/
matches a lettere
or^
?
specifies the preceding character to be optional./colou?r/
matchescolor
orcolour
.
*
(Kleene star) means "zero or more occurrences"/a/
matches null,a
,aa
, etc.
+
(Kleene plus) means "one or more occurrences"/a/
matchesa
,aa
, etc.
Anchors anchor RE to particular places in a string
也就是定位用
^
matches the start of a line./^The/
matches aThe
at the start of a line
$
matches the end of a line/.$/
matches a.
at the end of a line
\b
matches a word boundary, while\B
matches a non-boundarya word is a sequence of digits, underscores or letters
|
is a disjunction operator matching with patterns at either side/cat|dog/
matchescat
ordog
DEF false positive - incorrectly matched
DEF false negative - incorrectly missed
DEF increasing precision - minimizing false positives
DEF increasing recall - minimizing false negatives
s///
works as a substitution that replace a string with anothers/colour/color
replacescolour
withcolor
If we need to refer to a particular subpart of a matching result we can use number operator:
s/([0-9]+)/<\1>
encloses every number with a<>
.\1
refers to the number captured by()
./(.*) are \1/
will matchcats are cats
but notcats are dogs
.
()
serves as a capture group. strings matching the RE inside are stored in numbered registers. Data in registers can be referred to through number operators like\1
,\2
, etc.sometimes we use
()
only to specify priority of operators. In that case we use a non-capturing group(?:)
./(?:some|a few) people/
DEF lookahead assertions help to look ahead in the text without advancing the match cursor.
(这似乎已经不是形式语言中的正则表达式语法了)
(?=pattern)
returns true ifpattern
occurs理解时应当有 match cursor 的概念。简单来说就是
pattern
匹配到的东西不会作为结果的一部分,而且因为是在 look ahead,它完成匹配后被匹配串的匹配状态还是和匹配pattern
前一样。`
(?!pattern)
returns true ifpattern
does not occur/^(?!Volcano)[A-Za-z]+/
匹配所有不是Volcano
的单词。
reference
Last updated
Was this helpful?