- Very useful to use ranges in conjunction with
?
, *
, +
- E.g. let’s find all words with at least three consecutive vowels
str_view(words, "[aeiou][aeiou][aeiou]+")
[79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
[915] │ var<iou>s
- Useful for parsing strings which are partitioned by letters/numbers
name_score <- c("Mary_92", "Pat_35", "Will_85")
( str_view(name_score, "[a-zA-Z]+"))
[1] │ <Mary>_92
[2] │ <Pat>_35
[3] │ <Will>_85
str_view(name_score, "[0-9]+")
[1] │ Mary_<92>
[2] │ Pat_<35>
[3] │ Will_<85>
- E.g. replace all names with John, scores with 100
name_score %>% str_replace("[a-zA-Z]+", "John") %>%
str_replace("[0-9]+", "100")
[1] "John_100" "John_100" "John_100"
Extracting variables
separate_wider_regex()
: go from long to wide using regex.
df <- tribble(
~str,
"<Sheryl>-F_34",
"<Kisha>-F_45",
"<Pat>-X_33",
"<Sharon>-F_38",
"<Penny>-F_58",
"<Justin>-M_41",
"<Patricia>-F_84",
)
- To extract data, construct sequence of regex that match each piece.
- If you want contents of that piece to appear in output, give it a name.
df %>% separate_wider_regex(
str,
patterns = c(
"<",
name = "[A-Za-z]+",
">-",
gender = ".",
"_",
age = "[0-9]+"))
# A tibble: 7 × 3
name gender age
<chr> <chr> <chr>
1 Sheryl F 34
2 Kisha F 45
3 Pat X 33
4 Sharon F 38
5 Penny F 58
6 Justin M 41
7 Patricia F 84
Escaping
- Since the characters “.”, “?”, “+”, “*” have extra functions in regex, need to use escapes to help parse literal instances of these characters
- In regex, we require a
\
in front of characters to denote an escape
- But to create a string with an actual
\
in it, we need to use an escape, so need double \\
:
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>
str_view(c("a*rdvark", "*pple", "m*n"), "\\*")
[1] │ a<*>rdvark
[2] │ <*>pple
[3] │ m<*>n
- Recall that to represent backslash in a string, need to escape:
str_view("a\\b")
[1] │ a\b
- To match for a backslash, need to create a string which has an escape in front of a backslash.
- The escape requires double backslash, and the string
\
also requires double backslash.
str_view("a\\b", "\\\\")
[1] │ a<\>b
str_replace("mary.elizabeth", "\.", "-")
# Error: '\.' is an unrecognized escape in character string (<input>:1:33)
Anchors
- By default: regex will match any part of a string.
- If you only want to match at beginning or end, you need to anchor:
^
indicates “starts with”
$
indicates “ends with”
str_view(fruit, "^a")
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
str_view(fruit, "a$")
[4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
- To force a regex to match only the full string, not subsets, anchor it with both
^
and $
:
str_view(fruit, "apple")
[1] │ <apple>
[62] │ pine<apple>
str_view(fruit, "^apple$")
[1] │ <apple>
- Example: replace every fruit name which starts with “a” with an “o”
str_replace(fruit, "^a", "o")
[1] "opple" "opricot" "ovocado"
[4] "banana" "bell pepper" "bilberry"
[7] "blackberry" "blackcurrant" "blood orange"
[10] "blueberry" "boysenberry" "breadfruit"
...
Character sets
We already saw how we can construct sets with []
: e.g. [abc]
matches if any character is an “a”, “b”, or “c”
We also saw how to use -
to denote ranges, e.g. [a-z]
lowercase letters, [0-9]
numbers
A few others:
\d
matches any digit; \D
matches anything that isn’t a digit.
\s
matches any whitespace (e.g., space, tab, newline); \S
matches anything that isn’t whitespace.
\w
matches any “word” character, i.e. letters and numbers; \W
matches any “non-word” character.
Remember: to represent \
in a string, need double backslash.
x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
#> [1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+")
#> [1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+")
#> [1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+")
#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+")
#> [1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+")
#> [1] │ abcd< >ABCD< >12345< -!@#%.>
Anchors: boundaries of words
- You can specify the beginning or end of the word using
\b
- This works by treating all letters and numbers as “word” characters, and everything else as “non-word” characters
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
str_view(x, "\\bsum\\b")
[4] │ <sum>(x)
Quantifiers
We already discussed ?
(0 or 1 match), +
(1+ matches), *
(0+ matches)
colou?r
: matches American and British English
\d+
: matches 1+ digits
\s?
: matches 0+ whitespaces
Can specify exact number of matches using {}
:
{n}
matches exactly n times.
{n,}
matches at least n times.
{n,m}
matches between n and m times.
- Words with >= 3 consecutive vowels?
str_view(words, "[aeiou]{3,}")
[79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
...
- Words with between 4 and 6 consecutive consonants:
str_view(words, "[^aeiou]{4,6}")
[45] │ a<pply>
[198] │ cou<ntry>
[424] │ indu<stry>
[830] │ su<pply>
[836] │ <syst>em
Order of operations in regex
- Not immediately clear in which order R processes different operators.
ab+
: is this “a” and then 1+ “b”, or is it “ab” repeated 1+ times? (1st case)
^a|b$
: match the string “a” or the string “b”, OR: string starting with “a” or string starting with “b” (2nd case)
- Generally: quantifiers (
?+*
) have high precedence, alternation |
low.
- You can also introduce parenthesis to be more explicit about what you want, similar to normal math.
str_view(words, "a(b+)") # same as `ab+`
[2] │ <ab>le
[3] │ <ab>out
[4] │ <ab>solute
[62] │ avail<ab>le
[66] │ b<ab>y
[452] │ l<ab>our
[648] │ prob<ab>le
[837] │ t<ab>le
str_view(words, "(^a)|(b$)") # same as `^a|b$`
[1] │ <a>
[2] │ <a>ble
[3] │ <a>bout
[4] │ <a>bsolute
[5] │ <a>ccept
[6] │ <a>ccount
[7] │ <a>chieve
[8] │ <a>cross
[9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
[12] │ <a>dd
[13] │ <a>ddress
[14] │ <a>dmit
[15] │ <a>dvertise
[16] │ <a>ffect
[17] │ <a>fford
[18] │ <a>fter
[19] │ <a>fternoon
[20] │ <a>gain
... and 47 more
Grouping and capturing with parenthesis
- With paranthesis, you can back-reference matches that appeared in parens, using
\1
for match in first parens, \2
for match in second, etc.
- e.g. all fruits which have repeated pair of letters.
- Pair of letters = “(..)”; back-ref: “\1”
str_view(fruit, "(..)\\1")
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
- Words that start and end with same pair of letters:
# "starts with" a pair: ^(..)
# "ends with": need to end regex with \\1$
# to allow any chars between, put .* in middle
str_view(words, "^(..).*\\1$")
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>
- Words that are repetitions of the same pair of letters:
str_view(c("haha", "miumiu"), "^(..)+\\1$")
[1] │ <haha>
More grouping and capturing
- Can also use back references in
str_replace()
, e.g. switching second and third words in sentences:
sentences |>
str_replace(
"(\\w+) (\\w+) (\\w+)",
"\\1 \\3 \\2") |>
str_view()
[1] │ The canoe birch slid on the smooth planks.
[2] │ Glue sheet the to the dark blue background.
[3] │ It's to easy tell the depth of a well.
[4] │ These a days chicken leg is a rare dish.
[5] │ Rice often is served in round bowls.
[6] │ The of juice lemons makes fine punch.
[7] │ The was box thrown beside the parked truck.
...
(\\w+)
: matches with 1+ “word characters” (letters, numbers)
- Spacing between
(\\w+)
ensures we are looking for sequences of the form: word-chars, space, word-chars, space, word-chars
Examples
- Words that start with “y”:
str_view(words, "^y")
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
...
- Words that don’t start with “y”:
str_view(words, "^[^y]")
[1] │ <a>
[2] │ <a>ble
[3] │ <a>bout
[4] │ <a>bsolute
[5] │ <a>ccept
...
- Ends with a vowel-vowel-consonant triplet:
str_view(words, "[aeiou]{2}[^aeiou]$")
[3] │ ab<out>
[11] │ act<ual>
[19] │ aftern<oon>
[20] │ ag<ain>
[26] │ <air>
...
str_view(words, "[a-z]{7,}")
[4] │ <absolute>
[6] │ <account>
[7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
...
Boolean operations
- We already saw how
^
inside []
negates the set, i.e. words with no vowels:
str_view(words, "^[^aeiou]+$")
[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
...
- Another way: return vector of booleans indicating presence of vowels, then negate:
str_view(words[!str_detect(words, "[aeiou]")])
[1] │ by
[2] │ dry
[3] │ fly
[4] │ mrs
[5] │ try
...
- This is useful since there’s no “and” operator built into regex.
- e.g., find all words that contain an “a” and a “b”: trickier in standard regex,
str_view(words, "a.*b|b.*a")
[2] │ <ab>le
[3] │ <ab>out
[4] │ <ab>solute
[62] │ <availab>le
...
- Easier with
str_detect()
and &
:
str_view(words[str_detect(words, "a")
& str_detect(words, "b")])
[1] │ able
[2] │ about
[3] │ absolute
...
Boolean operations
- What if we wanted to find a word that contains “a”, “e”, “i”, and “o”?
- If we tried to use standard regex, this would be very complex.
- Much easier using
str_detect()
and &
:
words[
str_detect(words, "a") &
str_detect(words, "e") &
str_detect(words, "i") &
str_detect(words, "o")
]
[1] "appropriate" "associate" "organize" "relation"
Creating patterns with code
- What if we want all sentences which mention a color?
- Combine alternation with word boundaries
\b
:
str_view(sentences, "\\b(red|green|blue)\\b")
[2] │ Glue the sheet to the dark <blue> background.
[26] │ Two <blue> fish swam in the tank.
[92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...
- But if we wanted to update this code to have more colors, would be pretty tedious to construct pattern by hand.
- We can build up larger regex’s using functions we have seen before.
match_colors <- c("red","green", "blue")
for_regex <- str_c(
"\\b(",
str_flatten(match_colors, "|"),
")\\b")
str_view(sentences, for_regex)
[2] │ Glue the sheet to the dark <blue> background.
[26] │ Two <blue> fish swam in the tank.
[92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...
- Then we can easily modify the list of colors by simply modifying
match_colors
.
Slide