Regular Expressions

Data we will look at:
babynames (use `install.packages(“babynames”)): year/sex/name/number/proportion vars
stringr::fruit: 80 fruits
stringr::words: 980 common English words
stringr::sentences: 720 short sentences

We will use str_view(string, pattern = NULL) a lot. pattern will parse regular expressions (regex)

str_view(fruit, "berry")

 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
...

Literal characters, metacharacters

Letters and numbers which match exactly are literal characters
Punctuation characters typically have special regex meanings (., +, *, etc), and are called metacharacters
The metacharacter . will match any character, so “a.” matches any string which contains “a” followed by another character.

str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")

[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>

Or all fruits which have an “a”, then 3 letters, then an “e”:

str_view(fruit, "a...e")

 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
...

Quantifiers

? makes a pattern optional (i.e. it matches 0 or 1 times)
+ lets a pattern repeat (i.e. it matches at least once)
* lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).

# ab? matches an "a", optionally followed by "b".
str_view(c("a", "ab", "abb"), "ab?")

[1] │ <a>
[2] │ <ab>
[3] │ <ab>b

# ab+ matches an "a", followed by at >=1 "b".
str_view(c("a", "ab", "abb"), "ab+")

[2] │ <ab>
[3] │ <abb>

# ab* matches an "a", followed by any num of "b"s.
str_view(c("a", "ab", "abb"), "ab*")

[1] │ <a>
[2] │ <ab>
[3] │ <abb>

Character classes

Defined by [], let you match from a set of characters (similar idea to %in%)
- [abcd] matches anything with “a”, “b”, “c”, or “d”
- Can negate/invert by using ^: [^abcd] returns anything except “a”, “b”, “c”, “d”
e.g. any word containing “x” surrounded by vowels, or “y” surrounded by consonants
alternation | picks between alternative patterns, e.g. words containing “apple”, “melon”, or “nut”; repeated vowels

str_view(words, "[aeiou]x[aeoiu]")

[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st

str_view(words, "[^aeiou]y[^aeiou]")

[836] │ <sys>tem
[901] │ <typ>e

str_view(fruit, "apple|melon|nut")

 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
...

str_view(fruit, "aa|ee|ii|oo|uu")

 [9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n

`str_detect()`

str_detect(character_vector, pattern) returns a logical vector, TRUE if pattern matches element of vector and FALSE otherwise.

str_detect(c("a", "b", "c"), "[aeiou]")

[1]  TRUE FALSE FALSE

Since returns logical vectors, can be used with filter(), e.g. most popular names containing an “x”:

babynames

# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
...

babynames |> 
  filter(str_detect(name, "x")) |> 
  count(name, wt = n, sort = TRUE)

# A tibble: 974 × 2
   name            n
   <chr>       <int>
 1 Alexander  665492
 2 Alexis     399551
 3 Alex       278705
 4 Alexandra  232223
 5 Max        148787
 6 Alexa      123032
...

`str_detect()`

You can also use str_detect() in conjunction with group_by(), summarize() etc.
- sum() will return number of strings which have pattern
- mean() ill return proportion of strings which have pattern
E.g. proportion of names per year that have an “x”

babynames %>% 
  group_by(year) %>%
  summarize(prop_x = mean(str_detect(name, "x"))) %>%
  arrange(by = desc(prop_x))

# A tibble: 138 × 2
    year prop_x
   <dbl>  <dbl>
 1  2016 0.0163
 2  2017 0.0159
 3  2015 0.0154
 4  2014 0.0146
 5  2013 0.0145
 6  2012 0.0136
 7  2011 0.0130
 8  2010 0.0126
 9  2009 0.0118
10  2007 0.0108
# ℹ 128 more rows

Counting matches

str_count() tells how many matches there are in a string

x <- c("apple", "banana", "pear")
str_count(x, "p")

[1] 2 0 1

Regex matches never overlap - always start after the end of previous match

str_count("abababa", "aba")

[1] 2

str_view("abababa", "aba")

[1] │ <aba>b<aba>

Counting vowels and constants in baby names

Can use str_count() with mutate, i.e. computing number of vowels/consonants in baby names:

babynames %>%
  count(name) %>%
  mutate(
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )

# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      2          3
 2 Aabha         5      2          3
 3 Aabid         2      2          3
 4 Aabir         1      2          3
 5 Aabriella     5      4          5
 6 Aada          1      2          2
 7 Aadam        26      2          3
 8 Aadan        11      2          3
 9 Aadarsh      17      2          5
10 Aaden        18      2          3
# ℹ 97,300 more rows

Note that pattern matching is case sensitive, so “A” isn’t counted.
Ways around this:
- Add the upper case vowels to the character class:
  str_count(name, "[aeiouAEIOU]")
- Use str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), "[aeiou]")

babynames %>% count(name) %>%  mutate(
    name = str_to_lower(name),
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]"))

# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 aaban        10      3          2
...

Replacing and removing values

str_replace(): replaces first match
str_replace_all() replace all matches

x <- c("apple", "pear", "banana")
(str_replace(x, "[aeiou]", "-"))

[1] "-pple"  "p-ar"   "b-nana"

str_replace_all(x, "[aeiou]", "-")

[1] "-ppl-"  "p--r"   "b-n-n-"

You can remove patterns if you set replacement with ““, or using str_remove() / str_remove_all()

str_remove(x, "[aeiou]")

[1] "pple"  "par"   "bnana"

str_remove_all(x, "[aeiou]")

[1] "ppl" "pr"  "bnn"

Replacing characters and `?`, `*`, `+`

The question mark ? matches the preceding element zero OR ONE time, then iterates to rest of string.
Plus sign + matches AT LEAST once
The asterisk * matches the preceding element zero OR MORE times, then iterates to rest of string.

x <- c("apple", "aardvark", "happy", "haaahaha")
( str_view(x, "a*") )

[1] │ <a><>p<>p<>l<>e<>
[2] │ <aa><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<aaa><>h<a><>h<a><>

( str_remove(x, "a*") )

[1] "pple"     "rdvark"   "happy"    "haaahaha"

str_remove_all(x, "a*")

[1] "pple"  "rdvrk" "hppy"  "hhh"

( str_view(x, "a?") )

[1] │ <a><>p<>p<>l<>e<>
[2] │ <a><a><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<a><a><a><>h<a><>h<a><>

( str_remove(x, "a?") )

[1] "pple"     "ardvark"  "happy"    "haaahaha"

str_remove_all(x, "a?")

[1] "pple"  "rdvrk" "hppy"  "hhh"

Replacing characters and `?`, `*`, `+`

Compare ?, +, and *

x <- c("apple", "aardvark", "happy", "haaahaha")
(str_view(x, "a?"))

[1] │ <a><>p<>p<>l<>e<>
[2] │ <a><a><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<a><a><a><>h<a><>h<a><>

(str_view(x, "a+"))

[1] │ <a>pple
[2] │ <aa>rdv<a>rk
[3] │ h<a>ppy
[4] │ h<aaa>h<a>h<a>

(str_view(x, "a*"))

[1] │ <a><>p<>p<>l<>e<>
[2] │ <aa><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<aaa><>h<a><>h<a><>

Ranges of characters

Suppose you have a vector of strings, and you want to do the following modifications:
- If the string has a (lower/upper) letter between “a” and “u”, replace it with an “x”
Instead of spelling out manually what all of these letters are, you can use the character class operator [] together with -

An example with letters:

x <- c("happy", "ab", "zap", "war")
( str_view(x, "[a-u]") )

[1] │ <h><a><p><p>y
[2] │ <a><b>
[3] │ z<a><p>
[4] │ w<a><r>

str_replace_all(x, "[a-u]", "x")

[1] "xxxxy" "xx"    "zxx"   "wxx"

An example with numbers: replace all numbers between 0 and 5 with x’s

x <- c("code9202", "apple2850", "0352")
(str_view(x, "[0-5]"))

[1] │ code9<2><0><2>
[2] │ apple<2>8<5><0>
[3] │ <0><3><5><2>

str_replace_all(x, "[0-5]", "x")

[1] "code9xxx"  "applex8xx" "xxxx"

Ranges of characters and `?`, `*`

Very useful to use ranges in conjunction with ?, *, +
E.g. let’s find all words with at least three consecutive vowels

str_view(words, "[aeiou][aeiou][aeiou]+")

 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
[915] │ var<iou>s

Useful for parsing strings which are partitioned by letters/numbers

name_score <- c("Mary_92", "Pat_35", "Will_85")
( str_view(name_score, "[a-zA-Z]+"))

[1] │ <Mary>_92
[2] │ <Pat>_35
[3] │ <Will>_85

str_view(name_score, "[0-9]+")

[1] │ Mary_<92>
[2] │ Pat_<35>
[3] │ Will_<85>

E.g. replace all names with John, scores with 100

name_score %>% str_replace("[a-zA-Z]+", "John") %>%
  str_replace("[0-9]+", "100")

[1] "John_100" "John_100" "John_100"

Extracting variables

separate_wider_regex(): go from long to wide using regex.

df <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Pat>-X_33",
  "<Sharon>-F_38", 
  "<Penny>-F_58",
  "<Justin>-M_41", 
  "<Patricia>-F_84", 
)

To extract data, construct sequence of regex that match each piece.
If you want contents of that piece to appear in output, give it a name.

df %>%  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".",
      "_",
      age = "[0-9]+"))

# A tibble: 7 × 3
  name     gender age  
  <chr>    <chr>  <chr>
1 Sheryl   F      34   
2 Kisha    F      45   
3 Pat      X      33   
4 Sharon   F      38   
5 Penny    F      58   
6 Justin   M      41   
7 Patricia F      84

Escaping

Since the characters “.”, “?”, “+”, “*” have extra functions in regex, need to use escapes to help parse literal instances of these characters
In regex, we require a \ in front of characters to denote an escape
But to create a string with an actual \ in it, we need to use an escape, so need double \\:

str_view(c("abc", "a.c", "bef"), "a\\.c")

[2] │ <a.c>

str_view(c("a*rdvark", "*pple", "m*n"), "\\*")

[1] │ a<*>rdvark
[2] │ <*>pple
[3] │ m<*>n

Recall that to represent backslash in a string, need to escape:

str_view("a\\b")

[1] │ a\b

To match for a backslash, need to create a string which has an escape in front of a backslash.
The escape requires double backslash, and the string \ also requires double backslash.

str_view("a\\b", "\\\\")

[1] │ a<\>b

str_replace("mary.elizabeth", "\.", "-")
# Error: '\.' is an unrecognized escape in character string (<input>:1:33)

Anchors

By default: regex will match any part of a string.
If you only want to match at beginning or end, you need to anchor:
- ^ indicates “starts with”
- $ indicates “ends with”

str_view(fruit, "^a")

[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado

str_view(fruit, "a$")

 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>

To force a regex to match only the full string, not subsets, anchor it with both ^ and $:

str_view(fruit, "apple")

 [1] │ <apple>
[62] │ pine<apple>

str_view(fruit, "^apple$")

[1] │ <apple>

Example: replace every fruit name which starts with “a” with an “o”

str_replace(fruit, "^a", "o")

 [1] "opple"             "opricot"           "ovocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
...

Character sets

We already saw how we can construct sets with []: e.g. [abc] matches if any character is an “a”, “b”, or “c”
We also saw how to use - to denote ranges, e.g. [a-z] lowercase letters, [0-9] numbers
A few others:
- \d matches any digit; \D matches anything that isn’t a digit.
- \s matches any whitespace (e.g., space, tab, newline); \S matches anything that isn’t whitespace.
- \w matches any “word” character, i.e. letters and numbers; \W matches any “non-word” character.
Remember: to represent \ in a string, need double backslash.

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
#> [1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+")
#> [1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+")
#> [1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+")
#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+")
#> [1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+")
#> [1] │ abcd< >ABCD< >12345< -!@#%.>

Anchors: boundaries of words

You can specify the beginning or end of the word using \b
- This works by treating all letters and numbers as “word” characters, and everything else as “non-word” characters

x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")

[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)

str_view(x, "\\bsum\\b")

[4] │ <sum>(x)

Quantifiers

We already discussed ? (0 or 1 match), + (1+ matches), * (0+ matches)
- colou?r: matches American and British English
- \d+: matches 1+ digits
- \s?: matches 0+ whitespaces
Can specify exact number of matches using {}:
- {n} matches exactly n times.
- {n,} matches at least n times.
- {n,m} matches between n and m times.

Words with >= 3 consecutive vowels?

str_view(words, "[aeiou]{3,}")

 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
...

Words with between 4 and 6 consecutive consonants:

str_view(words, "[^aeiou]{4,6}")

 [45] │ a<pply>
[198] │ cou<ntry>
[424] │ indu<stry>
[830] │ su<pply>
[836] │ <syst>em

Order of operations in regex

Not immediately clear in which order R processes different operators.
- ab+: is this “a” and then 1+ “b”, or is it “ab” repeated 1+ times? (1st case)
- ^a|b$: match the string “a” or the string “b”, OR: string starting with “a” or string starting with “b” (2nd case)
Generally: quantifiers (?+*) have high precedence, alternation | low.

You can also introduce parenthesis to be more explicit about what you want, similar to normal math.

str_view(words, "a(b+)") # same as `ab+`

  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ avail<ab>le
 [66] │ b<ab>y
[452] │ l<ab>our
[648] │ prob<ab>le
[837] │ t<ab>le

str_view(words, "(^a)|(b$)") # same as `^a|b$`

 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
 [6] │ <a>ccount
 [7] │ <a>chieve
 [8] │ <a>cross
 [9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
[12] │ <a>dd
[13] │ <a>ddress
[14] │ <a>dmit
[15] │ <a>dvertise
[16] │ <a>ffect
[17] │ <a>fford
[18] │ <a>fter
[19] │ <a>fternoon
[20] │ <a>gain
... and 47 more

Grouping and capturing with parenthesis

With paranthesis, you can back-reference matches that appeared in parens, using \1 for match in first parens, \2 for match in second, etc.
e.g. all fruits which have repeated pair of letters.
- Pair of letters = “(..)”; back-ref: “\1”

str_view(fruit, "(..)\\1")

 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry

Words that start and end with same pair of letters:

# "starts with" a pair: ^(..)
# "ends with": need to end regex with \\1$
# to allow any chars between, put .* in middle
str_view(words, "^(..).*\\1$")

[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>

Words that are repetitions of the same pair of letters:

str_view(c("haha", "miumiu"), "^(..)+\\1$")

[1] │ <haha>

More grouping and capturing

Can also use back references in str_replace(), e.g. switching second and third words in sentences:

sentences |> 
  str_replace(
    "(\\w+) (\\w+) (\\w+)", 
    "\\1 \\3 \\2") |> 
  str_view()

 [1] │ The canoe birch slid on the smooth planks.
 [2] │ Glue sheet the to the dark blue background.
 [3] │ It's to easy tell the depth of a well.
 [4] │ These a days chicken leg is a rare dish.
 [5] │ Rice often is served in round bowls.
 [6] │ The of juice lemons makes fine punch.
 [7] │ The was box thrown beside the parked truck.
...

(\\w+): matches with 1+ “word characters” (letters, numbers)
Spacing between (\\w+) ensures we are looking for sequences of the form: word-chars, space, word-chars, space, word-chars

Examples

Words that start with “y”:

str_view(words, "^y")

[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
...

Words that don’t start with “y”:

str_view(words, "^[^y]")

 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
...

Ends with a vowel-vowel-consonant triplet:

str_view(words, "[aeiou]{2}[^aeiou]$")

  [3] │ ab<out>
 [11] │ act<ual>
 [19] │ aftern<oon>
 [20] │ ag<ain>
 [26] │ <air>
...

Has 7 or more letters:

str_view(words, "[a-z]{7,}")

 [4] │ <absolute>
 [6] │ <account>
 [7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
...

Boolean operations

We already saw how ^ inside [] negates the set, i.e. words with no vowels:

str_view(words, "^[^aeiou]+$")

[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
...

Another way: return vector of booleans indicating presence of vowels, then negate:

str_view(words[!str_detect(words, "[aeiou]")])

[1] │ by
[2] │ dry
[3] │ fly
[4] │ mrs
[5] │ try
...

This is useful since there’s no “and” operator built into regex.
e.g., find all words that contain an “a” and a “b”: trickier in standard regex,

str_view(words, "a.*b|b.*a")

  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ <availab>le
...

Easier with str_detect() and &:

str_view(words[str_detect(words, "a") 
               & str_detect(words, "b")])

 [1] │ able
 [2] │ about
 [3] │ absolute
...

Boolean operations

What if we wanted to find a word that contains “a”, “e”, “i”, and “o”?
If we tried to use standard regex, this would be very complex.
Much easier using str_detect() and &:

words[
  str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") 
]

[1] "appropriate" "associate"   "organize"    "relation"

Creating patterns with code

What if we want all sentences which mention a color?
- Combine alternation with word boundaries \b:

str_view(sentences, "\\b(red|green|blue)\\b")

  [2] │ Glue the sheet to the dark <blue> background.
 [26] │ Two <blue> fish swam in the tank.
 [92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...

But if we wanted to update this code to have more colors, would be pretty tedious to construct pattern by hand.
We can build up larger regex’s using functions we have seen before.

match_colors <- c("red","green", "blue")
for_regex <- str_c(
  "\\b(", 
  str_flatten(match_colors, "|"), 
  ")\\b")
str_view(sentences, for_regex)

  [2] │ Glue the sheet to the dark <blue> background.
 [26] │ Two <blue> fish swam in the tank.
 [92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...

Then we can easily modify the list of colors by simply modifying match_colors.

Literal characters, metacharacters

Quantifiers

Character classes

str_detect()

str_detect()

Counting matches

Counting vowels and constants in baby names

Replacing and removing values

Replacing characters and ?, *, +

Replacing characters and ?, *, +

Ranges of characters

Ranges of characters and ?, *

Extracting variables

Escaping

Anchors

Character sets

Anchors: boundaries of words

Quantifiers

Order of operations in regex

Grouping and capturing with parenthesis

More grouping and capturing

Examples

Boolean operations

Boolean operations

Creating patterns with code

Slide

`str_detect()`

`str_detect()`

Replacing characters and `?`, `*`, `+`

Replacing characters and `?`, `*`, `+`

Ranges of characters and `?`, `*`