• Data we will look at:
  • babynames (use `install.packages(“babynames”)): year/sex/name/number/proportion vars
  • stringr::fruit: 80 fruits
  • stringr::words: 980 common English words
  • stringr::sentences: 720 short sentences

We will use str_view(string, pattern = NULL) a lot. pattern will parse regular expressions (regex)

str_view(fruit, "berry")
 [6] │ bil<berry>
 [7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
...

Literal characters, metacharacters

  • Letters and numbers which match exactly are literal characters
  • Punctuation characters typically have special regex meanings (., +, *, etc), and are called metacharacters
  • The metacharacter . will match any character, so “a.” matches any string which contains “a” followed by another character.
str_view(c("a", "ab", "ae", "bd", "ea", "eab"), "a.")
[2] │ <ab>
[3] │ <ae>
[6] │ e<ab>
  • Or all fruits which have an “a”, then 3 letters, then an “e”:
str_view(fruit, "a...e")
 [1] │ <apple>
 [7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
...

Quantifiers

  • ? makes a pattern optional (i.e. it matches 0 or 1 times)
  • + lets a pattern repeat (i.e. it matches at least once)
  • * lets a pattern be optional or repeat (i.e. it matches any number of times, including 0).
# ab? matches an "a", optionally followed by "b".
str_view(c("a", "ab", "abb"), "ab?")
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
# ab+ matches an "a", followed by at >=1 "b".
str_view(c("a", "ab", "abb"), "ab+")
[2] │ <ab>
[3] │ <abb>
# ab* matches an "a", followed by any num of "b"s.
str_view(c("a", "ab", "abb"), "ab*")
[1] │ <a>
[2] │ <ab>
[3] │ <abb>

Character classes

  • Defined by [], let you match from a set of characters (similar idea to %in%)
    • [abcd] matches anything with “a”, “b”, “c”, or “d”
    • Can negate/invert by using ^: [^abcd] returns anything except “a”, “b”, “c”, “d”
  • e.g. any word containing “x” surrounded by vowels, or “y” surrounded by consonants
  • alternation | picks between alternative patterns, e.g. words containing “apple”, “melon”, or “nut”; repeated vowels
str_view(words, "[aeiou]x[aeoiu]")
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
str_view(words, "[^aeiou]y[^aeiou]")
[836] │ <sys>tem
[901] │ <typ>e
str_view(fruit, "apple|melon|nut")
 [1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
...
str_view(fruit, "aa|ee|ii|oo|uu")
 [9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n

str_detect()

  • str_detect(character_vector, pattern) returns a logical vector, TRUE if pattern matches element of vector and FALSE otherwise.
str_detect(c("a", "b", "c"), "[aeiou]")
[1]  TRUE FALSE FALSE
  • Since returns logical vectors, can be used with filter(), e.g. most popular names containing an “x”:
babynames
# A tibble: 1,924,665 × 5
    year sex   name          n   prop
   <dbl> <chr> <chr>     <int>  <dbl>
 1  1880 F     Mary       7065 0.0724
 2  1880 F     Anna       2604 0.0267
...
babynames |> 
  filter(str_detect(name, "x")) |> 
  count(name, wt = n, sort = TRUE)
# A tibble: 974 × 2
   name            n
   <chr>       <int>
 1 Alexander  665492
 2 Alexis     399551
 3 Alex       278705
 4 Alexandra  232223
 5 Max        148787
 6 Alexa      123032
...

str_detect()

  • You can also use str_detect() in conjunction with group_by(), summarize() etc.
    • sum() will return number of strings which have pattern
    • mean() ill return proportion of strings which have pattern
  • E.g. proportion of names per year that have an “x”
babynames %>% 
  group_by(year) %>%
  summarize(prop_x = mean(str_detect(name, "x"))) %>%
  arrange(by = desc(prop_x))
# A tibble: 138 × 2
    year prop_x
   <dbl>  <dbl>
 1  2016 0.0163
 2  2017 0.0159
 3  2015 0.0154
 4  2014 0.0146
 5  2013 0.0145
 6  2012 0.0136
 7  2011 0.0130
 8  2010 0.0126
 9  2009 0.0118
10  2007 0.0108
# ℹ 128 more rows

Counting matches

  • str_count() tells how many matches there are in a string
x <- c("apple", "banana", "pear")
str_count(x, "p")
[1] 2 0 1
  • Regex matches never overlap - always start after the end of previous match
str_count("abababa", "aba")
[1] 2
str_view("abababa", "aba")
[1] │ <aba>b<aba>

Counting vowels and constants in baby names

  • Can use str_count() with mutate, i.e. computing number of vowels/consonants in baby names:
babynames %>%
  count(name) %>%
  mutate(
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]")
  )
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 Aaban        10      2          3
 2 Aabha         5      2          3
 3 Aabid         2      2          3
 4 Aabir         1      2          3
 5 Aabriella     5      4          5
 6 Aada          1      2          2
 7 Aadam        26      2          3
 8 Aadan        11      2          3
 9 Aadarsh      17      2          5
10 Aaden        18      2          3
# ℹ 97,300 more rows
  • Note that pattern matching is case sensitive, so “A” isn’t counted.
  • Ways around this:
    • Add the upper case vowels to the character class:
      str_count(name, "[aeiouAEIOU]")
    • Use str_to_lower() to convert the names to lower case: str_count(str_to_lower(name), "[aeiou]")
babynames %>% count(name) %>%  mutate(
    name = str_to_lower(name),
    vowels = str_count(name, "[aeiou]"),
    consonants = str_count(name, "[^aeiou]"))
# A tibble: 97,310 × 4
   name          n vowels consonants
   <chr>     <int>  <int>      <int>
 1 aaban        10      3          2
...

Replacing and removing values

  • str_replace(): replaces first match
  • str_replace_all() replace all matches
x <- c("apple", "pear", "banana")
(str_replace(x, "[aeiou]", "-"))
[1] "-pple"  "p-ar"   "b-nana"
str_replace_all(x, "[aeiou]", "-")
[1] "-ppl-"  "p--r"   "b-n-n-"
  • You can remove patterns if you set replacement with ““, or using str_remove() / str_remove_all()
str_remove(x, "[aeiou]")
[1] "pple"  "par"   "bnana"
str_remove_all(x, "[aeiou]")
[1] "ppl" "pr"  "bnn"

Replacing characters and ?, *, +

  • The question mark ? matches the preceding element zero OR ONE time, then iterates to rest of string.
  • Plus sign + matches AT LEAST once
  • The asterisk * matches the preceding element zero OR MORE times, then iterates to rest of string.
x <- c("apple", "aardvark", "happy", "haaahaha")
( str_view(x, "a*") )
[1] │ <a><>p<>p<>l<>e<>
[2] │ <aa><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<aaa><>h<a><>h<a><>
( str_remove(x, "a*") )
[1] "pple"     "rdvark"   "happy"    "haaahaha"
str_remove_all(x, "a*")
[1] "pple"  "rdvrk" "hppy"  "hhh"  
( str_view(x, "a?") )
[1] │ <a><>p<>p<>l<>e<>
[2] │ <a><a><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<a><a><a><>h<a><>h<a><>
( str_remove(x, "a?") )
[1] "pple"     "ardvark"  "happy"    "haaahaha"
str_remove_all(x, "a?")
[1] "pple"  "rdvrk" "hppy"  "hhh"  

Replacing characters and ?, *, +

  • Compare ?, +, and *
x <- c("apple", "aardvark", "happy", "haaahaha")
(str_view(x, "a?"))
[1] │ <a><>p<>p<>l<>e<>
[2] │ <a><a><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<a><a><a><>h<a><>h<a><>
(str_view(x, "a+"))
[1] │ <a>pple
[2] │ <aa>rdv<a>rk
[3] │ h<a>ppy
[4] │ h<aaa>h<a>h<a>
(str_view(x, "a*"))
[1] │ <a><>p<>p<>l<>e<>
[2] │ <aa><>r<>d<>v<a><>r<>k<>
[3] │ <>h<a><>p<>p<>y<>
[4] │ <>h<aaa><>h<a><>h<a><>

Ranges of characters

  • Suppose you have a vector of strings, and you want to do the following modifications:
    • If the string has a (lower/upper) letter between “a” and “u”, replace it with an “x”
  • Instead of spelling out manually what all of these letters are, you can use the character class operator [] together with -

An example with letters:

x <- c("happy", "ab", "zap", "war")
( str_view(x, "[a-u]") )
[1] │ <h><a><p><p>y
[2] │ <a><b>
[3] │ z<a><p>
[4] │ w<a><r>
str_replace_all(x, "[a-u]", "x")
[1] "xxxxy" "xx"    "zxx"   "wxx"  

An example with numbers: replace all numbers between 0 and 5 with x’s

x <- c("code9202", "apple2850", "0352")
(str_view(x, "[0-5]"))
[1] │ code9<2><0><2>
[2] │ apple<2>8<5><0>
[3] │ <0><3><5><2>
str_replace_all(x, "[0-5]", "x")
[1] "code9xxx"  "applex8xx" "xxxx"     

Ranges of characters and ?, *

  • Very useful to use ranges in conjunction with ?, *, +
  • E.g. let’s find all words with at least three consecutive vowels
str_view(words, "[aeiou][aeiou][aeiou]+")
 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
[915] │ var<iou>s
  • Useful for parsing strings which are partitioned by letters/numbers
name_score <- c("Mary_92", "Pat_35", "Will_85")
( str_view(name_score, "[a-zA-Z]+")) 
[1] │ <Mary>_92
[2] │ <Pat>_35
[3] │ <Will>_85
str_view(name_score, "[0-9]+")
[1] │ Mary_<92>
[2] │ Pat_<35>
[3] │ Will_<85>
  • E.g. replace all names with John, scores with 100
name_score %>% str_replace("[a-zA-Z]+", "John") %>%
  str_replace("[0-9]+", "100")
[1] "John_100" "John_100" "John_100"

Extracting variables

  • separate_wider_regex(): go from long to wide using regex.
df <- tribble(
  ~str,
  "<Sheryl>-F_34",
  "<Kisha>-F_45", 
  "<Pat>-X_33",
  "<Sharon>-F_38", 
  "<Penny>-F_58",
  "<Justin>-M_41", 
  "<Patricia>-F_84", 
)
  • To extract data, construct sequence of regex that match each piece.
  • If you want contents of that piece to appear in output, give it a name.
df %>%  separate_wider_regex(
    str,
    patterns = c(
      "<", 
      name = "[A-Za-z]+", 
      ">-", 
      gender = ".",
      "_",
      age = "[0-9]+"))
# A tibble: 7 × 3
  name     gender age  
  <chr>    <chr>  <chr>
1 Sheryl   F      34   
2 Kisha    F      45   
3 Pat      X      33   
4 Sharon   F      38   
5 Penny    F      58   
6 Justin   M      41   
7 Patricia F      84   

Escaping

  • Since the characters “.”, “?”, “+”, “*” have extra functions in regex, need to use escapes to help parse literal instances of these characters
  • In regex, we require a \ in front of characters to denote an escape
  • But to create a string with an actual \ in it, we need to use an escape, so need double \\:
str_view(c("abc", "a.c", "bef"), "a\\.c")
[2] │ <a.c>
str_view(c("a*rdvark", "*pple", "m*n"), "\\*")
[1] │ a<*>rdvark
[2] │ <*>pple
[3] │ m<*>n
  • Recall that to represent backslash in a string, need to escape:
str_view("a\\b")
[1] │ a\b
  • To match for a backslash, need to create a string which has an escape in front of a backslash.
  • The escape requires double backslash, and the string \ also requires double backslash.
str_view("a\\b", "\\\\")
[1] │ a<\>b
str_replace("mary.elizabeth", "\.", "-")
# Error: '\.' is an unrecognized escape in character string (<input>:1:33)

Anchors

  • By default: regex will match any part of a string.
  • If you only want to match at beginning or end, you need to anchor:
    • ^ indicates “starts with”
    • $ indicates “ends with”
str_view(fruit, "^a")
[1] │ <a>pple
[2] │ <a>pricot
[3] │ <a>vocado
str_view(fruit, "a$")
 [4] │ banan<a>
[15] │ cherimoy<a>
[30] │ feijo<a>
[36] │ guav<a>
[56] │ papay<a>
[74] │ satsum<a>
  • To force a regex to match only the full string, not subsets, anchor it with both ^ and $:
str_view(fruit, "apple")
 [1] │ <apple>
[62] │ pine<apple>
str_view(fruit, "^apple$")
[1] │ <apple>
  • Example: replace every fruit name which starts with “a” with an “o”
str_replace(fruit, "^a", "o")
 [1] "opple"             "opricot"           "ovocado"          
 [4] "banana"            "bell pepper"       "bilberry"         
 [7] "blackberry"        "blackcurrant"      "blood orange"     
[10] "blueberry"         "boysenberry"       "breadfruit"       
...

Character sets

  • We already saw how we can construct sets with []: e.g. [abc] matches if any character is an “a”, “b”, or “c”

  • We also saw how to use - to denote ranges, e.g. [a-z] lowercase letters, [0-9] numbers

  • A few others:

    • \d matches any digit; \D matches anything that isn’t a digit.
    • \s matches any whitespace (e.g., space, tab, newline); \S matches anything that isn’t whitespace.
    • \w matches any “word” character, i.e. letters and numbers; \W matches any “non-word” character.
  • Remember: to represent \ in a string, need double backslash.

x <- "abcd ABCD 12345 -!@#%."
str_view(x, "\\d+")
#> [1] │ abcd ABCD <12345> -!@#%.
str_view(x, "\\D+")
#> [1] │ <abcd ABCD >12345< -!@#%.>
str_view(x, "\\s+")
#> [1] │ abcd< >ABCD< >12345< >-!@#%.
str_view(x, "\\S+")
#> [1] │ <abcd> <ABCD> <12345> <-!@#%.>
str_view(x, "\\w+")
#> [1] │ <abcd> <ABCD> <12345> -!@#%.
str_view(x, "\\W+")
#> [1] │ abcd< >ABCD< >12345< -!@#%.>

Anchors: boundaries of words

  • You can specify the beginning or end of the word using \b
    • This works by treating all letters and numbers as “word” characters, and everything else as “non-word” characters
x <- c("summary(x)", "summarize(df)", "rowsum(x)", "sum(x)")
str_view(x, "sum")
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[3] │ row<sum>(x)
[4] │ <sum>(x)
str_view(x, "\\bsum\\b")
[4] │ <sum>(x)

Quantifiers

  • We already discussed ? (0 or 1 match), + (1+ matches), * (0+ matches)

    • colou?r: matches American and British English
    • \d+: matches 1+ digits
    • \s?: matches 0+ whitespaces
  • Can specify exact number of matches using {}:

    • {n} matches exactly n times.
    • {n,} matches at least n times.
    • {n,m} matches between n and m times.
  • Words with >= 3 consecutive vowels?
str_view(words, "[aeiou]{3,}")
 [79] │ b<eau>ty
[565] │ obv<iou>s
[644] │ prev<iou>s
[670] │ q<uie>t
[741] │ ser<iou>s
...
  • Words with between 4 and 6 consecutive consonants:
str_view(words, "[^aeiou]{4,6}")
 [45] │ a<pply>
[198] │ cou<ntry>
[424] │ indu<stry>
[830] │ su<pply>
[836] │ <syst>em

Order of operations in regex

  • Not immediately clear in which order R processes different operators.
    • ab+: is this “a” and then 1+ “b”, or is it “ab” repeated 1+ times? (1st case)
    • ^a|b$: match the string “a” or the string “b”, OR: string starting with “a” or string starting with “b” (2nd case)
  • Generally: quantifiers (?+*) have high precedence, alternation | low.
  • You can also introduce parenthesis to be more explicit about what you want, similar to normal math.
str_view(words, "a(b+)") # same as `ab+`
  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ avail<ab>le
 [66] │ b<ab>y
[452] │ l<ab>our
[648] │ prob<ab>le
[837] │ t<ab>le
str_view(words, "(^a)|(b$)") # same as `^a|b$`
 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
 [6] │ <a>ccount
 [7] │ <a>chieve
 [8] │ <a>cross
 [9] │ <a>ct
[10] │ <a>ctive
[11] │ <a>ctual
[12] │ <a>dd
[13] │ <a>ddress
[14] │ <a>dmit
[15] │ <a>dvertise
[16] │ <a>ffect
[17] │ <a>fford
[18] │ <a>fter
[19] │ <a>fternoon
[20] │ <a>gain
... and 47 more

Grouping and capturing with parenthesis

  • With paranthesis, you can back-reference matches that appeared in parens, using \1 for match in first parens, \2 for match in second, etc.
  • e.g. all fruits which have repeated pair of letters.
    • Pair of letters = “(..)”; back-ref: “\1”
str_view(fruit, "(..)\\1")
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
  • Words that start and end with same pair of letters:
# "starts with" a pair: ^(..)
# "ends with": need to end regex with \\1$
# to allow any chars between, put .* in middle
str_view(words, "^(..).*\\1$")
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>
  • Words that are repetitions of the same pair of letters:
str_view(c("haha", "miumiu"), "^(..)+\\1$")
[1] │ <haha>

More grouping and capturing

  • Can also use back references in str_replace(), e.g. switching second and third words in sentences:
sentences |> 
  str_replace(
    "(\\w+) (\\w+) (\\w+)", 
    "\\1 \\3 \\2") |> 
  str_view()
 [1] │ The canoe birch slid on the smooth planks.
 [2] │ Glue sheet the to the dark blue background.
 [3] │ It's to easy tell the depth of a well.
 [4] │ These a days chicken leg is a rare dish.
 [5] │ Rice often is served in round bowls.
 [6] │ The of juice lemons makes fine punch.
 [7] │ The was box thrown beside the parked truck.
...
  • (\\w+): matches with 1+ “word characters” (letters, numbers)
  • Spacing between (\\w+) ensures we are looking for sequences of the form: word-chars, space, word-chars, space, word-chars

Examples

  • Words that start with “y”:
str_view(words, "^y")
[975] │ <y>ear
[976] │ <y>es
[977] │ <y>esterday
[978] │ <y>et
[979] │ <y>ou
...
  • Words that don’t start with “y”:
str_view(words, "^[^y]")
 [1] │ <a>
 [2] │ <a>ble
 [3] │ <a>bout
 [4] │ <a>bsolute
 [5] │ <a>ccept
...
  • Ends with a vowel-vowel-consonant triplet:
str_view(words, "[aeiou]{2}[^aeiou]$")
  [3] │ ab<out>
 [11] │ act<ual>
 [19] │ aftern<oon>
 [20] │ ag<ain>
 [26] │ <air>
...
  • Has 7 or more letters:
str_view(words, "[a-z]{7,}")
 [4] │ <absolute>
 [6] │ <account>
 [7] │ <achieve>
[13] │ <address>
[15] │ <advertise>
...

Boolean operations

  • We already saw how ^ inside [] negates the set, i.e. words with no vowels:
str_view(words, "^[^aeiou]+$")
[123] │ <by>
[249] │ <dry>
[328] │ <fly>
[538] │ <mrs>
[895] │ <try>
...
  • Another way: return vector of booleans indicating presence of vowels, then negate:
str_view(words[!str_detect(words, "[aeiou]")])
[1] │ by
[2] │ dry
[3] │ fly
[4] │ mrs
[5] │ try
...
  • This is useful since there’s no “and” operator built into regex.
  • e.g., find all words that contain an “a” and a “b”: trickier in standard regex,
str_view(words, "a.*b|b.*a")
  [2] │ <ab>le
  [3] │ <ab>out
  [4] │ <ab>solute
 [62] │ <availab>le
...
  • Easier with str_detect() and &:
str_view(words[str_detect(words, "a") 
               & str_detect(words, "b")])
 [1] │ able
 [2] │ about
 [3] │ absolute
...

Boolean operations

  • What if we wanted to find a word that contains “a”, “e”, “i”, and “o”?
  • If we tried to use standard regex, this would be very complex.
  • Much easier using str_detect() and &:
words[
  str_detect(words, "a") &
  str_detect(words, "e") &
  str_detect(words, "i") &
  str_detect(words, "o") 
]
[1] "appropriate" "associate"   "organize"    "relation"   

Creating patterns with code

  • What if we want all sentences which mention a color?
    • Combine alternation with word boundaries \b:
str_view(sentences, "\\b(red|green|blue)\\b")
  [2] │ Glue the sheet to the dark <blue> background.
 [26] │ Two <blue> fish swam in the tank.
 [92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...
  • But if we wanted to update this code to have more colors, would be pretty tedious to construct pattern by hand.
  • We can build up larger regex’s using functions we have seen before.
match_colors <- c("red","green", "blue")
for_regex <- str_c(
  "\\b(", 
  str_flatten(match_colors, "|"), 
  ")\\b")
str_view(sentences, for_regex)
  [2] │ Glue the sheet to the dark <blue> background.
 [26] │ Two <blue> fish swam in the tank.
 [92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
...
  • Then we can easily modify the list of colors by simply modifying match_colors.

Slide