Transformations of Strings

class: center, middle, inverse, title-slide

.title[
# Transformations of Strings
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

We'll focus on transforming strings.

```r
library(tidyverse)
library(babynames)
```
We'll primarily be working with `stringr`, which has functions which start with `str_`.

---

### Creating strings

* Can create strings using either `'` or `"` - single or double quotes (`"` preferred by Hadley Wickham)
* If you want quotes within your string, use `'` on outside and `"` on inside (or reverse)

```r
string1 <- "example of a string"
string2 <- 'this string has a "quote" inside of it'
```

* In RStudio editor, if you highlight text and then press `'` or `"`, it puts quotes around it
* If you forget to close a quote, console will print `+` and wait for you to complete
  * Can lead to very confusing / never-ending errors in the console.

```r
> "This is a string without a closing quote
+ 
+ 
+ more text
```

---

### Escapes
* If you want to include a literal single or double quote in a string, use `\` to escape it.
  * This is what R is implicitly doing when you put quotes inside of strings.

```r
string2 <- 'this string has a "quote" inside of it'
string3 <- "this string has a \"quote\" inside of it"
string2 == string3
```

```
[1] TRUE
```

* Another special character you need to escape: `\`, using `\\`.

```r
x <- c('\'', "\"", "\\")
```

There are other special characters.

---

Now:

---

### Other special characters

* In addition to `\"`, `\'`, `\\`, there is:
  * `\n`: new line
  * `\t`: tab
  * `\u` or `\U`: unicode characters
* Base R function `writeLines()` writes text, similar to `dplyr::str_view()`

```r
x <- c("one\ntwo", "one\ttwo", "\u00b5", "\U0001f604")
writeLines(x)
```

```
one
two
one	two
µ
😄
```

---

### Examples
* Creating a string with value `He said "That's amazing!"`:

```r
x <- 'He said "That\'s amazing!"'
y <- "He said \"That's amazing!\""
writeLines(x)
```

```
He said "That's amazing!"
```

```r
writeLines(y)
```

```
He said "That's amazing!"
```
* ... with value `\a\b\c\d`

```r
x <- "\\a\\b\\c\\d"
writeLines(x)
```

```
\a\b\c\d
```

---

### Creating strings from data
* We'll now go over ways to create new strings from tibbles.
* There are many functions which work well with `dplyr`

#### `str_c()`
* Similar to `paste0()` in base R, but friendlier for `dplyr` - obeys Tidyverse rules for recycling and propagating missing vals. 
* Concatenates any number of vectors and returns a character vector

```r
( str_c("x", "y") )
```

```
[1] "xy"
```

```r
( str_c("x", "y", "z") )
```

```
[1] "xyz"
```

```r
( str_c("Hello ", c("John", "Susan")))
```

```
[1] "Hello John"  "Hello Susan"
```

---

### `str_c()` vs `paste0()`
.pull-left[ 
* Compare `str_c()` and `paste0()`:

```r
df <- tibble(name = c("Flora", "David", "Terra", NA))
df %>% 
  mutate(greeting = str_c("Hi ", name, "!"))
```

```
# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>     
```

]

.pull-right[

```r
df %>% 
  mutate(greeting = paste0("Hi ", name, "!"))
```

```
# A tibble: 4 × 2
  name  greeting 
  <chr> <chr>    
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  Hi NA!   
```
]

---

### `str_glue()`
.pull-left[ 
* If you're mixing strings with variables which are strings, lots of `"`s make it hard to read
* `str_glue()` allows for a functionality similar to Python's f strings, where anything inside of `{}` will be evaluated like it doesn't have quotes:

```r
df <- tibble(name = c("Flora", "David", "Terra", NA))
df %>%
  mutate(greeting = str_glue("Hi {name}!"))
```

```
# A tibble: 4 × 2
  name  greeting 
  <chr> <glue>   
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  Hi NA!   
```

* Note that default behavior for `NA` is to copy over the literal `NA`; inconsistent with `str_c()`.  If you set .na=NULL, then matches behavior:
]

.pull-right[

```r
df %>% 
  mutate(greeting = str_glue("Hi {name}!", .na=NULL))
```

```
# A tibble: 4 × 2
  name  greeting 
  <chr> <glue>   
1 Flora Hi Flora!
2 David Hi David!
3 Terra Hi Terra!
4 <NA>  <NA>     
```
* For literal `{` or `}`, use double `{{` or `}}`:

```r
df %>% 
  mutate(greeting = str_glue("Hi {{{name}}}!", .na=NULL))
```

```
# A tibble: 4 × 2
  name  greeting   
  <chr> <glue>     
1 Flora Hi {Flora}!
...
```
]

---

#### `str_flatten()`

.pull-left[

* If operating over vectors `str_c()` and `str_glue()`, return vectors of the same length.
* This is useful for `mutate()` but not for `summarize()`, where we want to take a vector and return a single string, e.g. concatenation of all strings in a group.

```r
( str_flatten(c("x", "y", "z")) )
```

```
[1] "xyz"
```

```r
( str_flatten(c("x", "y", "z"), ", "))
```

```
[1] "x, y, z"
```

```r
( str_flatten(c("x", "y", "z"), ", ", last = ", and ") )
```

```
[1] "x, y, and z"
```

]

.pull-right[

.pull-left[ 
* Allows for easy computation of gluing together strings per group:

```r
df
```

```
# A tibble: 5 × 2
  name    fruit     
  <chr>   <chr>     
1 Carmen  banana    
2 Carmen  apple     
3 Marvin  nectarine 
4 Terence cantaloupe
5 Terence papaya    
```
]
.pull-right[

```r
df %>%
  group_by(name) %>%
  summarize(fruits = str_flatten(fruit, ", "))
```

```
# A tibble: 3 × 2
  name    fruits            
  <chr>   <chr>             
1 Carmen  banana, apple     
2 Marvin  nectarine         
3 Terence cantaloupe, papaya
```

]
]

---

### Extracting data from strings
.pull-left[ 
* We'll focus on four useful tidyr functions for extracting data from strings:

```r
    df |> separate_longer_delim(col, delim)
    df |> separate_longer_position(col, width)
    df |> separate_wider_delim(col, delim, names)
    df |> separate_wider_position(col, widths)
```

* `_longer` creates new rows / collapses columns to make df longer
* `_wider` creates new columns / collapses rows to make df wider
* `delim` splits up a string with a delimiter like ", " or " "
* `position` splits at specified widths of the string, like `c(3,5,2)`

]

.pull-right[

```r
( df1 <- tibble(x = c("a,b,c", "d,e", "f")) ) 
```

```
# A tibble: 3 × 1
  x    
  <chr>
1 a,b,c
2 d,e  
3 f    
```

```r
df1 %>% 
  separate_longer_delim(x, delim = ",")
```

```
# A tibble: 6 × 1
  x    
  <chr>
1 a    
2 b    
3 c    
4 d    
5 e    
6 f    
```

]

---

### `separate_longer_position()`
.pull-left[ 
* Less common, but sometimes you might have a dataset where each character in a value records a value itself, e.g. if you record all grades for each student in a single continuous string:

```r
df2 <- tibble(name = c("Mary", "Sam", "Bill"), grades = c("ABBA", "AAC", "CD"))
df2
```

```
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Mary  ABBA  
2 Sam   AAC   
3 Bill  CD    
```
]
.pull-right[

```r
df2 %>% separate_longer_position(grades, width = 1)
```

```
# A tibble: 9 × 2
  name  grades
  <chr> <chr> 
1 Mary  A     
2 Mary  B     
3 Mary  B     
4 Mary  A     
5 Sam   A     
6 Sam   A     
7 Sam   C     
8 Bill  C     
9 Bill  D     
```
]

---

.pull-left[ 
* Compare `delim` vs `position` based on different formatting:

```r
df3
```

```
# A tibble: 3 × 2
  name  grades 
  <chr> <chr>  
1 Mary  A,B,B,A
2 Sam   A,A,C  
3 Bill  C,D    
```

```r
df3 %>% separate_longer_delim(grades, delim = ",")
```

```
# A tibble: 9 × 2
  name  grades
  <chr> <chr> 
1 Mary  A     
2 Mary  B     
3 Mary  B     
4 Mary  A     
5 Sam   A     
6 Sam   A     
7 Sam   C     
8 Bill  C     
9 Bill  D     
```
]
.pull-right[

```r
df2
```

```
# A tibble: 3 × 2
  name  grades
  <chr> <chr> 
1 Mary  ABBA  
2 Sam   AAC   
3 Bill  CD    
```

```r
df2 %>% separate_longer_position(grades, width = 1)
```

```
# A tibble: 9 × 2
  name  grades
  <chr> <chr> 
1 Mary  A     
2 Mary  B     
3 Mary  B     
4 Mary  A     
5 Sam   A     
6 Sam   A     
7 Sam   C     
8 Bill  C     
9 Bill  D     
```

]

---

### Separating into columns (wider)
.pull-left[
* Slightly more complicated than `longer` as we need to name the columns we are creating
* Consider following tibble:

```r
df4
```

```
# A tibble: 3 × 1
  x         
  <chr>     
1 a10.1.2022
2 b10.2.2011
3 e15.1.2015
```
* `x` has a code, edition number, and year, separated by "."
* To separate, need to supply delimiter and names of new columns
]
.pull-right[

```r
df4 |> 
  separate_wider_delim(
    x,
    delim = ".",
    names = c("code", "edition", "year")
  )
```

```
# A tibble: 3 × 3
  code  edition year 
  <chr> <chr>   <chr>
1 a10   1       2022 
2 b10   2       2011 
3 e15   1       2015 
```

* If you want to remove one of the output columns, can supply `NA` for name of column

```r
df4 |>  separate_wider_delim(x, delim = ".",
    names = c("code", NA, "year"))
```
]

---

### `separate_wider_position()`
.pull-left[ 
* We now need to supply two things: name of each column, and the width (=number of characters) per column
* We do this using a named integer vector, each name = name of new column, value = number of characters

```r
df5 <- tibble(x = c("202215TX", "202122LA", "202325CA")) 
df5 |> 
  separate_wider_position(
    x,
    widths = c(year = 4, age = 2, state = 2)
  )
```

```
# A tibble: 3 × 3
  year  age   state
  <chr> <chr> <chr>
1 2022  15    TX   
2 2021  22    LA   
3 2023  25    CA   
```

]

.pull-right[
.pull-left[ 
* If you want to omit values from the output (i.e. not include columns), in the named vector, do not put a name - only put the number of characters that you have to parse.
* e.g. let's not include age in the output.

```r
df5 %>% 
  separate_wider_position(x, 
    widths = c(year = 4, 2, state=2))
```

```
# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA   
```
]
.pull-right[
* Alternatively, just use `select(-name)`:

```r
df5 |> 
  separate_wider_position(x,
    widths = c(year = 4, age = 2, state = 2)
  ) %>%
  select(-age)
```

```
# A tibble: 3 × 2
  year  state
  <chr> <chr>
1 2022  TX   
2 2021  LA   
3 2023  CA   
```
]
]

---

### Diagnosing widening problems

* `separate_wider_delim()` requires a fixed & known set of columns
* If some rows don't have expected number of pieces, problem!
* `too_few` and `too_many` args of `separate_wider_delim()` can help here.

```r
df <- tibble(x = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))

df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("x", "y", "z")
  )
#> Error in `separate_wider_delim()`:
#> ! Expected 3 pieces in each element of `x`.
#> ! 2 values were too short.
#> ℹ Use `too_few = "debug"` to diagnose the problem.
#> ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
```

---

.pull-left[

* Let's try its suggestion to use `debug`:

```r
df <- tibble(u = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
debug <- df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = "debug"
  )
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
#> `x_remainder`.
debug
#> A tibble: 5 × 7
#>  x     y     z     u     u_ok  u_pieces u_remainder
#>  <chr> <chr> <chr> <chr> <lgl>    <int> <chr>      
#>1 1     1     1     1-1-1 TRUE         3 ""         
#>2 1     1     2     1-1-2 TRUE         3 ""         
#>3 1     3     NA    1-3   FALSE        2 ""         
#>4 1     3     2     1-3-2 TRUE         3 ""         
#>5 1     NA    NA    1     FALSE        1 ""     
```

]
.pull-right[
* Three columns get added: `u_ok`, `u_pieces`, `u_remainder`
* `x_ok` helps find inputs which failed:

```r
debug %>% filter(!x_ok)
#> A tibble: 2 × 7
#>  x     y     z     u     u_ok  u_pieces u_remainder
#>  <chr> <chr> <chr> <chr> <lgl>    <int> <chr>      
#>1 1     3     NA    1-3   FALSE        2 ""         
#>2 1     NA    NA    1     FALSE        1 ""   
```
* `u_pieces` tells how many pieces were found, vs. expected number (3 = length(names))
* `u_remainder` isn't useful when too few pieces but we will see it is useful when too many. 
* Using `debug` will typically reveal a problem with delimiter strategy, suggests need for preprocessing of tibble
]

---

### `too_few`

* By setting `too_few = 'align_start'` or `too_few = 'align_end'`, `separate_wider_delim()` will fill in the missing pieces with `NA`s, either on the tail end (`align_start`) or on the front end (`align_end`)

```r
df <- tibble(u = c("1-1-1", "1-1-2", "1-3", "1-3-2", "1"))
```

.pull-left[

```r
df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_start'
)
```

```
# A tibble: 5 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 1     3     <NA> 
4 1     3     2    
5 1     <NA>  <NA> 
```

]
.pull-right[

```r
df %>% 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z"),
    too_few = 'align_end'
)
```

```
# A tibble: 5 × 3
  x     y     z    
  <chr> <chr> <chr>
1 1     1     1    
2 1     1     2    
3 <NA>  1     3    
4 1     3     2    
5 <NA>  <NA>  1    
```
]

---

### `too_many`

* Same principles apply for too many pieces.

.pull-left[

```r
df <- tibble(u = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))

df |> 
  separate_wider_delim(
    u,
    delim = "-",
    names = c("x", "y", "z")
  )
# Error in `separate_wider_delim()`:
# ! Expected 3 pieces in each element of `u`.
# ! 2 values were too long.
# ℹ Use `too_many = "debug"` to diagnose the problem.
# ℹ Use `too_many = "drop"/"merge"` to silence this message.
```
]

.pull-right[
* Debugging shows purpose of `u_remainder`:

```r
debug <- df |>
     separate_wider_delim(
         u,
         delim = "-",
         names = c("x", "y", "z"),
         too_many = 'debug'
     )
debug %>% filter(!u_ok)
# A tibble: 2 × 7
#   x     y     z     u         u_ok  u_pieces u_remainder
#   <chr> <chr> <chr> <chr>     <lgl>    <int> <chr>      
# 1 1     3     5     1-3-5-6   FALSE        4 -6         
# 2 1     3     5     1-3-5-7-9 FALSE        5 -7-9    
```

]

---

### Too many

* To handle too many pieces, you can either "drop" the additionals or "merge" into a single column.

```r
df <- tibble(u = c("1-1-1", "1-1-2", "1-3-5-6", "1-3-2", "1-3-5-7-9"))
```

.pull-left[

```r
df |>
     separate_wider_delim(
         u,
         delim = "-",
         names = c("x", "y", "z"),
         too_many = 'drop'
     )
# # A tibble: 5 × 3
#   x     y     z    
#   <chr> <chr> <chr>
# 1 1     1     1    
# 2 1     1     2    
# 3 1     3     5    
# 4 1     3     2    
# 5 1     3     5    
```
]

.pull-right[

```r
df |>
   separate_wider_delim(
       u,
       delim = "-",
       names = c("x", "y", "z"),
       too_many = 'merge'
   )
# # A tibble: 5 × 3
#   x     y     z    
#   <chr> <chr> <chr>
# 1 1     1     1    
# 2 1     1     2    
# 3 1     3     5-6  
# 4 1     3     2    
# 5 1     3     5-7-9
```

]

---

### Individual characters in a string
.pull-left[ 
* `str_length()`: returns number of characters in the string

```r
str_length(c("a", "R for data science", NA))
```

```
[1]  1 18 NA
```
* `str_sub(string, start, end)`: returns **sub**set of the string from char `start` to char `end`.

```r
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
```

```
[1] "App" "Ban" "Pea"
```
]

.pull-right[
* Can also use negative values for `start`, `end`: -1 is last char, -2 second to last, etc.

```r
str_sub(x, -3, -1)
```

```
[1] "ple" "ana" "ear"
```

]