Functions 1

class: center, middle, inverse, title-slide

.title[
# Functions 1
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

### Functions
* We'll talk in the next couple lectures about writing **functions**
* Functions allow for automating tasks, with a number of advantages:
  - Readability: have well-defined tasks that are encapsulated within individual functions, easier to debug
  - Portability: makes it easy to re-use code

---

### Vector functions
.pull-left[
* We'll take a look at **vector functions**: takes >=1 vector, returns vector as a result
* Try and understand this code.

```r
df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / 
    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
#> # A tibble: 5 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1 0.339  2.59 0.291 0    
#> 2 0.880  0    0.611 0.557
#> 3 0      1.37 1     0.752
#> 4 0.795  1.37 0     1    
#> 5 1      1.34 0.580 0.394
```

]

.pull-right[
* Seems to be rescaling each column to have range between 0 and 1
* Did you spot the error?  When creating column `b`, forgot to change an "a" to a "b"
* Functions will generally make it harder to make these kinds of mistakes
]

---

### Writing functions
.pull-left[
* Here's the computation we were trying to do:

```r
(a - min(a, na.rm = TRUE)) / (max(a, na.rm = TRUE) - min(a, na.rm = TRUE))
(b - min(b, na.rm = TRUE)) / (max(b, na.rm = TRUE) - min(b, na.rm = TRUE))
(c - min(c, na.rm = TRUE)) / (max(c, na.rm = TRUE) - min(c, na.rm = TRUE))
(d - min(d, na.rm = TRUE)) / (max(d, na.rm = TRUE) - min(d, na.rm = TRUE))  
```

* More clearly:

```r
(█ - min(█, na.rm = TRUE)) / (max(█, na.rm = TRUE) - min(█, na.rm = TRUE))
```
* To write functions, need 3 things:
  - **name**: we'll use `rescale01`
  - **arguments**: things that vary across calls; we'll use `x`
  - **body**: the code that repeats
  
]

.pull-right[

```r
name <- function(arguments) {
  body
}
```

* In our case:

```r
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
```
* Can test it out:

```r
rescale01(c(-10, 0, 10))
#> [1] 0.0 0.5 1.0
rescale01(c(1, 2, 3, NA, 5))
#> [1] 0.00 0.25 0.50   NA 1.00
```

]

---

.pull-left[
* Compare code with functions...

```r
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
df |> mutate(
  a = rescale01(a),
  b = rescale01(b),
  c = rescale01(c),
  d = rescale01(d),
)
#> # A tibble: 5 × 4
#>       a     b     c     d
#>   <dbl> <dbl> <dbl> <dbl>
#> 1 0.339 1     0.291 0    
#> 2 0.880 0     0.611 0.557
#> 3 0     0.530 1     0.752
#> 4 0.795 0.531 0     1    
#> 5 1     0.518 0.580 0.394
```

]

.pull-right[
* ... to code without functions:

```r
df <- tibble(
  a = rnorm(5),
  b = rnorm(5),
  c = rnorm(5),
  d = rnorm(5),
)

df |> mutate(
  a = (a - min(a, na.rm = TRUE)) / 
    (max(a, na.rm = TRUE) - min(a, na.rm = TRUE)),
  b = (b - min(b, na.rm = TRUE)) / 
    (max(b, na.rm = TRUE) - min(a, na.rm = TRUE)),
  c = (c - min(c, na.rm = TRUE)) / 
    (max(c, na.rm = TRUE) - min(c, na.rm = TRUE)),
  d = (d - min(d, na.rm = TRUE)) / 
    (max(d, na.rm = TRUE) - min(d, na.rm = TRUE)),
)
#> # A tibble: 5 × 4
#>       a     b      c     d
#>   <dbl> <dbl>  <dbl> <dbl>
#> 1 0.555 -88.7 0.545  0.597
#> 2 0.522 -36.1 0.0426 0    
#> 3 0     -93.8 0.0871 0.548
#> 4 0.640 -43.0 1      0.536
#> 5 1       0   0      1
```

]

---

### Improving the function
.pull-left[

```r
rescale01 <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
```
* Suppose we want to modify function so that we only compute the min once rather than twice.

```r
rescale01 <- function(x) {
  minval <- min(x, na.rm=TRUE)
  maxval <- max(x, na.rm=TRUE)
  (x - minval) / (maxval - minval)
}
```
* Much easier than having to go through copy/pasted code!
]

.pull-right[
* What if we want to modify it to deal with infinite values differently?

```r
x <- c(1:5, Inf)
rescale01(x)
#> [1]   0   0   0   0   0 NaN
```
* We can use `range()`: returns min and max of the vector, and has arguments `na.rm =` and `finite =`, which remove NA's and Inf's respectively

```r
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  minval <- rng[1]
  maxval <- rng[2]
  (x - minval) / (maxval - minval)
}
rescale01(x)
#> [1] 0.00 0.25 0.50 0.75 1.00  Inf
```

]

---

### Mutate functions 
.pull-left[ 
* We'll now look at functions which work well with `mutate` and `filter`
* Consider a variation of `rescale01()`, where we compute the **z-score**: rescaling a vector to have mean zero and standard deviation of one.

```r
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}
```
* Last line is returned as *output* of the function
* Can also use `return()`

```r
z_score <- function(x) {
  results <- (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
  return(results)
}
```
]

.pull-right[
* Function `clamp()` which makes sure all values in a vector lie between a value `min` and value `max`

```r
clamp <- function(x, min, max) {
  case_when(
    x < min ~ min,
    x > max ~ max,
    .default = x
  )
}

clamp(1:10, min = 3, max = 7)
#>  [1] 3 3 3 4 5 6 7 7 7 7
```

]

---

### Mutate functions
.pull-left[
* Functions can also be applied to non-numeric variables.
* Example: remove all percent signs, commas, and dollar signs from a string and then convert it to a number.

```r
clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |> 
    str_remove_all("%") |> 
    str_remove_all(",") |> 
    str_remove_all("\\$") |> 
    as.numeric()
  if_else(is_pct, num / 100, num)
}

clean_number("$12,300")
#> [1] 12300
clean_number("45%")
#> [1] 0.45
```
]

.pull-right[
* Another example: taking a vector, and if someone previously used numbers 997 / 998 / 999 to record missing values, replace them with `NA`:

```r
fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}
```
* You can use multiple arguments

```r
make_a_pair <- function(x, y) {
  str_c("The pair will be ", x, " and ", y)
}
make_a_pair("William", "Henry")
#> [1] "The pair will be William and Henry"
```

]

---

### Summary functions
.pull-left[
* Functions which take a vector and return a single value - useful for `summarize()`

```r
commas <- function(x) {
  str_flatten(x, collapse = ", ", last = " and ")
}
commas(c("cat", "dog", "pigeon"))
#> [1] "cat, dog and pigeon"
```

* You can set default values for arguments, e.g. for computing "coefficient of variation":

```r
cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
cv(runif(100, min = 0, max = 50))
#> [1] 0.5175436
cv(c(1, 2, 3, NA))
#> [1] NA
cv(c(1, 2, 3, NA), na.rm = TRUE)
#> [1] 0.5
```

]

.pull-right[
* Sometimes just want to create functions which make it easier to read code:

```r
n_missing <- function(x) {
  sum(is.na(x))
}
```

]

---

### Examples
.pull-left[
* Writing a function which takes in a vector of strings of birthdates of the form "YYYY-MM-DD" and computes the age in years.

```r
age_in_years <- function(dates) {
  time_intervals <- ymd(dates) %--% today()
  floor(time_intervals / years(1))
}
age_in_years(c("2004-11-11", "1997-02-08"))
#> [1] 19 26
```

]

.pull-right[
* A function `both_na()` which takes in two vectors of same length and returns the number of ositions which have an `NA` in both vectors

```r
both_na <- function(vec1, vec2) {
  sum(is.na(vec1) & is.na(vec2))
}
both_na(c(1,2,NA,3), c(4, 5, NA, NA))
#> [1] 1
both_na(c(1,2,NA,NA), c(4, 5, NA, NA))
#> [1] 2
```

]

---
### Data frame functions
.pull-left[
* We will often do manipulations of tibbles in a repetitive way
* Translating this into functions is a bit complicated for reasons to be discussed
* Consider trying to compute the mean of a tibble by groups.

```r
grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by(group_var) |> 
    summarize(mean(mean_var))
}
```
* If we try to use it for the `diamonds` dataset, we get an error:

```r
diamonds |> grouped_mean(cut, carat)
#> Error in `group_by()`:
#> ! Must group by variables found in `.data`.
#> ✖ Column `group_var` is not found.
```

]

.pull-right[
* Let's make the example a bit more clear:

```r
df <- tibble(
  mean_var = 1,
  group_var = "g",
  group = 1,
  x = 10,
  y = 100
)
df |> grouped_mean(group, x)
#> # A tibble: 1 × 2
#>   group_var `mean(mean_var)`
#>   <chr>                <dbl>
#> 1 g                        1
df |> grouped_mean(group, y)
#> # A tibble: 1 × 2
#>   group_var `mean(mean_var)`
#>   <chr>                <dbl>
#> 1 g                        1
```
* Always does `df %>% group_by(group_var)` instead of `df %>% group_by(group)` 
]

---
### Embracing
.pull-left[
* Solution is to use **embracing**: wrap a variable inside of `{{}}`

```r
df <- tibble(
  mean_var = 1,
  group_var = "g",
  group = 1,
  x = 10,
  y = 100
)
grouped_mean <- function(df, group_var, mean_var) {
  df |> 
    group_by({{ group_var }}) |> 
    summarize(mean({{ mean_var }}))
}

df |> grouped_mean(group, x)
#> # A tibble: 1 × 2
#>   group `mean(x)`
#>   <dbl>     <dbl>
#> 1     1        10
```

]

.pull-right[

* Typically use embracing whenever using functions like `arrange`, `filter`, `summarize`, `select`, `rename`, etc.
* We'll see examples of how to apply it
]

---

.pull-left[ 
* Let's create a function which does initial summary statistics of a variable

```r
summary6 <- function(data, var) {
  data |> summarize(
    min = min({{ var }}, na.rm = TRUE),
    mean = mean({{ var }}, na.rm = TRUE),
    median = median({{ var }}, na.rm = TRUE),
    max = max({{ var }}, na.rm = TRUE),
    n = n(),
    n_miss = sum(is.na({{ var }})),
    .groups = "drop"
  )
}

diamonds |> summary6(carat)
#> # A tibble: 1 × 6
#>     min  mean median   max     n n_miss
#>   <dbl> <dbl>  <dbl> <dbl> <int>  <int>
#> 1   0.2 0.798    0.7  5.01 53940      0
```

]

.pull-right[
* Since it is inside a `summarize()`, we can use it on grouped data:

```r
diamonds |> 
  group_by(cut) |> 
  summary6(carat)
#> # A tibble: 5 × 7
#>   cut         min  mean median   max     n n_miss
#>   <ord>     <dbl> <dbl>  <dbl> <dbl> <int>  <int>
#> 1 Fair       0.22 1.05    1     5.01  1610      0
#> 2 Good       0.23 0.849   0.82  3.01  4906      0
...
```

* We can even do computations on top of variables:

```r
diamonds |> 
  group_by(cut) |> 
  summary6(log10(carat))
#> # A tibble: 5 × 7
#>   cut          min    mean  median   max     n n_miss
#>   <ord>      <dbl>   <dbl>   <dbl> <dbl> <int>  <int>
#> 1 Fair      -0.658 -0.0273  0      0.700  1610      0
#> 2 Good      -0.638 -0.133  -0.0862 0.479  4906      0
...
```

]

---

.pull-left[
* You can supply conditions as well:

```r
library(nycflights13)
unique_where <- function(df, condition, var) {
  df |> 
    filter({{ condition }}) |> 
    distinct({{ var }}) |> 
    arrange({{ var }})
}

flights |> unique_where(month == 12, dest)
#> # A tibble: 96 × 1
#>   dest 
#>   <chr>
#> 1 ABQ  
#> 2 ALB  
#> 3 ATL  
#> 4 AUS  
#> 5 AVL  
#> 6 BDL  
#> # ℹ 90 more rows
```

]

---

### Examples
* For `flights` tibble, find all flights that were cancelled (`is.na(arr_time)` or delayed by more than an hour; create a function `filter_severe()` to do so

```r
filter_severe <- function(df) {
  df %>%
    filter(is.na(arr_time) | dep_delay > 60)
}
flights %>% filter_severe()
```

* Finds all flights that were cancelled or delayed by more than a user supplied number of hours:

```r
filter_severe <- function(df, hours = 1) {
  df %>%
    filter(is.na(arr_time) | dep_delay > 60*hours)
}
flights %>% filter_severe(hours = 3)
```

---

### Plot functions
.pull-left[
* If you are repeatedly using small variations on plots, you'll find it useful to make functions involving ggplot

```r
diamonds |> 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.1)

diamonds |> 
  ggplot(aes(x = carat)) +
  geom_histogram(binwidth = 0.05)
```
* Same as in the case of dplyr, you need to `{{embrace}}` if you want to refer to variables

]

.pull-right[

```r
histogram <- function(df, var, binwidth = NULL) {
  df |> 
    ggplot(aes(x = {{ var }})) + 
    geom_histogram(binwidth = binwidth)
}

diamonds |> histogram(carat, 0.1)
```

]

---

### Plot functions
.pull-left[

```r
histogram <- function(df, var, binwidth = NULL) {
  df |> 
    ggplot(aes(x = {{ var }})) + 
    geom_histogram(binwidth = binwidth)
}
```
* Since this function returns a ggplot, you can still add additional components, e.g.

```r
diamonds |> 
  histogram(carat, 0.1) +
  labs(x = "Size (in carats)", y = "Number of diamonds")
```

]
.pull-right[
<img src="lec13-functions-1_files/figure-html/unnamed-chunk-38-1.png" width="432" />

]

---
### Plot functions
.pull-left[
* It's easy to add more variables to the function

```r
linearity_check <- function(df, x, y) {
  df |>
    ggplot(aes(x = {{ x }}, y = {{ y }})) +
    geom_point() +
    geom_smooth(method = "loess", formula = y ~ x, color = "red", se = FALSE) +
    geom_smooth(method = "lm", formula = y ~ x, color = "blue", se = FALSE) 
}

starwars |> 
  filter(mass < 1000) |> 
  linearity_check(mass, height)
```

]

.pull-right[
* Example of creating a hexagon plot that is quite flexible:

```r
hex_plot <- function(df, x, y, z, bins = 20, fun = "mean") {
  df |> 
    ggplot(aes(x = {{ x }}, y = {{ y }}, z = {{ z }})) + 
    stat_summary_hex(
      aes(color = after_scale(fill)), # make border same color as fill
      bins = bins, 
      fun = fun,
    )
}

diamonds |> hex_plot(carat, price, depth)
```

]

---

.pull-left[ 
* Let's now try to create a function which creates a bar plot given a tibble and a variable name (a factor), but where the bars are ordered top-to-bottom from most-frequent to least-frequent.
* Useful function: `fct_infreq`, re-orders factor levels by number of observations within each level (largest first)
* See how it works:

```r
diamonds %>% count(clarity) %>% 
  arrange(by = desc(n))
#> # A tibble: 8 × 2
#>   clarity     n
#>   <ord>   <int>
#> 1 SI1     13065
#> 2 VS2     12258
#> 3 SI2      9194
#> 4 VS1      8171
#> 5 VVS2     5066
#> 6 VVS1     3655
#> # ℹ 2 more rows
diamonds$clarity %>% 
  fct_infreq %>%
  levels
#> [1] "SI1"  "VS2"  "SI2"  "VS1"  "VVS2" "VVS1" "IF"   "I1"
```

]

.pull-right[
* Now create the function:

```r
sorted_bars <- function(df, var) {
  df |> 
    mutate({{ var }} := fct_infreq({{ var }}))  |>
    ggplot(aes(y = {{ var }})) +
    geom_bar()
}
diamonds |> sorted_bars(clarity)
```

<img src="lec13-functions-1_files/figure-html/unnamed-chunk-42-1.png" width="432" />
* Note `:=` needed due to bracing 
* Largest is at bottom, not top - now use `fct_rev()` reverses factor levels.

]

---

]

.pull-right[
* Now create the function:

```r
sorted_bars <- function(df, var) {
  df |> 
    mutate({{ var }} := fct_rev(fct_infreq({{ var }})))  |>
    ggplot(aes(y = {{ var }})) +
    geom_bar()
}
diamonds |> sorted_bars(clarity)
```

<img src="lec13-functions-1_files/figure-html/unnamed-chunk-44-1.png" width="432" />
* Largest is at bottom, not top - now use `fct_rev()` reverses factor levels.

]

---

### Style
* Function names should be verbs, arguments should be nouns (typically) 
* Function names should be descriptive and tell the reader something about them

```r
# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()
```

.pull-left[
* Although R doesn't care about whitespace (unlike Python), it is very helpful to readers to be use consistent spacing conventions
* 2 space indent after `{}` in functions, and 2 space indent after beginning pipe
* In R Studio, the command `Ctrl + I` on a given line will indent it correctly (usually)
]

.pull-right[

```r
density <- function(color, facets, binwidth = 0.1) {
diamonds |> 
ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
geom_freqpoly(binwidth = binwidth) +
facet_wrap(vars({{ facets }}))
}
```

]

---

### Style
* Function names should be verbs, arguments should be nouns (typically) 
* Function names should be descriptive and tell the reader something about them

```r
# Too short
f()

# Not a verb, or descriptive
my_awesome_function()

# Long, but clear
impute_missing()
collapse_years()
```

.pull-right[

```r
density <- function(color, facets, binwidth = 0.1) {
  diamonds |> 
    ggplot(aes(x = carat, y = after_stat(density), color = {{ color }})) +
    geom_freqpoly(binwidth = binwidth) +
    facet_wrap(vars({{ facets }}))
}
```

]

---