Functions and Intro to Linear Regression

class: center, middle, inverse, title-slide

.title[
# Functions and Intro to Linear Regression
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

### Iteration
* We've already seen a few ways to do computations repeatedly in a clean / easy way:
  - `facet_grid()` and `facet_wrap()`, drawing plots for each group
  - `group_by()` + `summarize()` for summarys tatistics for each group
  - creating custom functions
  
* We'll see what other ways R allows for us to flexibly do calculations and save time

---

###  Modifying multiple columns

.pull-left[
* Consider a simple tibble (`runif(n)`: n independent standard normals), and that we want to compute median of every column:

```r
df <- tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)
df |> summarize(
  n = n(),
  a = median(a),
  b = median(b),
  c = median(c),
  d = median(d),
)
#> # A tibble: 1 × 5
#>       n      a      b       c     d
#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
#> 1    10 -0.246 -0.287 -0.0567 0.144
```

]

.pull-right[
* Should never be copying+pasting more than twice (what if we had 500 columns!)
* Helpful function: `across()`:

```r
df %>%
  summarize(
    n = n(),
    across(a:d, median)
  )
#> # A tibble: 1 × 5
#>       n      a      b       c     d
#>   <int>  <dbl>  <dbl>   <dbl> <dbl>
#> 1    10 -0.246 -0.287 -0.0567 0.144
```
* In coming slides, we'll see `across` works and how to modify this behavior.
* Three especially important arguments to `across()`:
  - `.cols`: which columns to iterate over
  - `.fns`: what to do (function) ofr each column
  - `.names`: name output of each column
]

---

### `across()`: Selecting columns with `.cols`

.pull-left[
* For `.cols`, we can use same things we used for `select()`:

```r
df %>% summarize(across(-a, median))
#> # A tibble: 1 × 3
#>        b       c     d
#>    <dbl>   <dbl> <dbl>
#> 1 -0.287 -0.0567 0.144
df %>% summarize(across(c(a,c), median))
#> # A tibble: 1 × 2
#>        a       c
#>    <dbl>   <dbl>
#> 1 -0.246 -0.0567
```
* Two additional arguments which are helpful: `everything()` and `where()`.

```r
df %>% summarize(across(everything(), median))
#> # A tibble: 1 × 4
#>        a      b       c     d
#>    <dbl>  <dbl>   <dbl> <dbl>
#> 1 -0.246 -0.287 -0.0567 0.144
```

]
.pull-right[
* `everything()` computes summaries for every non-grouping variable
* `where()` allows for selecting columns based on type, e.g. `where(is.numeric)` for numbers, `where(is.character)` for strings, `where(is.logical)` for logicals, etc.

```r
df <- tibble(
  grp = sample(2, 10, replace = TRUE), # either 1 or 2
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10))
df |> 
  group_by(grp) |> 
  summarize(across(everything(), median))
#> # A tibble: 2 × 5
#>     grp       a       b     c     d
#>   <int>   <dbl>   <dbl> <dbl> <dbl>
#> 1     1 -0.0935 -0.0163 0.363 0.364
#> 2     2  0.312  -0.0576 0.208 0.565
```

]

---

### `across()`: calling a single function
* `.fns` says how we want data to be transformed
* We are passing the *function* to `across()`, we are not calling the function itself. 
  - Never add the `()` after the function when you pass to across, otherwise you get an error.

```r
df |> 
  group_by(grp) |> 
  summarize(across(everything(), median()))
#> Error in `summarize()`:
#> ℹ In argument: `across(everything(), median())`.
#> Caused by error in `median.default()`:
#> ! argument "x" is missing, with no default
```
* Same reason why calling `median()` in console will result in an error, since it has no input.

---
### `across()`: calling multiple functions 
.pull-left[ 
* We may want to apply multiple transformations or have multiple arguments
* Motivating example: tibble with missing data

```r
rnorm_na <- function(n, n_na, mean = 0, sd = 1) {
  sample(c(rnorm(n - n_na, mean = mean, sd = sd), rep(NA, n_na)))
}
df_miss <- tibble(
  a = rnorm_na(5, 1),
  b = rnorm_na(5, 1),
  c = rnorm_na(5, 2),
  d = rnorm(5))

df_miss |> 
  summarize(
    across(a:d, median),
    n = n())
#> # A tibble: 1 × 5
#>       a     b     c     d     n
#>   <dbl> <dbl> <dbl> <dbl> <int>
#> 1    NA    NA    NA  1.15     5
```

]
.pull-right[
* If we want to pass along argument `na.rm = TRUE` we can create a new function in-line which calls median:

```r
df_miss |> 
  summarize(
    across(a:d, function(x) median(x, na.rm = TRUE)),
    n = n()  )
#> # A tibble: 1 × 5
#>       a     b      c     d     n
#>   <dbl> <dbl>  <dbl> <dbl> <int>
#> 1 0.139 -1.11 -0.387  1.15     5
```
* R also allows for a shortcut for in-line function creations: `\`:

```r
df_miss |> 
  summarize(
    across(a:d, \(x) median(x, na.rm = TRUE)),
    n = n()  )
```
* Equivalent to:

```r
df_miss |> 
  summarize(
    a = median(a, na.rm = TRUE),
    b = median(b, na.rm = TRUE),
    c = median(c, na.rm = TRUE),
    d = median(d, na.rm = TRUE),
    n = n()   )
```

]
---
.pull-left[ 
* So we can simplify code like ...

```r
df_miss |> 
  summarize(
    a = median(a, na.rm = TRUE),
    b = median(b, na.rm = TRUE),
    c = median(c, na.rm = TRUE),
    d = median(d, na.rm = TRUE),
    n = n()   )
#> # A tibble: 1 × 5
#>       a     b      c     d     n
#>   <dbl> <dbl>  <dbl> <dbl> <int>
#> 1 0.139 -1.11 -0.387  1.15     5
```
* ... to ...

```r
df_miss |> 
  summarize(
    across(a:d, \(x) median(x, na.rm = TRUE)),
    n = n() )
#> # A tibble: 1 × 5
#>       a     b      c     d     n
#>   <dbl> <dbl>  <dbl> <dbl> <int>
#> 1 0.139 -1.11 -0.387  1.15     5
```
]

.pull-right[
* When we remove missing values, may also be interested in how many were removed.  We can do that again using `across()` by using a named list to `.fns` argument:

```r
df_miss |> 
  summarize(
    across(a:d, list(
      median = \(x) median(x, na.rm = TRUE),
      n_miss = \(x) sum(is.na(x))
    )),
    n = n()
  )
#> # A tibble: 1 × 9
#>   a_median a_n_miss b_median b_n_miss c_median c_n_miss d_median d_n_miss
#>      <dbl>    <int>    <dbl>    <int>    <dbl>    <int>    <dbl>    <int>
#> 1    0.139        1    -1.11        1   -0.387        2     1.15        0
#> # ℹ 1 more variable: n <int>
```
* Columns are named using "glue": `{.col}.{.fn}`, `.col` is name of original column and `.fn` is name of function. 
* Next: more on how to name columns in the output
]

---

### Column names
* Specifying the `.names` column allows for custom output names:

```r
df_miss |> 
  summarize(
    across(
      a:d,
      list(
        median = \(x) median(x, na.rm = TRUE),
        n_miss = \(x) sum(is.na(x))
      ),
      .names = "{.fn}_for_{.col}"
    ),
    n = n(),
  )
#> # A tibble: 1 × 9
#>   median_for_a n_miss_for_a median_for_b n_miss_for_b median_for_c
#>          <dbl>        <int>        <dbl>        <int>        <dbl>
#> 1        0.139            1        -1.11            1       -0.387
#> # ℹ 4 more variables: n_miss_for_c <int>, median_for_d <dbl>,
#> #   n_miss_for_d <int>, n <int>
```

---

### Column names
* Specifying `.names` is especially important when using `mutate()`, since by default `across()` gives same names as input and thus will replace the original columns.

.pull-left[
* e.g., `coalesce(x, y)` replaces all appearances of `NA` in `x` with the value `y`

```r
df_miss |> 
  mutate(
    across(a:d, \(x) coalesce(x, 0))
  )
#> # A tibble: 5 × 4
#>        a      b      c     d
#>    <dbl>  <dbl>  <dbl> <dbl>
#> 1  0.434 -1.25   0     1.60 
#> 2  0     -1.43  -0.297 0.776
#> 3 -0.156 -0.980  0     1.15 
#> 4 -2.61  -0.683 -0.785 2.13 
#> 5  1.11   0     -0.387 0.704
```

]

.pull-right[
* If we wanted to create new columns, use `.names` to give output new names:

```r
df_miss |> 
  mutate(
    across(a:d, \(x) coalesce(x, 0), .names = "{.col}_na_zero")
  )
#> # A tibble: 5 × 8
#>        a      b      c     d a_na_zero b_na_zero c_na_zero d_na_zero
#>    <dbl>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
#> 1  0.434 -1.25  NA     1.60      0.434    -1.25      0         1.60 
#> 2 NA     -1.43  -0.297 0.776     0        -1.43     -0.297     0.776
#> 3 -0.156 -0.980 NA     1.15     -0.156    -0.980     0         1.15 
#> 4 -2.61  -0.683 -0.785 2.13     -2.61     -0.683    -0.785     2.13 
#> 5  1.11  NA     -0.387 0.704     1.11      0        -0.387     0.704
```

]

---

### Filtering
* `across()` is great with `summarize()` and `mutate()`, but not so much with `filter()` because there we usually combine conditions with `&` / `|`.  
* dplyr provides two variants: `if_any()` and `if_all()` to help combine logicals across columns

```r
# same as df_miss |> filter(is.na(a) | is.na(b) | is.na(c) | is.na(d))
df_miss |> filter(if_any(a:d, is.na))
#> # A tibble: 4 × 4
#>        a      b      c     d
#>    <dbl>  <dbl>  <dbl> <dbl>
#> 1  0.434 -1.25  NA     1.60 
#> 2 NA     -1.43  -0.297 0.776
#> 3 -0.156 -0.980 NA     1.15 
#> 4  1.11  NA     -0.387 0.704

# same as df_miss |> filter(is.na(a) & is.na(b) & is.na(c) & is.na(d))
df_miss |> filter(if_all(a:d, is.na))
#> # A tibble: 0 × 4
#> # ℹ 4 variables: a <dbl>, b <dbl>, c <dbl>, d <dbl>
```

---

### `across()` in functions
* Let's see an example of expanding all date columns into year / month / day columns.

```r
expand_dates <- function(df) {
  df |> 
    mutate(
      across(where(is.Date), list(year = year, month = month, day = mday))
    )
}

df_date <- tibble(
  name = c("Amy", "Bob"),
  date = ymd(c("2009-08-03", "2010-01-16"))
)

df_date |> 
  expand_dates()
#> # A tibble: 2 × 5
#>   name  date       date_year date_month date_day
#>   <chr> <date>         <dbl>      <dbl>    <int>
#> 1 Amy   2009-08-03      2009          8        3
#> 2 Bob   2010-01-16      2010          1       16
```

---
### `across()` in functions
.pull-left[ 
* You can supply multiple columns in a single argument using `c()` in addition to `where()`:

```r
summarize_means <- function(df, summary_vars = where(is.numeric)) {
  df |> 
    summarize(
      across({{ summary_vars }}, \(x) mean(x, na.rm = TRUE)),
      n = n(),
      .groups = "drop")
}
```
]
.pull-right[

```r
diamonds |> 
  group_by(cut) |> 
  summarize_means()
#> # A tibble: 5 × 9
#>   cut       carat depth table price     x     y     z     n
#>   <ord>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
#> 1 Fair      1.05   64.0  59.1 4359.  6.25  6.18  3.98  1610
#> 2 Good      0.849  62.4  58.7 3929.  5.84  5.85  3.64  4906
#> 3 Very Good 0.806  61.8  58.0 3982.  5.74  5.77  3.56 12082
#> 4 Premium   0.892  61.3  58.7 4584.  5.97  5.94  3.65 13791
#> 5 Ideal     0.703  61.7  56.0 3458.  5.51  5.52  3.40 21551

diamonds |> 
  group_by(cut) |> 
  summarize_means(c(carat, x:z))
#> # A tibble: 5 × 6
#>   cut       carat     x     y     z     n
#>   <ord>     <dbl> <dbl> <dbl> <dbl> <int>
#> 1 Fair      1.05   6.25  6.18  3.98  1610
#> 2 Good      0.849  5.84  5.85  3.64  4906
#> 3 Very Good 0.806  5.74  5.77  3.56 12082
#> 4 Premium   0.892  5.97  5.94  3.65 13791
#> 5 Ideal     0.703  5.51  5.52  3.40 21551
```

]

---

### `across()` vs `pivot_longer()`
.pull-left[
Consider calculating medians/means for all columns:

```r
df |> 
  summarize(across(a:d, list(median = median, mean = mean)))
#> # A tibble: 1 × 8
#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
```

Alternative way to compute: pivot longer, then group by and summarize:

```r
long <- df |> 
  pivot_longer(cols = a:d, names_to = "name", values_to = "value") |>
  group_by(name) |> 
  summarize(
    median = median(value),
    mean = mean(value)
  )
```
]

.pull-right[

```
#> # A tibble: 4 × 3
#>   name   median   mean
#>   <chr>   <dbl>  <dbl>
#> 1 a      0.0380 0.205 
#> 2 b     -0.0163 0.0910
#> 3 c      0.260  0.0716
#> 4 d      0.540  0.508
```

* Then pivot wider:

```r
long %>% pivot_wider(
  names_from = 'name',
  values_from = c(median, mean),
  names_vary = "slowest",
  names_glue = "{name}_{.value}")
#> # A tibble: 1 × 8
#>   a_median a_mean b_median b_mean c_median c_mean d_median d_mean
#>      <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>    <dbl>  <dbl>
#> 1   0.0380  0.205  -0.0163 0.0910    0.260 0.0716    0.540  0.508
```
* "fastest" varies names_from values fastest, resulting in ⁠value1_name1, value1_name2...

]

---
.pull-left[
* This approach is useful when you have groups of columns that you want to compute with simultaneously
* e.g. suppose df contains both values and weights, and we want to compute a weighted mean.

```r
df_paired <- tibble(
  a_val = rnorm(10),
  a_wts = runif(10),
  b_val = rnorm(10),
  b_wts = runif(10),
  c_val = rnorm(10),
  c_wts = runif(10),
  d_val = rnorm(10),
  d_wts = runif(10)
)
```
* No way to do with `across`, but easy with `pivot_longer`
]
.pull-right[

```r
( df_long <- df_paired |> 
  pivot_longer(
    cols = everything(), 
    names_to = c("group", ".value"), 
    names_sep = "_"
  ) )
#> # A tibble: 40 × 3
#>   group    val   wts
#>   <chr>  <dbl> <dbl>
#> 1 a      0.715 0.518
#> 2 b     -0.709 0.691
#> 3 c      0.718 0.216
#> 4 d     -0.217 0.733
#> 5 a     -1.09  0.979
#> 6 b     -0.209 0.675
#> # ℹ 34 more rows
df_long |> 
  group_by(group) |> 
  summarize(mean = weighted.mean(val, wts))
#> # A tibble: 4 × 2
#>   group    mean
#>   <chr>   <dbl>
#> 1 a      0.126 
#> 2 b     -0.0704
#> 3 c     -0.360 
#> 4 d     -0.248
```

]

---

### Examples
.pull-left[ 
* Number of unique values in each column of `palmerpenguins::penguins`:

```r
penguins %>% 
  summarize(across(everything(), \(x) length(unique(x))))
#> # A tibble: 1 × 8
#>   species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#>     <int>  <int>          <int>         <int>             <int>       <int>
#> 1       3      3            165            81                56          95
#> # ℹ 2 more variables: sex <int>, year <int>
```
* The mean of every column in `mtcars`:

```r
mtcars %>%
  summarize(across(everything(), mean))
#>        mpg    cyl     disp       hp     drat      wt     qsec     vs      am
#> 1 20.09062 6.1875 230.7219 146.6875 3.596563 3.21725 17.84875 0.4375 0.40625
#>     gear   carb
#> 1 3.6875 2.8125
```
]

.pull-right[
* Group diamonds by `cut`, `clarity`, and `color`, then count the number of observations and compute the mean of each numeric column.

```r
diamonds %>%
  group_by(cut, clarity, color) %>%
  summarize(num = n(), across(where(is.numeric), mean))
#> `summarise()` has grouped output by 'cut', 'clarity'. You can override using
#> the `.groups` argument.
#> # A tibble: 276 × 11
#> # Groups:   cut, clarity [40]
#>   cut   clarity color   num carat depth table price     x     y     z
#>   <ord> <ord>   <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair  I1      D         4 1.88   65.6  56.8 7383   7.52  7.42  4.90
#> 2 Fair  I1      E         9 0.969  65.6  58.1 2095.  6.17  6.06  4.01
#> 3 Fair  I1      F        35 1.02   65.7  58.4 2544.  6.14  6.04  4.00
#> 4 Fair  I1      G        53 1.23   65.3  57.7 3187.  6.52  6.43  4.23
#> 5 Fair  I1      H        52 1.50   65.8  58.4 4213.  6.96  6.86  4.55
#> 6 Fair  I1      I        34 1.32   65.7  58.4 3501   6.76  6.65  4.41
#> # ℹ 270 more rows
```

]

---

### Coming lectures
* Done with R4DS book now; moving to Introduction to Modern Statistics (IMS). 
* We'll use `library(openintro)`.

* In the coming lectures, we'll look at **linear models** and how to perform **inference**
* A quick refresher on plotting lines:
  * `y = m x + b`: slope of `m`, intercept of `b` (when `x=0`, `y=b`)
  * `y - v = m (x - u)`: point slope form.  Line going through `(u, v)` with slope `m`
  * If you have two points, you can always connect a line through them and find the slope by substituting into first equation above.

---

### Basic single-variable linear model
* The basic structure of a linear model:
$$ y = b_0 + b_1 x + \epsilon $$
* `$b_0$` is **intercept**, `$b_1$` is **slope**, `$\epsilon$` is an **error term**
* We call `$x$` the **predictor**, `$y$` the **response** or **outcome**.
* The error term can be small or large - its size relative to the slope determines how useful the linear model is

<div class="figure">
<img src="lec14-functions-2_v0_files/figure-html/fig-imperfLinearModel-1.png" alt="Three scatterplots with fabricated data.  The first panel shows a strong negative linear relationship.  The second panel shows a moderate positive linear relationship.  The last panel shows no relationship between the x and y variables." width="288" /><img src="lec14-functions-2_v0_files/figure-html/fig-imperfLinearModel-2.png" alt="Three scatterplots with fabricated data.  The first panel shows a strong negative linear relationship.  The second panel shows a moderate positive linear relationship.  The last panel shows no relationship between the x and y variables." width="288" /><img src="lec14-functions-2_v0_files/figure-html/fig-imperfLinearModel-3.png" alt="Three scatterplots with fabricated data.  The first panel shows a strong negative linear relationship.  The second panel shows a moderate positive linear relationship.  The last panel shows no relationship between the x and y variables." width="288" />
<p class="caption">Three datasets where a linear model may be useful even though the data do
not all fall exactly on the line.
</p>
</div>

---
### Linear regression for predicting possum head lengths
.pull-left[ 
* We'll look into a dataset which has measurements of possum body lengths and head lengths
* We want to see if we can predict the head length given the body length
* Let's examine a scatterplot of this data (`possum` is in `openintro` library)

```r
possum
#> # A tibble: 104 × 8
#>    site pop   sex     age head_l skull_w total_l tail_l
#>   <int> <fct> <fct> <int>  <dbl>   <dbl>   <dbl>  <dbl>
#> 1     1 Vic   m         8   94.1    60.4    89     36  
#> 2     1 Vic   f         6   92.5    57.6    91.5   36.5
#> 3     1 Vic   f         6   94      60      95.5   39  
#> 4     1 Vic   f         6   93.2    57.1    92     38  
#> 5     1 Vic   f         2   91.5    56.3    85.5   36  
#> 6     1 Vic   f         1   93.1    54.8    90.5   35.5
#> # ℹ 98 more rows
```
]
.pull-right[
* Let's visualize the relationship between total length and head length:

```r
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  )
```

<div class="figure">
<img src="lec14-functions-2_v0_files/figure-html/fig-scattHeadLTotalL-1.png" alt="A scatterplot with total length on the x-axis and head length on the y-axis.  The variables show a moderately strong positive linear relationship. A single observation is circled in red with coordinates of approximately 84cm of total length and 87mm of head length." width="432" />
<p class="caption">A scatterplot showing head length against total length for 104 brushtail
possums. A point representing a possum with head length 86.7 mm and total
length 84 cm is highlighted.
</p>
</div>
]

---
.pull-left[
* We can try and fit a line which roughly fits the data well by taking two points and then drawing the line between them.

```r
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_point(data = sample_n(possum, 2), aes(x = total_l, y = head_l), color = "red", size = 5, shape = "circle open", stroke = 2)
```

<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-35-1.png" width="432" />
* Not ideal
]

.pull-right[
* A more reasonable line:

```r
ggplot(possum, aes(x = total_l, y = head_l)) +
  geom_point(alpha = 0.7, size = 2) +
  labs(
    x = "Total Length (cm)",
    y = "Head Length (mm)"
  ) +
  geom_smooth(method = "lm", se = FALSE)
#> `geom_smooth()` using formula = 'y ~ x'
```

<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-36-1.png" width="432" />
]

---
### Residuals
* Suppose we have data `$(x_i, y_i)_{i=1}^n$`, and we fit a line using the data, and we make a prediction `$\hat{y}(x_i) = \hat y_i$`.  
* The **residual** for the `$i$`-th observation `$(x_i, y_i)$` is the difference between observed outcome and predicted outcome:
$$ e_i := y_i - \hat y_i .$$

.pull-left[ 
<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-37-1.png" width="432" />

]

.pull-right[
* We plot a linear model on the left for getting possum head length vs. possum body length.  
* The length of the lines from the black line to the circled points is the residual; if the dot is above the line, it is positive, if it is below the line, it is negative.
* Suppose the line is given by the equation
$$ \hat y = 41 + 0.59 x $$
* What is the residual for the observation (76.0, 85.1)?
$$ \hat y = 41 + 0.59 \times 85.1 = 85.84 $$
$$ e = y - \hat y = 85.1 - 85.84 = -0.74 $$
]

---

### Residual plots

* Residuals are often helpfuul when trying to understand how well a linear model fits the data
* We can visualize this by plotting predicted values on the `$x$` axis, and residuals on the `$y$`-axis
.pull-left[ 
<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-38-1.png" width="432" />
]

.pull-right[
* Residual plot:
<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-39-1.png" width="432" />

]

---

### Residual plots

.pull-left[ 
* Let's look at a few more linear relationships and their corresponding residual plots.

* What are some observed trends?
]

.pull-right[ 
* First dataset, no obvious patterns for residuals: they look random around the dashed line
* Second dataset: looks like some curvature.  Suggests the errors are *not* random, true relationship likely not linear plus noise.
* Third plot: not clear if underlying relatinoship is linear, but residuals don't appear to show any pattern. 
]

---

### Correlation for describing linear relationships

.pull-left[ 
* **Correlation** defines the strength of a linear relationship
* Always valued between -1 and 1, denoted by `$r$`
* Correlation for observations `$(x_1, y_1), \dots, (x_n, y_n)$`:
$$ r  = \frac{1}{n-1} \sum_{i=1}^n \frac{x_i - \bar x}{s_x} \cdot \frac{y_i - \bar y}{s_y}, $$
where `$\bar x, \bar y$` sample means and `$s_x, s_y$` standard deviations for `$x_i, y_i$` respectively. 
* Correlation of +1: positive linear relationship; -1: negative; 0: not linear relationship. 
]

.pull-right[ 
<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-41-1.png" width="432" />
]
---

### Correlation is "scale-invariant"
.pull-left[ 
* The definition of correlation:
$$ r  = \frac{1}{n-1} \sum_{i=1}^n \frac{x_i - \bar x}{s_x} \cdot \frac{y_i - \bar y}{s_y}, $$
* If we multiply every `$x_i$` by a number `$\epsilon$`, then both the numerator and denominator scale by the same number `$\epsilon$`, so `$r$` is unaffected
  * Same thing for `$y_i$`'s
* E.g. correlation between height/weight using metric or American units is the same
]

.pull-right[ 
<img src="lec14-functions-2_v0_files/figure-html/fig-bdims-units-1.png" width="432" />
]

---
### Least squares regression
.pull-left[
* How do we define the "best" line of fit?
* We would like the residuals to be small, but many ways to measure residuals.
* For residuals `$e_i = y_i - \hat y_i$`, consider these two ways:
$$ |e_1| + |e_2| + \dots + |e_n|$$
$$ e_1^2 + e_2^2 + \dots + e_n^2$$
* We plot the two lines which minimizes either the sum of the absolute values or the sum of the squares on the right
]

.pull-right[

<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-43-1.png" width="432" />
* Most commonly used one is least squares, we'll focus on this 
]

---
### Fitting least squares lines in R
.pull-left[
* Let's consider the `elmhurst` dataset:

```r
elmhurst
#> # A tibble: 50 × 3
#>   family_income gift_aid price_paid
#>           <dbl>    <dbl>      <dbl>
#> 1         92.9      21.7      14.3 
#> 2          0.25     27.5       8.53
#> 3         53.1      27.8      14.2 
#> 4         50.2      27.2       8.78
#> 5        138.       18        24   
#> 6         48.0      18.5      23.5 
#> # ℹ 44 more rows
```
* Suppose we want to predict amoutn of aid given as a functino of family income.  We want to form predictions
$$ \widehat{\mathsf{aid}} = \beta_0 + \beta_1 \times \mathsf{familyincome}$$
]
.pull-right[
* To compute line of best fit of variable `$y$` given `$x$`, we use `lm(y ~ x)`:

```r
lm(gift_aid ~ family_income, data = elmhurst)
#> 
#> Call:
#> lm(formula = gift_aid ~ family_income, data = elmhurst)
#> 
#> Coefficients:
#>   (Intercept)  family_income  
#>      24.31933       -0.04307
lm(gift_aid ~ family_income, data = elmhurst) %>% broom::tidy()
#> # A tibble: 2 × 5
#>   term          estimate std.error statistic  p.value
#>   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    24.3       1.29       18.8  8.28e-24
#> 2 family_income  -0.0431    0.0108     -3.98 2.29e- 4
```
* `broom` package takes outputs of messy functions like `lm()` and makes them into tidy tibbles 
]

---
.pull-left[

```r
lm(gift_aid ~ family_income, data = elmhurst)
#> 
#> Call:
#> lm(formula = gift_aid ~ family_income, data = elmhurst)
#> 
#> Coefficients:
#>   (Intercept)  family_income  
#>      24.31933       -0.04307
lm(gift_aid ~ family_income, data = elmhurst) %>% summary()
#> 
#> Call:
#> lm(formula = gift_aid ~ family_income, data = elmhurst)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -10.1128  -3.6234  -0.2161   3.1587  11.5707 
#> 
#> Coefficients:
#>               Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   24.31933    1.29145  18.831  < 2e-16 ***
#> family_income -0.04307    0.01081  -3.985 0.000229 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 4.783 on 48 degrees of freedom
#> Multiple R-squared:  0.2486,	Adjusted R-squared:  0.2329 
#> F-statistic: 15.88 on 1 and 48 DF,  p-value: 0.0002289
```
]
.pull-right[

```r
lm(gift_aid ~ family_income, data = elmhurst) %>% tidy()
#> # A tibble: 2 × 5
#>   term          estimate std.error statistic  p.value
#>   <chr>            <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    24.3       1.29       18.8  8.28e-24
#> 2 family_income  -0.0431    0.0108     -3.98 2.29e- 4
```

* How do we interpret these numbers?
* Intercept and slope estimates for the Elmhurst data are `$b_0$` = 24.319329 and `$b_1$` = -0.0430717.
* Slope: For each additional 1,000 of family income, we expect a student to receive a net difference of `$1,000 * (-0.0431) = -43.10$` in aid on average, i.e., 43.10 *less*.
* This is an **association** not **causal**: increasing a student family's income does nto necessarily cause student aid to drop, just likely
* Intercept:  `$b_0$` = 24.319329 describes the average aid if a student's family had no income, 24,319
* Intercept here is interpretable, but not always
* p-values provide insight into how "significant" the relationship between variables and response are.  We'll discuss more later. 
]

---

#### Computing the line of best fit from mean and standard deviations
.pull-left[ 
* Two crucial facts for computing line of best fit:
* (1) Slope of least squares line can be computed by
$$ b_1 = \frac{s_y}{s_x} \cdot r,$$
where `$r$` is correlation, `$s_x, s_y$` are sample standard deviations of predictor and outcome, respectively
* (2) If `$\bar x$` is sample mean of predictor, `$\bar y$` sample mean of outcome, then `$(\bar x, \bar y)$` is on the least squares line

<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Family income, x</div></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Gift aid, y</div></th>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
</tr>
  <tr>
   <th style="text-align:center;"> mean </th>
   <th style="text-align:center;"> sd </th>
   <th style="text-align:center;"> mean </th>
   <th style="text-align:center;"> sd </th>
   <th style="text-align:center;"> r </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:center;"> 101.7785 </td>
   <td style="text-align:center;"> 63.20645 </td>
   <td style="text-align:center;"> 19.93556 </td>
   <td style="text-align:center;"> 5.460581 </td>
   <td style="text-align:center;"> -0.4985561 </td>
  </tr>
</tbody>
</table>

]

.pull-right[
* Compute slope of least squares line:
$$ b_1 = \frac{s_y}{s_x} \cdot r = \frac{5.46}{63.2} \cdot (-0.499) = -0.0431$$
* Point slope form of a line: for slope `$m$` and point `$(x_0, y_0)$` on a line,
$$ y - y_0 = m \cdot (x - x_0) $$
* Since `$(\bar x, \bar y)$` is on the line with slope `$b_1$`, we get
$$ y - \bar y = b_1 (x - \bar y)$$
* This gives
$$ y = \bar y - b_1 \bar x + b_1 x \implies $$
$$ b_0 = \bar y - b_1 \bar x = 19.9 + 0.0431 * 102 = 24.3. $$
* Results in equation:
$$ \widehat {\mathsf{aid}} = 24.3 - 0.0431 \times \mathsf{familyincome} $$
]

---
* Let's compare our calculated line of best fit with what R returns to us with `lm()`:
$$ \widehat {\mathsf{aid}} = 24.3 - 0.0431 \times \mathsf{familyincome} $$

* If we have a senior who is considering Elmhurst College, can they use this to predict her financial aid?
* They may use it as an *estimate*.  Things can vary year to year, and the estimate is imperfect.  So likely will be errors.

---
### Extrapolation and its perils
.pull-left[ 
* We can make some informed estimates about future predictions of a linear model when the predictors are similar to those we saw before.
* Consider if we had data which looked like the left, and we tried to make predictions for predictors which are outside of the domain where we have data
]
.pull-right[ 
<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-51-1.png" width="432" />
]

---

### Describing the strength of a fit
.pull-left[
* Previously evaluated strength of linear relationship between two variables using correlation `$r$`
* More common to describe this using `$R^2$`: "R-squared"
* Roughly: describes amount of variation in outcome variable explained by the least-squares line
* `$R^2$` is called the **coefficient of determination**
* `$R^2$` is equal to *square of correlation*: `$R^2 = (r)^2$`. 
* e.g. if `$r=-0.5$`, `$R^2 = 0.25$`. 
* Since `$r\in [-1,1]$`, `$R^2 \in [0,1]$`. 
]
.pull-right[
* `$R^2$` can be calculated in terms of the *total sum of squares* SST and the *sum of squared errors/residuals* SSE. 
$$ \mathsf{SST} = (y_1 - \bar y)^2 + \cdots + (y_n - \bar y)^2, $$
$$ \mathsf{SSE} = (y_1 - \hat y_1)^2 + \cdots + (y_n - \hat y)^2 = e_1^2 + \cdots + e_n^2.$$
`$$R^2 = \frac{\mathsf{SST} - \mathsf{SSE}}{\mathsf{SST}} = 1 - \frac{\mathsf{SSE}}{\mathsf{SST}}.$$`
]

---

### Categorical predictors with two levels
.pull-left[ 
* Let's consider dataset `mariokart`, which has outcomes for Ebay auctions for Mario Kart for Nintendo Wii.
  * Final price of auction and condition of game were recorded

```r
mariokart
#> # A tibble: 143 × 12
#>             id duration n_bids cond  start_pr ship_pr total_pr ship_sp   
#>          <dbl>    <int>  <int> <fct>    <dbl>   <dbl>    <dbl> <fct>     
#> 1 150377422259        3     20 new       0.99    4        51.6 standard  
#> 2 260483376854        7     13 used      0.99    3.99     37.0 firstClass
#> 3 320432342985        3     16 new       0.99    3.5      45.5 firstClass
#> 4 280405224677        3     18 new       0.99    0        44   standard  
#> 5 170392227765        1     20 new       0.01    0        71   media     
#> 6 360195157625        3     19 new       0.99    4        45   standard  
#> # ℹ 137 more rows
#> # ℹ 4 more variables: seller_rate <int>, stock_photo <fct>, wheels <int>, …
```

* To deal with binary categorical variables, we record one of the values as a 0 and the other as a 1
* By default, R coerces factors into numerics corresponding to the level of the factor.

```r
levels(mariokart$cond)
#> [1] "new"  "used"
```
]

.pull-right[

```r
head(mariokart)$cond %>% as.numeric
#> [1] 1 2 1 1 1 1
```
* To convert to 0/1, we want to subtract one.

```r
mariokart |>
  filter(total_pr < 100) |>
  ggplot(aes(x = as.numeric(cond) - 1, y = total_pr)) +
  geom_jitter(alpha = 0.8, position = position_jitter(width = 0.05)) +
  geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) +
  scale_x_continuous(
    breaks = c(0, 1),
    minor_breaks = c(0, 1),
    limits = c(-0.2, 1.2),
    labels = c("0\n(new)", "1\n(used)")
  ) +
  labs(
    x = "Condition",
    y = "Total price"
  )
#> `geom_smooth()` using formula = 'y ~ x'
```

]

---

.pull-left[

```
We are forming a linear model for
$$
\widehat{\texttt{price}} = b_0 + b_1 \times \texttt{condnew}
$$

]

.pull-right[

```
#> `geom_smooth()` using formula = 'y ~ x'
```

```r
lm(total_pr ~ cond, data = mariokart) %>% tidy
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)    53.8       3.33     16.2  7.04e-34
#> 2 condused       -6.62      4.34     -1.52 1.30e- 1
```
* We thus see the estimated linear model is
`$$\widehat{\texttt{price}} = 53.8 - 6.62 \times \texttt{condnew}$$`
]

---

### Outliers in linear regression
.pull-left[
* Least squares line is determined by minimizing the sum of the squares of the residuals:
$$ e_1^2 + \dots + e_n^2.$$
* If one point is very far from the rest of the points, the least squares line will be highly affected by this point.

<img src="lec14-functions-2_v0_files/figure-html/fig-outlier-plots-1-1.png" width="100%" />
]
--

.pull-right[
* Example A: clear outlier but doesn't have a large influence on the line
* Example B: an outlier, but appears to be on trend with behavior of non-outlier points
* Example C: outlier, pulls the least squares line up and causes residuals to be more negative as x increases. 
]

---

### Outliers in linear regression 
.pull-left[

<img src="lec14-functions-2_v0_files/figure-html/fig-outlier-plots-2-1.png" width="100%" />
]

.pull-right[
* In E, there is no obvious trend in 99.9% of the data per residuals, but the outlier makes it appear like there is a strong positive correlation
* In F, there is one outlier far from cloud but is close to least squares line so doesn't have much influence 
]

---

### Examples
.pull-left[

* For these scatterplots, for each, describe in words what those plots would look like, and sketch out a residual plot.

<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-61-1.png" width="432" />
(Residual plot sketch made in class)
]

.pull-right[
* Consider the following two residual plots for linear model fit to two different sets of data
* Describe what type of data would generate such a plot, and whether a linear model would be a good fit for these data

<img src="lec14-functions-2_v0_files/figure-html/unnamed-chunk-62-1.png" width="432" />
(Data sketch made in class)
]