Linear Regression

class: center, middle, inverse, title-slide

.title[
# Linear Regression
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

### Linear regression with multiple predictors
.pull-left[
* Multiple regression extends single predictor regression to setting where there's still a single response but multiple predictors:
$$ y = b_0 + b_1 x_1 + \cdots + b_n x_n.$$
* Data: `loans_full_schema` data with tidying

```r
loans <- loans_full_schema |>
  mutate(
    credit_util = total_credit_utilized / total_credit_limit,
    bankruptcy  = as.factor(if_else(public_record_bankrupt == 0, 0, 1)),
    verified_income = droplevels(verified_income)
    ) |>
  rename(credit_checks = inquiries_last_12m) |>
  select(interest_rate, verified_income, debt_to_income, credit_util, bankruptcy, term, credit_checks, issue_month)
```

<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> interest_rate </th>
   <th style="text-align:left;"> verified_income </th>
   <th style="text-align:right;"> debt_to_income </th>
   <th style="text-align:right;"> credit_util </th>
   <th style="text-align:right;"> bankruptcy </th>
   <th style="text-align:right;"> term </th>
   <th style="text-align:right;"> credit_checks </th>
   <th style="text-align:left;"> issue_month </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 14.07 </td>
   <td style="text-align:left;"> Verified </td>
   <td style="text-align:right;"> 18.01 </td>
   <td style="text-align:right;"> 0.5475952 </td>
   <td style="text-align:right;"> 0 </td>
   <td style="text-align:right;"> 60 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> Mar-2018 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 12.61 </td>
   <td style="text-align:left;"> Not Verified </td>
   <td style="text-align:right;"> 5.04 </td>
   <td style="text-align:right;"> 0.1500347 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:right;"> 36 </td>
   <td style="text-align:right;"> 1 </td>
   <td style="text-align:left;"> Feb-2018 </td>
  </tr>
</tbody>
</table>

]

.pull-right[

```r
loans_var_def <- tribble(
  ~variable,         ~description,
  "interest_rate",   "Interest rate on the loan, annual percentage.",
  "verified_income", "Categorical var: whether the borrower's income source/amount have been verified",
  "debt_to_income",  "total debt of the borrower divided by their total income.",
  "credit_util",     "what fraction of credit they utilizing",
  "bankruptcy",      "Indicator 0/1: whether the borrower has a past bankruptcy in their record",
  "term",            "The length of the loan, in months.",
  "issue_month",     "The month and year the loan was issued", 
  "credit_checks",   "Number of credit checks in the last 12 months",
)
```

]

---

### Indicators and categorical variables
.pull-left[
* We saw previously how to fit a linear regression model with a single predictor, e.g.

$$
\widehat{\texttt{interest_rate}} = b_0 + b_1 \times \texttt{bankruptcy}
$$
* Bankruptcy is 0 if no history of bankruptcy, 1 if there is
* Let's inspect the least-squares linear model resulting from this

```r

m_bankruptcy <- lm(interest_rate ~ bankruptcy, data = loans)
m_bankruptcy |>
  tidy()
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic    p.value
#>   <chr>          <dbl>     <dbl>     <dbl>      <dbl>
#> 1 (Intercept)   12.3      0.0533    231.   0         
#> 2 bankruptcy1    0.737    0.153       4.82 0.00000147
```
* Slope of 0.74 means model predicts 0.74% higher interest rate for those w/ bankruptcy
]

.pull-right[
* Let's now consider a 3-level categorical variable like `verified_income`:

```r
loans$verified_income %>% levels 
#> [1] "Not Verified"    "Source Verified" "Verified"
m_verifincome <- lm(interest_rate ~ verified_income, data = loans)
m_verifincome %>% tidy()
#> # A tibble: 3 × 5
#>   term                           estimate std.error statistic   p.value
#>   <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)                       11.1     0.0809     137.  0        
#> 2 verified_incomeSource Verified     1.42    0.111       12.8 3.79e- 37
#> 3 verified_incomeVerified            3.25    0.130       25.1 8.61e-135
```
* Each row represents the relative difference for each level of verified income
* The level "Not Verified" is missing - this is because it is the **reference level**, default level that other levels are measured against
]

---

### Indicators and categorical variables
.pull-left[

```r
m_verifincome %>% tidy()
#> # A tibble: 3 × 5
#>   term                           estimate std.error statistic   p.value
#>   <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)                       11.1     0.0809     137.  0        
#> 2 verified_incomeSource Verified     1.42    0.111       12.8 3.79e- 37
#> 3 verified_incomeVerified            3.25    0.130       25.1 8.61e-135
```
* The equation for the regresion model can be written as:

$$
`\begin{aligned}
\widehat{\texttt{interest_rate}} &= 11.10 \\
&+ 1.42 \times \texttt{verified_income}_{\texttt{Source Verified}} \\
&+ 3.25 \times \texttt{verified_income}_{\texttt{Verified}}
\end{aligned}`
$$
* When `verified_income` takes value `Not Verified`, then both indicator functions in the equation for the linear model are set to 0:

$$
\widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 0 + 3.25 \times 0 = 11.10
$$
]

.pull-right[

* Average interest rate for these borrowers is 11.1%.
* When `verified_income` takes value `Source Verified`, then the corresponding variable takes a value of 1 while the other is 0:

$$
\widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 1 + 3.25 \times 0 = 12.52
$$

* Average interest rate for these borrowers is 12.52%.
]

---

.pull-left[
### Multiple predictors in a linear model
* Let's suppose we want to construct model for interest rate accounting for not just bankruptcy, but also other variables:

$$
`\begin{aligned}
&\widehat{\texttt{interest_rate}} = b_0 \\
&+ b_1 \times \texttt{verified_income}_{\texttt{Source Verified}} \\
&+ b_2 \times \texttt{verified_income}_{\texttt{Verified}} \\
&+ b_3 \times \texttt{debt_to_income} \\
&+ b_4 \times \texttt{credit_util} \\
&+ b_5 \times \texttt{bankruptcy} \\
&+ b_6 \times \texttt{term} \\
&+ b_9 \times \texttt{credit_checks} \\
&+ b_7 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\
&+ b_8 \times \texttt{issue_month}_{\texttt{Mar-2018}}
\end{aligned}`
$$

]

.pull-right[

* Just as before, we find those `$b_0, \cdots, b_8$` such that the sum of squared residuals is small,

$$
`\begin{aligned}
SSE = e_1^2 + \cdots + e_{10000}^2 = \sum_{i=1}^n e_i^2.
\end{aligned}`
$$

* We can use `lm()` just as before.  But now, we use `lm(response ~ .)` to mean we use ALL predictors in the data except for `response` to fit a linear model.

```r
m_full <- lm(interest_rate ~ ., data = loans)
m_full |>
  tidy() 
#> # A tibble: 10 × 5
#>   term                           estimate std.error statistic   p.value
#>   <chr>                             <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)                      1.89     0.210        9.01 2.49e- 19
#> 2 verified_incomeSource Verified   0.997    0.0992      10.1  1.12e- 23
#> 3 verified_incomeVerified          2.56     0.117       21.9  1.25e-103
#> 4 debt_to_income                   0.0218   0.00294      7.43 1.14e- 13
#> 5 credit_util                      4.90     0.162       30.2  2.21e-192
#> 6 bankruptcy1                      0.391    0.132        2.96 3.12e-  3
#> # ℹ 4 more rows
```

]

---

.pull-left[
* The fitted model from previous slide has following form:

$$
`\begin{aligned}
&\widehat{\texttt{interest_rate}} = 1.89 \\
&+ 1.00 \times \texttt{verified_income}_{\texttt{Source Verified}} \\
&+ 2.56 \times \texttt{verified_income}_{\texttt{Verified}} \\
&+ 0.02 \times \texttt{debt_to_income} \\
&+ 4.90 \times \texttt{credit_util} \\
&+ 0.39 \times \texttt{bankruptcy} \\
&+ 0.15 \times \texttt{term} \\
&+ 0.23 \times \texttt{credit_checks} \\
&+ 0.05 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\
&- 0.04 \times \texttt{issue_month}_{\texttt{Mar-2018}}
\end{aligned}`
$$
]

.pull-right[
* Note that `verified_income` and `issue_month` are individual categorical predictors, but we have *two* coefficients for each.  
* If a categorical predictor has `$p$` different levels, we get an additional `$p-1$` terms in a multiple regression model
* So effective number of predictors grows with more and more levels in a categorical variable.
* Seven total variables (verified income, debt to income, bankruptcy, term, credit checks, issue month) but a total of NINE predictors because of the extra levels in the categorical variables. 
* Note that the intercept term here is *not* interpretable, since we cannot set `term` (=length of loan) to 0 in a meaningful way.  
]

---

### Multiple regression vs. single linear regression
* When we looked at `lm(interest_rate ~ bankruptcy)`, we found a coefficient of 0.74, while for `lm(interest rate ~ .)`, we found a coefficient of 0.386 for bankruptcy.
* This is because some of the variables are **correlated**: when just modeling interest rate as a function of bankruptcy, we didn't account for role of income verification, debt to income ratio, etc.
* In general, correlation between variables complicates linear regression analyses and interpretation.

---

### Adjusted R squared

.pull-left[
* We saw in single-variable linear regression that `$R^2$` helps determine amount of variability in response explained by the model

$$
`\begin{aligned}
R^2 &= 1 - \frac{\text{variability in residuals}}{\text{variability in the outcome}}\\
    &= 1 - \frac{Var(e_i)}{Var(y_i)}
\end{aligned}`
$$

* For multiple linear regression, we prefer the *adjusted* `$R^2$`.  If there are `$k$` predictor variables in the model,

$$
`\begin{aligned}
  R_{adj}^{2}
    &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)}
        {s_{\text{outcome}}^2 / (n-1)} \\
    &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2}
        \times \frac{n-1}{n-k-1}
\end{aligned}`
$$
($p$-levels of categorical `$\implies p-1$` predictor vars)

]

.pull-left[
* Adjusted `$R^2$` is smaller than unadjusted `$R^2$`
* One can show that just by adding additional variables to a linear regression model, you can increase the unadjusted `$R^2$`
* The adjusted `$R^2$` tries to account for this to normalize by the number of predictors you're using
]

---

### Model selection
.pull-left[ 
* How do we decide which variables to include when devising a linear model?
* Generally referred to as "model selection"
* Main idea: **stepwise selection**.  Either start from a large model and reduce variables one-by-one (*backward elminiation*), or to start from a small model and add variables one-by-one (*forward selection*)
* Different criteria can be used to estimate whether to add/eliminate variables
* We'll mainly use adjusted R-squared - improvement = larger adjusted R-squared
]

.pull-right[
* Example: using `lm(interest_rate ~ ., data = loans)` results in a linear model with adjusted `$R^2$` of 0.2597, using variables    "verified_income", "debt_to_income", "credit_util", "bankruptcy", "term", "credit_checks", "issue_month".  
* We then create new linear models where we remove exactly one variable, and we report adjusted `$R^2$` as:
  * Excluding verified_income: 0.2238
  * Excluding debt_to_income: 0.2557
  * Excluding credit_util: 0.1916
  * Excluding bankruptcy: 0.2589
  * Excluding term: 0.1468
  * Excluding credit_checks: 0.2484
  * Excluding issue_month: 0.2598
* Since removing issue_month has adjusted `$R^2$` of 0.2598 > 0.2597, we drop issue_month from the model. 
* We can then continue and see if we remove both issue_month and another variable results in better adjusted `$R^2$`. 
]

---
### Model selection: forward selection
.pull-left[
* Let's now see how to build a model using forward selection, adding one predictor at a time
* Start with a model with no predictors.  This always has `$R^2_{adj}=0$`.  Compare to models with a single variable each, which have `$R^2_{adj}$` of:
    - Including verified_income: 0.05926
    - Including debt_to_income: 0.01946
    - Including credit_util: 0.06452
    - Including bankruptcy: 0.00222
    - Including term: 0.12855
    - Including credit_checks: -0.0001
    - Including issue_month: 0.01711
* "term" has largest `$R^2_{adj}$`, so we add this variable to the model
]

.pull-right[
* Now let's try adding a second variable in addition to "term"
* Two predictor models with "term" and their adjusted `$R^2$`:
    - Including term and verified_income: 0.16851
    - Including term and debt_to_income: 0.14368
    - Including term and credit_util: 0.20046
    - Including term and bankruptcy: 0.13070
    - Including term and credit_checks: 0.12840
    - Including term and issue_month: 0.14294
* Adding "credit_util" yields highest increase in `$R^2_{adj}$`, and this is higher than the baseline level of 0.12855, so we add the model as a predictor
* We can continue this way until we do not see improvement in adjusted `$R^2$`
]

---
.pull-left[
### Example: housing prices
* Let's look at `duke_forest` dataset from `openintro` - first few rows look like this

<table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:right;"> price </th>
   <th style="text-align:right;"> bed </th>
   <th style="text-align:right;"> bath </th>
   <th style="text-align:right;"> area </th>
   <th style="text-align:right;"> year_built </th>
   <th style="text-align:left;"> cooling </th>
   <th style="text-align:right;"> lot </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:right;"> 1,520,000 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 6,040 </td>
   <td style="text-align:right;"> 1,972 </td>
   <td style="text-align:left;"> central </td>
   <td style="text-align:right;"> 0.97 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 1,030,000 </td>
   <td style="text-align:right;"> 5 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 4,475 </td>
   <td style="text-align:right;"> 1,969 </td>
   <td style="text-align:left;"> central </td>
   <td style="text-align:right;"> 1.38 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 420,000 </td>
   <td style="text-align:right;"> 2 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 1,745 </td>
   <td style="text-align:right;"> 1,959 </td>
   <td style="text-align:left;"> central </td>
   <td style="text-align:right;"> 0.51 </td>
  </tr>
  <tr>
   <td style="text-align:right;"> 680,000 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:right;"> 3 </td>
   <td style="text-align:right;"> 2,091 </td>
   <td style="text-align:right;"> 1,961 </td>
   <td style="text-align:left;"> central </td>
   <td style="text-align:right;"> 0.84 </td>
  </tr>
</tbody>
</table>

]
.pull-right[
* Variables and their descriptions:
<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Variable </th>
   <th style="text-align:left;"> Description </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> price </td>
   <td style="text-align:left;width: 25em; "> Sale price, in USD </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> bed </td>
   <td style="text-align:left;width: 25em; "> Number of bedrooms </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> bath </td>
   <td style="text-align:left;width: 25em; "> Number of bathrooms </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> area </td>
   <td style="text-align:left;width: 25em; "> Area of home, in square feet </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> year_built </td>
   <td style="text-align:left;width: 25em; "> Year the home was built </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> cooling </td>
   <td style="text-align:left;width: 25em; "> Cooling system: central or other (other is baseline) </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 15em; font-family: monospace;"> lot </td>
   <td style="text-align:left;width: 25em; "> Area of the entire property, in acres </td>
  </tr>
</tbody>
</table>

]

---

.pull-left[
* Let's do some exploratory data analysis

```r
pr_bed <- ggplot(duke_forest, aes(x = bed, y = price)) +
  geom_point(alpha = 0.8) +
    labs(
    x = "Number of bedrooms",
    y = "Sale price (USD)"
  ) +  
  stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
  scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))

pr_bath <- ggplot(duke_forest, aes(x = bath, y = price)) +
  geom_point(alpha = 0.8) +
    labs(
    x = "Number of bathrooms",
    y = "Sale price (USD)"
  ) +
  stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) +
  scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K"))
```

]

.pull-right[

<img src="lec15-linear-regression-2_files/figure-html/fig-single-scatter-1.png" width="100%" />
* Seems that all variables are positively associated with price

]

---

.pull-left[
* Let's now look at trying to predict `price` from `area`: single linear regression

```r
price_from_area <- lm(price ~ area, data = duke_forest)
price_from_area %>% tidy
#> # A tibble: 2 × 5
#>   term        estimate std.error statistic  p.value
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 (Intercept)  116652.   53302.       2.19 3.11e- 2
#> 2 area            159.      18.2      8.78 6.29e-14
```
* This gives the predictive model
$$ \widehat{\mathsf{price}} = 116,652 + 159\times \mathsf{area}$$
* Increasing the area by a single square foot is expected to increase price by $159 
* Residual plot:
]
.pull-right[
<img src="lec15-linear-regression-2_files/figure-html/fig-price-resid-slr-1.png" width="70%" />
* Does appear to be a linear relationship, but residuals for expensive homes are quite large.  Not clear that a linear model is the best model for large/expensive homes, need more advanced methods
]

---
.pull-left[
### Modeling price with multiple variables
* Let's now see model price with multiple variables

```r
m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest)
m_full %>% tidy()
#> # A tibble: 7 × 5
#>   term            estimate std.error statistic   p.value
#>   <chr>              <dbl>     <dbl>     <dbl>     <dbl>
#> 1 (Intercept)    -2910715. 1787934.     -1.63  0.107    
#> 2 area                102.      23.1     4.42  0.0000279
#> 3 bed              -13692.   25928.     -0.528 0.599    
#> 4 bath              41076.   24662.      1.67  0.0993   
#> 5 year_built         1459.     914.      1.60  0.114    
#> 6 coolingcentral    84065.   30338.      2.77  0.00679  
#> # ℹ 1 more row
```
* This forms the predictive model

$$
`\begin{aligned}
\widehat{\texttt{price}} &= -2,910,715 + 102 \times \texttt{area} - 13,692 \times \texttt{bed} \\
&+ 41,076 \times \texttt{bath} + 1,459 \times \texttt{year_built}\\
&+ 84,065 \times \texttt{cooling}_{\texttt{central}} + 356,141 \times \texttt{lot}
\end{aligned}`
$$
]

.pull-right[
* Residual plot:

* Residuals appear to be randomly scattered around zero, but again pretty large residuals for expensive homes
* One apparent outlier at -750k predicted value. 
]

---
.pull-left[
### Backward elimination
* Let's see if excluding certain variables helps to improve the model's adjusted `$R^2$`
* Helpful function in R: `update()`
* Takes in the output of `lm()` and allows for you to specify what changes you'd like to make

```r
m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest)
m_area_r_sq_adj <- update(m_full, . ~ . - area, data = duke_forest)
m_area_r_sq_adj %>% tidy()
#> # A tibble: 6 × 5
#>   term            estimate std.error statistic     p.value
#>   <chr>              <dbl>     <dbl>     <dbl>       <dbl>
#> 1 (Intercept)    -3056731.  1960957.    -1.56  0.123      
#> 2 bed               15103.    27528.     0.549 0.585      
#> 3 bath              91076.    24034.     3.79  0.000271   
#> 4 year_built         1521.     1002.     1.52  0.133      
#> 5 coolingcentral    67210.    33015.     2.04  0.0447     
#> 6 lot              447962.    80120.     5.59  0.000000234
```
* ". ~ . - variable" <--> use same formula as before, but now remove "variable" from the formula
]

.pull-right[

* Baseline adjusted `$R^2$` from the full model is 0.5896, and we need to determine whether dropping a predictor will improve the adjusted `$R^2$`.
To check, we fit models that each drop a different predictor, and we record the adjusted `$R^2$`:
  -   Excluding `area`: 0.5062
  -   Excluding `bed`: 0.5929
  -   Excluding `bath`: 0.5816
  -   Excluding `year_built`: 0.5826
  -   Excluding `cooling`: 0.5595
  -   Excluding `lot`: 0.4894
* Excluding bed improves adjusted `$R^2$`, so we eliminate this
  - Excluding bed and area: 0.51
  - Excluding bed and bath: 0.586
  - Excluding bed and year_built: 0.586
  - Excluding bed and cooling: 0.563
  - Excluding bed and lot: 0.493
* None of these improve on the (new) baseline adjusted `$R^2$` of `$0.593$`, so we don't change the model any more 
]

---

* Removing bed, we have the following model:

```r
m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest)
m_bed_r_sq_adj <- update(m_full, . ~ . - bed, data = duke_forest)
m_bed_r_sq_adj %>% tidy()
#> # A tibble: 6 × 5
#>   term             estimate std.error statistic    p.value
#>   <chr>               <dbl>     <dbl>     <dbl>      <dbl>
#> 1 (Intercept)    -2952641.  1779079.      -1.66 0.100     
#> 2 area                 99.1      22.3      4.44 0.0000249 
#> 3 bath              36228.    22799.       1.59 0.116     
#> 4 year_built         1466.      910.       1.61 0.111     
#> 5 coolingcentral    83856.    30215.       2.78 0.00669   
#> 6 lot              357119.    75617.       4.72 0.00000841
```
* Results in the following linear model:

$$ 
`\begin{aligned}
\widehat{\texttt{price}} &= -2,952,641 + 99 \times \texttt{area}\\ 
&+ 36,228 \times \texttt{bath} + 1,466 \times \texttt{year\_built}\\
&+ 83,856 \times \texttt{cooling}_{\texttt{central}} + 357,119 \times \texttt{lot}
\end{aligned}`
$$