class: center, middle, inverse, title-slide .title[ # Linear Regression ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Linear regression with multiple predictors .pull-left[ * Multiple regression extends single predictor regression to setting where there's still a single response but multiple predictors: $$ y = b_0 + b_1 x_1 + \cdots + b_n x_n.$$ * Data: `loans_full_schema` data with tidying ```r loans <- loans_full_schema |> mutate( credit_util = total_credit_utilized / total_credit_limit, bankruptcy = as.factor(if_else(public_record_bankrupt == 0, 0, 1)), verified_income = droplevels(verified_income) ) |> rename(credit_checks = inquiries_last_12m) |> select(interest_rate, verified_income, debt_to_income, credit_util, bankruptcy, term, credit_checks, issue_month) ``` <table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> interest_rate </th> <th style="text-align:left;"> verified_income </th> <th style="text-align:right;"> debt_to_income </th> <th style="text-align:right;"> credit_util </th> <th style="text-align:right;"> bankruptcy </th> <th style="text-align:right;"> term </th> <th style="text-align:right;"> credit_checks </th> <th style="text-align:left;"> issue_month </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 14.07 </td> <td style="text-align:left;"> Verified </td> <td style="text-align:right;"> 18.01 </td> <td style="text-align:right;"> 0.5475952 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> Mar-2018 </td> </tr> <tr> <td style="text-align:right;"> 12.61 </td> <td style="text-align:left;"> Not Verified </td> <td style="text-align:right;"> 5.04 </td> <td style="text-align:right;"> 0.1500347 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:left;"> Feb-2018 </td> </tr> </tbody> </table> ] .pull-right[ ```r loans_var_def <- tribble( ~variable, ~description, "interest_rate", "Interest rate on the loan, annual percentage.", "verified_income", "Categorical var: whether the borrower's income source/amount have been verified", "debt_to_income", "total debt of the borrower divided by their total income.", "credit_util", "what fraction of credit they utilizing", "bankruptcy", "Indicator 0/1: whether the borrower has a past bankruptcy in their record", "term", "The length of the loan, in months.", "issue_month", "The month and year the loan was issued", "credit_checks", "Number of credit checks in the last 12 months", ) ``` ] --- ### Indicators and categorical variables .pull-left[ * We saw previously how to fit a linear regression model with a single predictor, e.g. $$ \widehat{\texttt{interest_rate}} = b_0 + b_1 \times \texttt{bankruptcy} $$ * Bankruptcy is 0 if no history of bankruptcy, 1 if there is * Let's inspect the least-squares linear model resulting from this ```r m_bankruptcy <- lm(interest_rate ~ bankruptcy, data = loans) m_bankruptcy |> tidy() #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 12.3 0.0533 231. 0 #> 2 bankruptcy1 0.737 0.153 4.82 0.00000147 ``` * Slope of 0.74 means model predicts 0.74% higher interest rate for those w/ bankruptcy ] .pull-right[ * Let's now consider a 3-level categorical variable like `verified_income`: ```r loans$verified_income %>% levels #> [1] "Not Verified" "Source Verified" "Verified" m_verifincome <- lm(interest_rate ~ verified_income, data = loans) m_verifincome %>% tidy() #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 11.1 0.0809 137. 0 #> 2 verified_incomeSource Verified 1.42 0.111 12.8 3.79e- 37 #> 3 verified_incomeVerified 3.25 0.130 25.1 8.61e-135 ``` * Each row represents the relative difference for each level of verified income * The level "Not Verified" is missing - this is because it is the **reference level**, default level that other levels are measured against ] --- ### Indicators and categorical variables .pull-left[ ```r m_verifincome %>% tidy() #> # A tibble: 3 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 11.1 0.0809 137. 0 #> 2 verified_incomeSource Verified 1.42 0.111 12.8 3.79e- 37 #> 3 verified_incomeVerified 3.25 0.130 25.1 8.61e-135 ``` * The equation for the regresion model can be written as: $$ `\begin{aligned} \widehat{\texttt{interest_rate}} &= 11.10 \\ &+ 1.42 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ 3.25 \times \texttt{verified_income}_{\texttt{Verified}} \end{aligned}` $$ * When `verified_income` takes value `Not Verified`, then both indicator functions in the equation for the linear model are set to 0: $$ \widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 0 + 3.25 \times 0 = 11.10 $$ ] .pull-right[ * Average interest rate for these borrowers is 11.1%. * When `verified_income` takes value `Source Verified`, then the corresponding variable takes a value of 1 while the other is 0: $$ \widehat{\texttt{interest_rate}} = 11.10 + 1.42 \times 1 + 3.25 \times 0 = 12.52 $$ * Average interest rate for these borrowers is 12.52%. ] --- .pull-left[ ### Multiple predictors in a linear model * Let's suppose we want to construct model for interest rate accounting for not just bankruptcy, but also other variables: $$ `\begin{aligned} &\widehat{\texttt{interest_rate}} = b_0 \\ &+ b_1 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ b_2 \times \texttt{verified_income}_{\texttt{Verified}} \\ &+ b_3 \times \texttt{debt_to_income} \\ &+ b_4 \times \texttt{credit_util} \\ &+ b_5 \times \texttt{bankruptcy} \\ &+ b_6 \times \texttt{term} \\ &+ b_9 \times \texttt{credit_checks} \\ &+ b_7 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\ &+ b_8 \times \texttt{issue_month}_{\texttt{Mar-2018}} \end{aligned}` $$ ] .pull-right[ * Just as before, we find those `\(b_0, \cdots, b_8\)` such that the sum of squared residuals is small, $$ `\begin{aligned} SSE = e_1^2 + \cdots + e_{10000}^2 = \sum_{i=1}^n e_i^2. \end{aligned}` $$ * We can use `lm()` just as before. But now, we use `lm(response ~ .)` to mean we use ALL predictors in the data except for `response` to fit a linear model. ```r m_full <- lm(interest_rate ~ ., data = loans) m_full |> tidy() #> # A tibble: 10 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 1.89 0.210 9.01 2.49e- 19 #> 2 verified_incomeSource Verified 0.997 0.0992 10.1 1.12e- 23 #> 3 verified_incomeVerified 2.56 0.117 21.9 1.25e-103 #> 4 debt_to_income 0.0218 0.00294 7.43 1.14e- 13 #> 5 credit_util 4.90 0.162 30.2 2.21e-192 #> 6 bankruptcy1 0.391 0.132 2.96 3.12e- 3 #> # ℹ 4 more rows ``` ] --- .pull-left[ * The fitted model from previous slide has following form: $$ `\begin{aligned} &\widehat{\texttt{interest_rate}} = 1.89 \\ &+ 1.00 \times \texttt{verified_income}_{\texttt{Source Verified}} \\ &+ 2.56 \times \texttt{verified_income}_{\texttt{Verified}} \\ &+ 0.02 \times \texttt{debt_to_income} \\ &+ 4.90 \times \texttt{credit_util} \\ &+ 0.39 \times \texttt{bankruptcy} \\ &+ 0.15 \times \texttt{term} \\ &+ 0.23 \times \texttt{credit_checks} \\ &+ 0.05 \times \texttt{issue_month}_{\texttt{Jan-2018}} \\ &- 0.04 \times \texttt{issue_month}_{\texttt{Mar-2018}} \end{aligned}` $$ ] .pull-right[ * Note that `verified_income` and `issue_month` are individual categorical predictors, but we have *two* coefficients for each. * If a categorical predictor has `\(p\)` different levels, we get an additional `\(p-1\)` terms in a multiple regression model * So effective number of predictors grows with more and more levels in a categorical variable. * Seven total variables (verified income, debt to income, bankruptcy, term, credit checks, issue month) but a total of NINE predictors because of the extra levels in the categorical variables. * Note that the intercept term here is *not* interpretable, since we cannot set `term` (=length of loan) to 0 in a meaningful way. ] --- ### Multiple regression vs. single linear regression * When we looked at `lm(interest_rate ~ bankruptcy)`, we found a coefficient of 0.74, while for `lm(interest rate ~ .)`, we found a coefficient of 0.386 for bankruptcy. * This is because some of the variables are **correlated**: when just modeling interest rate as a function of bankruptcy, we didn't account for role of income verification, debt to income ratio, etc. * In general, correlation between variables complicates linear regression analyses and interpretation. --- ### Adjusted R squared .pull-left[ * We saw in single-variable linear regression that `\(R^2\)` helps determine amount of variability in response explained by the model $$ `\begin{aligned} R^2 &= 1 - \frac{\text{variability in residuals}}{\text{variability in the outcome}}\\ &= 1 - \frac{Var(e_i)}{Var(y_i)} \end{aligned}` $$ * For multiple linear regression, we prefer the *adjusted* `\(R^2\)`. If there are `\(k\)` predictor variables in the model, $$ `\begin{aligned} R_{adj}^{2} &= 1 - \frac{s_{\text{residuals}}^2 / (n-k-1)} {s_{\text{outcome}}^2 / (n-1)} \\ &= 1 - \frac{s_{\text{residuals}}^2}{s_{\text{outcome}}^2} \times \frac{n-1}{n-k-1} \end{aligned}` $$ ($p$-levels of categorical `\(\implies p-1\)` predictor vars) ] .pull-left[ * Adjusted `\(R^2\)` is smaller than unadjusted `\(R^2\)` * One can show that just by adding additional variables to a linear regression model, you can increase the unadjusted `\(R^2\)` * The adjusted `\(R^2\)` tries to account for this to normalize by the number of predictors you're using ] --- ### Model selection .pull-left[ * How do we decide which variables to include when devising a linear model? * Generally referred to as "model selection" * Main idea: **stepwise selection**. Either start from a large model and reduce variables one-by-one (*backward elminiation*), or to start from a small model and add variables one-by-one (*forward selection*) * Different criteria can be used to estimate whether to add/eliminate variables * We'll mainly use adjusted R-squared - improvement = larger adjusted R-squared ] .pull-right[ * Example: using `lm(interest_rate ~ ., data = loans)` results in a linear model with adjusted `\(R^2\)` of 0.2597, using variables "verified_income", "debt_to_income", "credit_util", "bankruptcy", "term", "credit_checks", "issue_month". * We then create new linear models where we remove exactly one variable, and we report adjusted `\(R^2\)` as: * Excluding verified_income: 0.2238 * Excluding debt_to_income: 0.2557 * Excluding credit_util: 0.1916 * Excluding bankruptcy: 0.2589 * Excluding term: 0.1468 * Excluding credit_checks: 0.2484 * Excluding issue_month: 0.2598 * Since removing issue_month has adjusted `\(R^2\)` of 0.2598 > 0.2597, we drop issue_month from the model. * We can then continue and see if we remove both issue_month and another variable results in better adjusted `\(R^2\)`. ] --- ### Model selection: forward selection .pull-left[ * Let's now see how to build a model using forward selection, adding one predictor at a time * Start with a model with no predictors. This always has `\(R^2_{adj}=0\)`. Compare to models with a single variable each, which have `\(R^2_{adj}\)` of: - Including verified_income: 0.05926 - Including debt_to_income: 0.01946 - Including credit_util: 0.06452 - Including bankruptcy: 0.00222 - Including term: 0.12855 - Including credit_checks: -0.0001 - Including issue_month: 0.01711 * "term" has largest `\(R^2_{adj}\)`, so we add this variable to the model ] .pull-right[ * Now let's try adding a second variable in addition to "term" * Two predictor models with "term" and their adjusted `\(R^2\)`: - Including term and verified_income: 0.16851 - Including term and debt_to_income: 0.14368 - Including term and credit_util: 0.20046 - Including term and bankruptcy: 0.13070 - Including term and credit_checks: 0.12840 - Including term and issue_month: 0.14294 * Adding "credit_util" yields highest increase in `\(R^2_{adj}\)`, and this is higher than the baseline level of 0.12855, so we add the model as a predictor * We can continue this way until we do not see improvement in adjusted `\(R^2\)` ] --- .pull-left[ ### Example: housing prices * Let's look at `duke_forest` dataset from `openintro` - first few rows look like this <table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:right;"> price </th> <th style="text-align:right;"> bed </th> <th style="text-align:right;"> bath </th> <th style="text-align:right;"> area </th> <th style="text-align:right;"> year_built </th> <th style="text-align:left;"> cooling </th> <th style="text-align:right;"> lot </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 1,520,000 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 6,040 </td> <td style="text-align:right;"> 1,972 </td> <td style="text-align:left;"> central </td> <td style="text-align:right;"> 0.97 </td> </tr> <tr> <td style="text-align:right;"> 1,030,000 </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 4,475 </td> <td style="text-align:right;"> 1,969 </td> <td style="text-align:left;"> central </td> <td style="text-align:right;"> 1.38 </td> </tr> <tr> <td style="text-align:right;"> 420,000 </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 1,745 </td> <td style="text-align:right;"> 1,959 </td> <td style="text-align:left;"> central </td> <td style="text-align:right;"> 0.51 </td> </tr> <tr> <td style="text-align:right;"> 680,000 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 2,091 </td> <td style="text-align:right;"> 1,961 </td> <td style="text-align:left;"> central </td> <td style="text-align:right;"> 0.84 </td> </tr> </tbody> </table> ] .pull-right[ * Variables and their descriptions: <table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Variable </th> <th style="text-align:left;"> Description </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> price </td> <td style="text-align:left;width: 25em; "> Sale price, in USD </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> bed </td> <td style="text-align:left;width: 25em; "> Number of bedrooms </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> bath </td> <td style="text-align:left;width: 25em; "> Number of bathrooms </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> area </td> <td style="text-align:left;width: 25em; "> Area of home, in square feet </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> year_built </td> <td style="text-align:left;width: 25em; "> Year the home was built </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> cooling </td> <td style="text-align:left;width: 25em; "> Cooling system: central or other (other is baseline) </td> </tr> <tr> <td style="text-align:left;width: 15em; font-family: monospace;"> lot </td> <td style="text-align:left;width: 25em; "> Area of the entire property, in acres </td> </tr> </tbody> </table> ] --- .pull-left[ * Let's do some exploratory data analysis ```r pr_bed <- ggplot(duke_forest, aes(x = bed, y = price)) + geom_point(alpha = 0.8) + labs( x = "Number of bedrooms", y = "Sale price (USD)" ) + stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) + scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) pr_bath <- ggplot(duke_forest, aes(x = bath, y = price)) + geom_point(alpha = 0.8) + labs( x = "Number of bathrooms", y = "Sale price (USD)" ) + stat_cor(aes(label = paste("r", ..r.., sep = "~`=`~"))) + scale_y_continuous(labels = label_dollar(scale = 1/1000, suffix = "K")) ``` ] .pull-right[ <img src="lec15-linear-regression-2_files/figure-html/fig-single-scatter-1.png" width="100%" /> * Seems that all variables are positively associated with price ] --- .pull-left[ * Let's now look at trying to predict `price` from `area`: single linear regression ```r price_from_area <- lm(price ~ area, data = duke_forest) price_from_area %>% tidy #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) 116652. 53302. 2.19 3.11e- 2 #> 2 area 159. 18.2 8.78 6.29e-14 ``` * This gives the predictive model $$ \widehat{\mathsf{price}} = 116,652 + 159\times \mathsf{area}$$ * Increasing the area by a single square foot is expected to increase price by $159 * Residual plot: ] .pull-right[ <img src="lec15-linear-regression-2_files/figure-html/fig-price-resid-slr-1.png" width="70%" /> * Does appear to be a linear relationship, but residuals for expensive homes are quite large. Not clear that a linear model is the best model for large/expensive homes, need more advanced methods ] --- .pull-left[ ### Modeling price with multiple variables * Let's now see model price with multiple variables ```r m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest) m_full %>% tidy() #> # A tibble: 7 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -2910715. 1787934. -1.63 0.107 #> 2 area 102. 23.1 4.42 0.0000279 #> 3 bed -13692. 25928. -0.528 0.599 #> 4 bath 41076. 24662. 1.67 0.0993 #> 5 year_built 1459. 914. 1.60 0.114 #> 6 coolingcentral 84065. 30338. 2.77 0.00679 #> # ℹ 1 more row ``` * This forms the predictive model $$ `\begin{aligned} \widehat{\texttt{price}} &= -2,910,715 + 102 \times \texttt{area} - 13,692 \times \texttt{bed} \\ &+ 41,076 \times \texttt{bath} + 1,459 \times \texttt{year_built}\\ &+ 84,065 \times \texttt{cooling}_{\texttt{central}} + 356,141 \times \texttt{lot} \end{aligned}` $$ ] .pull-right[ * Residual plot: <img src="lec15-linear-regression-2_files/figure-html/fig-price-resid-mlr-nobed-1.png" width="70%" /> * Residuals appear to be randomly scattered around zero, but again pretty large residuals for expensive homes * One apparent outlier at -750k predicted value. ] --- .pull-left[ ### Backward elimination * Let's see if excluding certain variables helps to improve the model's adjusted `\(R^2\)` * Helpful function in R: `update()` * Takes in the output of `lm()` and allows for you to specify what changes you'd like to make ```r m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest) m_area_r_sq_adj <- update(m_full, . ~ . - area, data = duke_forest) m_area_r_sq_adj %>% tidy() #> # A tibble: 6 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -3056731. 1960957. -1.56 0.123 #> 2 bed 15103. 27528. 0.549 0.585 #> 3 bath 91076. 24034. 3.79 0.000271 #> 4 year_built 1521. 1002. 1.52 0.133 #> 5 coolingcentral 67210. 33015. 2.04 0.0447 #> 6 lot 447962. 80120. 5.59 0.000000234 ``` * ". ~ . - variable" <--> use same formula as before, but now remove "variable" from the formula ] .pull-right[ * Baseline adjusted `\(R^2\)` from the full model is 0.5896, and we need to determine whether dropping a predictor will improve the adjusted `\(R^2\)`. To check, we fit models that each drop a different predictor, and we record the adjusted `\(R^2\)`: - Excluding `area`: 0.5062 - Excluding `bed`: 0.5929 - Excluding `bath`: 0.5816 - Excluding `year_built`: 0.5826 - Excluding `cooling`: 0.5595 - Excluding `lot`: 0.4894 * Excluding bed improves adjusted `\(R^2\)`, so we eliminate this - Excluding bed and area: 0.51 - Excluding bed and bath: 0.586 - Excluding bed and year_built: 0.586 - Excluding bed and cooling: 0.563 - Excluding bed and lot: 0.493 * None of these improve on the (new) baseline adjusted `\(R^2\)` of `\(0.593\)`, so we don't change the model any more ] --- * Removing bed, we have the following model: ```r m_full <- lm(price ~ area + bed + bath + year_built + cooling + lot, data = duke_forest) m_bed_r_sq_adj <- update(m_full, . ~ . - bed, data = duke_forest) m_bed_r_sq_adj %>% tidy() #> # A tibble: 6 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -2952641. 1779079. -1.66 0.100 #> 2 area 99.1 22.3 4.44 0.0000249 #> 3 bath 36228. 22799. 1.59 0.116 #> 4 year_built 1466. 910. 1.61 0.111 #> 5 coolingcentral 83856. 30215. 2.78 0.00669 #> 6 lot 357119. 75617. 4.72 0.00000841 ``` * Results in the following linear model: $$ `\begin{aligned} \widehat{\texttt{price}} &= -2,952,641 + 99 \times \texttt{area}\\ &+ 36,228 \times \texttt{bath} + 1,466 \times \texttt{year\_built}\\ &+ 83,856 \times \texttt{cooling}_{\texttt{central}} + 357,119 \times \texttt{lot} \end{aligned}` $$