class: center, middle, inverse, title-slide .title[ # Inference for comparing many means, and inference for linear regression ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ #### F distribution * For comparing many means, we use `\(F\)` distribution/statistic * **Goal**: assess whether or not the variability we see in sample means is so large that it is unlikely to be due to random chance * To do this, we compute the **sum of squares between groups**. * For each group, we can calculate a sample mean `\(\bar x_i\)`, `\(i=1, ..., k\)` if you have `\(k\)` groups (each group has `\(n_i\)` people) * We can also calculate the sample mean across *all* observations, `\(\bar x\)` `$$\begin{align*}\text{sum of squares btwn groups}&=SSG \newline\\ &= \sum_{j=1}^k n_j (\bar x_j - \bar x)^2\end{align*}$$` ] .pull-right[ * We want to compare this to the total variability of all samples to the total mean across all samples using **sum of squared errors** `$$\begin{align*}SSE&= \sum_{i=1}^n ( x_i - \bar x)^2\end{align*}$$` * The **mean square between groups** is normalized version of SSG, and **mean squared errors** is normalized version of SSE: `$$\begin{align*} MSG &= \frac{1}{\mathsf{df}_G} SSG = \frac{1}{k-1} SSG,\newline MSE&=\frac{1}{\mathsf{df}_E} SSE =\frac{1}{n-k} SSE.\end{align*}$$` * The F-statistic is ratio of these two quantities: `$$F = \frac{MSG}{MSE}$$` * Under null hypothesis that all means are the same, and under conditions, F has `\(F\)` distribution with `\(df_1=k-1\)`, `\(df_2=n-k\)`. ] --- #### ANOVA tables from R * Let's again look at the MLB data (note we are using a modified version of the `mlb_players_18` dataset) <table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> team </th> <th style="text-align:left;"> position </th> <th style="text-align:right;"> OBP </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Abreu, J </td> <td style="text-align:left;"> CWS </td> <td style="text-align:left;"> IF </td> <td style="text-align:right;"> 0.325 </td> </tr> <tr> <td style="text-align:left;"> Acuna Jr., R </td> <td style="text-align:left;"> ATL </td> <td style="text-align:left;"> OF </td> <td style="text-align:right;"> 0.366 </td> </tr> <tr> <td style="text-align:left;"> Adames, W </td> <td style="text-align:left;"> TB </td> <td style="text-align:left;"> IF </td> <td style="text-align:right;"> 0.348 </td> </tr> <tr> <td style="text-align:left;"> Adams, M </td> <td style="text-align:left;"> STL </td> <td style="text-align:left;"> IF </td> <td style="text-align:right;"> 0.309 </td> </tr> </tbody> </table> * We checked last time that all of the conditions needed for applying F statistic hold, except possibly independence which we couldn't check (let's assume it does hold) * Output of ANOVA and lm in R: ```r mod <- lm(OBP ~ position, data = mlb_players_18) anova(mod) %>% tidy() %>% kable() ``` |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| --- #### Understanding the ANOVA output in R |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| * You can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic) |term | df| sumsq|meansq |statistic |p.value | |:---------|---:|---------:|:------|:---------|:-------| |position | 2| 0.0160627|X |X |X | |Residuals | 426| 0.6739500|X |X |NA | * For `position` = independent variable, `df` `\(=k-1\)` if `\(k\)` groups; `sumsq` <-> SSG, `meansq` <-> MSG * For `Residuals`, `df`$=n-k$; `sumsq` <-> SSE, `meansq` <-> MSE * `statistics` is the F statistic, `\(F = MSG/MSE\)` --- #### Understanding the ANOVA output in R |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| * You can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic) |term | df| sumsq| meansq|statistic |p.value | |:---------|---:|---------:|---------:|:---------|:-------| |position | 2| 0.0160627| 0.0080314|X |X | |Residuals | 426| 0.6739500| 0.0015820|X |NA | * For `position` = independent variable, `df` `\(=k-1\)` if `\(k\)` groups; `sumsq` <-> SSG, `meansq` <-> MSG * For `Residuals`, `df`$=n-k$; `sumsq` <-> SSE, `meansq` <-> MSE * `statistics` is the F statistic, `\(F = MSG/MSE\)` .pull-left[ * For `position`, `meansq` `\(\approx 0.016 / 2 = 0.008\)` * For `Residuals`, `meansq` `\(\approx 0.674 / 426 = 0.00158\)` ] --- #### Understanding the ANOVA output in R |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| * You can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic) |term | df| sumsq| meansq| statistic|p.value | |:---------|---:|---------:|---------:|---------:|:-------| |position | 2| 0.0160627| 0.0080314| 5.076574|X | |Residuals | 426| 0.6739500| 0.0015820| NA|NA | * For `position` = independent variable, `df` `\(=k-1\)` if `\(k\)` groups; `sumsq` <-> SSG, `meansq` <-> MSG * For `Residuals`, `df`$=n-k$; `sumsq` <-> SSE, `meansq` <-> MSE * `statistics` is the F statistic, `\(F = MSG/MSE\)` .pull-left[ * For `position`, `meansq` `\(\approx 0.016 / 2 = 0.008\)` * For `Residuals`, `meansq` `\(\approx 0.674 / 426 = 0.00158\)` * `statistic`: MSG/MSE `\(\approx 0.008/0.00158 \approx 5.07\)` ] --- #### Understanding the ANOVA output in R |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| * You can derive this entire tibble by using just `df` and `sumsq` (and `pf()` to calc p-value from the statistic) |term | df| sumsq| meansq| statistic| p.value| |:---------|---:|---------:|---------:|---------:|---------:| |position | 2| 0.0160627| 0.0080314| 5.076574| 0.0066242| |Residuals | 426| 0.6739500| 0.0015820| NA| NA| * For `position` = independent variable, `df` `\(=k-1\)` if `\(k\)` groups; `sumsq` <-> SSG, `meansq` <-> MSG * For `Residuals`, `df` `\(=n-k\)`; `sumsq` <-> SSE, `meansq` <-> MSE * `statistics` is the F statistic, `\(F = MSG/MSE\)` .pull-left[ * For `position`, `meansq` `\(\approx 0.016 / 2 = 0.008\)` * For `Residuals`, `meansq` `\(\approx 0.674 / 426 = 0.00158\)` ] .pull-right[ * `statistic`: MSG/MSE `\(\approx 0.008/0.00158 \approx 5.07\)` * For `p.value`, can use `pf`: ```r 1 - pf(5.08, df1 = 2, df2 = 426) #> [1] 0.006602098 ``` ] --- .pull-left[ ### Inference for linear regression with a single variable * Let's think about linear regression again * We'll consider setting where we have chain sandwich stores which spent different amounts on advertising and then the amount of revenue they received * Let's imagine we had access to **all** data, for all chain sandwich stores and amount spent on advertising, and it looked like this <img src="lec23-inference-regression_files/figure-html/unnamed-chunk-12-1.png" width="80%" /> ] .pull-right[ The population model is: `$$y = \beta_0 + \beta_1 x + \varepsilon.$$` If we had full information of all data in the true population, model would be: `$$\texttt{expected revenue} = 11.23 + 4.8 \times \texttt{advertising}.$$` * For every \$1,000 spent on advertising, revenue increases by \$4,800 on average * If we spent nothing on advertising, we expect \$11,230 revenue on average * However, unrealistic to expect that we can have knowledge of **all** data---more realistically, we have knowledge of a sample. ] --- .pull-left[ * When we take samples, our estimates for the slope of the line in linear regression change from sample to sample <img src="lec23-inference-regression_files/figure-html/fig-sand-samp1-1.png" width="432" /> ] -- .pull-right[ <img src="lec23-inference-regression_files/figure-html/fig-sand-samp2-1.png" width="80%" /> If we did this on many different samples of 20 stores, we would get varied lines: <img src="lec23-inference-regression_files/figure-html/fig-slopes-1.png" width="80%" /> ] --- .pull-left[ * The relevant hypotheses for a linear model are about whether or not the population slope parameter is zero. - `\(H_0\)`: `\(\beta_1 = 0\)`, no linear relationship between response and independent variable - `\(H_A\)`: `\(\beta_1 \neq 0\)`, some linear relationship between response and independent variable * We can perform similar analyses that we used before: - Randomization test - Bootstrap confidence interval - Mathematical approach ] -- .pull-right[ #### Randomization test * In the randomization test, we randomly permute the responses * Then any structure which existed in original dataset (for independent variable --> response) disappears * This gives us a baseline for what types of variability we should expect under null hypothesis * E.g. let's consider **births14** dataset as before, which has baby birth weight and number of weeks of gestation <img src="lec23-inference-regression_files/figure-html/fig-permweightScatter-1.png" width="432" /> ] --- .pull-left[ * When we do repeated randomizations, we see what types of slopes we would get when there is no relationship between the response and the independent variable. <img src="lec23-inference-regression_files/figure-html/fig-permweekslm-1.png" width="432" /> ] -- .pull-right[ <img src="lec23-inference-regression_files/figure-html/fig-nulldistBirths-1.png" width="432" /> * If there were no linear relationship between `weight` and `weeks`, the natural variability of the slopes would produce estimates between approximately -0.15 and +0.15. * We reject the null hypothesis. * We believe that the slope observed on the original data is not just due to natural variability---rather, we believe there is a linear relationship between `weight` of baby and `weeks` gestation ] --- .pull-left[ #### Bootstrap for confidence interval for the slope * Let's now consider looking at predicting weight of the baby as a function of mother's age * When we run the following code, we get ```r model <- lm(weight ~ mage, data = births14_100) ``` <table class="table table-striped table-condensed" style="margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> term </th> <th style="text-align:right;"> estimate </th> <th style="text-align:right;"> std.error </th> <th style="text-align:right;"> statistic </th> <th style="text-align:right;"> p.value </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;width: 10em; font-family: monospace;"> (Intercept) </td> <td style="text-align:right;width: 5em; "> 7.34 </td> <td style="text-align:right;width: 5em; "> 0.60 </td> <td style="text-align:right;width: 5em; "> 12.29 </td> <td style="text-align:right;width: 5em; "> <0.0001 </td> </tr> <tr> <td style="text-align:left;width: 10em; font-family: monospace;"> mage </td> <td style="text-align:right;width: 5em; "> 0.00 </td> <td style="text-align:right;width: 5em; "> 0.02 </td> <td style="text-align:right;width: 5em; "> -0.11 </td> <td style="text-align:right;width: 5em; "> 0.915 </td> </tr> </tbody> </table> * For bootstrap-, we will take bootstrap samples of the data (sampling with replacement) * We will sometimes get the same sample in the resulting bootstrap, sometimes will not, sometimes get it twice ] -- .pull-right[ <img src="lec23-inference-regression_files/figure-html/fig-birth2BS-1.png" width="85%" /><img src="lec23-inference-regression_files/figure-html/fig-birth2BS-2.png" width="85%" /> ] --- .pull-left[ * For each bootstrap sample, we compute a linear model on the data * We then plot a histogram of all of the different slopes we get * This gives us a sense of the variability of the slopes across random samples * Suppose all of the bootstrap slopes are given in the vector `slopes` * The following code would provide allow for calculating the bootstrap 95% confidence interval ```r quantile(x, probs = c(0.025, 0.975)) ``` ] -- .pull-right[ <img src="lec23-inference-regression_files/figure-html/fig-mageBSslopes-1.png" width="85%" /> * From this we say that we are 95% confident that, for the model of weight of a baby described by mother's age, a one unit increase in mother's age will be associated with an increase in predicted average baby weight of between -0.01 and 0.081 pounds * This includes 0, so it is possible mother's age does not have meaningful linear relationship with baby's weight ] --- .pull-left[ ### Mathematical model for testing a slope * Under certain conditions, can use the `\(t\)`-distribution to test and estimate the slope parameter in linear regression. Conditions: - **Linearity.** The data should show a linear trend. - **Independent observations.** Be cautious about applying regression to data that are sequential observations in time, e.g. stock prices/day - **Nearly normal residuals.** Generally, the residuals should be nearly normal. When this condition is found to be unreasonable, it is often because of outliers or concerns about influential points - **Constant or equal variability.** The variability of points around the least squares line remains roughly constant. ] -- .pull-right[ <img src="lec23-inference-regression_files/figure-html/unnamed-chunk-26-1.png" width="100%" /> ] --- .pull-left[ ### Example * Let's consider dataset that has info on elections, with unemployment rate and election outcomes in midterm elections * The question is whether higher unmployment rate entails worse performance for the current President's party * `midterms_house` data in *openintro* ``` #> tibble [31 × 5] (S3: tbl_df/tbl/data.frame) #> $ year : int [1:31] 1899 1903 1907 1911 1915 1919 1923 1927 1931 1935 ... #> $ potus : Factor w/ 19 levels "Barack Obama",..: 18 16 16 17 19 19 3 3 11 6 ... #> $ party : Factor w/ 2 levels "Democrat","Republican": 2 2 2 2 1 1 2 2 2 1 ... #> $ unemp : num [1:31] 11.62 4.3 3.29 5.86 6.63 ... #> $ house_change: num [1:31] -9.22 -4.28 -12.29 -26.59 -20.96 ... ``` ] -- .pull-right[ * Let's build a linear model for the change in house seats for a president's party as a function of unemployment rate * Let's restrict to election years OUTSIDE of the Great Depression (i.e., not year 1935 or 1939) * First check that there are no substantial outliers, no clear nonlinearity ```r midterms_house %>% filter(!(year %in% c(1935, 1939))) %>% ggplot(aes(x = unemp, y = house_change)) + geom_point(aes(color = party, shape = party), size = 3, alpha = 0.7) + theme(legend.position = c(0.8, 0.8), legend.background = element_rect(color = "white") ) ``` <img src="lec23-inference-regression_files/figure-html/fig-unemploymentAndChangeInHouse-1.png" width="65%" /> ] --- Let's investigate the residuals: .pull-left[ <img src="lec23-inference-regression_files/figure-html/unnamed-chunk-29-1.png" width="432" /> ] .pull-right[ <img src="lec23-inference-regression_files/figure-html/unnamed-chunk-30-1.png" width="432" /> * No clear problems with the residuals * Does appear the linear model might not be the best fit, but let's investigate ] --- .pull-left[ * Let's build the linear model for the non-Great Depression data ```r not_gd_data <- midterms_house %>% filter(! ( year %in% c(1935, 1939))) lm(house_change ~ unemp, data = not_gd_data) %>% tidy() #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -7.36 5.16 -1.43 0.165 #> 2 unemp -0.890 0.835 -1.07 0.296 ``` * From this we see the following predictive model: $$ `\begin{aligned} &\texttt{Percent change House seats for President's party} \newline \\ &\qquad\qquad= -7.36 - 0.89 \times \texttt{(unemployment rate)} \end{aligned}` $$ * Predicts greater unemployment rate implies worse performance in House midterm election * Is there significant evidence that the "true" linear model has a negative slope? ] -- .pull-right[ * We can frame this as hypothesis test: - `\(H_0:\ \beta_1 = 0\)`, true linear model has slope zero - `\(H_A:\ \beta_1 \neq 0\)`, true linear model has slope different than zero; unmployment rate is predcitve of change in seats after midterm election * To reject the null, we use the following test statistic $$ T = \frac{\text{estimate} - \text{null}}{SE}.$$ * This will follow a `\(t\)` distribution with `\(n-2\)` degrees of freedom. The 2 comes from the linear model having two parameters: `\(\beta_0\)`, intercept, and `\(\beta_1\)`, slope. * In this course we will not teach how to compute SE - just use the R table output. * We can thus compute `\(T = \frac{-0.890-0}{0.835 } = -1.0659\)`. ] --- .pull-left[ #### P value for vote vs. unemployment ```r not_gd_data <- midterms_house %>% filter(! ( year %in% c(1935, 1939))) nrow(not_gd_data) #> [1] 29 lm(house_change ~ unemp, data = not_gd_data) %>% tidy() #> # A tibble: 2 × 5 #> term estimate std.error statistic p.value #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 (Intercept) -7.36 5.16 -1.43 0.165 #> 2 unemp -0.890 0.835 -1.07 0.296 ``` * We can frame this as hypothesis test: - `\(H_0:\ \beta_1 = 0\)`, true linear model has slope zero - `\(H_A:\ \beta_1 \neq 0\)`, true linear model has slope different than zero; unmployment rate is predcitve of change in seats after midterm election ] -- .pull-right[ * To reject the null, we use the following test statistic $$ T = \frac{\text{estimate} - \text{null}}{SE} = -1.0659.$$ * We can use R to compute a p-value for this (although lm() does it already) - we have t distribution, we are testing two-sided alternative `\((\beta_1 \neq 0)\)`, so we need to double the left-side area ```r 2* pt(-1.0659, df = nrow(not_gd_data) -2) #> [1] 0.2959081 ``` ] --- ### Practice * IMS 22.12(a,b) * IMS 22.13(a,b) * IMS 24.3(a,b) * Take 3 minutes to think about the problem, discuss with your neighbors * I'll call on one of you and ask what your thoughts are for how to approach the problem * Not expecting you to have the right answer! Just want to hear how you approach the problem, then help guide you through thinking about the next steps!