class: center, middle, inverse, title-slide .title[ # Inference for a single mean ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ * We will see how to use the bootstrap to develop confidence intervals for a single mean * We want to know the average price of cars on a lot by taking a random sample * Suppose we have a sample of five cars, with prices - $18,300; $20,100; $9,600; $10,700; $27,000 - Sample mean: $17,140 - Standard deviation: $$ \begin{align} s^2 &= \frac{1}{5-1} \Big( (18300-17140)^2 +\dots \newline \\ &\quad + (20100-17140)^2 + ... + (27000-17140)^2 \Big) \newline \\ &\quad\implies s = \$7,170.29 \end{align} $$ * How much can we expect the sample mean to change from sample to sample? ] .pull-right[ * If we don't have the ability to sample more from the population, we can use the bootstrap to estimate this <img src="bootpop1mean.png" width="100%" /> ] --- .pull-left[ * For each bootstrapped re-sample, we get a new subsample that we can calculate a sample mean for * The variability across sample means across bootstrap re-samples gives us an estimate for the variability of the sample mean from the population <img src="bootmeans1mean.png" width="200%" /> ] .pull-right[ * Let's suppose we took 1,000 bootstrap samples, and for each bootstrap sample we calculate the sample mean. It would result in a plot as follows <img src="lec20-inference-single-mean_files/figure-html/fig-carsbsmean-1.png" width="100%" /> * To develop bootstrap confidence interval for the population mean at e.g. 90% confidence level, we can simply calculate the values corresponding to the 5% and 95% percentile of the bootstrapped statistics ] --- .pull-left[ #### Computing bootstrap means and confidence intervals * Let's build a function which takes in a sample (a vector of numerics), and then computes a desired number of bootstrap re-samples and returns the sample means for each bootstrap re-sample. ```r bootstrap_samp_means <- function(observations, num_resample = 500) { means <- numeric(num_resample) num_observations <- length(observations) for(i in 1:num_resample) { resampled <- sample(observations, size = num_observations, replace = TRUE) means[i] <- mean(resampled, na.rm=TRUE) } return(means) } ``` ] .pull-right[ * For the car example, here's how we could do this: ```r obs <- c(18300, 20100, 9600, 10700, 27000) bs_means <- bootstrap_samp_means(obs, num_resample = 500) mean(obs) #> [1] 17140 bs_means[1:5] #> [1] 13300 17360 16560 20620 22360 ``` * If we want to construct a 90% confidence interval, want to find 5th percentile and 95th percentile ```r (qs <- quantile(bs_means, probs = c(0.05, 0.95))) #> 5% 95% #> 12000 22140 ``` * We are 90% confident true average is between $12000 and $22140 * We can also see variability across bootstrap samples: ```r quantile(bootstrap_samp_means(obs, num_resample = 100), probs = c(0.05, 0.95)) #> 5% 95% #> 12360 20980 ``` ] --- .pull-left[ #### Bootstrap percentile confidence interval for a standard deviation * We have thus far focused on trying to understand the **mean** or average of a population * We are also often interested in the standard deviation `\(\sigma\)` of the variable - this represents the natural variability of the population value `\(\mu\)` * We have constructed point estimate using samples via $$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar x)^2 $$ * Then we have that `\(s^2\)` is a good approximation for `\(\sigma^2\)` * If we want to construct a *confidence interval* for `\(\sigma\)`, then we can use bootstrap idea as before. ] -- .pull-right[ * Recall for constructing confidence interval for population mean using bootstrap: - Take bootstrap re-samples of dataset - Compute **sample mean** for each bootstrap sample - Create histogram of **sample means** across each bootstrap sample - Use the percentiles of the histogram to get confidence intervals * Same idea for confidence intervals for population variance `\(\sigma^2\)` / s.d. `\(\sigma\)`: - Take bootstrap re-samples of dataset - Compute **sample variance `\(s^2\)`** for each bootstrap sample - Create histogram of **sample variances** across each bootstrap sample - Use the percentiles of the histogram to get confidence intervals ] --- .pull-left[ * Results of bootstrap *standard deviations*: <img src="lec20-inference-single-mean_files/figure-html/fig-carsbssd-1.png" width="100%" /> * Very high variability * This is due to the very small sample size - only 5 observations * As we increase the original dataset sample size, the bootstrap gets better * Precise characterization of how sample size / number bootstrap trials affect the accuracy of the bootstrap is beyond this course -- subject of current research ] -- .pull-right[ #### Mathematical model for a mean * Recall that we previously used Central Limit Theorem to show that the variability of sample proportion is well-described by normal distribution * For proportions, the single parameter `\(p\in [0,1]\)` determines the variance `\(p(1-p)\)` (and thus s.d. `\(\sqrt{p(1-p)}\)`) * Thus the only source of uncertainty is in measuring `\(p\)` * For numerical variables, there are two sources of uncertainty: - The average/mean `\(\mu\)` - typical value - The standard deviation `\(\sigma\)` - typical variability in the parameter * Thus we need a slightly different approach ] --- .pull-left[ #### Mathematical model for a mean * Central Limit Theorem for the sample mean: - If sufficiently large number of `\(n\)` independent samples from population with mean `\(\mu\)`, s.d. `\(\sigma\)`, then: - Sampling distribution of `\(\bar x_n\)` is `\(\approx N(\mu, \sigma / \sqrt n)\)` - approximately normal with mean `\(\mu\)` and standard deviation `\(\sigma/\sqrt n\)`. - However, generally do not know population-level `\(\sigma\)` - we have to estimate `\(\sigma\)` - When you have to estimate `\(\sigma\)`, the sampling distribution of `\(\bar x_n\)` is not approximately normal, but is what's called a "t distribution" ] -- .pull-right[ * The `\(t\)` distribution is defined in terms of **degrees of freedom** `df`. * Has a similar bell-shaped curve to normal, but has "thicker tails" -- allows for more extreme events to occur than normal <img src="lec20-inference-single-mean_files/figure-html/fig-tDistCompareToNormalDist-1.png" width="90%" /> * This shows `\(t\)` distribution with 1 d.o.f. * For normal, very little data beyond 2.5, while for t distribution, relatively more ] --- .pull-left[ * When looking at a dataset with `\(n\)` samples, we will typically use the `\(t\)` distribution with `\(n-1\)` degrees of freedom * As degrees of freedom increases, the `\(t\)` distribution gets thinner tails and looks more like a standard normal * When d.o.f. `\(>30\)`, almost indistinguishable from standard normal * Intuition: height/thickness represents how likely values are; when d.o.f. ($n-1$) is small, we are more uncertain and so more extreme values are more likely <img src="lec20-inference-single-mean_files/figure-html/fig-tDistConvergeToNormalDist-1.png" width="432" /> ] .pull-right[ * Recall in R, to calculate probabilities under normal, we used - `pnorm(val, mean, sd)` to calculate probability that `\(N(\mu, \sigma)\)` is `\(<= \text{val}\)` - `qnorm(quantile, mean, sd)` to calculate value corresponding to `quantile` * Similarly, for `\(t\)`-distribution, we use - `pt(val, df)` to calculate probability that `\(t\)` distr. with `df` degrees of freedom is `\(<= \text{val}\)` - `qt(quantile, df)` to calculate value corresponding to `quantile` for `\(t\)` distr. with `df` degrees of freedom ```r pnorms #> # A tibble: 7 × 5 #> val pnorm pt_2 pt_5 pt_30 #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 -3 0.00135 0.0477 0.0150 0.00269 #> 2 -2 0.0228 0.0918 0.0510 0.0273 #> 3 -1 0.159 0.211 0.182 0.163 #> 4 0 0.5 0.5 0.5 0.5 #> 5 1 0.841 0.789 0.818 0.837 #> 6 2 0.977 0.908 0.949 0.973 #> # ℹ 1 more row ``` ] --- .pull-left[ ### `\(t\)` distribution calculations * Probability that `\(t\)` distribution with 20 degrees of freedom is less than -1.5? ```r # use pt() to find probability under the $t$-distribution pt(-1.5, df = 18) #> [1] 0.07547523 ``` <img src="lec20-inference-single-mean_files/figure-html/fig-tDistDF18LeftTail2Point10-1.png" width="100%" /> ] .pull-right[ * Probability that `\(t\)` distribution with 11 degrees of freedom is bigger than 2.5? ```r 1 - pt(2.5, df = 11) #> [1] 0.01475319 ``` <img src="lec20-inference-single-mean_files/figure-html/unnamed-chunk-16-1.png" width="80%" /> ] --- .pull-left[ * Probability that `\(t\)` distribution with 2 degrees of freedom is more than 3 units away from the mean? ```r # use pt() to find probability under the $t$-distribution pt(-3, df = 2) + (1 - pt(3, df = 2)) #> [1] 0.09546597 ``` <img src="lec20-inference-single-mean_files/figure-html/fig-tDistDF23UnitsFromMean-1.png" width="80%" /> * `\(t\)` distribution is symmetric around 0, so could also do ```r 2*pt(-3, df = 2) #> [1] 0.09546597 ``` ] -- .pull-right[ * Compare with what happens with standard normal: 68-95-99.7 rule says that only 0.3% (=0.003) would be more than 3 units from the mean. * Since `\(t\)` distribution has fatter tails, it assigns greater probability to extreme values, so we get significantly more area for `\(t\)` distribution. * As degrees of freedom increase, this becomes less and less the case. ] --- ### Conditions needed for using `\(t\)`-distr. confidence intervals * Previously: *if* we have independent samples and sufficiently large dataset where population has mean `\(\mu\)` and s.d. `\(\sigma\)`, then `\(\bar x_n\)` is approximately normal, with mean `\(\mu\)` and s.d. `\(\sigma/\sqrt n\)`. * However, we don't know `\(\sigma\)`, and if we want to use the sample estimate `\(s\)` for `\(\sigma\)` then we can't say that `\(\bar x_n\)` is approximately normal, but will instead be `\(t\)` distribution with `\(n-1\)` d.o.f. under certain conditions * Conditions needed: - Independent sample observations (satisfied w/ random sample) - Normality of samples - each `\(x_i\)` is from a normal distribution (or approximately). How to check? * If `\(n<30\)` and no clear outliers, then OK. * If `\(n\geq 30\)` and no particularly *extreme* outliers, then OK * If these assumptions hold, then the confidence interval for the mean is <img src="t.png" width="25%" /> * `\(t^*_{df}\)` found same way as for `\(z^*\)`: `\(t^*_{df}\)` is s.t. the proportion of `\(t\)` distr with `\(df\)` d.o.f. within a distance `\(t^*_{df}\)` of 0 matches confidence level of interest --- .pull-left[ ### One sample t-intervals <img src="t.png" width="50%" /> * `\(t^*_{df}\)` found same way as for `\(z^*\)`: `\(t^*_{df}\)` is s.t. the proportion of `\(t\)` distr with `\(df\)` d.o.f. within a distance `\(t^*_{df}\)` of 0 matches confidence level * Recall if we wanted to find `\(z^*\)` corresponding to 95% confidence level, we could calculate - `qnorm(1 - 0.05/2) = qnorm(0.97.5)` `\(= 1.95 \approx 2\)` <img src="lec20-inference-single-mean_files/figure-html/unnamed-chunk-22-1.png" width="65%" /> ] -- .pull-right[ * Same idea holds for finding `\(t^*_{df}\)`: to get confidence level of `\(1-\alpha\)`, we use - `qt(1 - alpha/2, df = df)` * E.g. if d.o.f. = 5 and we want 95% confidence level, ```r qt(1 - 0.05/2, df = 5) #> [1] 2.570582 ``` <img src="lec20-inference-single-mean_files/figure-html/unnamed-chunk-24-1.png" width="75%" /> ] --- .pull-left[ #### Example: mercury in tuna * Let's consider problem of measuring the amount of mercury in tuna * High mercury concentrations can be dangerous for the tuna and humans that eat them * Suppose we have a random sample of 19 tunas, with measurements as follows; measurements in micrograms mercury / gram tuna <table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:center;"> n </th> <th style="text-align:center;"> Mean </th> <th style="text-align:center;"> SD </th> <th style="text-align:center;"> Min </th> <th style="text-align:center;"> Max </th> </tr> </thead> <tbody> <tr> <td style="text-align:center;width: 6em; "> 19 </td> <td style="text-align:center;width: 6em; "> 4.4 </td> <td style="text-align:center;width: 6em; "> 2.3 </td> <td style="text-align:center;width: 6em; "> 1.7 </td> <td style="text-align:center;width: 6em; "> 9.2 </td> </tr> </tbody> </table> * Conditions for applying `\(t\)` distr. satisfied? - Independent since random sample - `\(n<30\)` and summary stats suggest no clear outliers. - All good ] -- .pull-right[ * Let's calculate 95% confidence interval * Calculate standard error: $$ SE = \frac{s}{\sqrt n} = \frac{2.3}{\sqrt{19}} = 0.528$$ * Since d.o.f. is `\(n-1=19-1\)`, calculate `\(t^*_{19-1}\)` for `\(1-0.05\)` confidence level: ```r qt(1-0.05/2, df = 18) #> [1] 2.100922 ``` * Thus 95% confidence interval is $$ \begin{aligned} \bar{x} \ &\pm\ t^{\star}_{18} \times SE \newline \\ 4.4 \ &\pm\ 2.10 \times 0.528 \newline \\ (3.29 \ &, \ 5.51) \end{aligned} $$ * We are 95% confident that average mercury content in tunas is between 3.29 and 5.51 micrograms per gram of tuna - very high. ] --- .pull-left[ #### One sample `\(t\)`-tests * We now know how to calculate a `\(t\)`-distribution based confidence interval * Let's look now at hypothesis tests for the mean * **T score** is a ratio of how the sample mean differs from the hypothesized mean as compared to how the observations vary. $$ T = \frac{\bar{x} - \mbox{null value}}{s/\sqrt{n}} $$ * When the null hypothesis is true and the conditions are met, T has a t-distribution with `\(df = n - 1.\)` Conditions: - Independent observations. - Large samples and no extreme outliers. ] -- .pull-right[ * Example: testing whether runners in Washington, DC races are getting slower or faster over time * In 2006, DC Cherry Blossom Race (10 miles) had average of 93.29 minutes * Will use data from 100 participants from 2017 to determine whether runners are getting faster or slower (vs. possiblity of no change) - Null hypothesis `\(H_0\)`: average 10 mile time is same in 2006 as in 2017, so `\(\mu = 93.29\)` - Alternative hypothesis `\(H_A\)`: average 10 mile run time is different in 2017; `\(\mu \neq 93.29\)` * Data looks like: <img src="lec20-inference-single-mean_files/figure-html/unnamed-chunk-27-1.png" width="60%" /> * Large enough samples, no extreme outliers, can proceed ] --- .pull-left[ * To do the hypothesis test, same procedure as before - Normal setting: find Z-score using observed value, null value, standard error, then use normal distribution to calculate tail area / p value - This setting: T-score using observed value, null value, standard error, then use **t distribution** to calculate tail area / p value * For sample of 100 runners, 2017 data had average of 98.78 and s.d. of 16.59; average run time in 2006 was 93.29 * First calculate standard error: $$ SE = \frac{s}{\sqrt{n}} = \frac{16.59}{\sqrt{100}} = 1.66$$ * T score: $$ T = \frac{\text{observed}-\text{null}}{SE} = \frac{98.8 - 93.29}{1.66} = 3.32$$ ] -- .pull-right[ * We have `\(n=100\)` observations, so `\(df=100-1=99\)`. <img src="lec20-inference-single-mean_files/figure-html/unnamed-chunk-28-1.png" width="75%" /> * We can use `pt` to find this; by symmetry, area to right of 3.32 is same as area to left of -3.32, so ```r pt(-3.32, df = 99) #> [1] 0.0006305322 ``` * Double this to get the p-value - 0.00126. * Since `\(p\)`-value is `\(<0.05\)`, we can reject the null hypothesis at 95% confidence level. - Can reject at even 99% confidence level. - Thus average run time in 2017 is different than 2006. ]