class: center, middle, inverse, title-slide .title[ # Inference II ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei
Figures taken from [IMS], Cetinkaya-Rundel and Hardin text ] --- .pull-left[ ### Inference with mathematical models * We've talked about how to do bootstrapping and randomization tests * We'll now talk about how to use mathematical models to think about inference * In both randomization and bootstrapping, we formed **sampling distributions**: distribution of all possibel values of a *sample statistics* from samples of a population * Describes how much sample statistics vary from one sample to another * Can only be visualized after using computational methods / mathematical models - a single dataset's observations do not suffice ] .pull-right[ * In the four examples we described in last lectures, we ran 10,000 simulations under the null hypothesis * We sampled 10,000 versions of the sample proportion under the null hypothesis * *Sampling distribution* of sample proportion under the null hypothesis, i.e. **null distribution** <img src="lec17-inference-2_files/figure-html/unnamed-chunk-2-1.png" width="432" /> ] --- .pull-left[ #### The central limit theorem and normal distributions * Whenever we look at a population and take enough *independent* samples, then the sample proportion/mean will follow a **normal distribution**: a bell-shaped curve that looks like <img src="lec17-inference-2_files/figure-html/fig-er6895997-1.png" width="432" /> * `\(\mu\)` is the mean (or proportion) for the population; `\(\sigma\)` is "standard deviation" * This idea is called "central limit theorem" ] .pull-right[ * Values of `\(\mu\)`, `\(\sigma\)` may change from plot to plot, but the shape remains the same * The sample proportion `\(\hat p\)` will look like a normal distribution centered at population proportion `\(p\)` provided: - The observations in the sample are *independent*: samples are truly randomly sampled from a population - Sample size is *large enough*: we can make this exact, but generally need >= 10 observed examples in each class (treatment/control) * Same ideas hold for sample mean `\(\bar x\)`: centered at population mean `\(\mu\)` ] --- .pull-left[ #### Normal distribution model * Symmetric, unimodal, bell-shaped. * Exact values of center / spread can change. * Mean `\(\mu\)` shifts from left to right, standard deviation `\(\sigma\)` squishes to be larger or smaller <img src="lec17-inference-2_files/figure-html/fig-twoSampleNormals-1.png" alt="Two normal curves. The first one has a mean at zero with a standard deviation of one and is called the standard normal distribution. The second one has a mean at 19 with a standard deviation of four." width="100%" /> <img src="lec17-inference-2_files/figure-html/fig-twoSampleNormalsStacked-1.png" width="100%" /> ] .pull-right[ * We denote "normal distribution with mean `\(k\)` and standard deviation `\(r\)` as `\(N(\mu = k, \sigma = r)\)`. - If `\(k=0\)`, `\(r=1\)`, i.e. `\(N(\mu=0, \sigma=1)\)`, we call this the "standard normal distribution" - Left has `\(N(\mu=0, \sigma=1)\)` and `\(N(\mu=19, \sigma=4)\)`. * The way to quantify how "unusual"/"extreme" an event was is to look at how many standard deviations `\(\sigma\)` away from the mean `\(\mu\)` the quantity is. * This is called the `\(Z\)` score: $$ Z := \frac{ x - \mu}{\sigma}. $$ * E.g. which is more extreme: value of 27 for `\(N(\mu=19, \sigma=4)\)` or value of 3 for `\(N(\mu=0, \sigma=1)\)`? - 27 for `\(N(\mu=19, \sigma=4)\)`: Z-score of `\(\frac{27-19}{4} = 2\)`. - 3 for `\(N(\mu=0, \sigma=1)\)`: Z-score of `\(\frac{3-0}{1} = 3\)`. * This is more extreme ] --- .pull-left[ #### Z-score examples * SAT scores are approximately normal distribution with mean 1500 and standard deviation 300; ACT scores are approximately normal with mean 21 and standard deviation 5. * If Maria scored 1800 on the SAT and Abed scored 24 on the ACT, who performed better? - Z-score for Maria: (1800 - 1500) / 300 = 1 - Z-score for Abed: (24 - 21) / 5 = 0.6 - Maria did better on the exam * What score corresponds to a z-score of 2.5 in each of the SAT and ACT? - SAT: 1500 + 2.5 * 300 = 1950 - ACT: 21 + 2.5 * 5 = 33.5 ] .pull-right[ #### Normal probability calculations * Suppose that Kavitha scored 1800 on her SAT. How can they find what **percentile** this score is? We know this corresponds to a Z-score of 1. <img src="lec17-inference-2_files/figure-html/fig-satBelow1800-1.png" width="85%" /> * Total area under normal curve is always equal to 1. * Proportion of people who scored below Kavitha is equal to the area of the shaded region. which we can calculate to be 0.8413 using a computer * R and other statistical software can calculate probabilities/percentiles as a function of the z-score. ] --- .pull-left[ #### Normal probability calculations in R * `pnorm()` function in base R provides percentile associated with a cutoff in the normal curve. - E.g. Kavitha got 1800 on SAT, which has mean 1500 and sd 300: ```r pnorm(1800, mean = 1500, sd = 300) #> [1] 0.8413447 ``` * `normTail()` function in **openintro** draws associated normal curve: ```r normTail(m = 1500, s = 300, L = 1800) ``` <img src="lec17-inference-2_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ * We can also do the reverse calculation: identify the Z score associated with a percentile. * `qnorm()`: identifies **quantile** for given percentage ```r qnorm(0.841, mean = 1500, sd = 300) #> [1] 1799.573 ``` ```r normTail(m = 0, s = 1, L = 0.841) ``` <img src="lec17-inference-2_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> * **quantile** and **percentile** are inverse operations: ```r qnorm(pnorm(3, mean = 0, sd = 1), mean = 0, sd = 1) #> [1] 3 pnorm(qnorm(0.99, mean = 5, sd = 3), mean = 5, sd = 3) #> [1] 0.99 ``` ] --- .pull-left[ #### Normal probability calculations * Again consider SAT scores: normal, mean `\(\mu=1500\)`, s.d. `\(\sigma=300\)`. * Suppose we take a random SAT taker. What is the probability they score `\(\geq 1630\)` on the SAT? * Draw the normal curve and visualize the problem: <img src="lec17-inference-2_files/figure-html/fig-subtractingArea-1.png" width="432" /> ] .pull-right[ * Calculate Z score of 1630: $$ Z = \frac{x - \mu}{\sigma} = \frac{1630 - 1500}{300} = 0.43.$$ * Then we want to calculate the **p**ercentile: ```r pnorm(0.43, mean = 0, sd = 1) #> [1] 0.6664022 pnorm(1630, mean = 1500, sd = 300) #> [1] 0.6676137 ``` * Thus proportion of 0.6664 have people with Z score **lower** than 0.43 * To compute area **above**, need to take one minus this (total area = 1) - Total proportion: `\(1-0.6664 = 0.3336\)`. - Probability of scoring at least 1630 is 0.3336, or 33.36%. ] --- .pull-left[ * Suppose Ed scored 1400 on his SAT. What percentile is this? * Draw a picture: <img src="lec17-inference-2_files/figure-html/fig-edward-percentile-1.png" width="60%" /> * Calculate Z score: $$ Z = \frac{x - \mu}{\sigma} = \frac{1400 - 1500}{300} = -0.333.$$ * `pnorm()` returns percentile ```r pnorm(-0.3334, mean = 0, sd = 1) #> [1] 0.3694162 pnorm(1400, mean = 1500, sd = 300) #> [1] 0.3694413 ``` * Edward did better than ~37% of SAT takers ] .pull-right[ * Suppose the height of men is approx. normal with avg 70" and standard deviation 3.3" * If Kamron is 5'7", Adrian 6'4", what percentile of men are their heights? Draw a picture and use `pnorm()` to calculate the percentage. - 5'7" = 67"; 6'4" = 76" <img src="lec17-inference-2_files/figure-html/fig-kamron-adrian-percentiles-1.png" width="50%" /><img src="lec17-inference-2_files/figure-html/fig-kamron-adrian-percentiles-2.png" width="50%" /> ```r pnorm(67, mean = 70, sd = 3.3) #> [1] 0.1816511 pnorm(76, mean = 70, sd = 3.3) #> [1] 0.9654818 ``` ] --- .pull-left[ * Let's now try and calculate what the 40th percentile for height is * Mean: 70", s.d.: 3.3" * Always draw a picture first: aw the picture. ```r normTail(70, 3.3, L = qnorm(0.4, 70, 3.3), col = IMSCOL["blue", "full"]) text(67, 0.03, "40%\n(0.40)", cex = 1, col = IMSCOL["black", "full"]) ``` <img src="lec17-inference-2_files/figure-html/unnamed-chunk-18-1.png" width="60%" /> * Z score associated with 40th percentile: ```r qnorm(0.4, mean = 0, sd = 1) #> [1] -0.2533471 ``` ] .pull-right[ * With Z-score, mean, and s.d., we can calculate the height: $$-0.253 = Z = \frac{x-\mu}{\sigma} = \frac{x - 70}{3.3} $$ $$ \implies x - 70 = 3.3 \times -0.253 $$ $$ \implies x = 70 - 3.3\times 0.253 = 69.18 $$ * 69.18" is approximately 5'9" ] --- .pull-left[ ### Quantifying the variability of statistics * Most of the statistics we have sean (sample proportion, sample mean, difference in two sample proportions/means, slope of a linear model fit to data) are **approximately normal** when the data is independent and there are enough samples. * It is thus very useful to have an intuition for how much variability there is within a few standard deviations of the mean in a normal distribution. * **68 - 95 - 99.7 rule**: pictorially, <img src="lec17-inference-2_files/figure-html/unnamed-chunk-20-1.png" width="432" /> ] .pull-right[ * 68% of the data lies within 1 s.d. of the mean * 95% lies within 2 s.d.'s * 99.7% lies within 3 s.d.'s * Only 0.3% lies more than 3 s.d.'s away from the mean. * We can confirm this with `pnorm()` in R: ```r pnorm(1, mean = 0, sd = 1) - pnorm(-1, mean = 0, sd = 1) #> [1] 0.6826895 pnorm(2, mean = 0, sd = 1) - pnorm(-2, mean = 0, sd = 1) #> [1] 0.9544997 pnorm(3, mean = 0, sd = 1) - pnorm(-3, mean = 0, sd = 1) #> [1] 0.9973002 2 * pnorm(-3, mean = 0, sd = 1) #> [1] 0.002699796 ``` ] --- .pull-left[ #### Z score tables * In addition to using `pnorm()`, people use "Z score tables" * It's important to be able to read and interpret these - imagine if all computers disappeared :) * Example: what is the percentile corresponding to a Z score of -2.33? - Draw! (see chalkboard) - Look for -2.3 on the table - Go to column of 0.03 - Read out value: 0.0099 - 0.99% - Sanity check with 68 / 95 / 99.7 rule: * 5% of data > 2 s.d.'s from mean; 2.5% below -2 s.d.'s (Z of -2) of mean * 0.3% of data > 3 s.d.'s from mean; 0.15% below -3 s.d.'s (Z of -3) of mean ] .pull-right[ <img src="zscores.png" width="411" /> ] --- .pull-left[ #### Z score tables * What is the percentile corresponding to a Z score of 1.37? Draw! (see chalkboard) - By symmetry, percentile of Z score of 1.37 corresponds to one minus the percentile corresponding to percentile of -1.37 (draw!) - Look for -1.3 on the table - Go to column of 0.07 - Read out value: 0.0853 - 8.53% percentile for Z = -1.3 - 1 - 0.0853 = 0.9147 --> 91.47 percentile for Z of 1.37 - Sanity check with 68 / 95 / 99.7 rule: - 68% of data within 1 s.d. of mean; 34% between mean and 1 s.d. above mean; thus 84% (=50+34) are less than 1 s.d. above mean. * 95% of data within 2 s.d. of mean; 47.5% between mean and 2 s.d. above mean; thus 97.5% (=50+47.5) are less than 2 s.d. above mean. ] .pull-right[ <img src="zscores.png" width="411" /> ] --- .pull-left[ #### Standard error * Whenever we estimate proportions/means, these estimates vary from sample to sample * This variability is quantified using the **standard erorr**: corresponds to standard deviation of the statistic * Actual variability of the statistic in the population is unknown -- we use data to estimate the standard error (just as we use data to estimate the unknown population proportion/mean) * We typically estimate standard error using the "central limit theorem" - will see how to calculate this in future lectures ] .pull-right[ #### Margin of error * You may also see the term "margin of error": describes how far away observations are from the mean * Margin of error = `\(z^* \times \mathsf{StdError}\)`, where `\(z^*\)` is a cutoff value from the normal distribution * E.g. if we have `\(z^* = 1.96 \approx 2\)`, then margin of error corresponds to variability associated w/ 95% of sampled statistics ] --- .pull-left[ #### Opportunity cost case study * Recall the study where students are reminded about being able to save $15 if they don't spend it on video game right now * We estimated that difference in proportions (buy vs not buy when given reminder) was 0.20. * We then used a randomization test to look at the variability in difference of proportions under *null hypothesis*: difference in proportions does NOT depend on being presented with reminder <img src="lec17-inference-2_files/figure-html/fig-OpportunityCostDiffs-w-normal-1.png" width="432" /> ] .pull-right[ <img src="lec17-inference-2_files/figure-html/OpportunityCostDiffs_normal_only-1.png" width="60%" /> * The Z score in a hypothesis test is given by substituting the standard error for the standard deviation: `$$Z = \frac{\text{observed difference} - \text{null value}}{SE}$$` * Here: observed difference is 0.20, and let's assume we know the standard error is 0.078 (methods to calculate: to come) * `$$Z = \frac{\text{observed difference} - 0}{SE} = \frac{0.20 - 0}{0.078} = 2.56$$` ```r 1 - pnorm(2.56, mean = 0, sd = 1) #> [1] 0.005233608 ``` ] --- .pull-left[ #### Z score tables * What is the percentile corresponding to a Z score of 2.56? Draw! (see chalkboard) - By symmetry, percentile of Z score of 2.56 corresponds to one minus the percentile corresponding to percentile of -2.56 (draw!) - Look for -2.5 on the table - Go to column of 0.06 - Read out value: 0.0052 - 0.52% percentile for Z = -2.56 - 1 - 0.0052 = 0.9948 --> 99.48 percentile for Z of -2.56 - Sanity check with 68 / 95 / 99.7 rule: * 95% of data within 2 s.d. of mean; 47.5% between mean and 2 s.d. above mean; thus 97.5% (=50+47.5) are less than 2 s.d. above mean. * 99.7% of data within 3 s.d. of mean; 49.85% between mean and 3 s.d. above mean; thus 99.85% (=50+49.85) are less than 3 s.d. above mean. ] .pull-right[ <img src="zscores.png" width="411" /> ] --- #### Drawbacks of the central limit theorem / normal approximation * Under certain conditions, we can be guaranteed that the sample mean and sample proportions will be approximately normal, with appropriate mean `\(\mu\)` and s.d. `\(\sigma\)` * We require: - Independent samples (no correlation between them!) - Large number of samples (>= 10 per treatment for proportions) * We sometimes cannot be certain samples are independent * The normal approximation assigns a positive chance for EVERY value to occur - this is not ideal if you are talking about things like time or other variables with constraints (e.g., time is always >= 0) * Developing statistics which incorporate *constraints* into the variables is more challenging, and requires mathematical work