Inference for two proportions

class: center, middle, inverse, title-slide

.title[
# Inference for two proportions
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei <br> Figures taken from [IMS], Cetinkaya-Rundel and Hardin text
]

---

.pull-left[
* We will now use similar ideas from previous lectures to derive confidence intervals and hypothesis tests for differences in population proportions from two groups: `$p_1 - p_2$`.
* For samples of group 1 and group 2, we use point estimate `$\hat p_1 - \hat p_2$` to estimate `$p_1-p_2$`
* To estimate variability / perform inference, we'll use three methods:
  - Randomization test
  - Bootstrap for confidence intervals
  - Normal approximation
]

.pull-right[
* Example dataset: `openintro::cpr`
* Experiment consists of two treatment on patients who underwent CPR for a heart attack and then admitted to a hospital
* Patients randomly assigned to either receive blood thinner (treatment) or not (control)
* Outcome: whether patient survived for >= 24 hours post heart attack

```r
str(cpr)
#> tibble [90 × 2] (S3: tbl_df/tbl/data.frame)
#>  $ group  : Factor w/ 2 levels "control","treatment": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ outcome: Factor w/ 2 levels "died","survived": 2 2 2 2 2 2 2 2 2 2 ...
```

<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Group </th>
   <th style="text-align:center;"> Died </th>
   <th style="text-align:center;"> Survived </th>
   <th style="text-align:center;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> Control </td>
   <td style="text-align:center;width: 7em; "> 39 </td>
   <td style="text-align:center;width: 7em; "> 11 </td>
   <td style="text-align:center;width: 7em; "> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Treatment </td>
   <td style="text-align:center;width: 7em; "> 26 </td>
   <td style="text-align:center;width: 7em; "> 14 </td>
   <td style="text-align:center;width: 7em; "> 40 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:center;width: 7em; "> 65 </td>
   <td style="text-align:center;width: 7em; "> 25 </td>
   <td style="text-align:center;width: 7em; "> 90 </td>
  </tr>
</tbody>
</table>

]

---
.pull-left[

* Proportion of treatment who survive: `$\hat p_T = \frac{14}{30} = 0.35$`
* Proportion of control who survive: `$\hat p_C = \frac{11}{50} = 0.22$`
* Difference `$\hat p_T - \hat p_C = 0.13$` - how likely would this occur due to random chance?
* We can formulate a hypothesis test, and a randomization procedure to examine the hypothesis.
  - `$H_0$`: blood thinners after CPR is independent of survival, `$p_1-p_2=0$`.
  - `$H_A$`: blood thinners after CPR increase chance of survival, `$p_1-p_2>0$`
* Since the assignment of treatment/control was random initially, if we reject the null, we can say the difference we due to usage of blood thinners
]

.pull-right[
* We assume same number of people died / survived, then look at what happens if we randomly assign treatment or control to each
* Then calculate proportions within each simulated treatment/control group.
* We plot the difference in 100 randomizations below. 
<img src="lec19-inference-two-proportions_files/figure-html/fig-cpr-rand-dot-plot-1.png" width="432" />
* 12/100 simulations had a difference of >= 13%
* If null were true, about 12% chance of differences like we saw; this is pretty likely
* Cannot reject null - but just means we don't have enough evidence to suggest blood thinners are effective

]

---

.pull-left[
#### Bootstrap confidence interval for difference in proportions
* Last slide: randomization distribution to understand distr. of `$\hat p_1 - \hat p_2$` under null hypothesis `$H_0: p_1-p_2=0$`. 
* We previously applied bootstrap sampling to a single population
]

.pull-right[
* We now have samples from two different groups, and we will re-sample from each group to calculate bootstrap proportions `$\hat p_{1,boot1}, \hat p_{2,boot2}$` from each group.
* We use this to tally up different estimates of the proportions in each bootstrap sample, which gives us an estimate of how much variability there is in the sample proportions (and thus the difference in sample proportions)
]

---

---

.pull-left[
#### Confidence intervals from bootstrap samples
<img src="lec19-inference-two-proportions_files/figure-html/fig-bootCPR1000-1.png" width="432" />
* We have already discussed one method for constructing confidence interval: suppose we saved the bootstrap data as

```r
head(cpr_boot_dist %>% as_tibble())
#> # A tibble: 6 × 2
#>   replicate   stat
#>       <int>  <dbl>
#> 1         1 0.0847
#> 2         2 0.102 
#> 3         3 0.167 
#> 4         4 0.0979
#> 5         5 0.0618
#> 6         6 0.169
```

]

.pull-right[ 
* For 95% confidence interval, we want to find the 2.5th and 97.5th percentiles/quantile:

```r
quantile(cpr_boot_dist$stat, 0.975)
#>     97.5% 
#> 0.3242973
quantile(cpr_boot_dist$stat, 0.025)
#>        2.5% 
#> -0.06631288
```
* 95% conf interval: (-0.0588, 0.3034) 
  - 95% confidence that in the population, true difference in probability of survival for people receiving blood thinners post CPR is between 0.032 lower and 0.3034 higher
* Similar calculation for 90% conf interval:

```r
quantile(cpr_boot_dist$stat, 0.95)
#>       95% 
#> 0.2840669
quantile(cpr_boot_dist$stat, 0.05)
#>          5% 
#> -0.03240614
```
* 90% conf interval: (-0.029, 0.280)
  - Even at 90% confidence, still includes 0
]

---
.pull-left[
#### Mathematical model for difference in proportions
* Like with `$\hat p$` for a single proportion, difference in proportions `$\hat p_1 - \hat p_2$` can be modeled with a normal distribution under certain assumptions.
* Assumptions are similar to those from before, but more stringent:
  - **Independence**: data *within each group* and *between groups* are indepednent; generally satisfied if in randomized experiment
  - **Success-failure condition**: at least 10 successes and 10 failure *within each group*
* When these are satisfied, the standard error of `$\hat p_1 - \hat p_2$` is

`$$SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}$$`

where `$p_1$` and `$p_2$` represent the population proportions, and `$n_1$` and `$n_2$` represent the sample sizes.
]

.pull-right[
* In most cases, the standard error is approximated using the observed data:

`$$SE(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$`

where `$\hat{p}_1$` and `$\hat{p}_2$` represent the observed sample proportions
* NOTE: `$\sqrt{a+b} \neq \sqrt{a} + \sqrt{b}$`!  Think of `$\sqrt{2 + 2} = \sqrt{4}=2$`, but `$\sqrt{2} + \sqrt{2} = 2 \sqrt{2} \approx 2.82$`.

* **Margin of error** defined in terms of standard error: 
`$$z^\star \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$`
  where `$z^\star$` is calculated from a specified percentile on the normal distribution (e.g. 1.96 for 95%)
]

---

.pull-left[

* **Margin of error** defined in terms of standard error: 
`$$z^\star \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}$$`
  - where `$z^\star$` is calculated from a specified percentile on the normal distribution (e.g. 1.96 for 95%)
  - this describes the variability of the difference in proportions (when assumptions for normal model hold)
* Can then use standard approach for defining confidence interval:

$$
`\begin{aligned}
\text{point estimate} \ &\pm \  z^{\star} \ \times \ SE \\
(\hat{p}_1 - \hat{p}_2) \ &\pm \ z^{\star} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}
\end{aligned}`
$$

* Let's return to the CPR dataset

]

.pull-left[
<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
<caption></caption>
 <thead>
  <tr>
   <th style="text-align:left;"> Group </th>
   <th style="text-align:center;"> Died </th>
   <th style="text-align:center;"> Survived </th>
   <th style="text-align:center;"> Total </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;width: 7em; "> Control </td>
   <td style="text-align:center;width: 7em; "> 39 </td>
   <td style="text-align:center;width: 7em; "> 11 </td>
   <td style="text-align:center;width: 7em; "> 50 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Treatment </td>
   <td style="text-align:center;width: 7em; "> 26 </td>
   <td style="text-align:center;width: 7em; "> 14 </td>
   <td style="text-align:center;width: 7em; "> 40 </td>
  </tr>
  <tr>
   <td style="text-align:left;width: 7em; "> Total </td>
   <td style="text-align:center;width: 7em; "> 65 </td>
   <td style="text-align:center;width: 7em; "> 25 </td>
   <td style="text-align:center;width: 7em; "> 90 </td>
  </tr>
</tbody>
</table>

* Are conditions met?
  - Random assignment of treatment/control, so independence is satisfied
  - Success-failure condition in each group: 11, 14, 39, 26 in each outcome so condition is satisfied - all satisfied
  
* Let's create 90% confidence interval for difference in survival rates. Call `$p_T$` for the survival rate in the treatment group and `$p_C$` for the control group:

`$$\hat{p}_{T} - \hat{p}_{C} = \frac{14}{40} - \frac{11}{50}  = 0.35 - 0.22 = 0.13$$`

]

---

.pull-left[
#### Bootstrap confidence interval

`$$\hat{p}_{T} - \hat{p}_{C} = \frac{14}{40} - \frac{11}{50}  = 0.35 - 0.22 = 0.13$$`
* Formula for the SE:

`$$SE \approx \sqrt{\frac{0.35 (1 - 0.35)}{40} + \frac{0.22 (1 - 0.22)}{50}}  = 0.095$$`

* For 90% confidence interval, need to find `$z^*$` for this:

```r
qnorm(0.05)
#> [1] -1.644854
```
* `$z^* = 1.64$`, so

$$
`\begin{aligned}
\text{point estimate} \ &\pm \ z^{\star} \ \times \ SE \\
0.13 \ &\pm \ 1.64 \ \times \ 0.095 \\
(-0.027 \ &, \ 0.287)
\end{aligned}`
$$
]

.pull-right[
* We are 90% confident individuals receiving blood thinners have between a 2.7% fewer chance of survival to 28.7% greater chance of survival
* Since 0% is in the interval, we cannot conclude that blood thinners have an effect on CPR
* Using bootstrap confidence interval estimate, we found 90% confidence interval of (-0.029, 0.280)
* Not exactly the same - we are using simulations for bootstrap, mathematical approximation for above - but roughly similar ideas. 
* Similarly, using randomization test, we found around 12% likelihood that difference we observed in the original dataset (13%) would occur due to random chance if blood thinner treatment is independent of survival. 
]

---

.pull-left[
### Hypothesis test for difference in proportions
* For hypothesis test for *difference in proportions* being not equal (null `$H_0$`: `$p_1-p_2=0$`), similar ideas from before: calculate SE and check technical conditions like independence and success-failure
* One difference: success-failure is defined in terms of **pooled proportion**,
$$ \hat p_{pool} = \frac{\text{number successes}}{\text{number cases}} = \frac{\hat p_1 n_1 + \hat p_2 n_2}{n_1 + n_2}. $$
Note that
$$ \hat p_1 = \frac{\text{number successes in sample 1}}{n_1}.$$
* From this we can formulate a test statistic for assessing difference in two proportions - a Z score

]

.pull-right[

* Standard error: approximation for the standard deviation of the sampling distribution for `$\hat p_1 - \hat p_2$`

* General form for Z-score:

$$ 
`\begin{align*}
Z &= \frac{\text{point estimate} - \text{null value}}{SE} \\
&= \frac{\hat p_1 - \hat p_2 - 0}{ \sqrt{\hat p_{pool}(1-\hat p_{pool}) \left( \frac{1}{n_1} + \frac{1}{n_2}\right) }}
\end{align*}`
$$
* When null hypothesis is true and following conditions hold, `$Z$` has approximately standard normal distribution.
  - Independent observations
  - Sufficiently large samples: `$n_1 \hat p_{pool} \geq 10$`, `$n_1 (1-\hat p_{pool}) \geq 10$`, `$n_2 \hat p_{pool} \geq 10$`, `$n_2 (1-\hat p_{pool}) \geq 10$`
]

---
.pull-left[
#### Example: mammogram study
* We'll look at data for whether getting mammograms as a part of regular breast cancer exams is effective at reducing death from breast cancer
  - Control: non-mammogram breast cancer exam
  - Treatment mammograms included in breast cancer exam
* Randomly assigned ~90,000 people to each of control/treatment groups
* Data: `mammogram` from openintro package

<table class="table table-striped table-condensed" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
<tr>
<th style="empty-cells: hide;border-bottom:hidden;" colspan="1"></th>
<th style="border-bottom:hidden;padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"><div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; ">Death from breast cancer?</div></th>
</tr>
  <tr>
   <th style="text-align:left;"> Treatment </th>
   <th style="text-align:right;"> Yes </th>
   <th style="text-align:right;"> No </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> control </td>
   <td style="text-align:right;"> 505 </td>
   <td style="text-align:right;"> 44,405 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> mammogram </td>
   <td style="text-align:right;"> 500 </td>
   <td style="text-align:right;"> 44,425 </td>
  </tr>
</tbody>
</table>

]

.pull-right[
* We want to test whether mammograms are effective
* Formulate hypotheses:
  - `$H_0$`: mammograms do not affect survival likelihood
  - `$H_A$`: mammograms improve likelihood of survival
* Need to check conditions for being able to use Z-score:
  - Independence: random treatment assignment; good
  - Success-failure: need to calculate `$\hat p_{pool}$`,
  
<img src="breast_cancer.png" width="80%" />

* This is our best estimate of `$p_1$` and `$p_2$` *if the null hypothesis is true* that `$p_1 = p_2$`; we will use this to compute the standard error

]

---
.pull-left[

* Calculated before:  `$\hat p_{pool} = 0.0112$` 
* Now to check success-failure conditions:
  
$$
`\begin{align*}
\hat{p}_{\textit{pool}} \times n_{MGM}
      &= 0.0112 \times \text{44,925} = 503\\
 (1 - \hat{p}_{\textit{pool}}) \times n_{MGM}
      &= 0.9888 \times \text{44,925} = \text{44,422} \\
  \hat{p}_{\textit{pool}} \times n_{C}
      &= 0.0112 \times \text{44,910} = 503\\
  (1 - \hat{p}_{\textit{pool}}) \times n_{C}
      &= 0.9888 \times \text{44,910} = \text{44,407}
\end{align*}`
$$
* All are at least 10, so this (plus independence) means we can model the difference of proportions using a normal distribution, and we know that the Z score is approximately standard normal

]

.pull-right[
* To calculate a confidence interval, we need the point estimate as well as the standard error

The breast cancer death rate in the mammogram group was 0.012% less than in the control group.
Next, the standard error is calculated *using the pooled proportion* `$\hat{p}_{\textit{pool}}:$`

`$$\hat{p}_{M} - \hat{p}_{C}= \frac{500}{500 + 44425} - \frac{505}{505 + 44405} = -0.00012$$`

`$$SE = \sqrt{\frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{MGM}} + \frac{\hat{p}_{\textit{pool}}(1-\hat{p}_{\textit{pool}})}{n_{C}}}= 0.00070$$`

Just like in past tests, we first compute a test statistic and draw a picture:

`$$Z = \frac{\text{point est} - \text{null val}}{SE} = \frac{-0.00012 - 0}{0.00070} = -0.17$$`

.pull-left[
<img src="lec19-inference-two-proportions_files/figure-html/unnamed-chunk-17-1.png" width="100%" />
]
.pull-right[
* Right tail area: 0.4325, via `pnorm(-0.17)`
]

]

---

<img src="lec19-inference-two-proportions_files/figure-html/unnamed-chunk-18-1.png" width="75%" />
* Z score: -0.17
* Right tail area: 0.4325, via `pnorm(-0.17)`
* Double to get the p-value: 0.8650
* Can we conclude mammograms have no benefits or do not cause harm?
  - We do not "accept" the null hypothesis 
  - We do not reject the null given the evidence - if mammograms help or hurt, the data suggests the effect is not too large

---
.pull-left[ 
#### Midterm prep: `ggplot`, `dplyr` and functions
* Know and understand the core ggplot functions (and the dplyr basics: `mutate`, `group_by`, `summarize`, ... )
  - `ggplot`, `aes()`, `geom_point()`, `geom_smooth()`, `geom_bar()`, `geom_histogram()`, `geom_boxplot()`, `geom_density()`, `geom_freqpoly()` 
  - `facet_wrap` / `facet_grid`,
  - `labs`,  `guides`, `theme`
* factors, how to relevel them, how to use these in conjunction with ggplot
* functions: general syntax, how default values are used, how to use them, embracing (for use with dplyr), `across()`, 
]

.pull-right[
#### Linear regression
* Slope-intercept / point-slope equations for lines
* Residuals, residual plots, correlation, R^2 (coefficient of determination), SST/SSE
* Understanding outputs of `lm()` in R, translating these into predictions given `$x$` values
* Understanding how outliers affect least squares line
* Plotting data from residuals and vice-versa
* Multiple linear regression; categorical variables with 2+ categories; adjusted R^2; model selection with stepwise selection

#### Inference
* Permutation/randomization tests, `sample()`
* Bootstrap samples, bootstrap statistics, bootstrap confidence intervals 
* Z score / normal distribution approximation: `qnorm()`, `pnorm()`, `quantile()`, Z table
* Hypothesis tests, standard error, margin of error
* This for proportions; difference of proportions
]