Visualizations IV / Factors

class: center, middle, inverse, title-slide

.title[
# Visualizations IV / Factors
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

### Factors
.pull-left[ 
* Factors are used for categorical variables that have a fixed and known set of values
* Useful when we want to display character vectors in non-alphabetical order
* E.g. let's suppose we have a variable that records month:

```r
x1 <- c("Dec", "Apr", "Jan", "Mar")
```
Two issues:
* Only twelve possible months, so if we typed "Jam" instead of "Jan", would be hard to catch the error.
* Sorting by characters is not meaningful:

```r
sort(x1)
#> [1] "Apr" "Dec" "Jan" "Mar"
```

]

.pull-right[
* We can fix these by treating them as a factor
* Factors require a list of valid **levels**:

```r
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
```
* With levels, you can create factors:

```r
y1 <- factor(x1, levels = month_levels)
y1
#> [1] Dec Apr Jan Mar
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
sort(y1)
#> [1] Jan Mar Apr Dec
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```

]

---

### Factors
.pull-left[
* If you create a factor from a vector, every element of the vector must be a level of a factor, otherwise you get `NA`:

```r
x2 <- c("Dec", "Apr", "Jam", "Mar")
(y2 <- factor(x2, levels = month_levels))
#> [1] Dec  Apr  <NA> Mar 
#> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
```
* If you want to throw an error for non-level vector info, you can use ``forcats::fct()`

```r
y2 <- fct(x2, levels = month_levels)
#> Error in `fct()`:
#> ! All values of `x` must appear in `levels` or `na`
#> ℹ Missing level: "Jam"
```

]

.pull-right[
* If you omit levels, `factor()` makes factors in alphabetical order:

```r
factor(x1)
#> [1] Dec Apr Jan Mar
#> Levels: Apr Dec Jan Mar
```
* `forcats::fct()` orders by first appearance in the vector

```r
fct(x1)
#> [1] Dec Apr Jan Mar
#> Levels: Dec Apr Jan Mar
```
* You can extract levels from a factor using `levels()`:

```r
levels(y2)
#>  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
```

]

---

### Factor: General Social Survey
* We'll look at survey data from the General Social Survey (`forcats::gss_cat`, loaded in tidyverse)

```r
gss_cat
#> # A tibble: 21,483 × 9
#>    year marital         age race  rincome        partyid           
#>   <int> <fct>         <int> <fct> <fct>          <fct>             
#> 1  2000 Never married    26 White $8000 to 9999  Ind,near rep      
#> 2  2000 Divorced         48 White $8000 to 9999  Not str republican
#> 3  2000 Widowed          67 White Not applicable Independent       
#> 4  2000 Never married    39 White Not applicable Ind,near rep      
#> 5  2000 Divorced         25 White Not applicable Not str democrat  
#> 6  2000 Married          25 White $20000 - 24999 Strong democrat   
#> # ℹ 21,477 more rows
#> # ℹ 3 more variables: relig <fct>, denom <fct>, tvhours <int>
```
* When factors are in a tibble, levels do not directly show, but can see them using `count`:

.pull-left[

```r
gss_cat %>%
  count(race)
```
]

.pull-right[

```
#> # A tibble: 3 × 2
#>   race      n
#>   <fct> <int>
#> 1 Other  1959
#> 2 Black  3129
#> 3 White 16395
```
]

---

### GSS Data: exploratory data analysis
.pull-left[ 
* Let's look at the distribution of `rincome` (reported income)

```r
ggplot(gss_cat, aes(x = rincome)) + geom_bar()
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-13-1.png" width="432" />
* A little hard to see what's happening on the x-axis!
]

.pull-right[

```r
ggplot(gss_cat, aes(y = rincome)) + geom_bar()
```

]

---

### GSS Data: exploratory data analysis
.pull-left[ 
* Does income vary by religion much?

```r
ggplot(gss_cat, aes(x = rincome)) + geom_bar()
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-15-1.png" width="432" />
* A little hard to see what's happening on the x-axis!
]

.pull-right[

```r
ggplot(gss_cat, aes(y = rincome)) + geom_bar()
```

]

---

### Exploratory data analysis
.pull-left[ 
* What are the most common religions in the survey?  What about party ID?

```r
gss_cat %>% count(relig) %>% arrange(n) %>% print(n=Inf)
#> # A tibble: 15 × 2
#>    relig                       n
#>    <fct>                   <int>
#>  1 Don't know                 15
#>  2 Native american            23
#>  3 Other eastern              32
#>  4 Hinduism                   71
#>  5 No answer                  93
#>  6 Orthodox-christian         95
#>  7 Moslem/islam              104
#>  8 Inter-nondenominational   109
#>  9 Buddhism                  147
#> 10 Other                     224
#> 11 Jewish                    388
#> 12 Christian                 689
#> 13 None                     3523
#> 14 Catholic                 5124
#> 15 Protestant              10846
```

]

.pull-right[

```r
gss_cat %>% count(partyid) %>% arrange(n) %>% print(n=Inf)
#> # A tibble: 10 × 2
#>    partyid                n
#>    <fct>              <int>
#>  1 Don't know             1
#>  2 No answer            154
#>  3 Other party          393
#>  4 Ind,near rep        1791
#>  5 Strong republican   2314
#>  6 Ind,near dem        2499
#>  7 Not str republican  3032
#>  8 Strong democrat     3490
#>  9 Not str democrat    3690
#> 10 Independent         4119
```

]
---

.pull-left[ 
* Let's look at different TV watching habits by marital status.  Maybe we can first try doing a `geom_freqpoly()` with color given by religion.

```r
ggplot(gss_cat,
       aes(x = tvhours, y = after_stat(density), color = marital)) + 
  geom_freqpoly()
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-19-1.png" width="432" />
]

.pull-right[
* If we try the same thing for religion, it is trickier to visualize:

* Let's look at TV hours watched per religion.  Maybe we can first try doing a `geom_freqpoly()` with color given by religion.

```r
ggplot(gss_cat,
       aes(x = tvhours, y = after_stat(density), color = relig)) + 
  geom_freqpoly()
```

* We need to reduce the amount of data we're trying to show.
]

---
.pull-left[ 
* To reduce amount of information, we should summarize the data, e.g. by taking the average per religion.

```r
relig_summary <- gss_cat |>
  group_by(relig) |>
  summarize(
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n()
  )

ggplot(relig_summary, aes(x = tvhours, y = relig)) + 
  geom_point()
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-21-1.png" width="360" />
* Hard to read because there's no clear pattern in the plot.

]

.pull-right[
* To re-order, we can improve it by reordering using `fact_reorder( f=, x=, fun=)`
* `f`: factor whose levels to modify
* `x`: numeric vector to give the new order of levels
* `fun`: function to use if multiple values of `x` for given value of `f` (default: `median`)

```r
ggplot(relig_summary,
       aes(x = tvhours, y = fct_reorder(relig, tvhours))) +
  geom_point()
```

]

---

### Modifying factor levels

.pull-left[
* Often will want to change values of levels so that plots look better

```r
gss_cat %>% count(partyid)
#> # A tibble: 10 × 2
#>   partyid                n
#>   <fct>              <int>
#> 1 No answer            154
#> 2 Don't know             1
#> 3 Other party          393
#> 4 Strong republican   2314
#> 5 Not str republican  3032
#> 6 Ind,near rep        1791
#> # ℹ 4 more rows
```
* Not ideal for plotting - inconsistent (acronyms vs full spelling etc)
* Key function: `fct_recode()`

]

.pull-left[

```r
gss_cat %>%
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat")) %>%
  count(partyid)
#> # A tibble: 10 × 2
#>   partyid                   n
#>   <fct>                 <int>
#> 1 No answer               154
#> 2 Don't know                1
#> 3 Other party             393
#> 4 Republican, strong     2314
#> 5 Republican, weak       3032
#> 6 Independent, near rep  1791
#> # ℹ 4 more rows
```
* Levels not explicitly mentioned will be kept as is. 
]

---

### Modifying factor levels

.pull-left[
* To combine groups, assign multiple old levels to same new level

```r
gss_cat |>
  mutate(
    partyid = fct_recode(partyid,
      "Republican, strong"    = "Strong republican",
      "Republican, weak"      = "Not str republican",
      "Independent, near rep" = "Ind,near rep",
      "Independent, near dem" = "Ind,near dem",
      "Democrat, weak"        = "Not str democrat",
      "Democrat, strong"      = "Strong democrat",
      "Other"                 = "No answer",
      "Other"                 = "Don't know",
      "Other"                 = "Other party"
    )
  )
```

]

.pull-right[
* If you are collapsing many levels, can use `fct_collapse`:

```r
gss_cat |>
  mutate(
    partyid = fct_collapse(partyid,
      "other" = c("No answer", "Don't know", "Other party"),
      "rep" = c("Strong republican", "Not str republican"),
      "ind" = c("Ind,near rep", "Independent", "Ind,near dem"),
      "dem" = c("Not str democrat", "Strong democrat")
    )
  ) |>
  count(partyid)
#> # A tibble: 4 × 2
#>   partyid     n
#>   <fct>   <int>
#> 1 other     548
#> 2 rep      5346
#> 3 ind      8409
#> 4 dem      7180
```
]

---

### Examples
* Let's now try and put some of all of the ideas we've seen so far together
* Look at the `weather` tibble in nycflights13:

```r
library(nycflights13)
weather
#> # A tibble: 26,115 × 15
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#> 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
#> 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
#> 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
#> 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
#> 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
#> 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
#> # ℹ 26,109 more rows
#> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>, …
```
* Let's see if we can come up with a way of visualizing the temperature trends across every year.
* We want to visualize what are the minimum temperature, maximum temperature, and average temperature per month, at each location

---

### Data prep
* We need to compute the min/max/average temp per month in every year in every location

```r
weather2 <- weather %>% 
  group_by(year, month, origin) %>%
  summarize(min_temp = min(temp, na.rm=TRUE), max_temp = max(temp, na.rm=TRUE),
            avg_temp = mean(temp, na.rm=TRUE)) %>%
  group_by(month, origin) %>%
  summarize(min_temp = mean(min_temp, na.rm=TRUE),
            max_temp = mean(max_temp, na.rm=TRUE),
            avg_temp = mean(avg_temp, na.rm=TRUE), n = n())
```
* This tibble only has one year in it, so we don't actually need the code following the second group by.
* However, if there are > 1 years in tibble, this code will compute the average min/max/mean temp per month across years.

---

.pull-left[

* Let's try plotting the monthly temps at Newark (EWR).

```r
weather2 %>% filter(origin == 'EWR') %>%
  ggplot(aes(x = month)) + 
  geom_line(aes(y = min_temp), color = 'blue') +
  geom_line(aes(y = max_temp), color = 'red') + 
  geom_line(aes(y = avg_temp), color = 'black')
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-29-1.png" width="360" />
* This is OK, but not ideal.  No legend appears by default, also have to manually add three geom_line's.
* Better idea is to clean the tibble so that we have groups of data: for this we need data to be *long*
]

.pull-right[

```r
(weather3 <- weather2 %>%
  pivot_longer(cols = c(min_temp, max_temp, avg_temp),
               values_to = "temperature",
               names_to = "measurement")
)
#> # A tibble: 108 × 5
#> # Groups:   month [12]
#>   month origin     n measurement temperature
#>   <int> <chr>  <int> <chr>             <dbl>
#> 1     1 EWR        1 min_temp           10.9
#> 2     1 EWR        1 max_temp           64.4
#> 3     1 EWR        1 avg_temp           35.6
#> 4     1 JFK        1 min_temp           12.0
#> 5     1 JFK        1 max_temp           57.9
#> 6     1 JFK        1 avg_temp           35.4
#> # ℹ 102 more rows
```

]

---
### Visualizing
.pull-left[
* Now we can do color by temperature type:

```r
ggplot(weather3 %>% filter(origin == 'EWR'),
       aes(x = month, y = temperature, color = measurement)) +
  geom_line()
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-31-1.png" width="432" />
* Now we'd like better legend; blue for min, black for avg, and red for max; fix ticks 
]

.pull-left[

```r
ggplot(weather3 %>% filter(origin == 'EWR'),
       aes(x = month, y = temperature, color = measurement)) +
  geom_line(linewidth = 2) + 
  scale_color_manual(
    values = c(min_temp = "blue", avg_temp = "black", max_temp = "red"),
    labels = c("min_temp" = "minimum", "avg_temp" = "average", "max_temp" = "maximum")) + 
  scale_x_continuous(breaks = 1:12, minor_breaks = 1:12) + 
  labs(title = "Temperature at EWR per month")
```

<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-32-1.png" width="432" />
]

---
### Visualizing
.pull-left[ 
* Now let's do the same type of plot but with each airport's min/avg/max temp per month plotted side-by-side.

```r
ggplot(weather3,
       aes(x = month, y = temperature, color = measurement)) +
  geom_line(linewidth = 2) + 
  scale_color_manual(
    values = c(min_temp = "blue", avg_temp = "black", max_temp = "red"),
    labels = c("min_temp" = "minimum", "avg_temp" = "average", "max_temp" = "maximum")) + 
  scale_x_continuous(breaks = 1:12, minor_breaks = 1:12) + 
  facet_grid(. ~ origin)
```
]

.pull-right[
<img src="lec12-visualization-4_files/figure-html/unnamed-chunk-34-1.png" width="576" />

]

* What happened at JFK in May?  Why is it so different from EWR and LGA?

---

.pull-left[

```r
weather %>%
  filter(month == 5, temp < 25)
#> # A tibble: 1 × 15
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#> 1 JFK     2013     5     8    22  13.1  12.0  95.3       80       8.06
#> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> #   visib <dbl>, time_hour <dttm>
```
* Appears real!

]

.pull-right[

* How does it compare to EWR and LGA?

```r
weather %>% filter(month == 5, day == 8) %>%
  filter(between(hour, 21, 23)) %>% print(n = Inf)
#> # A tibble: 9 × 15
#>   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#> 1 EWR     2013     5     8    21  59    53.1  80.6       80       4.60
#> 2 EWR     2013     5     8    22  59    52.0  77.5       90       3.45
#> 3 EWR     2013     5     8    23  57.9  52.0  80.6        0       0   
#> 4 JFK     2013     5     8    21  57.0  48.9  74.3      170      11.5 
#> 5 JFK     2013     5     8    22  13.1  12.0  95.3       80       8.06
#> 6 JFK     2013     5     8    23  57.2  53.6  87.7      120       4.60
#> 7 LGA     2013     5     8    21  59    48.9  69.2       NA       5.75
#> 8 LGA     2013     5     8    22  59    51.1  75.0      100       6.90
#> 9 LGA     2013     5     8    23  55.9  51.1  83.7       90       6.90
#> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> #   visib <dbl>, time_hour <dttm>
```

* Appears to be only a single instance of this, so definitely an outlier, can't be certain for why it happens.

]

---

### Midterm studying tips
* Review labs, homeworks (and homework solutions - HW 2 sol up tonight), and practice midterm
* Make sure you understand all of the core functions: `min`, `max`, `mean`, `pmin`, `pmax`, `group_by()`, `summarize()`, `pivot_wider`, `pivot_longer`, regex, joins, etc.   Understand how NA's work, what typical default behavior for NA's is, etc. 
* Test-taking strategy: Go through exam and solve all of the easy questions **first**.  If it takes more than 2 minutes, skip and return later.
* Work through examples on the margins / side / back of the exam to make sure you're understanding things correctly.