STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ### Factors .pull-left[ * Factors are used for categorical variables that have a fixed and known set of values * Useful when we want to display character vectors in non-alphabetical order * E.g. let's suppose we have a variable that records month: ```r x1 <- c("Dec", "Apr", "Jan", "Mar") ``` Two issues: * Only twelve possible months, so if we typed "Jam" instead of "Jan", would be hard to catch the error. * Sorting by characters is not meaningful: ```r sort(x1) #> [1] "Apr" "Dec" "Jan" "Mar" ``` ] .pull-right[ * We can fix these by treating them as a factor * Factors require a list of valid **levels**: ```r month_levels <- c( "Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" ) ``` * With levels, you can create factors: ```r y1 <- factor(x1, levels = month_levels) y1 #> [1] Dec Apr Jan Mar #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec sort(y1) #> [1] Jan Mar Apr Dec #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` ] --- ### Factors .pull-left[ * If you create a factor from a vector, every element of the vector must be a level of a factor, otherwise you get `NA`: ```r x2 <- c("Dec", "Apr", "Jam", "Mar") (y2 <- factor(x2, levels = month_levels)) #> [1] Dec Apr <NA> Mar #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ``` * If you want to throw an error for non-level vector info, you can use ``forcats::fct()` ```r y2 <- fct(x2, levels = month_levels) #> Error in `fct()`: #> ! All values of `x` must appear in `levels` or `na` #> ℹ Missing level: "Jam" ``` ] .pull-right[ * If you omit levels, `factor()` makes factors in alphabetical order: ```r factor(x1) #> [1] Dec Apr Jan Mar #> Levels: Apr Dec Jan Mar ``` * `forcats::fct()` orders by first appearance in the vector ```r fct(x1) #> [1] Dec Apr Jan Mar #> Levels: Dec Apr Jan Mar ``` * You can extract levels from a factor using `levels()`: ```r levels(y2) #> [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec" ``` ] --- ### Factor: General Social Survey * We'll look at survey data from the General Social Survey (`forcats::gss_cat`, loaded in tidyverse) ```r gss_cat #> # A tibble: 21,483 × 9 #> year marital age race rincome partyid #> <int> <fct> <int> <fct> <fct> <fct> #> 1 2000 Never married 26 White $8000 to 9999 Ind,near rep #> 2 2000 Divorced 48 White $8000 to 9999 Not str republican #> 3 2000 Widowed 67 White Not applicable Independent #> 4 2000 Never married 39 White Not applicable Ind,near rep #> 5 2000 Divorced 25 White Not applicable Not str democrat #> 6 2000 Married 25 White $20000 - 24999 Strong democrat #> # ℹ 21,477 more rows #> # ℹ 3 more variables: relig <fct>, denom <fct>, tvhours <int> ``` * When factors are in a tibble, levels do not directly show, but can see them using `count`: .pull-left[ ```r gss_cat %>% count(race) ``` ] .pull-right[ ``` #> # A tibble: 3 × 2 #> race n #> <fct> <int> #> 1 Other 1959 #> 2 Black 3129 #> 3 White 16395 ``` ] --- ### GSS Data: exploratory data analysis .pull-left[ * Let's look at the distribution of `rincome` (reported income) ```r ggplot(gss_cat, aes(x = rincome)) + geom_bar() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-13-1.png" width="432" /> * A little hard to see what's happening on the x-axis! ] -- .pull-right[ ```r ggplot(gss_cat, aes(y = rincome)) + geom_bar() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-14-1.png" width="432" /> ] --- ### GSS Data: exploratory data analysis .pull-left[ * Does income vary by religion much? ```r ggplot(gss_cat, aes(x = rincome)) + geom_bar() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-15-1.png" width="432" /> * A little hard to see what's happening on the x-axis! ] -- .pull-right[ ```r ggplot(gss_cat, aes(y = rincome)) + geom_bar() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-16-1.png" width="432" /> ] --- ### Exploratory data analysis .pull-left[ * What are the most common religions in the survey? What about party ID? ```r gss_cat %>% count(relig) %>% arrange(n) %>% print(n=Inf) #> # A tibble: 15 × 2 #> relig n #> <fct> <int> #> 1 Don't know 15 #> 2 Native american 23 #> 3 Other eastern 32 #> 4 Hinduism 71 #> 5 No answer 93 #> 6 Orthodox-christian 95 #> 7 Moslem/islam 104 #> 8 Inter-nondenominational 109 #> 9 Buddhism 147 #> 10 Other 224 #> 11 Jewish 388 #> 12 Christian 689 #> 13 None 3523 #> 14 Catholic 5124 #> 15 Protestant 10846 ``` ] .pull-right[ ```r gss_cat %>% count(partyid) %>% arrange(n) %>% print(n=Inf) #> # A tibble: 10 × 2 #> partyid n #> <fct> <int> #> 1 Don't know 1 #> 2 No answer 154 #> 3 Other party 393 #> 4 Ind,near rep 1791 #> 5 Strong republican 2314 #> 6 Ind,near dem 2499 #> 7 Not str republican 3032 #> 8 Strong democrat 3490 #> 9 Not str democrat 3690 #> 10 Independent 4119 ``` ] --- .pull-left[ * Let's look at different TV watching habits by marital status. Maybe we can first try doing a `geom_freqpoly()` with color given by religion. ```r ggplot(gss_cat, aes(x = tvhours, y = after_stat(density), color = marital)) + geom_freqpoly() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-19-1.png" width="432" /> ] -- .pull-right[ * If we try the same thing for religion, it is trickier to visualize: * Let's look at TV hours watched per religion. inconsistent (acronyms vs full spelling etc) * Key function: `fct_recode()` ] .pull-left[ ```r gss_cat %>% mutate( partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat")) %>% count(partyid) #> # A tibble: 10 × 2 #> partyid n #> <fct> <int> #> 1 No answer 154 #> 2 Don't know 1 #> 3 Other party 393 #> 4 Republican, strong 2314 #> 5 Republican, weak 3032 #> 6 Independent, near rep 1791 #> # ℹ 4 more rows ``` * Levels not explicitly mentioned will be kept as is. ] --- ### Modifying factor levels .pull-left[ * To combine groups, assign multiple old levels to same new level ```r gss_cat |> mutate( partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" ) ) ``` ] .pull-right[ * If you are collapsing many levels, can use `fct_collapse`: ```r gss_cat |> mutate( partyid = fct_collapse(partyid, "other" = c("No answer", "Don't know", "Other party"), "rep" = c("Strong republican", "Not str republican"), "ind" = c("Ind,near rep", "Independent", "Ind,near dem"), "dem" = c("Not str democrat", "Strong democrat") ) ) |> count(partyid) #> # A tibble: 4 × 2 #> partyid n #> <fct> <int> #> 1 other 548 #> 2 rep 5346 #> 3 ind 8409 #> 4 dem 7180 ``` ] --- ### Examples * Let's now try and put some of all of the ideas we've seen so far together * Look at the `weather` tibble in nycflights13: ```r library(nycflights13) weather #> # A tibble: 26,115 × 15 #> origin year month day hour temp dewp humid wind_dir wind_speed #> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 #> 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 #> 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 #> 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 #> 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 #> 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 #> # ℹ 26,109 more rows #> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>, … ``` * Let's see if we can come up with a way of visualizing the temperature trends across every year. * We want to visualize what are the minimum temperature, maximum temperature, and average temperature per month, at each location --- ### Data prep * We need to compute the min/max/average temp per month in every year in every location ```r weather2 <- weather %>% group_by(year, month, origin) %>% summarize(min_temp = min(temp, na.rm=TRUE), max_temp = max(temp, na.rm=TRUE), avg_temp = mean(temp, na.rm=TRUE)) %>% group_by(month, origin) %>% summarize(min_temp = mean(min_temp, na.rm=TRUE), max_temp = mean(max_temp, na.rm=TRUE), avg_temp = mean(avg_temp, na.rm=TRUE), n = n()) ``` * This tibble only has one year in it, so we don't actually need the code following the second group by. * However, if there are > 1 years in tibble, this code will compute the average min/max/mean temp per month across years. --- .pull-left[ * Let's try plotting the monthly temps at Newark (EWR). ```r weather2 %>% filter(origin == 'EWR') %>% ggplot(aes(x = month)) + geom_line(aes(y = min_temp), color = 'blue') + geom_line(aes(y = max_temp), color = 'red') + geom_line(aes(y = avg_temp), color = 'black') ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-29-1.png" width="360" /> * This is OK, but not ideal. No legend appears by default, also have to manually add three geom_line's. * Better idea is to clean the tibble so that we have groups of data: for this we need data to be *long* ] .pull-right[ ```r (weather3 <- weather2 %>% pivot_longer(cols = c(min_temp, max_temp, avg_temp), values_to = "temperature", names_to = "measurement") ) #> # A tibble: 108 × 5 #> # Groups: month [12] #> month origin n measurement temperature #> <int> <chr> <int> <chr> <dbl> #> 1 1 EWR 1 min_temp 10.9 #> 2 1 EWR 1 max_temp 64.4 #> 3 1 EWR 1 avg_temp 35.6 #> 4 1 JFK 1 min_temp 12.0 #> 5 1 JFK 1 max_temp 57.9 #> 6 1 JFK 1 avg_temp 35.4 #> # ℹ 102 more rows ``` ] --- ### Visualizing .pull-left[ * Now we can do color by temperature type: ```r ggplot(weather3 %>% filter(origin == 'EWR'), aes(x = month, y = temperature, color = measurement)) + geom_line() ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-31-1.png" width="432" /> * Now we'd like better legend; blue for min, black for avg, and red for max; fix ticks ] .pull-left[ ```r ggplot(weather3 %>% filter(origin == 'EWR'), aes(x = month, y = temperature, color = measurement)) + geom_line(linewidth = 2) + scale_color_manual( values = c(min_temp = "blue", avg_temp = "black", max_temp = "red"), labels = c("min_temp" = "minimum", "avg_temp" = "average", "max_temp" = "maximum")) + scale_x_continuous(breaks = 1:12, minor_breaks = 1:12) + labs(title = "Temperature at EWR per month") ``` <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-32-1.png" width="432" /> ] --- ### Visualizing .pull-left[ * Now let's do the same type of plot but with each airport's min/avg/max temp per month plotted side-by-side. ```r ggplot(weather3, aes(x = month, y = temperature, color = measurement)) + geom_line(linewidth = 2) + scale_color_manual( values = c(min_temp = "blue", avg_temp = "black", max_temp = "red"), labels = c("min_temp" = "minimum", "avg_temp" = "average", "max_temp" = "maximum")) + scale_x_continuous(breaks = 1:12, minor_breaks = 1:12) + facet_grid(. ~ origin) ``` ] .pull-right[ <img src="lec12-visualization-4_files/figure-html/unnamed-chunk-34-1.png" width="576" /> ] * What happened at JFK in May? Why is it so different from EWR and LGA? --- .pull-left[ ```r weather %>% filter(month == 5, temp < 25) #> # A tibble: 1 × 15 #> origin year month day hour temp dewp humid wind_dir wind_speed #> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 JFK 2013 5 8 22 13.1 12.0 95.3 80 8.06 #> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>, #> # visib <dbl>, time_hour <dttm> ``` * Appears real! ] .pull-right[ * How does it compare to EWR and LGA? ```r weather %>% filter(month == 5, day == 8) %>% filter(between(hour, 21, 23)) %>% print(n = Inf) #> # A tibble: 9 × 15 #> origin year month day hour temp dewp humid wind_dir wind_speed #> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 EWR 2013 5 8 21 59 53.1 80.6 80 4.60 #> 2 EWR 2013 5 8 22 59 52.0 77.5 90 3.45 #> 3 EWR 2013 5 8 23 57.9 52.0 80.6 0 0 #> 4 JFK 2013 5 8 21 57.0 48.9 74.3 170 11.5 #> 5 JFK 2013 5 8 22 13.1 12.0 95.3 80 8.06 #> 6 JFK 2013 5 8 23 57.2 53.6 87.7 120 4.60 #> 7 LGA 2013 5 8 21 59 48.9 69.2 NA 5.75 #> 8 LGA 2013 5 8 22 59 51.1 75.0 100 6.90 #> 9 LGA 2013 5 8 23 55.9 51.1 83.7 90 6.90 #> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>, #> # visib <dbl>, time_hour <dttm> ``` * Appears to be only a single instance of this, so definitely an outlier, can't be certain for why it happens. ] --- ### Midterm studying tips * Review labs, homeworks (and homework solutions - HW 2 sol up tonight), and practice midterm * Make sure you understand all of the core functions: `min`, `max`, `mean`, `pmin`, `pmax`, `group_by()`, `summarize()`, `pivot_wider`, `pivot_longer`, regex, joins, etc. Understand how NA's work, what typical default behavior for NA's is, etc. * Test-taking strategy: Go through exam and solve all of the easy questions **first**. If it takes more than 2 minutes, skip and return later. * Work through examples on the margins / side / back of the exam to make sure you're understanding things correctly.