class: center, middle, inverse, title-slide .title[ # Transformations of vectors I ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- Main data types we use in R: * Logical/boolean (`TRUE`, `FALSE`) * Numeric (13.8) * Character/string ("hello") * Missing (`NA`) For logical vectors, every element takes one of 3 values: `TRUE`, `FALSE`, `NA` We'll investigate how to manipulate and transform data to get logicals, and how to use logicals. ```r library(tidyverse) library(nycflights13) flights ``` ``` # A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 ... ``` --- ### Logical comparators .pull-left[ Three basic logical operators that we will use over and over: * AND (denoted `&` in R): operation between two logicals * OR (denoted `|` in R): operation between two logicals * NOT (denoted `!` in R): operation on a single logical. Truth table for AND: | A | B | `A AND B` | |-------|-------|---------| | `TRUE` | `TRUE` | `TRUE` | | `TRUE` | `FALSE` | `FALSE` | | `FALSE` | `TRUE` | `FALSE` | | `FALSE` | `FALSE` | `FALSE` | ] .pull-right[ Truth table for OR: | A | B | A OR B | |-------|-------|---------| | `TRUE` | `TRUE` | `TRUE` | | `TRUE` | `FALSE` | `TRUE` | | `FALSE` | `TRUE` | `TRUE` | | `FALSE` | `FALSE` | `FALSE` | Truth table for NOT: | A | NOT A | |-------|-------| | `TRUE` | `FALSE` | | `FALSE` | `TRUE` | ] --- ### Comparisons Common way to create a logical vector: numeric comparison with `<`, `!=`, etc. We have implicitly been using this when doing filtering. ```r flights$dep_time > 600 ``` ``` [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE ... ``` Using a comparator between two vectors of logicals returns pairwise comparisons. ```r x <- c(TRUE, FALSE, TRUE) y <- c(FALSE, FALSE, TRUE) (x & y) # x AND y ``` ``` [1] FALSE FALSE TRUE ``` ```r (x | y) # x OR y ``` ``` [1] TRUE FALSE TRUE ``` --- ### Comparisons So when we use multiple comparisons in `filter()`, we are building a new vector of logicals. We only keep those rows where the vector is `TRUE`. ```r flights %>% filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20) ``` ``` # A tibble: 172,286 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 601 600 1 844 850 2 2013 1 1 602 610 -8 812 820 3 2013 1 1 602 605 -3 821 805 ... ``` --- ## Comparisons Filter and mutate can be used in conjunction ```r flights %>% mutate( daytime = dep_time > 600 & dep_time < 2000, approx_ontime = abs(arr_delay) < 20, ) %>% filter(daytime & approx_ontime) ``` ``` # A tibble: 172,286 × 21 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 601 600 1 844 850 2 2013 1 1 602 610 -8 812 820 3 2013 1 1 602 605 -3 821 805 ... ``` --- ## Floating point comparisons .pull-left[ Using `==` with floating points can cause problems. This is because numbers are represented with finite "precision", i.e. only up to 2^{-32} or 2^{-64}. ```r x <- c( (1/49) * 49, sqrt(2)^2) x == c(1,2) ``` ``` [1] FALSE FALSE ``` What's going on? Let's look at more precise representation in R using `print(x, digits=)`: ```r print(x, digits=10) ``` ``` [1] 1 2 ``` ```r print(x, digits=20) ``` ``` [1] 0.99999999999999988898 2.00000000000000044409 ``` ] -- .pull-right[ `dplyr::near()` helps with this, ignores small differences ```r near(x, c(1,2)) ``` ``` [1] TRUE TRUE ``` ] --- ## Missing values .pull-left[ Almost any operation involving an `NA` returns `NA`. ```r (NA > 5) ``` ``` [1] NA ``` ```r (10 == NA) ``` ``` [1] NA ``` ] .pull-right[ What about `NA==NA`? ```r NA==NA ``` ``` [1] NA ``` Why? Think of this example ```r # Suppose we don't know Spencer's age age_spencer <- NA # And we also don't know Zelda's age age_zelda <- NA # Then we shouldn't know whether Spencer and # Zelda are the same age age_spencer == age_zelda ``` ``` [1] NA ``` ] --- ### Missing values A useful function for dealing with `NA`: `is.na()` `is.na(x)` works with any type of vector and returns TRUE for missing values and FALSE for everything else: ```r ( is.na(c(TRUE, NA, FALSE)) ) ``` ``` [1] FALSE TRUE FALSE ``` ```r ( is.na(c(1, NA, 3)) ) ``` ``` [1] FALSE TRUE FALSE ``` ```r ( is.na(c("a", NA, "b")) ) ``` ``` [1] FALSE TRUE FALSE ``` --- ### Missing values Since `is.na()` returns logicals, can be used in `filter()`: ```r flights %>% filter(is.na(dep_time)) ``` ``` # A tibble: 8,255 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 NA 1630 NA NA 1815 2 2013 1 1 NA 1935 NA NA 2240 ... ``` Can be used to help identify where `NA` come from. e.g., why are there air_time `NA`s? ```r flights %>% select(time_hour, flight, dep_time, arr_time, air_time) %>% filter(is.na(air_time) & !is.na(arr_time) & !is.na(dep_time)) ``` ``` # A tibble: 717 × 5 time_hour flight dep_time arr_time air_time <dttm> <int> <int> <int> <dbl> 1 2013-01-01 15:00:00 4525 1525 1934 NA 2 2013-01-01 14:00:00 3806 1528 2002 NA ... ``` --- ### Missing values Let's examine how `dep_time`, `dep_delay`, and `sched_dep_time` are related. ```r flights %>% mutate(missing_dep_time = is.na(dep_time), missing_dep_delay = is.na(dep_delay), missing_sched_dep_time = is.na(sched_dep_time)) %>% count(missing_dep_time, missing_dep_delay, missing_sched_dep_time) ``` ``` # A tibble: 2 × 4 missing_dep_time missing_dep_delay missing_sched_dep_time n <lgl> <lgl> <lgl> <int> 1 FALSE FALSE FALSE 328521 2 TRUE TRUE FALSE 8255 ``` * The only instances where `dep_delay` is missing have `dep_time` missing. --- ### Missing values * Is it the case that `dep_delay` = `dep_time` - `sched_dep_time`? ```r flights %>% mutate(dep_delay_manual = dep_time - sched_dep_time, manual_matches_given = near(dep_delay_manual, dep_delay)) %>% count(manual_matches_given) ``` ``` # A tibble: 3 × 2 manual_matches_given n <lgl> <int> 1 FALSE 99777 2 TRUE 228744 3 NA 8255 ``` Quite weird, since we are getting a lot right but also getting a lot wrong. --- ### Missing values Let's inspect further. What do those observations where manual doesn't match given look like? ```r flights %>% mutate(manual_delay = dep_time - sched_dep_time, manual_matches_given = near(manual_delay, dep_delay)) %>% filter(!manual_matches_given) %>% select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay) ``` ``` # A tibble: 99,777 × 6 time_hour flight dep_time sched_dep_time dep_delay manual_delay <dttm> <int> <int> <int> <dbl> <int> 1 2013-01-01 06:00:00 461 554 600 -6 -46 2 2013-01-01 06:00:00 507 555 600 -5 -45 3 2013-01-01 06:00:00 5708 557 600 -3 -43 4 2013-01-01 06:00:00 79 557 600 -3 -43 5 2013-01-01 06:00:00 301 558 600 -2 -42 ... ``` The problem comes from the fact that `R` is treating `dep_time` and `sched_dep_time` as integers, not time! Our calculation doesn't account for the fact that 5:54 is only 6 minutes away from 6:00, rather than 46. We will later see how to properly treat dates and times. --- ### Boolean algebra .pull-left[ * Recall the basic Boolean algebra comparators, AND and OR * There is a third one, XOR, which we won't use that often * Can combine AND/OR with NOT and cover any combination of a pair of Booleans ] .pull-right[ ![Visualization of Boolean algebra](transform.png) ] --- ### Boolean algebra and missing values * Booleans and missing values interact in logical, but possibly counterintuitive ways. ```r df <- tibble(x = c(TRUE, FALSE, NA)) df %>% mutate( and_NA = x & NA, or_NA = x | NA ) ``` ``` # A tibble: 3 × 3 x and_NA or_NA <lgl> <lgl> <lgl> 1 TRUE NA TRUE 2 FALSE FALSE NA 3 NA NA NA ``` * NA OR TRUE returns true, since it is TRUE regardless of NA being FALSE or TRUE. * NA AND TRUE returns NA since it depends on value of NA. * NA OR FALSE returns NA since it depends on value of NA. * NA AND FALSE returns FALSE since NA value doesn't affect result, always false. --- ### Order of operations .pull-left[ Consider finding all flights departing between November and December in the tibble. ```r flights %>% filter(month == 11 | month == 12) ``` This results in the correct calculation. However, the following calculation does not: ] .pull-right[ ```r flights %>% filter(month == 11 | 12) ``` ``` # A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 ... ``` Why? * R first evaluates `month==11`, creates a logical vector `vec`. * R then compares `vec | 12` * When comparing a number to any logical, every nonzero number is considered TRUE. * So `vec | 12` returns a vector with TRUE for every element ] --- ### `%in%` Instead of worrying about `|` and `==` in order, just use `%in%`. ```r 1:10 %in% c(1, 5, 10) ``` ``` [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE ``` So to find all flights from November and December: ```r flights %>% filter(month %in% c(11, 12)) ``` * `%in%` obeys different rules for `NA` vs. `==`, since `NA %in% NA` is TRUE: ```r (c(1,2,NA) == NA) ``` ``` [1] NA NA NA ``` ```r (c(1,2,NA) %in% NA) ``` ``` [1] FALSE FALSE TRUE ``` --- ### Logical summaries .pull-left[ Two main functions for logical summaries: `any()` and `all()`. * `any(x)` returns TRUE if there any TRUEs in `x` * `all(x)` returns TRUE only if all values in `x` are TRUE For instance, was there a day where every flight was delayed on departure by less than an hour? Or a day where there were any flights delayed on arrival by >= 5 hours? ] .pull-right[ ```r flights %>% group_by(year, month, day) %>% summarize( all_delayed = all(dep_delay <= 60, na.rm=TRUE), any_long_delay = any(arr_delay >= 300, na.rm=TRUE) ) ``` ``` `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument. ``` ``` # A tibble: 365 × 5 # Groups: year, month [12] year month day all_delayed any_long_delay <int> <int> <int> <lgl> <lgl> 1 2013 1 1 FALSE TRUE 2 2013 1 2 FALSE TRUE 3 2013 1 3 FALSE FALSE 4 2013 1 4 FALSE FALSE ... ``` ] --- ### Logical summaries * When coerced into a numeric, TRUE = 1 and FALSE = 0 * If you want to find percentages/proportions that are TRUE/FALSE, this is very useful, e.g. `mean()`, `sum()` * Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours: ```r flights %>% group_by(year, month, day) %>% summarise( prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE), num_long_delay = sum(arr_delay > 300, na.rm=TRUE) ) ``` ``` `summarise()` has grouped output by 'year', 'month'. You can override using the `.groups` argument. ``` ``` # A tibble: 365 × 5 # Groups: year, month [12] year month day prop_delayed_1hour num_long_delay <int> <int> <int> <dbl> <int> 1 2013 1 1 0.0609 3 2 2013 1 2 0.0856 3 3 2013 1 3 0.0586 0 4 2013 1 4 0.0473 0 ... ``` --- ### Logical summaries * When coerced into a numeric, TRUE = 1 and FALSE = 0 * If you want to find percentages/proportions that are TRUE/FALSE, this is very useful, e.g. `mean()`, `sum()` * Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours: ```r flights %>% group_by(year, month, day) %>% summarise( prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE), num_long_delay = sum(arr_delay > 300, na.rm=TRUE), .groups = 'drop' ) ``` ``` # A tibble: 365 × 5 year month day prop_delayed_1hour num_long_delay <int> <int> <int> <dbl> <int> 1 2013 1 1 0.0609 3 2 2013 1 2 0.0856 3 3 2013 1 3 0.0586 0 4 2013 1 4 0.0473 0 5 2013 1 5 0.0363 1 ... ``` --- ### Logical subsetting .pull-left[ * Logical vectors can also be used for subsetting * Subset operator: `[]` * e.g. computing average delay for flights with actual (>=0 minutes) delays, we would typically do: ```r flights |> filter(arr_delay > 0) |> group_by(year, month, day) |> summarize( behind = mean(arr_delay), n = n(), .groups = 'drop' ) ``` ``` # A tibble: 365 × 5 year month day behind n <int> <int> <int> <dbl> <int> 1 2013 1 1 32.5 461 2 2013 1 2 32.0 535 3 2013 1 3 27.7 460 4 2013 1 4 28.3 297 ... ``` ] .pull-right[ Another way is to use subsetting: ```r flights |> group_by(year, month, day) |> summarize( behind = mean(arr_delay[arr_delay > 0], na.rm=TRUE), early = mean(arr_delay[arr_delay < 0], na.rm=TRUE), n = n(), .groups = 'drop' ) ``` ``` # A tibble: 365 × 6 year month day behind early n <int> <int> <int> <dbl> <dbl> <int> 1 2013 1 1 32.5 -12.5 842 2 2013 1 2 32.0 -14.3 943 3 2013 1 3 27.7 -18.2 914 ... ``` In first calc, `n()` gives number of delayed flights while second gives total number of flights, not ideal. ] --- ### Conditional transformations: `if_else()` * `if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL)` is useful when: * When condition is TRUE, it's one value. When FALSE, it's another value. ```r x <- c(-2, -1, 1, 2, NA) if_else(x > 0, "+pos", "-neg") ``` ``` [1] "-neg" "-neg" "+pos" "+pos" NA ``` The fourth arg of `if_else()` specifies what to fill `NA`'s with: ```r if_else(x > 0, "+pos", "-neg", "?????") ``` ``` [1] "-neg" "-neg" "+pos" "+pos" "?????" ``` We can also use vectors as an argument for what to do when true/false. ```r if_else(x < 0, -x, x) ``` ``` [1] 2 1 1 2 NA ``` --- ### Conditional transformations: `if_else()` We can use general vectors inside of `if_else()`: ```r x1 <- c(NA, 1, 2, NA) y1 <- c(3, NA, 4, 6) if_else(is.na(x1), y1, x1) ``` ``` [1] 3 1 2 6 ``` If you have many different conditions for which you want to specify values, e.g. * If number is between `a` and `b` then do... * If number is between `b` and `c` then do... * If number is between `c` and `d` then do... Your best tool is `case_when()`. --- ### Conditional transformations: `case_when()` Inspired by SQL's `CASE` statement. Has a very weird syntax: * `condition ~ output` * `condition` is a logical vector * when is is `TRUE`, `output` is used. Weird, but pretty readable: ```r x <- c(-3:3, NA) case_when( x == 0 ~ "0", x < 0 ~ "-ve", x > 0 ~ "+ve", is.na(x) ~ "???" ) ``` ``` [1] "-ve" "-ve" "-ve" "0" "+ve" "+ve" "+ve" "???" ``` --- ### Conditional transformations: `case_when()` .pull-left[ If no cases match, then returns NA: ```r x <- c(-3:3, NA) case_when( x < 0 ~ "-ve", x > 0 ~ "+ve" ) ``` ``` [1] "-ve" "-ve" "-ve" NA "+ve" "+ve" "+ve" NA ``` The argument `.default` specifies what to do if there is no condition satisfied, or if value is NA. ```r x <- c(-3:3, NA) case_when( x < 0 ~ "-ve", x > 0 ~ "+ve", .default = "???" ) ``` ``` [1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???" ``` ] .pull-right[ If there are multiple conditions which match, only the first is used -- be careful! ```r case_when( x > 0 ~ "+ve", x > 2 ~ "big" ) ``` ``` [1] NA NA NA NA "+ve" "+ve" "+ve" NA ``` ] --- ### `case_when()` .pull-left[ Here's a more complex example of `case_when()`: providing human-readable labels to flight delays. ```r (df1 <- flights |> mutate( status = case_when( is.na(arr_delay) ~ "cancelled", arr_delay < -30 ~ "very early", arr_delay < -15 ~ "early", abs(arr_delay) <= 15 ~ "on time", arr_delay < 60 ~ "late", arr_delay < Inf ~ "very late", ), .keep = "used" # only returns those columns used in calc ) ) ``` ``` # A tibble: 336,776 × 2 arr_delay status <dbl> <chr> 1 11 on time 2 20 late ... ``` ] .pull-right[ Some things to note: * We can refer to variables inside the dataframe inside case_when, just as in most other dplyr functions * The first conditional that is true is what gets assigned * So when `arr_delay < -30`, the remaining conditionals do not get checked ] --- ### `case_when()` .pull-left[ Two equivalent ways of using `case_when` for this problem: ```r df1 <- flights |> mutate( status = case_when( is.na(arr_delay) ~ "cancelled", arr_delay < -30 ~ "very early", arr_delay < -15 ~ "early", abs(arr_delay) <= 15 ~ "on time", arr_delay < 60 ~ "late", arr_delay < Inf ~ "very late", ), .keep = "used" # only returns those columns used in calc ) ``` ] .pull-right[ ```r df2 <- flights |> mutate( status = case_when( is.na(arr_delay) ~ "cancelled", arr_delay < -30 ~ "very early", arr_delay < -15 ~ "early", abs(arr_delay) <= 15 ~ "on time", arr_delay < 60 ~ "late", .default = "very late" ), .keep = "used" # only returns those columns used in calc ) all.equal(df1, df2) ``` ``` [1] TRUE ``` Recall that `.default` says how all `NA` and non-specified conditions are handled. Since we have already used that `NA` implies canceled, this does the same thing. ] --- ### Compatible types * Both `if_else()` and `case_when()` require the outputs to be of consistent types. If not, you'll get errors, e.g. ```r if_else(TRUE, "a", 1) #> Error in `if_else()`: #> ! Can't combine `true` <character> and `false` <double>. case_when( x < -1 ~ TRUE, x > 0 ~ now() ) #> Error in `case_when()`: #> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>> ``` Most types are incompatible in order to catch errors. Types which are compatible: * Numeric and logical (treats TRUE=1, FALSE=0) * Dates and "date-times" - we will discuss these types later * `NA` is compatible with everything * Strings and factors are compatible - will discuss later --- ### Example: labelling numbers as even or odd * Even number = divisible by two. * In R, operator `%%` (read "modulo") does "modular arithmetic": `a %% b` returns the remainder when dividing `a` by `b`, e.g. * `17 %% 12 = 5` * `34 %% 6 = 4` * A number `n` is even if and only if `n %% 2 == 0`; otherwise, odd. * We can use `if_else` to label numbers between 0 and 20 as even or odd ```r x <- 0:20 if_else(x %% 2 == 0, 'even', 'odd') ``` ``` [1] "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" [11] "even" "odd" "even" "odd" "even" "odd" "even" "odd" "even" "odd" [21] "even" ``` --- ### Example: