Transformations of vectors I

class: center, middle, inverse, title-slide

.title[
# Transformations of vectors I
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

Main data types we use in R:
* Logical/boolean (`TRUE`, `FALSE`)
* Numeric (13.8)
* Character/string ("hello")
* Missing (`NA`)

For logical vectors, every element takes one of 3 values: `TRUE`, `FALSE`, `NA`

We'll investigate how to manipulate and transform data to get logicals, and how to use logicals.

```r
library(tidyverse)
library(nycflights13)
flights
```

---

### Logical comparators

.pull-left[
Three basic logical operators that we will use over and over:
* AND (denoted `&` in R): operation between two logicals
* OR (denoted `|` in R): operation between two logicals
* NOT (denoted `!` in R): operation on a single logical.

Truth table for AND:

| A     | B     | `A AND B` |
|-------|-------|---------|
| `TRUE`  | `TRUE`  | `TRUE`    |
| `TRUE`  | `FALSE` | `FALSE`   |
| `FALSE` | `TRUE`  | `FALSE`   |
| `FALSE` | `FALSE` | `FALSE`   |

]

.pull-right[

Truth table for OR:

| A     | B     | A OR B |
|-------|-------|---------|
| `TRUE`  | `TRUE`  | `TRUE`    |
| `TRUE`  | `FALSE` | `TRUE`   |
| `FALSE` | `TRUE`  | `TRUE`   |
| `FALSE` | `FALSE` | `FALSE`   |

Truth table for NOT:

| A     | NOT A |
|-------|-------|
| `TRUE`  | `FALSE` |
| `FALSE` | `TRUE`  |

]

---

### Comparisons
Common way to create a logical vector: numeric comparison with `<`, `!=`, etc.

We have implicitly been using this when doing filtering.

```r
flights$dep_time > 600
```

```
    [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
   [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
...
```

Using a comparator between two vectors of logicals returns pairwise comparisons.

```r
x <- c(TRUE, FALSE, TRUE)
y <- c(FALSE, FALSE, TRUE)
(x & y) # x AND y
```

```
[1] FALSE FALSE  TRUE
```

```r
(x | y) # x OR y
```

```
[1]  TRUE FALSE  TRUE
```

---

### Comparisons
So when we use multiple comparisons in `filter()`, we are building a new vector of logicals.

We only keep those rows where the vector is `TRUE`.

```r
flights %>%
  filter(dep_time > 600 & dep_time < 2000 & abs(arr_delay) < 20)
```

```
# A tibble: 172,286 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...
```

---

## Comparisons
Filter and mutate can be used in conjunction

```r
flights %>%
  mutate(
    daytime = dep_time > 600 & dep_time < 2000,
    approx_ontime = abs(arr_delay) < 20,
  ) %>%
  filter(daytime & approx_ontime)
```

```
# A tibble: 172,286 × 21
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      601            600         1      844            850
 2  2013     1     1      602            610        -8      812            820
 3  2013     1     1      602            605        -3      821            805
...
```

---

## Floating point comparisons
.pull-left[
Using `==` with floating points can cause problems.
This is because numbers are represented with finite "precision", i.e. only up to 2^{-32} or 2^{-64}.

```r
x <- c( (1/49) * 49, sqrt(2)^2)
x == c(1,2)
```

```
[1] FALSE FALSE
```

What's going on?  Let's look at more precise representation in R using `print(x, digits=)`:

```r
print(x, digits=10)
```

```
[1] 1 2
```

```r
print(x, digits=20)
```

```
[1] 0.99999999999999988898 2.00000000000000044409
```
]

--
.pull-right[
`dplyr::near()` helps with this, ignores small differences

```r
near(x, c(1,2))
```

```
[1] TRUE TRUE
```
]

---

## Missing values
.pull-left[ 
Almost any operation involving an `NA` returns `NA`.

```r
(NA > 5)
```

```
[1] NA
```

```r
(10 == NA)
```

```
[1] NA
```

]

.pull-right[
What about `NA==NA`?

```r
NA==NA
```

```
[1] NA
```
Why?  Think of this example

```r
# Suppose we don't know Spencer's age
age_spencer <- NA

# And we also don't know Zelda's age
age_zelda <- NA

# Then we shouldn't know whether Spencer and
# Zelda are the same age
age_spencer == age_zelda
```

```
[1] NA
```

]

---

### Missing values
A useful function for dealing with `NA`: `is.na()`

`is.na(x)` works with any type of vector and returns TRUE for missing values and FALSE for everything else:

```r
( is.na(c(TRUE, NA, FALSE)) )
```

```
[1] FALSE  TRUE FALSE
```

```r
( is.na(c(1, NA, 3)) )
```

```
[1] FALSE  TRUE FALSE
```

```r
( is.na(c("a", NA, "b")) ) 
```

```
[1] FALSE  TRUE FALSE
```

---
### Missing values
Since `is.na()` returns logicals, can be used in `filter()`:

```r
flights %>%
  filter(is.na(dep_time))
```

```
# A tibble: 8,255 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1       NA           1630        NA       NA           1815
 2  2013     1     1       NA           1935        NA       NA           2240
...
```

Can be used to help identify where `NA` come from. e.g., why are there air_time `NA`s?

```r
flights %>%
  select(time_hour, flight, dep_time, arr_time, air_time) %>%
  filter(is.na(air_time) & !is.na(arr_time) & !is.na(dep_time)) 
```

```
# A tibble: 717 × 5
   time_hour           flight dep_time arr_time air_time
   <dttm>               <int>    <int>    <int>    <dbl>
 1 2013-01-01 15:00:00   4525     1525     1934       NA
 2 2013-01-01 14:00:00   3806     1528     2002       NA
...
```

---

### Missing values
Let's examine how `dep_time`, `dep_delay`, and `sched_dep_time` are related.

```r
flights %>% 
  mutate(missing_dep_time = is.na(dep_time),
         missing_dep_delay = is.na(dep_delay),
         missing_sched_dep_time = is.na(sched_dep_time)) %>% 
  count(missing_dep_time, missing_dep_delay, missing_sched_dep_time)
```

```
# A tibble: 2 × 4
  missing_dep_time missing_dep_delay missing_sched_dep_time      n
  <lgl>            <lgl>             <lgl>                   <int>
1 FALSE            FALSE             FALSE                  328521
2 TRUE             TRUE              FALSE                    8255
```

* The only instances where `dep_delay` is missing have `dep_time` missing.

---

### Missing values

* Is it the case that `dep_delay` = `dep_time` - `sched_dep_time`?

```r
flights %>% 
  mutate(dep_delay_manual = dep_time - sched_dep_time,
         manual_matches_given = near(dep_delay_manual, dep_delay)) %>%
  count(manual_matches_given)
```

```
# A tibble: 3 × 2
  manual_matches_given      n
  <lgl>                 <int>
1 FALSE                 99777
2 TRUE                 228744
3 NA                     8255
```

Quite weird, since we are getting a lot right but also getting a lot wrong.

---

### Missing values

Let's inspect further.  What do those observations where manual doesn't match given look like?

```r
flights %>% 
  mutate(manual_delay = dep_time - sched_dep_time,
         manual_matches_given = near(manual_delay, dep_delay)) %>%
  filter(!manual_matches_given) %>%
  select(time_hour, flight, dep_time, sched_dep_time, dep_delay, manual_delay)
```

```
# A tibble: 99,777 × 6
   time_hour           flight dep_time sched_dep_time dep_delay manual_delay
   <dttm>               <int>    <int>          <int>     <dbl>        <int>
 1 2013-01-01 06:00:00    461      554            600        -6          -46
 2 2013-01-01 06:00:00    507      555            600        -5          -45
 3 2013-01-01 06:00:00   5708      557            600        -3          -43
 4 2013-01-01 06:00:00     79      557            600        -3          -43
 5 2013-01-01 06:00:00    301      558            600        -2          -42
...
```

The problem comes from the fact that `R` is treating `dep_time` and `sched_dep_time` as integers, not time!

Our calculation doesn't account for the fact that 5:54 is only 6 minutes away from 6:00, rather than 46.

We will later see how to properly treat dates and times.

---

### Boolean algebra

.pull-left[

* Recall the basic Boolean algebra comparators, AND and OR
* There is a third one, XOR, which we won't use that often
* Can combine AND/OR with NOT and cover any combination of a pair of Booleans
]

.pull-right[ 
![Visualization of Boolean algebra](transform.png)

]

---

### Boolean algebra and missing values
* Booleans and missing values interact in logical, but possibly counterintuitive ways.

```r
df <- tibble(x = c(TRUE, FALSE, NA))

df %>% 
  mutate(
    and_NA = x & NA,
    or_NA = x | NA
  )
```

```
# A tibble: 3 × 3
  x     and_NA or_NA
  <lgl> <lgl>  <lgl>
1 TRUE  NA     TRUE 
2 FALSE FALSE  NA   
3 NA    NA     NA   
```

* NA OR TRUE returns true, since it is TRUE regardless of NA being FALSE or TRUE.
* NA AND TRUE returns NA since it depends on value of NA.
* NA OR FALSE returns NA since it depends on value of NA.
* NA AND FALSE returns FALSE since NA value doesn't affect result, always false.

---

### Order of operations

.pull-left[ 
Consider finding all flights departing between November and December in the tibble.

```r
flights %>%
  filter(month == 11 | month == 12)
```
This results in the correct calculation.

However, the following calculation does not: 
]

.pull-right[

```r
flights %>%
  filter(month == 11 | 12)
```

```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...
```
Why?
* R first evaluates `month==11`, creates a logical vector `vec`.
* R then compares `vec | 12`
* When comparing a number to any logical, every nonzero number is considered TRUE.
* So `vec | 12` returns a vector with TRUE for every element

]

---

### `%in%`
Instead of worrying about `|` and `==` in order, just use `%in%`.

```r
1:10 %in% c(1, 5, 10)
```

```
 [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
```
So to find all flights from November and December:

```r
flights %>%
  filter(month %in% c(11, 12))
```

* `%in%` obeys different rules for `NA` vs. `==`, since `NA %in% NA` is TRUE:

```r
(c(1,2,NA) == NA)
```

```
[1] NA NA NA
```

```r
(c(1,2,NA) %in% NA)
```

```
[1] FALSE FALSE  TRUE
```

---

### Logical summaries

.pull-left[
Two main functions for logical summaries: `any()` and `all()`.
* `any(x)` returns TRUE if there any TRUEs in `x`
* `all(x)` returns TRUE only if all values in `x` are TRUE

For instance, was there a day where every flight was delayed on departure by less than an hour?  Or a day where there were any flights delayed on arrival by >= 5 hours?
]

.pull-right[

```r
flights %>%
  group_by(year, month, day) %>%
  summarize(
    all_delayed = all(dep_delay <= 60, na.rm=TRUE),
    any_long_delay = any(arr_delay >= 300, na.rm=TRUE)
  )
```

```
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
```

```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day all_delayed any_long_delay
   <int> <int> <int> <lgl>       <lgl>         
 1  2013     1     1 FALSE       TRUE          
 2  2013     1     2 FALSE       TRUE          
 3  2013     1     3 FALSE       FALSE         
 4  2013     1     4 FALSE       FALSE         
...
```

]

---

### Logical summaries
* When coerced into a numeric, TRUE = 1 and FALSE = 0
* If you want to find percentages/proportions that are TRUE/FALSE, this is very useful, e.g. `mean()`, `sum()`
* Example: proportion of flights delayed > 1 hour on departure, and number of flights delayed on arrival by > 5 hours:

```r
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE)
  )
```

```
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
```

```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
...
```

---

```r
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    prop_delayed_1hour = mean(dep_delay > 60, na.rm=TRUE),
    num_long_delay = sum(arr_delay > 300, na.rm=TRUE),
    .groups = 'drop'
  )
```

```
# A tibble: 365 × 5
    year month   day prop_delayed_1hour num_long_delay
   <int> <int> <int>              <dbl>          <int>
 1  2013     1     1             0.0609              3
 2  2013     1     2             0.0856              3
 3  2013     1     3             0.0586              0
 4  2013     1     4             0.0473              0
 5  2013     1     5             0.0363              1
...
```

---

### Logical subsetting
.pull-left[ 
* Logical vectors can also be used for subsetting
* Subset operator: `[]` 
* e.g. computing average delay for flights with actual (>=0 minutes) delays, we would typically do:

```r
flights |> 
  filter(arr_delay > 0) |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay),
    n = n(),
    .groups = 'drop'
  )
```

```
# A tibble: 365 × 5
    year month   day behind     n
   <int> <int> <int>  <dbl> <int>
 1  2013     1     1   32.5   461
 2  2013     1     2   32.0   535
 3  2013     1     3   27.7   460
 4  2013     1     4   28.3   297
...
```
]

.pull-right[

Another way is to use subsetting:

```r
flights |> 
  group_by(year, month, day) |> 
  summarize(
    behind = mean(arr_delay[arr_delay > 0], na.rm=TRUE),
    early = mean(arr_delay[arr_delay < 0], na.rm=TRUE),
    n = n(),
    .groups = 'drop'
  )
```

```
# A tibble: 365 × 6
    year month   day behind early     n
   <int> <int> <int>  <dbl> <dbl> <int>
 1  2013     1     1   32.5 -12.5   842
 2  2013     1     2   32.0 -14.3   943
 3  2013     1     3   27.7 -18.2   914
...
```

In first calc, `n()` gives number of delayed flights while second gives total number of flights, not ideal.

]

---

### Conditional transformations: `if_else()`
* `if_else(CONDITION, TRUE_VAL, FALSE_VAL, MISSING_VAL)` is useful when:
  * When condition is TRUE, it's one value.  When FALSE, it's another value.

```r
x <- c(-2, -1, 1, 2, NA)
if_else(x > 0, "+pos", "-neg")
```

```
[1] "-neg" "-neg" "+pos" "+pos" NA    
```
The fourth arg of `if_else()` specifies what to fill `NA`'s with:

```r
if_else(x > 0, "+pos", "-neg", "?????")
```

```
[1] "-neg"  "-neg"  "+pos"  "+pos"  "?????"
```

We can also use vectors as an argument for what to do when true/false.

```r
if_else(x < 0, -x, x)
```

```
[1]  2  1  1  2 NA
```

---
### Conditional transformations: `if_else()`
We can use general vectors inside of `if_else()`:

```r
x1 <- c(NA, 1, 2, NA)
y1 <- c(3, NA, 4, 6)
if_else(is.na(x1), y1, x1)
```

```
[1] 3 1 2 6
```

If you have many different conditions for which you want to specify values, e.g.
* If number is between `a` and `b` then do...
* If number is between `b` and `c` then do...
* If number is between `c` and `d` then do...
Your best tool is `case_when()`.

---

### Conditional transformations: `case_when()`
Inspired by SQL's `CASE` statement.  Has a very weird syntax:
* `condition ~ output`
* `condition` is a logical vector
* when is is `TRUE`, `output` is used.
Weird, but pretty readable:

```r
x <- c(-3:3, NA)
case_when(
  x == 0   ~ "0",
  x < 0    ~ "-ve", 
  x > 0    ~ "+ve",
  is.na(x) ~ "???"
)
```

```
[1] "-ve" "-ve" "-ve" "0"   "+ve" "+ve" "+ve" "???"
```

---

### Conditional transformations: `case_when()`

.pull-left[ 
If no cases match, then returns NA:

```r
x <- c(-3:3, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve"
)
```

```
[1] "-ve" "-ve" "-ve" NA    "+ve" "+ve" "+ve" NA   
```

The argument `.default` specifies what to do if there is no condition satisfied, or if value is NA.

```r
x <- c(-3:3, NA)
case_when(
  x < 0 ~ "-ve",
  x > 0 ~ "+ve",
  .default = "???"
)
```

```
[1] "-ve" "-ve" "-ve" "???" "+ve" "+ve" "+ve" "???"
```

]

.pull-right[
If there are multiple conditions which match, only the first is used -- be careful!

```r
case_when(
  x > 0 ~ "+ve",
  x > 2 ~ "big"
)
```

```
[1] NA    NA    NA    NA    "+ve" "+ve" "+ve" NA   
```

]

---

### `case_when()`
.pull-left[ 
Here's a more complex example of `case_when()`: providing human-readable labels to flight delays.

```r
(df1 <- flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calc
  ) )
```

```
# A tibble: 336,776 × 2
   arr_delay status 
       <dbl> <chr>  
 1        11 on time
 2        20 late   
...
```
]

.pull-right[
Some things to note:
* We can refer to variables inside the dataframe inside case_when, just as in most other dplyr functions 
* The first conditional that is true is what gets assigned
  * So when `arr_delay < -30`, the remaining conditionals do not get checked

]
---
### `case_when()`
.pull-left[
Two equivalent ways of using `case_when` for this problem:

```r
df1 <- flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
      arr_delay < Inf       ~ "very late",
    ),
    .keep = "used" # only returns those columns used in calc
  ) 
```
]

.pull-right[

```r
df2 <- flights |> 
  mutate(
    status = case_when(
      is.na(arr_delay)      ~ "cancelled",
      arr_delay < -30       ~ "very early",
      arr_delay < -15       ~ "early",
      abs(arr_delay) <= 15  ~ "on time",
      arr_delay < 60        ~ "late",
	    .default = "very late"
    ),
    .keep = "used" # only returns those columns used in calc
  )  
all.equal(df1, df2)
```

```
[1] TRUE
```
Recall that `.default` says how all `NA` and non-specified conditions are handled.  Since we have already used that `NA` implies canceled, this does the same thing. 
]

---

### Compatible types

* Both `if_else()` and `case_when()` require the outputs to be of consistent types.  If not, you'll get errors, e.g.

```r
if_else(TRUE, "a", 1)
#> Error in `if_else()`:
#> ! Can't combine `true` <character> and `false` <double>.

case_when(
  x < -1 ~ TRUE,
  x > 0 ~ now()
)
#> Error in `case_when()`:
#> ! Can't combine `..1 (right)` <logical> and `..2 (right)` <datetime<local>>
```
Most types are incompatible in order to catch errors.  Types which are compatible:
* Numeric and logical (treats TRUE=1, FALSE=0)
* Dates and "date-times" - we will discuss these types later
* `NA` is compatible with everything
* Strings and factors are compatible - will discuss later

---

### Example: labelling numbers as even or odd
* Even number = divisible by two.
* In R, operator `%%` (read "modulo") does "modular arithmetic": `a %% b` returns the remainder when dividing `a` by `b`, e.g.
  * `17 %% 12 = 5`
  * `34 %% 6 = 4`
  * A number `n` is even if and only if `n %% 2 == 0`; otherwise, odd.
* We can use `if_else` to label numbers between 0 and 20 as even or odd

```r
x <- 0:20
if_else(x %% 2 == 0, 'even', 'odd')
```

```
 [1] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[11] "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd"  "even" "odd" 
[21] "even"
```

---

### Example: