Transformations of Numbers

class: center, middle, inverse, title-slide

.title[
# Transformations of Numbers
]
.subtitle[
## <br><br> STA35B: Statistical Data Science 2
]
.author[
### Spencer Frei
]

---

Today we'll focus on transforming numbers.

```r
library(tidyverse)
library(nycflights13)
flights
```

```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
...
```

---

### Parsing strings to get numbers

`readr` package in the tidyverse has two useful functions:
* `parse_double()` - is useful when you have numbers written as strings.
* `parse_number()` - ignores all non-numeric text to parse strings.

```r
parse_double(c("1.2", "5.6", "1e3"))
```

```
[1]    1.2    5.6 1000.0
```

```r
parse_number(c("$1,234", "USD 53,256", "59%"))
```

```
[1]  1234 53256    59
```
If you try to use `parse_double()` with non-numeric-identifying strings, will throw an error.

---

### Arithmetic and "recycling rules"
* We've already seen how to create new rows, e.g. `flights %>% mutate(air_time = air_time / 60)`.
* `air_time` has 336,776 numbers while 60 has only one, so we divide every element of `air_time` by 60.
* We've also seen that if you have two vectors of same length, operations are done elementwise:

```r
x <- c(1, 2, 3, 4)
y <- c(2, 3, 4, 5)
x / y
```

```
[1] 0.5000000 0.6666667 0.7500000 0.8000000
```
* What happens when the number of elements is not 1 or the exact matching number?
* R does what is called **recycling**, or **repeating**: it will create a new vector which repeats until reaches vector length.  Will throw warning if not an even multiple.

```r
x * c(1,2)
```

```
[1] 1 4 3 8
```

```r
x * c(1, 2, 3)
```

```
Warning in x * c(1, 2, 3): longer object length is not a multiple of shorter
object length
```

```
[1] 1 4 9 4
```

---

### Recycling rules

* The recycling rules apply for all logical comparison operators (`==`, `<`, etc) and arithmetic operators (`+`, `^` etc)
* Be very careful when doing logical comnparisons / arithmetic using two vectors!

```r
flights %>%
  filter(month == c(1,2))
```

```
# A tibble: 25,977 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      542            540         2      923            850
...
```
* `month==c(1,2)` returns a logical vector where:
  * TRUE if either the row number is odd and the month is 1, OR row number is even and month 2
  * FALSE if either the row number is odd and the month is NOT 1, OR row number is even and month is NOT 2
* Better to use `month %in% c(1,2)` here!

---

### Pairwise minimums and maximums

.pull-left[ 
* `pmin()` and `pmax()` return pairwise min / max of 2 or more variables

```r
(df)
```

```
# A tibble: 3 × 3
      x     y     z
  <dbl> <dbl> <dbl>
1     1     3     5
2     5     2     7
3     7    NA     1
```

```r
df %>% 
  mutate(min = pmin(x, y, z, na.rm=TRUE),
         max = pmax(x, y, z, na.rm=TRUE))
```

```
# A tibble: 3 × 5
      x     y     z   min   max
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     3     5     1     5
2     5     2     7     2     7
3     7    NA     1     1     7
```

]
--

.pull-right[ 
Different behavior than using `min()`, `max()`, which takes in >=2 args and returns a single value:

```r
df %>%
  mutate(bad_min = min(x, y, z, na.rm=TRUE),
         bad_max = max(x, y, z, na.rm=TRUE))
```

```
# A tibble: 3 × 5
      x     y     z bad_min bad_max
  <dbl> <dbl> <dbl>   <dbl>   <dbl>
1     1     3     5       1       7
2     5     2     7       1       7
3     7    NA     1       1       7
```

]

---

### Modular arithmetic

.pull-left[ 
* Modular arithmetic is similar to "clock arithmetic", except you always start from 0 instead of 1
* `%/%`: integer division; `%%`: the remainder after integer division.

```r
(1:10 %/% 3)
```

```
 [1] 0 0 1 1 1 2 2 2 3 3
```

```r
1:10 %% 3
```

```
 [1] 1 2 0 1 2 0 1 2 0 1
```
]

.pull-right[
Useful for calculating things related to time, e.g. create new columns with hour and minute from `sched_dep_time`:

```r
flights %>% 
  mutate(hour = sched_dep_time %/% 100,
         minute = sched_dep_time %% 100,
         .keep = 'used') 
```

```
# A tibble: 336,776 × 3
   sched_dep_time  hour minute
            <int> <dbl>  <dbl>
 1            515     5     15
 2            529     5     29
 3            540     5     40
 4            545     5     45
...
```

]
---

### Modular arithmetic
We can then do things like calculate the percent of delayed flights per hour.

```r
flights %>% 
  mutate(hour = sched_dep_time %/% 100) %>%
  group_by(hour) %>%
  summarize(percent_cancelled = 100*mean(is.na(dep_time)),
            n = n())
```

```
# A tibble: 20 × 3
    hour percent_cancelled     n
   <dbl>             <dbl> <int>
 1     1           100         1
 2     5             0.461  1953
 3     6             1.64  25951
 4     7             1.27  22821
 5     8             1.62  27242
 6     9             1.61  20312
 7    10             1.74  16708
 8    11             1.85  16033
 9    12             2.13  18181
10    13             2.15  19956
11    14             2.61  21706
12    15             2.80  23888
...
```

---

### Rounding
.pull-left[ 
* `round()` allows for you to round to either nearest integer, or to certain number of decimal places (using the `digits=` argument)

```r
round(123.456)
```

```
[1] 123
```
* `round()` deals with ties by "rounding to the nearest **even** integer:

```r
round(c(1.5, 2.5))
```

```
[1] 2 2
```
This way half of all 0.5's are rounded up, half are down. 
]
.pull-right[

* `round(x, digits=n)` either rounds to `n` digits past the decimal place, or to the nearest `10^n` if `n < 0`.

```r
( round(123.456, 2) )
```

```
[1] 123.46
```

```r
( round(123.456, 1) )
```

```
[1] 123.5
```

```r
( round(123.456, -1) )
```

```
[1] 120
```

```r
( round(123.456, -2) )
```

```
[1] 100
```

]

---

### Cumulative and rolling aggregates
* R provides many functions for computing rolling aggregates (sums, products, minimums, etc.)
* `cumsum()`, `cumprod()`, `cummin()`, `cummmax()`, `dplyr::cummean()`

```r
x <- 1:10
(cumsum(x))
```

```
 [1]  1  3  6 10 15 21 28 36 45 55
```

```r
(cummean(x))
```

```
 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
```

---

### Rounding
.pull-left[
* Two similar arguments: `floor()` and `ceiling()`
* `floor(x)` rounds to greatest integer <= x, while `ceiling(x)` rounds to least integer >= x.

```r
(floor(123.456))
```

```
[1] 123
```

```r
(ceiling(123.456))
```

```
[1] 124
```

]

---
### Ranks
.pull-left[ 
* `dplyr::min_rank()` takes a vector of numbers and returns the rank of each element, with lowest = 1st.
* Ties broken in obvious way: 1st, 2nd, 2nd, 4th if second and third element equal.

```r
x <- c(1, 2, 2, 3, NA, 4)
min_rank(x)
```

```
[1]  1  2  2  4 NA  5
```

To have large values ranked first, apply `min_rank()` to `desc(x)`:

```r
min_rank(desc(x))
```

```
[1]  5  3  3  2 NA  1
```

]
--
.pull-right[
There are many variants of `min_rank()` inside dplyr universe:

```r
df <- tibble( x = x)
df %>%
  mutate(row_num = row_number(x),
         percent_rank = percent_rank(x),
         cumulative_dist = cume_dist(x)
)
```

```
# A tibble: 6 × 4
      x row_num percent_rank cumulative_dist
  <dbl>   <int>        <dbl>           <dbl>
1     1       1         0                0.2
2     2       2         0.25             0.6
3     2       3         0.25             0.6
4     3       4         0.75             0.8
5    NA      NA        NA               NA  
6     4       5         1                1  
```
* `cume_dist()`: proportion of values <= x. 
]

---

### Ranks
* Which 10 flight routes have the longest average delays? 
* Remember: negative delay = early

```r
flights %>%
  group_by(origin, dest) %>%
  summarize(avg_delay = mean(arr_delay, na.rm=TRUE),
            .groups = 'drop') %>%
  mutate(rank = min_rank(desc(avg_delay))) %>%
  arrange(by = rank) %>%
  filter(rank <= 10)
```

```
# A tibble: 10 × 4
   origin dest  avg_delay  rank
   <chr>  <chr>     <dbl> <int>
 1 EWR    CAE        44.6     1
 2 EWR    TYS        41.2     2
 3 EWR    TUL        33.7     3
 4 EWR    OKC        30.6     4
 5 EWR    JAC        29.9     5
 6 EWR    RIC        25.8     6
 7 EWR    MKE        24.0     7
 8 EWR    DSM        23.5     8
 9 EWR    MSN        23.2     9
10 EWR    PWM        23.0    10
```

---
### Offsets
.pull-left[ 
* `dplyr::lead()` and `dplyr::lag()` return the value just before / after "current" val.
* Returns vector of same length, padded with `NA` if cannot compute.

```r
x <- c(2, 5, 11, 11, 19, 35)
(lag(x))
```

```
[1] NA  2  5 11 11 19
```

```r
(lead(x))
```

```
[1]  5 11 11 19 35 NA
```

* `x - lag(x)` gives difference between current and previous value.

```r
x - lag(x)
```

```
[1] NA  3  6  0  8 16
```

]
--
.pull-right[

* `x == lag(x)` tells whether or not current value changes.

```r
x == lag(x)
```

```
[1]    NA FALSE FALSE  TRUE FALSE FALSE
```

* `lag(x, default = y)` fills in the `NA` with `y`.

```r
lag(x, default = -1)
```

```
[1] -1  2  5 11 11 19
```

```r
lag(x, default = first(x))
```

```
[1]  2  2  5 11 11 19
```

]

---

### Consecutive identifiers
.pull-left[ 
* Suppose we have a website, and we are given times that a user has visited the website:

```r
events <- tibble(time = c(0, 1, 2, 3, 5, 10, 12, 15, 17, 19, 20, 27))
```
* Suppose we want to create an ID called "session", where a new session starts if it has been > 5 minutes since last visit.
* We can compute the time elapsed between visits using `lag()`, and let first visit be treated as 0 minutes since last visit.

```r
events <- events %>% 
  mutate(
    diff = time - lag(time, default = first(time)),
    five_mins_since = diff >= 5
  )
```

]

.pull-right[
* The "session" ID will start at 0 and increase by 1 every time it `five_mins_since` hits a TRUE.
* To do this, we can use `cumsum()`:

```r
events %>%
  mutate(session = cumsum(five_mins_since))
```

```
# A tibble: 12 × 4
    time  diff five_mins_since session
   <dbl> <dbl> <lgl>             <int>
 1     0     0 FALSE                 0
 2     1     1 FALSE                 0
 3     2     1 FALSE                 0
 4     3     1 FALSE                 0
 5     5     2 FALSE                 0
 6    10     5 TRUE                  1
 7    12     2 FALSE                 1
 8    15     3 FALSE                 1
 9    17     2 FALSE                 1
10    19     2 FALSE                 1
11    20     1 FALSE                 1
12    27     7 TRUE                  2
```

]

---
### Minimum, maximum, quantiles 
.pull-left[ 
* As mentioned before, `min(x)` and `max(x)` return single smallest/largest vals within vector `x`.
* `quantile(vector, threshold)` is a generalization of median:
  * `quantile(x, 0.25)` returns value of `x` that is >= 25% of values within `x`
  * `quantile(x, 0.5)` returns median
  * `quantile(x, 0.95)` returns value of `x` that is >= 95% of values within `x`.
* Quantiles are less susceptible to extreme values, which affect the mean more
  * Consider `c(1,2, 3, 2, 5,2,3,1,4,2,3, 10000000)`

]

.pull-right[
* Let's calculate what the maximum delay and the 95th quantile of delays for flights per day.

```r
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    maxim = max(dep_delay, na.rm=TRUE),
    q95 = quantile(dep_delay, 0.95, na.rm=TRUE),
    .groups = 'drop'
  )
```

```
# A tibble: 365 × 5
    year month   day maxim   q95
   <int> <int> <int> <dbl> <dbl>
 1  2013     1     1   853  70.1
 2  2013     1     2   379  85  
 3  2013     1     3   291  68  
 4  2013     1     4   288  60  
...
```

]

---

### Spread
.pull-left[ 
* Standard measures of spread:
  * Standard deviation `sd()`: you should already be familiar with this,
  `$$var(x) = \frac{1}{\mathsf{length}(x)-1} \sum_{i=1}^{\mathsf{length}(x)} (x[i] - \mathsf{mean}(x))^2$$`
  `$$sd(x) = \sqrt{var(x)}$$`
  * Interquartile range `IQR(x) = quantile(x, 0.75) - quantile(x, 0.25)`: contains middle 50% of data
  
]
.pull-right[
* Similar to median vs. mean, IQR is less sensitive to big outliers.
* If the data were a Standard Normal (no outliers), 50% of the data lies within 0.6745 stddev's of the mean.

```r
flights %>% 
  group_by(year, month, day) %>%
  summarise(
    stddev = sd(dep_delay, na.rm=TRUE),
    stddev_50p_range = 0.6745*stddev,
    iqr = IQR(dep_delay, na.rm=TRUE),
    .groups = 'drop'
  )
```

```
# A tibble: 365 × 6
    year month   day stddev stddev_50p_range   iqr
   <int> <int> <int>  <dbl>            <dbl> <dbl>
 1  2013     1     1   45.3             30.5  13  
 2  2013     1     2   37.2             25.1  17  
 3  2013     1     3   31.5             21.2  15  
...
```
]

---
### Spread
.pull-left[
* Which destinations show the greatest variation in air speed? Multiple possible metrics.
  * Middle 90% by quantiles vs. within 2 standard deviations.
  * Absolute maximum vs absolute minimum.
* We need to be sure to only look at destinations with > 1 observation, otherwise can't compute stddev. 
]

.pull-right[

```r
flights %>%
  mutate(air_speed_mph = distance / (air_time / 60)) %>%
  group_by(dest) %>%
  filter(n() > 1) %>%
  summarize(
    speed_middle90 = quantile(air_speed_mph, 0.95, na.rm=TRUE) - quantile(air_speed_mph, 0.05, na.rm=TRUE),
    speed_2stddev = 2*sd(air_speed_mph, na.rm=TRUE),
    speed_max_diff = max(air_speed_mph, na.rm=TRUE) - min(air_speed_mph, na.rm=TRUE) 
  ) %>% arrange(by = desc(speed_middle90))
```

```
# A tibble: 103 × 4
   dest  speed_middle90 speed_2stddev speed_max_diff
   <chr>          <dbl>         <dbl>          <dbl>
 1 ILM             127.          73.8           154.
 2 OKC             125.          76.7           213.
 3 BNA             121.          73.8           388.
 4 CLT             121.          73.3           280.
 5 TUL             118.          74.9           201.
 6 MEM             118.          71.8           216.
...
```

]

---

### Positions
* Final type of summary for numerics that are useful: `first(x)` (seen already!), `last(x)`, `nth(x, n)`.
* E.g. first, fifth, and last departure per day:

```r
flights %>%
  group_by(year, month, day) %>%
  summarize( 
    first_dep = first(dep_time, na_rm = TRUE),
    fifth_dep = nth(dep_time, 5, na_rm = TRUE),
    last_dep = last(dep_time, na_rm = TRUE)
  )
```

```
`summarise()` has grouped output by 'year', 'month'. You can override using the
`.groups` argument.
```

```
# A tibble: 365 × 6
# Groups:   year, month [12]
    year month   day first_dep fifth_dep last_dep
   <int> <int> <int>     <int>     <int>    <int>
 1  2013     1     1       517       554     2356
 2  2013     1     2        42       535     2354
 3  2013     1     3        32       520     2349
...
```