class: center, middle, inverse, title-slide .title[ # dplyr and tidy data ] .subtitle[ ##
STA35B: Statistical Data Science 2 ] .author[ ### Spencer Frei ] --- ## Reminder about packages Packages are things we load using `library()`. We'll primarily use the `tidyverse` library, but occasionally we'll use other packages. If you haven't installed a package `PACKAGENAME` before, you'll get an error. To install a package, `install.packages(PACKAGENAME)`. --- ## dataframes/tibbles The main objects we'll care about in R are dataframes and tibbles - you should be familiar with these. <!-- Tibbles are a special type of data frame that have nicer properties - we'll use these for all of the course. --> ```r library(nycflights13) # loads the `flights` tibble into environment flights ``` ``` # A tibble: 336,776 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 7 2013 1 1 555 600 -5 913 854 8 2013 1 1 557 600 -3 709 723 9 2013 1 1 557 600 -3 838 846 10 2013 1 1 558 600 -2 753 745 # ℹ 336,766 more rows # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## dataframes/tibbles Useful functions for inspecting dataframes: `head()` and `str()` ```r head(flights) ``` ``` # A tibble: 6 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 1 1 517 515 2 830 819 2 2013 1 1 533 529 4 850 830 3 2013 1 1 542 540 2 923 850 4 2013 1 1 544 545 -1 1004 1022 5 2013 1 1 554 600 -6 812 837 6 2013 1 1 554 558 -4 740 728 # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## dataframes/tibbles Useful functions for inspecting dataframes: `head()` and `str()` ```r str(flights) ``` ``` tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame) $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ... $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ... $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ... $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ... $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ... $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ... $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ... ... ``` --- ## dataframes/tibbles To return the data from a column in a dataframe/tibble, use `df$COLUMN` (returns vector) ```r flights$distance ``` ``` [1] 1400 1416 1089 1576 762 719 1065 229 944 733 1028 1005 2475 2565 [15] 1389 187 2227 1076 762 1023 1020 502 1085 760 1085 719 2586 1074 [29] 1598 746 2133 1096 1416 1008 1020 2153 1598 2454 733 185 529 212 [43] 1389 950 187 1023 529 1065 301 1620 2434 2248 1182 1035 1990 2586 [57] 1096 733 944 1147 997 1076 762 2475 1096 1069 502 1068 1608 2475 ... ``` `df[,"COLUMN"]` (returns tibble) ```r flights[,"distance"] ``` ``` # A tibble: 336,776 × 1 distance <dbl> 1 1400 2 1416 ... ``` --- ## Piping: `%>%` and `|>` The operators `%>%` (and equivalently, `|>`) allow for you to write operations sequentially. ```r mean(flights$distance) ``` ``` [1] 1039.913 ``` Same as: ```r flights$distance %>% mean ``` ``` [1] 1039.913 ``` We will see more complex examples of this shortly. --- ### The `select()` function Returns a tibble with the columns specified, in order specified. Do not need quotation marks. ```r (colnames(flights)) ``` ``` [1] "year" "month" "day" "dep_time" [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" [9] "arr_delay" "carrier" "flight" "tailnum" [13] "origin" "dest" "air_time" "distance" [17] "hour" "minute" "time_hour" ``` ```r after_select <- flights %>% select(time_hour, origin, dest, air_time, distance) (colnames(after_select)) ``` ``` [1] "time_hour" "origin" "dest" "air_time" "distance" ``` Helper functions for `select()`: * `starts_with("abc")`: matches names that begin with “abc”. * `ends_with("xyz")`: matches names that end with “xyz”. * `contains("ijk")`: matches names that contain “ijk”. * `num_range("x", 1:3)`: matches x1, x2 and x3. --- ## The `select()` function ```r colnames(flights) ``` ``` [1] "year" "month" "day" "dep_time" [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time" [9] "arr_delay" "carrier" "flight" "tailnum" [13] "origin" "dest" "air_time" "distance" [17] "hour" "minute" "time_hour" ``` ```r after_select <- flights %>% select(time_hour, origin, dest, contains("sched")) colnames(after_select) ``` ``` [1] "time_hour" "origin" "dest" "sched_dep_time" [5] "sched_arr_time" ``` --- ## The `filter()` function ```r # data = data to filter # expr<#> = expression used to filter data, typically using ==, >, etc filter(<data>, <expr1>) filter(<data>, <expr1>, <expr2>, <expr3>) ``` Example: flights on February 15 ```r filter(flights, month == 2, day == 15) ``` ``` # A tibble: 954 × 19 year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 2 15 3 2358 5 503 438 2 2013 2 15 6 2115 171 48 2214 3 2013 2 15 6 2250 76 117 5 4 2013 2 15 22 2230 112 510 312 5 2013 2 15 36 2352 44 607 437 ... ``` --- ## The `filter()` function ```r # data = data to filter # expr = expression used to filter data, typically using ==, >, etc filter(<data>, <expr>) ``` Expressions can be built up using comparison operators and logical operators. * `&` is "and" * `|` is "or" * `!` is "not" * `%in%` is a membership checker (is an object inside of a vector) * `>`, `<` mean greater than, less than * `>=`, `<=` mean greater than or equal to, less than or requal to * `==` means "is equal" * `!=` means "is not equal" --- ## Examples of `filter()` ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(distance < 120) ``` ``` # A tibble: 2,076 × 5 time_hour origin dest air_time distance <dttm> <chr> <chr> <dbl> <dbl> 1 2013-01-01 09:00:00 LGA PHL 32 96 2 2013-01-01 13:00:00 EWR BDL 25 116 3 2013-01-01 16:00:00 JFK PHL 35 94 ... ``` ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(distance < 120, origin == 'EWR') ``` ``` # A tibble: 493 × 5 time_hour origin dest air_time distance <dttm> <chr> <chr> <dbl> <dbl> 1 2013-01-01 13:00:00 EWR BDL 25 116 2 2013-01-01 22:00:00 EWR BDL 24 116 ... ``` --- ## Spot the error? ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(origin == SFO) ``` --- ## Spot the error? ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(origin == 'SFO') ``` --- ## Equivalent ways to filter on multiple conditions ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(distance < 120, origin == "EWR") ``` ``` # A tibble: 493 × 5 time_hour origin dest air_time distance <dttm> <chr> <chr> <dbl> <dbl> 1 2013-01-01 13:00:00 EWR BDL 25 116 2 2013-01-01 22:00:00 EWR BDL 24 116 ... ``` ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(distance < 120 & origin == "EWR") ``` ```r flights %>% select(time_hour, origin, dest, air_time, distance) %>% filter(distance < 120) %>% filter(origin == "EWR") ``` --- ## `count()` .pull-left[ If you want to count occurrences of different pairs/"tuples" of values across columns: ```r dataframe %>% count(column1, column2) ``` e.g. for flights, want to count how many flights by origin: ```r flights %>% count(origin) ``` ``` # A tibble: 3 × 2 origin n <chr> <int> 1 EWR 120835 2 JFK 111279 3 LGA 104662 ``` ] -- .pull-right[ Or how many origin-destination pairs: ```r flights %>% count(origin, dest, sort = TRUE) ``` ``` # A tibble: 224 × 3 origin dest n <chr> <chr> <int> 1 JFK LAX 11262 2 LGA ATL 10263 3 LGA ORD 8857 4 JFK SFO 8204 5 LGA CLT 6168 6 EWR ORD 6100 ... ``` ] --- ## `mutate()` The function `mutate()` adds new columns to the data frame using functions of extant columns. Example: ```r small_flights <- flights %>% select(origin, dest, air_time, distance, air_time) small_flights %>% mutate(hours = air_time / 60, mi_per_hour = distance / hours, km_per_hour = 1.61 * mi_per_hour) ``` ``` # A tibble: 336,776 × 7 origin dest air_time distance hours mi_per_hour km_per_hour <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 EWR IAH 227 1400 3.78 370. 596. 2 LGA IAH 227 1416 3.78 374. 603. 3 JFK MIA 160 1089 2.67 408. 657. 4 JFK BQN 183 1576 3.05 517. 832. ... ``` Notice how `mi_per_hour` uses `hours`, which was created within the same `mutate()` call. --- ## Summary statistics and missing data Common summary statistics of interest in data: * Mean (`mean()`) * Min/max (`min()`, `max()`) * Median (`median()`) * Standard deviation / variance (`sd()`, `var()`) R denotes missing data using `NA`. Typically, if you compute a function of a vector with `NA`s, it will return `NA`, unless you put `na.rm=TRUE`. ```r x <- c(-1, 0, 1, NA) (mean(x)) ``` ``` [1] NA ``` ```r (mean(x, na.rm=TRUE)) ``` ``` [1] 0 ``` --- ## `summarise()` If you want to compute summary statistics of dataframe, use `summarise()`. ```r # data = data you want to create a new variable from # new var name = the name your new variable will be # calc exp = the calculation you want to perform summarise(<data>, <new var name> = <calc exp>) ``` ```r flights %>% select(origin, dest, air_time, distance) %>% mutate(hours = air_time / 60, mi_per_hour = distance / hours) %>% summarise(flight_time = mean(mi_per_hour)) ``` ``` # A tibble: 1 × 1 flight_time <dbl> 1 NA ``` --- ## `summarize()` If you want to compute summary statistics of dataframe, use `summarise()`. ```r # data = data you want to create a new variable from # new var name = the name your new variable will be # calc exp = the calculation you want to perform summarize(<data>, <new var name> = <calc exp>) ``` ```r flights %>% select(origin, dest, air_time, distance) %>% mutate(hours = air_time / 60, mi_per_hour = distance / hours) %>% summarize(flight_time = mean(mi_per_hour, na.rm=TRUE)) ``` ``` # A tibble: 1 × 1 flight_time <dbl> 1 394. ``` --- ## `group_by()` and `summarise()` If you would like to compute summary statistics per category, use `group_by()` first. ```r flights %>% select(origin, dest, air_time, distance) %>% group_by(origin) %>% summarize(median_distance = median(distance, na.rm=TRUE), max_air_time = max(air_time, na.rm=TRUE)) ``` ``` # A tibble: 3 × 3 origin median_distance max_air_time <chr> <dbl> <dbl> 1 EWR 872 695 2 JFK 1069 691 3 LGA 762 331 ``` --- ## Complex calculations .delay[ Average flight time, in hours, per origin airport for flights > 500 miles long? ] .delay[ ```r flights %>% select(origin, dest, air_time, distance) %>% mutate(air_time_hrs = air_time / 60) %>% filter(distance > 500) %>% group_by(origin) %>% summarize(avg_flight_time_hrs = mean(air_time_hrs, na.rm=TRUE)) ``` ``` # A tibble: 3 × 2 origin avg_flight_time_hrs <chr> <dbl> 1 EWR 2.99 2 JFK 3.76 3 LGA 2.27 ``` ] --- ## Complex calculations Average flight time, in minutes, per origin airport, when flight is > 500 miles long vs. <= 500 miles long? We need to `group_by` two variables: origin airport, and whether or not flight is > 500 or <= 500 miles. We need to create the latter variable. ```r flights %>% select(origin, dest, air_time, distance) %>% mutate(air_time_hrs = air_time / 60, distance_greater_500mi = distance > 500) ``` ``` # A tibble: 336,776 × 6 origin dest air_time distance air_time_hrs distance_greater_500mi <chr> <chr> <dbl> <dbl> <dbl> <lgl> 1 EWR IAH 227 1400 3.78 TRUE 2 LGA IAH 227 1416 3.78 TRUE 3 JFK MIA 160 1089 2.67 TRUE 4 JFK BQN 183 1576 3.05 TRUE ... ``` --- ## Complex calculations Average flight time, in minutes, per origin airport, when flight is > 500 miles long vs. <= 500 miles long? ```r flights %>% select(origin, dest, air_time, distance) %>% mutate(air_time_hrs = air_time / 60, distance_greater_500mi = distance > 500) %>% group_by(origin, distance_greater_500mi) %>% summarize(avg_flight_time_hrs = mean(air_time_hrs, na.rm=TRUE)) ``` ``` # A tibble: 6 × 3 # Groups: origin [3] origin distance_greater_500mi avg_flight_time_hrs <chr> <lgl> <dbl> 1 EWR FALSE 0.900 2 EWR TRUE 2.99 3 JFK FALSE 0.849 4 JFK TRUE 3.76 5 LGA FALSE 0.916 6 LGA TRUE 2.27 ``` --- ### Tidy data Multiple equivalent ways of organizing data into a dataframe. ```r table1 #> # A tibble: 6 × 4 #> country year cases population #> <chr> <dbl> <dbl> <dbl> #> 1 Afghanistan 1999 745 19987071 #> 2 Afghanistan 2000 2666 20595360 #> 3 Brazil 1999 37737 172006362 #> 4 Brazil 2000 80488 174504898 #> 5 China 1999 212258 1272915272 #> 6 China 2000 213766 1280428583 table2 #> # A tibble: 12 × 4 #> country year type count #> <chr> <dbl> <chr> <dbl> #> 1 Afghanistan 1999 cases 745 #> 2 Afghanistan 1999 population 19987071 #> 3 Afghanistan 2000 cases 2666 #> 4 Afghanistan 2000 population 20595360 #> 5 Brazil 1999 cases 37737 #> 6 Brazil 1999 population 172006362 #> # ℹ 6 more rows ``` table1 is **tidy** - easier to work with using `tidyverse`. --- ### Tidy data * Each variable is a column; each column is a variable. * Each observation is a row; each row is an observation. * Each value is a cell; each cell is a single value. ![Tidy data](tidy.png) Why tidy data? * Consistency - uniform format makes collaboration easier * Vectorization - R commands often operate on vectors of data, best for each column to be a vector of data --- ### **Tidying** data Unfortunately, most real-world datasets you encounter are NOT tidy. In fact, a significant portion of "data scientist work" consists in tidying data ("cleaning data"). The next two weeks will primarily be about how to tidy/rearrange data so that you can do data analysis and visualization properly. ```r library(tidyverse) # loads in tidyr::billboard into namespace ``` ```r billboard ``` ``` # A tibble: 317 × 79 artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA ... ``` --- ### Lengthening data with `pivot_longer()` ```r billboard ``` ``` # A tibble: 317 × 79 artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8 <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53 ... ``` Billboard dataset: * Each observation is a song * First three columns are variable describing the song: `artist`, `track`, `date.entered` * Then 76 columns (`wk1`, `wk2`, ..., `wk76`) saying rank of the song in each week. * The column names (`week`) are actually a *variable*, while the cell values are another variable (the `rank`) * To make it tidy, each observation (`artist-track-date.entered-week-rank`) should be a row, so need to be longer dataframe to make tidy. --- ### Lengthening data with `pivot_longer()` ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank" ) ``` ``` # A tibble: 24,092 × 5 artist track date.entered week rank <chr> <chr> <date> <chr> <dbl> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77 ... ``` * `cols` specifies which columns need to be pivoted (= are NOT variables) * `names_to` names the variable stored in the column names ("week") * `values_to` names the variable stored in cell values ("rank") * "week" and "rank" do not appear as column names in billboard, so need quotes --- ### Lengthening data with `pivot_longer()` ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank" ) ``` ``` # A tibble: 24,092 × 5 artist track date.entered week rank <chr> <chr> <date> <chr> <dbl> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77 ... ``` Data is now tidy, but not ideal for data analysis. Why? * `week` should ideally be a number, not a character * We can use `readr::parse_numbers()` that extracts first number from string to fix. --- ### Lengthening data with `pivot_longer()` ```r billboard %>% pivot_longer( cols = starts_with("wk"), names_to = "week", values_to = "rank" ) %>% mutate(week = parse_number(week)) ``` ``` # A tibble: 24,092 × 5 artist track date.entered week rank <chr> <chr> <date> <dbl> <dbl> 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77 ... ``` --- ### Understanding pivoting .pull-left[ Let's consider toy dataset: three people (A,B,C) with two blood pressure (BP) measurements each. ```r df ``` ``` # A tibble: 3 × 3 id bp1 bp2 <chr> <dbl> <dbl> 1 A 100 120 2 B 140 115 3 C 120 125 ``` ] -- .pull-right[ For new dataset, want each observation (id, bp) to have its own row: ```r df %>% pivot_longer( cols = c(bp1, bp2), names_to = "measurement", values_to = "value" ) ``` ``` # A tibble: 6 × 3 id measurement value <chr> <chr> <dbl> 1 A bp1 100 2 A bp2 120 3 B bp1 140 4 B bp2 115 5 C bp1 120 6 C bp2 125 ``` ] --- <!--### Understanding pivoting--> .pull-left[ ```r df %>% pivot_longer( cols = c(bp1, bp2), names_to = "measurement", values_to = "value" ) ``` * `cols`: which columns to be pivoted * `names_to`: names var stored in the col names * `values_to` names the var stored in cell values ![Columns that are already vars need to be repeated](variables.png) <span style="font-size:smaller;">If col is already a var, needs to repeat for each pivoted col.</span> ] -- .pull-right[ ![Col names of pivoted columns become values in a new column](column-names.png) <span style="font-size:smaller;">Pivoted column names become values in a new variable, with name given by `names_to`. They need to be repeated for each row in the original dataset. </span> -- ![Cell values become new variable, with name given by values_to](cell-values.png) Cell values are values in a new variable, with name `values_to`, unwound row by row. ] --- ### Widening data We'll now use `pivot_wider()` to widen data which is (too) long. ```r (cms_patient_experience <- tidyr::cms_patient_experience %>% select(-measure_title)) ``` ``` # A tibble: 500 × 4 org_pac_id org_nm measure_cd prf_rate <chr> <chr> <chr> <dbl> 1 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_1 63 2 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_2 87 3 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_3 86 4 0446157747 USC CARE MEDICAL GROUP INC CAHPS_GRP_5 57 ... ``` The basic unit studied is an organization, but it's spread across six rows for different measurements. --- ### Widening data ```r cms_patient_experience %>% pivot_wider( id_cols = starts_with("org"), # org_pac_id and org_nm are identifiers names_from = measure_cd, values_from = prf_rate ) ``` ``` # A tibble: 95 × 8 org_pac_id org_nm CAHPS_GRP_1 CAHPS_GRP_2 CAHPS_GRP_3 CAHPS_GRP_5 CAHPS_GRP_8 <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0446157747 USC C… 63 87 86 57 85 2 0446162697 ASSOC… 59 85 83 63 88 3 0547164295 BEAVE… 49 NA 75 44 73 4 0749333730 CAPE … 67 84 85 65 82 ... ``` If you don't supply `id_cols`, R assumes that all columns EXCEPT for `names_from` and `values_from` are id_cols. --- ### Understanding `pivot_wider()` .pull-left[ Dataset where two patients (A, B), with between 2 and 3 BP measurements. ```r df ``` ``` # A tibble: 5 × 3 id measurement value <chr> <chr> <dbl> 1 A bp1 100 2 B bp1 140 3 B bp2 115 4 A bp2 120 5 A bp3 105 ``` ] -- .pull-right[ ```r df %>% pivot_wider( names_from = measurement, values_from = value ) ``` ``` # A tibble: 2 × 4 id bp1 bp2 bp3 <chr> <dbl> <dbl> <dbl> 1 A 100 120 105 2 B 140 115 NA ``` ] Since there is no measurement for bp3 for B, R puts in `NA`. Since `id_cols` is empty, R assumes that all columns EXCEPT for `names_from` and `values_from` are id_cols. --- ### Going from long to wide and back again .pull-left[ ```r df ``` ``` # A tibble: 5 × 3 id measurement value <chr> <chr> <dbl> 1 A bp1 100 2 B bp1 140 3 B bp2 115 4 A bp2 120 5 A bp3 105 ``` ```r (wide_df <- df %>% pivot_wider( names_from = measurement, values_from = value ) ) ``` ``` # A tibble: 2 × 4 id bp1 bp2 bp3 <chr> <dbl> <dbl> <dbl> 1 A 100 120 105 2 B 140 115 NA ``` ] -- .pull-right[ ```r (long_df <- wide_df %>% pivot_longer( cols = c(bp1, bp2, bp3), names_to = "measurement", values_to = "value")) ``` ``` # A tibble: 6 × 3 id measurement value <chr> <chr> <dbl> 1 A bp1 100 2 A bp2 120 3 A bp3 105 4 B bp1 140 5 B bp2 115 6 B bp3 NA ``` Note that we now have an additional row to account for the `NA`. ] --- ### Tidy check .pull-left[ Make the table wider. ```r df ``` ``` # A tibble: 8 × 4 City Date Measurement Value <chr> <chr> <chr> <dbl> 1 CityA 2024-01-01 Temperature 20 2 CityA 2024-01-01 Humidity 80 3 CityA 2024-01-02 Temperature 22 4 CityA 2024-01-02 Humidity 82 5 CityB 2024-01-01 Temperature 18 6 CityB 2024-01-01 Humidity 85 7 CityB 2024-01-02 Temperature 19 8 CityB 2024-01-02 Humidity 88 ``` Not tidy, since each variable is not a column, e.g. temperature/humidity. "Tidyness" can be a bit ambiguous, but if a column has different units (e.g., degrees vs. % humidity), likely not tidy. ] .pull-right[ ```r df %>% pivot_wider( names_from = Measurement, values_from = Value ) ``` ``` # A tibble: 4 × 4 City Date Temperature Humidity <chr> <chr> <dbl> <dbl> 1 CityA 2024-01-01 20 80 2 CityA 2024-01-02 22 82 3 CityB 2024-01-01 18 85 4 CityB 2024-01-02 19 88 ``` ] --- .pull-left[ Make the table longer by having a column called Subject, values Math/Science. ```r scores ``` ``` # A tibble: 3 × 3 Student Math Science <chr> <dbl> <dbl> 1 Alice 85 90 2 Bob 92 78 3 Charlie 88 95 ``` Is the following code correct? ```r scores %>% pivot_longer( cols = c(Math, Science), names_to = Subject, values_to = Score ) ``` ] -- .pull-right[ No, since `scores` tries to refer to `Subject` and `Score` which are not columns. ```r scores %>% pivot_longer( cols = c(Math, Science), names_to = "Subject", values_to = "Score" ) ``` ``` # A tibble: 6 × 3 Student Subject Score <chr> <chr> <dbl> 1 Alice Math 85 2 Alice Science 90 3 Bob Math 92 4 Bob Science 78 5 Charlie Math 88 6 Charlie Science 95 ``` ]