Dates and times: complications

A number of things contribute to dates and times being more complex than one might think.

  • Not all years have 365 days
    • The actual rule that determines leap years:
    A year is a leap year if it’s divisible by 4, 
    unless it’s also divisible by 100, 
    except if it’s also divisible by 400. 
    In other words, in every set of 400 years, there’s 97 leap years.
  • Not every day in every location has 24 hours a day
    • Daylight savings time implies one day has 23, another has 24
  • Time zones are difficult!
  • We will be using lubridate package (part of latest tidyverse), and nycflights13.
library(tidyverse)
library(nycflights13)

Creating dates and times

Three types of date/time data:

  • A date. Tibbles print this as <date>.

  • A time within a day. Tibbles print this as <time>.

  • A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.

  • We are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

To get the current date or date-time you can use today() or now():

today()
[1] "2024-01-23"
now()
[1] "2024-01-23 19:52:32 PST"
(class(now()))
[1] "POSIXct" "POSIXt" 

Dates and times from strings

  • Number of functions which create dates from strings: three letter combos of “y”, “m”, “d”
ymd("2017-01-31")
[1] "2017-01-31"
mdy("January 31st, 2017")
[1] "2017-01-31"
mdy("January 31, 2017")
[1] "2017-01-31"
dmy("31-Jan-2017")
[1] "2017-01-31"
  • To create date-times, you can add an underscore and then one or more of “h”, “m”, “s”.
(ymd_hms("2017-01-31 20:11:59"))
[1] "2017-01-31 20:11:59 UTC"
mdy_hm("01/31/2017 08:01")
[1] "2017-01-31 08:01:00 UTC"
  • Times are assumed to be UTC time zone; can change by using tz=
mdy_hm("01/31/2017 08:01", tz = "PST")
[1] "2017-01-31 16:01:00 PST"

Creating date-times from dplyr parts

Remember how flights stored some of the date information:

(flights_select <- flights %>% select(
  year, month, day, hour, minute))
# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
...
  • To create date/time from this, can use make_date() or make_datetime():
flights_select %>%
  mutate(departure = make_datetime(year, month, day, hour, minute))
# A tibble: 336,776 × 6
    year month   day  hour minute departure          
   <int> <int> <int> <dbl>  <dbl> <dttm>             
 1  2013     1     1     5     15 2013-01-01 05:15:00
 2  2013     1     1     5     29 2013-01-01 05:29:00
...

Creating date-times from dplyr parts

  • We’ll now do a similar computation for the four time columns in flights
  • We’ll do so using a function - we haven’t seen this yet, but we will see it in a couple weeks
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
} # recall: time is an integer, use modular arithmetic to convert 

flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt
# A tibble: 328,063 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
...

Updated flights df with times for arrivals/departures

  • We’ll now use this update df
flights_dt %>%
  filter(dep_time < ymd(20130102))
# A tibble: 837 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 827 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Date-time components

  • Accessor functions which are helpful for date-time types:
    • year(), month(), hour(), minute(), and second()
    • mday() (day of the month), yday() (day of the year), wday() (day of the week)
datetime <- ymd_hms("2026-07-08 12:34:56")
year(datetime)
[1] 2026
month(datetime)
[1] 7
mday(datetime)
[1] 8
yday(datetime)
[1] 189
wday(datetime) # 2026-07-08 is Weds. (Sun.=1)
[1] 4
  • month() and wday() can have label=TRUE, returns abbreviated name of month/day
  • Set abbr=FALSE to get full name
datetime <- ymd_hms("2026-07-08 12:34:56")
month(datetime, label=TRUE) 
[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
wday(datetime, label=TRUE, abbr = FALSE)
[1] Wednesday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Date-time components

With this you can do things like calculate the minute with the highest departure delays:

flights_dt %>%
  mutate(minute = minute(dep_time)) %>%
  group_by(minute) %>%
  summarize(avg_delay = mean(dep_delay, na.rm=TRUE)
  ) %>% arrange(by = desc(avg_delay))
# A tibble: 60 × 2
   minute avg_delay
    <int>     <dbl>
 1     17      18.6
 2     32      17.8
 3     34      17.8
 4     33      17.7
 5     37      17.5
 6     15      17.2
 7     13      17.1
 8     36      17.1
 9     16      17.1
10     18      17.0
# ℹ 50 more rows

Rounding

  • There are analogues of the standard rounding functions for dates
    • floor_date(), ceiling_date()
    • round_date()
  • They take vector of dates to adjust, name of unit (week, day, etc)
flights_dt %>%
  mutate(year = floor_date(dep_time, "year")) %>%
  select(dep_time, year)
# A tibble: 328,063 × 2
   dep_time            year               
   <dttm>              <dttm>             
 1 2013-01-01 05:17:00 2013-01-01 00:00:00
 2 2013-01-01 05:33:00 2013-01-01 00:00:00
 3 2013-01-01 05:42:00 2013-01-01 00:00:00
 4 2013-01-01 05:44:00 2013-01-01 00:00:00
 5 2013-01-01 05:54:00 2013-01-01 00:00:00
 6 2013-01-01 05:54:00 2013-01-01 00:00:00
 7 2013-01-01 05:55:00 2013-01-01 00:00:00
...

Examples

  • Let’s compute the average delay time of flights which depart at times in two groups:
    • departure time is between minutes 20-30 and 50-60 vs. the other times
flights_dt %>%
  mutate(dep_minute = minute(dep_time),
         mins_2030 = dep_minute >= 20 & dep_minute <= 30,
         mins_5060 = dep_minute >= 50 & dep_minute <= 59,
         mins_2030_or_5060 = mins_2030 | mins_5060) %>%
  group_by(mins_2030_or_5060) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE),
            n = n())
# A tibble: 2 × 3
  mins_2030_or_5060 avg_dep_delay      n
  <lgl>                     <dbl>  <int>
1 FALSE                     15.5  181621
2 TRUE                       8.90 146442

Time spans and date-time arithmetic

Three important classes representing time spans:

  • Durations: exact time, measured in seconds
  • Periods: human units, like weeks/months
  • Intervals: represent starting/end point
  • When subtracting two dates, we get a “difftime” object:
# How old is Hadley?
h_age <- today() - ymd("1979-10-14")
h_age
Time difference of 16172 days
  • Exact unit of difftime can vary from seconds, minutes, hours, days, or weeks.
  • as.duration() always uses seconds:
as.duration(h_age)
[1] "1397260800s (~44.28 years)"

Constructing durations

  • Durations always record time span in seconds
  • Useful constructors: d{units}, {units} is seconds, days, etc
    • No way to convert month to duration since it’s not well-defined
dseconds(15)
[1] "15s"
dminutes(10)
[1] "600s (~10 minutes)"
dhours(c(12, 24))
[1] "43200s (~12 hours)" "86400s (~1 days)"  
  • Can add and multiply durations:
2 * dyears(1)
[1] "63115200s (~2 years)"
dyears(1) + dweeks(12) + dhours(5)
[1] "38833200s (~1.23 years)"
  • Can subtract durations to and from days
(tomorrow <- today() + ddays(1)) # returns Date
[1] "2024-01-24"
(last_year <- today() - dyears(1)) # returns date-time
[1] "2023-01-22 18:00:00 UTC"

Duration computations and weird results

  • Durations represent an exact number of seconds
  • So if you are looking at time zones, things can be odd
one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_am
[1] "2026-03-08 01:00:00 EST"
one_am + ddays(1)
[1] "2026-03-09 02:00:00 EDT"
  • If we add a full day of seconds, would need to account for hour time change from EST to EDT.
  • lubridate provides periods to address this

Periods

  • Periods are time spans but don’t have fixed length in seconds
    • Work like “human” times, i.e. days/months
one_am
[1] "2026-03-08 01:00:00 EST"
one_am + days(1)
[1] "2026-03-09 01:00:00 EDT"
one_am + ddays(1)
[1] "2026-03-09 02:00:00 EDT"
  • Similar to durations, useful constructors and behavior under +/*:
hours(c(12, 24))
[1] "12H 0M 0S" "24H 0M 0S"
days(7)
[1] "7d 0H 0M 0S"
months(1:3)
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S"
10 * (months(6) + days(2))
[1] "60m 20d 0H 0M 0S"

Durations vs. periods

  • Adding periods can be a bit more in line with expectations
# A leap year
ymd("2024-01-01") + dyears(1)
#> [1] "2024-12-31 06:00:00 UTC"
ymd("2024-01-01") + years(1)
#> [1] "2025-01-01"

# Daylight saving time
one_am + ddays(1)
#> [1] "2026-03-09 02:00:00 EDT"
one_am + days(1)
#> [1] "2026-03-09 01:00:00 EDT"

Fixing a bug in flights_dt

  • We used same date information for arrival and departure times, but the flights really arrived on the following day.
flights_dt %>% filter(arr_time < dep_time) %>% select(origin, dest, arr_time, dep_time)
# A tibble: 10,633 × 4
   origin dest  arr_time            dep_time           
   <chr>  <chr> <dttm>              <dttm>             
 1 EWR    BQN   2013-01-01 00:03:00 2013-01-01 19:29:00
 2 JFK    DFW   2013-01-01 00:29:00 2013-01-01 19:39:00
...
flights_dt %>% mutate(
  overnight = arr_time < dep_time, # returns T/F
  arr_time = arr_time + days(overnight),
  sched_arr_time = sched_arr_time + days(overnight)
) %>% filter(arr_time < dep_time)
# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
#   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>

Intervals

  • dyears(1) / ddays(365) does not return 1, since dyears() is defined as the number of seconds per average year: 365.25 days.
  • years(1) / days(1) does not return 365, since in leap years this isn’t true.
  • Intervals allow for defining specific intervals of time, using pair of starting/end date times.
  • Format: start %--% end:
(y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01"))
[1] 2023-01-01 UTC--2024-01-01 UTC
(y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01"))
[1] 2024-01-01 UTC--2025-01-01 UTC
  • Can divide by days() to find out how many days/year:
y2023 / days(1)
[1] 365
y2024 / days(1)
[1] 366

Example: extracting dates and computing durations

  • Suppose we are given a tibble whose first three rows are as follows:
 df <- tribble(
    ~name, ~entry,
    "Mary", "First arrival: January 2, 2005; Second arrival: January 6, 2023",
    "Will", "First time: January 5, 1997; Second time: January 8, 2015",
    "Jose", "First visit: January 4, 1990; Second visit: January 9, 2008",
  )
  • Task: return a tibble which says how many days elapsed between the first and second visit, ordered by the most number of days
  • Complex task! Need to:
    • Parse the “entry” column to extract the dates
    • Turn them into proper date-times
    • Compute the number of days between visits (leap years happen at different times!)
    • Order by number of days
  • Let’s start by parsing “entry” and creating two columns for different dates

  • entry elements look like: First arrival: January 2, 2005; Second arrival: January 9, 2023”
( df2 <- df %>% mutate(
    date1 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\2")),
    date2 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\4"))
  ) )
# A tibble: 3 × 4
  name  entry                                              date1      date2     
  <chr> <chr>                                              <date>     <date>    
1 Mary  First arrival: January 2, 2005; Second arrival: J… 2005-01-02 2023-01-06
2 Will  First time: January 5, 1997; Second time: January… 1997-01-05 2015-01-08
3 Jose  First visit: January 4, 1990; Second visit: Janua… 1990-01-04 2008-01-09
  • Now we want to compute number of days between: ddays of the difference!
df2 %>% mutate(days_elapsed = days(date2 - date1) )
# A tibble: 3 × 5
  name  entry                               date1      date2      days_elapsed  
  <chr> <chr>                               <date>     <date>     <Period>      
1 Mary  First arrival: January 2, 2005; Se… 2005-01-02 2023-01-06 6578d 0H 0M 0S
2 Will  First time: January 5, 1997; Secon… 1997-01-05 2015-01-08 6577d 0H 0M 0S
3 Jose  First visit: January 4, 1990; Seco… 1990-01-04 2008-01-09 6579d 0H 0M 0S

Examples

  • Just need to order by maximum number of days now
df %>% mutate(
    date1 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\2")),
    date2 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\4")),
    days_elapsed = days(date2-date1)
  ) %>% arrange(by = desc(days_elapsed))
# A tibble: 3 × 5
  name  entry                               date1      date2      days_elapsed  
  <chr> <chr>                               <date>     <date>     <Period>      
1 Jose  First visit: January 4, 1990; Seco… 1990-01-04 2008-01-09 6579d 0H 0M 0S
2 Mary  First arrival: January 2, 2005; Se… 2005-01-02 2023-01-06 6578d 0H 0M 0S
3 Will  First time: January 5, 1997; Secon… 1997-01-05 2015-01-08 6577d 0H 0M 0S
  • We could also have done separate_wider_regex() to parse the string initially. How would we do that?

Examples

df2 <- df %>% separate_wider_regex(
  entry,
  patterns = c(
    ".*",
    ": ",
    date1 = ".*",
    ";",
    ".*",
    ": ",
    date2 = ".*"
  ) ) %>%
  mutate(date1 = mdy(date1), date2 = mdy(date2))
  • Original df:
df
# A tibble: 3 × 2
  name  entry                                                          
  <chr> <chr>                                                          
1 Mary  First arrival: January 2, 2005; Second arrival: January 6, 2023
2 Will  First time: January 5, 1997; Second time: January 8, 2015      
3 Jose  First visit: January 4, 1990; Second visit: January 9, 2008    
df2
# A tibble: 3 × 3
  name  date1      date2     
  <chr> <date>     <date>    
1 Mary  2005-01-02 2023-01-06
2 Will  1997-01-05 2015-01-08
3 Jose  1990-01-04 2008-01-09

Example - string formatting errors for dates

  • Let’s suppose we are given a string with a bunch of supposed dates
ds <- c("2022-01-08", "202-01-09", "2022/01/10")
  • And suppose we want to return a new datetime vector satisfying the following:
    • if the date is in the correct format, it is the original datetime
    • otherwise, return the datetime of january 1, year 1.
  • We need to flag which elements of ds have the format of either YYYY-MM-DD or YYYY/MM/DD, i.e. return a vector with booleans saying whether that element has the correct format.
  • We’ll use regular expressions
pattern_regex <- "^\\d{4}[-/]\\d{2}[-/]\\d{2}$"
str_detect(ds, pattern_regex)
[1]  TRUE FALSE  TRUE

Example - string formatting errors for dates

ds
[1] "2022-01-08" "202-01-09"  "2022/01/10"
str_detect(ds, "^\\d{4}[-/]\\d{2}[-/]\\d{2}$")
[1]  TRUE FALSE  TRUE
  • Now we can use a case_when to make it year 1 if not correct format:
df <- tibble(entered = ds)
df %>% mutate(new_date = case_when(
  str_detect(entered, "^\\d{4}[-/]\\d{2}[-/]\\d{2}$") ~ ymd(entered, quiet=TRUE),
  TRUE ~ ymd("0001-01-01") )
  )
# A tibble: 3 × 2
  entered    new_date  
  <chr>      <date>    
1 2022-01-08 2022-01-08
2 202-01-09  1-01-01   
3 2022/01/10 2022-01-10

Slide

Slide

Slide

Slide

Slide

Slide