Dates and times

Dates and times: complications

A number of things contribute to dates and times being more complex than one might think.

Not all years have 365 days

The actual rule that determines leap years:

A year is a leap year if it’s divisible by 4, 
unless it’s also divisible by 100, 
except if it’s also divisible by 400. 
In other words, in every set of 400 years, there’s 97 leap years.

Not every day in every location has 24 hours a day
- Daylight savings time implies one day has 23, another has 24
Time zones are difficult!
We will be using lubridate package (part of latest tidyverse), and nycflights13.

library(tidyverse)
library(nycflights13)

Creating dates and times

Three types of date/time data:

A date. Tibbles print this as <date>.
A time within a day. Tibbles print this as <time>.
A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as <dttm>. Base R calls these POSIXct, but doesn’t exactly trip off the tongue.
We are going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.

To get the current date or date-time you can use today() or now():

today()

[1] "2024-01-23"

now()

[1] "2024-01-23 19:52:32 PST"

(class(now()))

[1] "POSIXct" "POSIXt"

Dates and times from strings

Number of functions which create dates from strings: three letter combos of “y”, “m”, “d”

ymd("2017-01-31")

[1] "2017-01-31"

mdy("January 31st, 2017")

[1] "2017-01-31"

mdy("January 31, 2017")

[1] "2017-01-31"

dmy("31-Jan-2017")

[1] "2017-01-31"

To create date-times, you can add an underscore and then one or more of “h”, “m”, “s”.

(ymd_hms("2017-01-31 20:11:59"))

[1] "2017-01-31 20:11:59 UTC"

mdy_hm("01/31/2017 08:01")

[1] "2017-01-31 08:01:00 UTC"

Times are assumed to be UTC time zone; can change by using tz=

mdy_hm("01/31/2017 08:01", tz = "PST")

[1] "2017-01-31 16:01:00 PST"

Creating date-times from dplyr parts

Remember how flights stored some of the date information:

(flights_select <- flights %>% select(
  year, month, day, hour, minute))

# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
...

To create date/time from this, can use make_date() or make_datetime():

flights_select %>%
  mutate(departure = make_datetime(year, month, day, hour, minute))

# A tibble: 336,776 × 6
    year month   day  hour minute departure          
   <int> <int> <int> <dbl>  <dbl> <dttm>             
 1  2013     1     1     5     15 2013-01-01 05:15:00
 2  2013     1     1     5     29 2013-01-01 05:29:00
...

Creating date-times from dplyr parts

We’ll now do a similar computation for the four time columns in flights
We’ll do so using a function - we haven’t seen this yet, but we will see it in a couple weeks

make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
} # recall: time is an integer, use modular arithmetic to convert 

flights_dt <- flights |> 
  filter(!is.na(dep_time), !is.na(arr_time)) |> 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) |> 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt

# A tibble: 328,063 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
...

Updated flights df with times for arrivals/departures

We’ll now use this update df

flights_dt %>%
  filter(dep_time < ymd(20130102))

# A tibble: 837 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 827 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>

Date-time components

Accessor functions which are helpful for date-time types:
- year(), month(), hour(), minute(), and second()
- mday() (day of the month), yday() (day of the year), wday() (day of the week)

datetime <- ymd_hms("2026-07-08 12:34:56")
year(datetime)

[1] 2026

month(datetime)

[1] 7

mday(datetime)

[1] 8

yday(datetime)

[1] 189

wday(datetime) # 2026-07-08 is Weds. (Sun.=1)

[1] 4

month() and wday() can have label=TRUE, returns abbreviated name of month/day
Set abbr=FALSE to get full name

datetime <- ymd_hms("2026-07-08 12:34:56")
month(datetime, label=TRUE)

[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec

wday(datetime, label=TRUE, abbr = FALSE)

[1] Wednesday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

Date-time components

With this you can do things like calculate the minute with the highest departure delays:

flights_dt %>%
  mutate(minute = minute(dep_time)) %>%
  group_by(minute) %>%
  summarize(avg_delay = mean(dep_delay, na.rm=TRUE)
  ) %>% arrange(by = desc(avg_delay))

# A tibble: 60 × 2
   minute avg_delay
    <int>     <dbl>
 1     17      18.6
 2     32      17.8
 3     34      17.8
 4     33      17.7
 5     37      17.5
 6     15      17.2
 7     13      17.1
 8     36      17.1
 9     16      17.1
10     18      17.0
# ℹ 50 more rows

Rounding

There are analogues of the standard rounding functions for dates
- floor_date(), ceiling_date()
- round_date()
They take vector of dates to adjust, name of unit (week, day, etc)

flights_dt %>%
  mutate(year = floor_date(dep_time, "year")) %>%
  select(dep_time, year)

# A tibble: 328,063 × 2
   dep_time            year               
   <dttm>              <dttm>             
 1 2013-01-01 05:17:00 2013-01-01 00:00:00
 2 2013-01-01 05:33:00 2013-01-01 00:00:00
 3 2013-01-01 05:42:00 2013-01-01 00:00:00
 4 2013-01-01 05:44:00 2013-01-01 00:00:00
 5 2013-01-01 05:54:00 2013-01-01 00:00:00
 6 2013-01-01 05:54:00 2013-01-01 00:00:00
 7 2013-01-01 05:55:00 2013-01-01 00:00:00
...

Examples

Let’s compute the average delay time of flights which depart at times in two groups:
- departure time is between minutes 20-30 and 50-60 vs. the other times

flights_dt %>%
  mutate(dep_minute = minute(dep_time),
         mins_2030 = dep_minute >= 20 & dep_minute <= 30,
         mins_5060 = dep_minute >= 50 & dep_minute <= 59,
         mins_2030_or_5060 = mins_2030 | mins_5060) %>%
  group_by(mins_2030_or_5060) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm=TRUE),
            n = n())

# A tibble: 2 × 3
  mins_2030_or_5060 avg_dep_delay      n
  <lgl>                     <dbl>  <int>
1 FALSE                     15.5  181621
2 TRUE                       8.90 146442

Time spans and date-time arithmetic

Three important classes representing time spans:

Durations: exact time, measured in seconds
Periods: human units, like weeks/months
Intervals: represent starting/end point

When subtracting two dates, we get a “difftime” object:

# How old is Hadley?
h_age <- today() - ymd("1979-10-14")
h_age

Time difference of 16172 days

Exact unit of difftime can vary from seconds, minutes, hours, days, or weeks.
as.duration() always uses seconds:

as.duration(h_age)

[1] "1397260800s (~44.28 years)"

Constructing durations

Durations always record time span in seconds
Useful constructors: d{units}, {units} is seconds, days, etc
- No way to convert month to duration since it’s not well-defined

dseconds(15)

[1] "15s"

dminutes(10)

[1] "600s (~10 minutes)"

dhours(c(12, 24))

[1] "43200s (~12 hours)" "86400s (~1 days)"

Can add and multiply durations:

2 * dyears(1)

[1] "63115200s (~2 years)"

dyears(1) + dweeks(12) + dhours(5)

[1] "38833200s (~1.23 years)"

Can subtract durations to and from days

(tomorrow <- today() + ddays(1)) # returns Date

[1] "2024-01-24"

(last_year <- today() - dyears(1)) # returns date-time

[1] "2023-01-22 18:00:00 UTC"

Duration computations and weird results

Durations represent an exact number of seconds
So if you are looking at time zones, things can be odd

one_am <- ymd_hms("2026-03-08 01:00:00", tz = "America/New_York")
one_am

[1] "2026-03-08 01:00:00 EST"

one_am + ddays(1)

[1] "2026-03-09 02:00:00 EDT"

If we add a full day of seconds, would need to account for hour time change from EST to EDT.
lubridate provides periods to address this

Periods

Periods are time spans but don’t have fixed length in seconds
- Work like “human” times, i.e. days/months

one_am

[1] "2026-03-08 01:00:00 EST"

one_am + days(1)

[1] "2026-03-09 01:00:00 EDT"

one_am + ddays(1)

[1] "2026-03-09 02:00:00 EDT"

Similar to durations, useful constructors and behavior under +/*:

hours(c(12, 24))

[1] "12H 0M 0S" "24H 0M 0S"

days(7)

[1] "7d 0H 0M 0S"

months(1:3)

[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S"

10 * (months(6) + days(2))

[1] "60m 20d 0H 0M 0S"

Durations vs. periods

Adding periods can be a bit more in line with expectations

# A leap year
ymd("2024-01-01") + dyears(1)
#> [1] "2024-12-31 06:00:00 UTC"
ymd("2024-01-01") + years(1)
#> [1] "2025-01-01"

# Daylight saving time
one_am + ddays(1)
#> [1] "2026-03-09 02:00:00 EDT"
one_am + days(1)
#> [1] "2026-03-09 01:00:00 EDT"

Fixing a bug in `flights_dt`

We used same date information for arrival and departure times, but the flights really arrived on the following day.

flights_dt %>% filter(arr_time < dep_time) %>% select(origin, dest, arr_time, dep_time)

# A tibble: 10,633 × 4
   origin dest  arr_time            dep_time           
   <chr>  <chr> <dttm>              <dttm>             
 1 EWR    BQN   2013-01-01 00:03:00 2013-01-01 19:29:00
 2 JFK    DFW   2013-01-01 00:29:00 2013-01-01 19:39:00
...

flights_dt %>% mutate(
  overnight = arr_time < dep_time, # returns T/F
  arr_time = arr_time + days(overnight),
  sched_arr_time = sched_arr_time + days(overnight)
) %>% filter(arr_time < dep_time)

# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
#   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>

Intervals

dyears(1) / ddays(365) does not return 1, since dyears() is defined as the number of seconds per average year: 365.25 days.
years(1) / days(1) does not return 365, since in leap years this isn’t true.
Intervals allow for defining specific intervals of time, using pair of starting/end date times.
Format: start %--% end:

(y2023 <- ymd("2023-01-01") %--% ymd("2024-01-01"))

[1] 2023-01-01 UTC--2024-01-01 UTC

(y2024 <- ymd("2024-01-01") %--% ymd("2025-01-01"))

[1] 2024-01-01 UTC--2025-01-01 UTC

Can divide by days() to find out how many days/year:

y2023 / days(1)

[1] 365

y2024 / days(1)

[1] 366

Example: extracting dates and computing durations

Suppose we are given a tibble whose first three rows are as follows:

 df <- tribble(
    ~name, ~entry,
    "Mary", "First arrival: January 2, 2005; Second arrival: January 6, 2023",
    "Will", "First time: January 5, 1997; Second time: January 8, 2015",
    "Jose", "First visit: January 4, 1990; Second visit: January 9, 2008",
  )

Task: return a tibble which says how many days elapsed between the first and second visit, ordered by the most number of days
Complex task! Need to:
- Parse the “entry” column to extract the dates
- Turn them into proper date-times
- Compute the number of days between visits (leap years happen at different times!)
- Order by number of days
Let’s start by parsing “entry” and creating two columns for different dates

entry elements look like: First arrival: January 2, 2005; Second arrival: January 9, 2023”

( df2 <- df %>% mutate(
    date1 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\2")),
    date2 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\4"))
  ) )

# A tibble: 3 × 4
  name  entry                                              date1      date2     
  <chr> <chr>                                              <date>     <date>    
1 Mary  First arrival: January 2, 2005; Second arrival: J… 2005-01-02 2023-01-06
2 Will  First time: January 5, 1997; Second time: January… 1997-01-05 2015-01-08
3 Jose  First visit: January 4, 1990; Second visit: Janua… 1990-01-04 2008-01-09

Now we want to compute number of days between: ddays of the difference!

df2 %>% mutate(days_elapsed = days(date2 - date1) )

# A tibble: 3 × 5
  name  entry                               date1      date2      days_elapsed  
  <chr> <chr>                               <date>     <date>     <Period>      
1 Mary  First arrival: January 2, 2005; Se… 2005-01-02 2023-01-06 6578d 0H 0M 0S
2 Will  First time: January 5, 1997; Secon… 1997-01-05 2015-01-08 6577d 0H 0M 0S
3 Jose  First visit: January 4, 1990; Seco… 1990-01-04 2008-01-09 6579d 0H 0M 0S

Examples

Just need to order by maximum number of days now

df %>% mutate(
    date1 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\2")),
    date2 = mdy(str_replace(entry, "(.*): (.*); (.*): (.*)", "\\4")),
    days_elapsed = days(date2-date1)
  ) %>% arrange(by = desc(days_elapsed))

# A tibble: 3 × 5
  name  entry                               date1      date2      days_elapsed  
  <chr> <chr>                               <date>     <date>     <Period>      
1 Jose  First visit: January 4, 1990; Seco… 1990-01-04 2008-01-09 6579d 0H 0M 0S
2 Mary  First arrival: January 2, 2005; Se… 2005-01-02 2023-01-06 6578d 0H 0M 0S
3 Will  First time: January 5, 1997; Secon… 1997-01-05 2015-01-08 6577d 0H 0M 0S

We could also have done separate_wider_regex() to parse the string initially. How would we do that?

Examples

df2 <- df %>% separate_wider_regex(
  entry,
  patterns = c(
    ".*",
    ": ",
    date1 = ".*",
    ";",
    ".*",
    ": ",
    date2 = ".*"
  ) ) %>%
  mutate(date1 = mdy(date1), date2 = mdy(date2))

Original df:

df

# A tibble: 3 × 2
  name  entry                                                          
  <chr> <chr>                                                          
1 Mary  First arrival: January 2, 2005; Second arrival: January 6, 2023
2 Will  First time: January 5, 1997; Second time: January 8, 2015      
3 Jose  First visit: January 4, 1990; Second visit: January 9, 2008

df2

# A tibble: 3 × 3
  name  date1      date2     
  <chr> <date>     <date>    
1 Mary  2005-01-02 2023-01-06
2 Will  1997-01-05 2015-01-08
3 Jose  1990-01-04 2008-01-09

Example - string formatting errors for dates

Let’s suppose we are given a string with a bunch of supposed dates

ds <- c("2022-01-08", "202-01-09", "2022/01/10")

And suppose we want to return a new datetime vector satisfying the following:
- if the date is in the correct format, it is the original datetime
- otherwise, return the datetime of january 1, year 1.
We need to flag which elements of ds have the format of either YYYY-MM-DD or YYYY/MM/DD, i.e. return a vector with booleans saying whether that element has the correct format.
We’ll use regular expressions

pattern_regex <- "^\\d{4}[-/]\\d{2}[-/]\\d{2}$"
str_detect(ds, pattern_regex)

[1]  TRUE FALSE  TRUE

Example - string formatting errors for dates

ds

[1] "2022-01-08" "202-01-09"  "2022/01/10"

str_detect(ds, "^\\d{4}[-/]\\d{2}[-/]\\d{2}$")

[1]  TRUE FALSE  TRUE

Now we can use a case_when to make it year 1 if not correct format:

df <- tibble(entered = ds)
df %>% mutate(new_date = case_when(
  str_detect(entered, "^\\d{4}[-/]\\d{2}[-/]\\d{2}$") ~ ymd(entered, quiet=TRUE),
  TRUE ~ ymd("0001-01-01") )
  )

# A tibble: 3 × 2
  entered    new_date  
  <chr>      <date>    
1 2022-01-08 2022-01-08
2 202-01-09  1-01-01   
3 2022/01/10 2022-01-10

Dates and times: complications

Creating dates and times

Dates and times from strings

Creating date-times from dplyr parts

Creating date-times from dplyr parts

Updated flights df with times for arrivals/departures

Date-time components

Date-time components

Rounding

Examples

Time spans and date-time arithmetic

Constructing durations

Duration computations and weird results

Periods

Durations vs. periods

Fixing a bug in `flights_dt`

Intervals

Example: extracting dates and computing durations

Examples

Examples

Example - string formatting errors for dates

Example - string formatting errors for dates

Slide

Slide

Slide

Slide

Slide

Slide

Dates and times: complications

Creating dates and times

Dates and times from strings

Creating date-times from dplyr parts

Creating date-times from dplyr parts

Updated flights df with times for arrivals/departures

Date-time components

Date-time components

Rounding

Examples

Time spans and date-time arithmetic

Constructing durations

Duration computations and weird results

Periods

Durations vs. periods

Fixing a bug in flights_dt

Intervals

Example: extracting dates and computing durations

Examples

Examples

Example - string formatting errors for dates

Example - string formatting errors for dates

Slide

Slide

Slide

Slide

Slide

Slide

Fixing a bug in `flights_dt`