In today’s data-driven world almost every piece of data is associated with a date/time in some way (for the purposes of simplicity I will henceforth refer to both dates and time as datetime). The datetime may be the data itself, or may simply represent when the particular piece of data was collected. Regardless, the ability to manipulate this type of data, including mathematical manipulations as well as separating the individual components, is essential for a data-scientist.

One of the greatest challenges when working with datetime data is the huge variety in formatting that can be present. This can range from the simple (US and European ordering of the day and month) to verbose statements that must be parsed piecemeal (Tuesday January 24th, 2017).

Python and R both have reasonably good support for datetime manipulation. While this is not meant to be a comprehensive tutorial on all of the available functions, it will hopefully provide an easy-to-implement framework for those just getting started.

R

Datetime data is cumbersome in the R codebase. However, like so many abilities, improved datetime handling is provided by a downloadable package, in this case called “lubridate.” Almost every function in this package is labeled exactly as you would think. First, let’s parse the same date in different formats:

library(lubridate)
date1 <- ymd('20170125')
date2 <- ymd('2017-01-25')
date3 <- ymd('2017/01/25')
date4 <- mdy('01/25/2017')
date5 <- dmy('25/01/2017')
date1
## "2017-01-25"

You will notice how these functions automatically parse the delimiters. We can easily expand the functionality to include time by adding ’_hms’ to the function name. The timezone is an optional parameter

date1 <- ymd_hms('2017-01-25 14:25:12', tz='EST')
date1
## "2017-01-25 14:25:12 EST"

Extracting and changing individual components

lubridate has functions for extracting/manipulating individual components of a datetime object. As usual, they are titled as we would expect

second(date1)
## 12
wday(date1)
## 4
hour(date1)
## 14
year(date1)
## 2017
second(date1) <- 49
date1
## "2017-01-25 14:25:49 EST"

A note of caution. The functions to extract the individual components have a plural counterpart (e.g. seconds() ) that gives the interval in POSIX/Epoch time - the time from Thursday January 1, 1970.

Date arithmetic

We can create a period of time using the the plural name of the function with a ‘d’ out front. This period is then be added (or subtracted) to a date object to create a target date.

ddays(1)
## "86400s (~1 days)"
date1
## "2017-01-25 14:25:49 EST"
date1 + ddays(1)
## "2017-01-26 14:25:49 EST"
date1 + dyears(1)
## "2018-01-25 14:25:49 EST"
date1 - dweeks(1)
## "2017-01-18 14:25:49 EST"

lubridate also has the ability to create and manipulate time intervals. This functionality can be really useful, depending on the goal, but we will leave that for another day.

Python

Python packages can be imported and utilized at a much smaller scale than those in R. At the top level, Python has a ‘datetime’ package that will include everything you will need. Within ‘datetime’ there are separate date, time, datetime and timedelta objects that can be used individually.

As with R, we will start by how to parse text into a format that Python can work with. Unlike R, instead of changing the name of the function to reflect the format of the information, we explicitly provide the format.

from datetime import datetime as dt 
date1 = dt.strptime(‘20170125’,“%Y%m%d”) 
date2 = dt.strptime(‘2017-01-25’,“%Y-%m-%d”) 
date3 = dt.strptime(‘2017/01/25’,“%Y/%m/%d”) 
date4 = dt.strptime(‘01/25/2017’,“%m/%d/%Y”) 
date5 = dt.strptime(‘25/01/2017’,“%d/%m/%Y”) 
date6 = dt.strptime(‘2017-01-25 14:25:12’,“%Y-%m-%d %H:%M:%S”)

Extracting and changing individual components

Once we have a datetime object we can extract individual pieces. This includes getting specific components, or the ability to separate the date portion from the time portion. We can also replace a particular component, though note that the function actually returns a new datetime object but does not change the original.

date6.date()
date6.time()
date6.hour
date6.year
date6.weekday
date6.replace(hour=23)
## "datetime.date(2017, 1, 25)"
## "datetime.time(14, 25, 12)"
## "14"
## "2017"
## "2"
## "datetime.datetime(2017, 1, 25, 23, 25, 12)"

Date arithmetic

The conceptual framework for interval and date arithmetic are almost identical between R and Python, with just a little difference in syntax. Here we will use a timedelta object, which is both an interval and can be used to add/subtract dates and times.

from datetime import timedelta 
oneHour = timedelta(hours=1) 
twoHour = dt.strptime(‘2017-01-25 14:25:12’,“%Y-%m-%d %H:%M:%S”) - dt.strptime(‘2017-01-25 12:25:12’,“%Y-%m-%d %H:%M:%S”) 
twoHour 
date6 + oneHour 
date6 - oneHour
## "datetime.timedelta(0, 7200)"
## "datetime.datetime(2017, 1, 25, 15, 25, 12)"
## "datetime.datetime(2017, 1, 25, 13, 25, 12)"

Conclusions

The ability to manage dates and time is fundamental to data science and even more important when working with healthcare data, almost all of which is timestamped. Both Python and R have a lot of conceptual similarities when working with this type of data, though the exact syntax is different.

This brief tutorial was by no means comprehensive and I strongly encourage everyone to read the actual documentation to see what all can be done with these tools.

paste("Author:", Sys.getenv("USER"))
paste("Last edited:", Sys.time())
R.version.string
## "Author: Joe Wildenberg"
## "Last edited: 2016-11-04 13:31:15"
## R version 3.3.1 (2016-06-21)"