Getting Started

Here is a dataset you can download to follow along with this exercise. This data is downsampled and slightly altered to protect patient privacy, but the random sample still allows us to evaluate the internal relationship without problem. The data will contain 3 fields:

  • ID - an unique ID for each entry
  • OrderDate - the date and time for an exam. In this dataset, this represents the timestamp an inpatient exam is scheduled.
  • Modality - The type of imaging exam.

We'll treat this dataset as an unknown, as if we are encountering it for the first time. Specifically, we'll be doing three things:

  1. Overview of the data
  2. Analysis of volume separated by month of year.
  3. Analysis of volume by time of day.

Data Overview

Let's load up the data and take a look. The following code:

  • Loads the data and stores it in inpt variable
  • Looks at a sampling of the data
  • Uses the describe function to get an overview of the dataset
In [1]:
%matplotlib inline 
# Jupyter magic to display graphs inline

import pandas as pd

# Import data

inpt = pd.read_csv("csv/inpatient_data_sample.csv")

# Peek at the data

inpt.head(10)
Out[1]:
ID OrderDate Modality
0 144093 2015-12-11 21:35:00 US
1 88891 2015-10-10 04:25:00 XR
2 51111 2015-08-27 18:50:00 XR
3 161583 2016-01-02 20:25:00 XR
4 310429 2016-06-13 10:30:00 CT
5 5966 2015-07-07 21:10:00 XR
6 213494 2016-03-01 11:00:00 XR
7 264869 2016-04-25 12:45:00 US
8 258733 2016-04-18 15:55:00 XR
9 244192 2016-04-03 02:05:00 XR
In [2]:
inpt.describe
Out[2]:
<bound method NDFrame.describe of            ID            OrderDate Modality
0      144093  2015-12-11 21:35:00       US
1       88891  2015-10-10 04:25:00       XR
2       51111  2015-08-27 18:50:00       XR
3      161583  2016-01-02 20:25:00       XR
4      310429  2016-06-13 10:30:00       CT
5        5966  2015-07-07 21:10:00       XR
6      213494  2016-03-01 11:00:00       XR
7      264869  2016-04-25 12:45:00       US
8      258733  2016-04-18 15:55:00       XR
9      244192  2016-04-03 02:05:00       XR
10     173718  2016-01-16 08:30:00       CT
11     232945  2016-03-21 21:30:00       XR
12      53562  2015-08-30 18:35:00       XR
13      81614  2015-10-01 20:45:00       XR
14      24666  2015-07-29 04:05:00       XR
15      66372  2015-09-14 10:35:00       XR
16     137385  2015-12-04 11:35:00       XR
17      22053  2015-07-26 00:45:00       XR
18     182248  2016-01-26 18:40:00       XR
19     103832  2015-10-27 14:15:00       XR
20     344110  2016-07-19 18:10:00       XR
21     349514  2016-07-25 16:50:00       XR
22     212141  2016-02-29 00:35:00       XR
23     218840  2016-03-07 08:20:00       XR
24     155773  2015-12-26 06:40:00       XR
25     169470  2016-01-11 17:25:00       XR
26     200630  2016-02-16 12:30:00       CT
27      21953  2015-07-25 20:15:00       CT
28     191107  2016-02-05 17:35:00       XR
29     108138  2015-11-01 11:40:00       XR
...       ...                  ...      ...
49970  257676  2016-04-17 11:55:00       XR
49971  133451  2015-11-30 13:25:00       CT
49972  255306  2016-04-14 18:10:00       XR
49973  246589  2016-04-05 17:05:00       CT
49974  335940  2016-07-11 12:20:00       CT
49975  223750  2016-03-11 17:30:00       CT
49976  187587  2016-02-02 05:45:00       XR
49977  184105  2016-01-28 19:40:00       CT
49978  342809  2016-07-18 14:20:00       XR
49979  186075  2016-01-31 08:35:00       XR
49980   44704  2015-08-20 15:15:00       XR
49981   63059  2015-09-10 10:30:00       US
49982   60002  2015-09-06 22:25:00       CT
49983  350699  2016-07-26 21:35:00       XR
49984   29450  2015-08-03 18:15:00       XR
49985  292078  2016-05-24 12:00:00       XR
49986  305328  2016-06-07 16:30:00       US
49987  252027  2016-04-11 14:25:00       XR
49988  189866  2016-02-04 13:10:00       CT
49989   97949  2015-10-21 04:05:00       XR
49990  211087  2016-02-27 13:30:00       CT
49991   22357  2015-07-26 11:50:00       XR
49992  189823  2016-02-04 12:15:00       XR
49993  265200  2016-04-25 17:25:00       XR
49994  299760  2016-06-01 17:35:00       XR
49995   33925  2015-08-08 08:10:00       CT
49996  131569  2015-11-27 21:00:00       CT
49997  148349  2015-12-16 21:50:00       XR
49998   37925  2015-08-12 20:10:00       CT
49999  144935  2015-12-13 01:10:00       US

[50000 rows x 3 columns]>

For this dataset, most of the describe() output is nonsense, but it does tell us the count which is helpful.

Identify Boundaries of the Dataset

Reading through the descriptors, we have now gotten some insights into this dataset, which has 3 columns:

  • ID appears to represent some unique identification number.
  • OrderDate appeaers to be from 2015 and 2016, but we're not sure exactly which months.
  • Modality has some modalities, mostly XR but also CT and US.

Let's confirm each of these insights. The following code will:

  • Find all the distinct UniqueIDs and take the length of the list. Since the above describe function told us there are 50,000 rows in total, if these are in fact unique IDs, we would expect there to be 50,000 unique IDs.
  • Find the earliest and latest order date.
  • Use the value_counts() function on Modality to identify all the modalities, which gives us a count of all available modalities in this dataset.
In [3]:
unique_id = inpt['ID'].unique()
print("Number of unique IDs:", len(unique_id))

inpt['OrderDate'] = pd.to_datetime(inpt['OrderDate']) # Convert from string to datetime
sortby_orderdate = inpt['OrderDate'].sort_values().tolist()
print("Oldest study: ", sortby_orderdate[0])
print("Most recent study: ", sortby_orderdate[-1])

mods = inpt['Modality'].value_counts()
print(mods)
Number of unique IDs: 50000
Oldest study:  2015-07-01 00:00:00
Most recent study:  2016-07-31 23:25:00
XR    35043
CT    11170
US     3787
Name: Modality, dtype: int64

Ordering Patterns by Month

We now have a general sense of what the data holds. For the rest of this post, we will focus on monthly ordering patterns.

Let's create a graph depicting the volume of XR, CT and US by month of the year. The following code will:

  • Create a new column "OrderMonth" containing the month of the ordering date.
  • Create line graph based on the the ordering month
In [4]:
# OrderMonth created as string, using the year and zero-padded month numbers for sorting purposes.
inpt['OrderMonth'] = inpt['OrderDate'].map(lambda x: str(x.year) + "-" + str(x.month).zfill(2))
inpt.head()
Out[4]:
ID OrderDate Modality OrderMonth
0 144093 2015-12-11 21:35:00 US 2015-12
1 88891 2015-10-10 04:25:00 XR 2015-10
2 51111 2015-08-27 18:50:00 XR 2015-08
3 161583 2016-01-02 20:25:00 XR 2016-01
4 310429 2016-06-13 10:30:00 CT 2016-06
In [5]:
month_counts = inpt['OrderMonth'].value_counts()

# By default value_counts sort by value, so change that to sort by month (index)
order_month_sbIndex = month_counts.sort_index()

# Make a line graph.
order_month_sbIndex.plot(kind='line', figsize=(16,6), title="Overall Radiology Volume");

By default pandas does not zero out the y-axis, so this is a slightly exaggerated view. However, the apparent pattern is that there looks to be a bump in inpatient volume from February to March 2016.

Separating out the Modalities

Based on our findings using value_counts() previously, XR outnumber CTs 3:1, which means most of these numbers we see on the graph arise from variations in radiographs.

Let's take apart the different modalities and create a 3-part graph.

The following code will:

  • Create a pivot table separating the different modalities into columns and using OrderMonth as rows.
  • Use count function on ID to determine volume in each cell.
  • Plot the data as a stacked area graph.
In [6]:
inpt.pivot_table("ID", "OrderMonth", "Modality", aggfunc='count').plot(kind='area', figsize=(16,6));

Assessing the Variation

Here we confirm that XR indeed takes up the most volume and mostly accounts for the bump from Feb 2016 and March 2016. We also see that the Y-axis is scaled differently - the bump doesn't seem all that significant when the Y-axis is scaled from 0.

So let's dig in a little further. We know there is natural variation in radiology volume throughout the year. Does this bump constitute a significant change?

We can do this using a control chart also known as a Shewart chart.

Control Chart

To make a control chart, we'll have to measure the mean and standard deviation of the dataset. Since radiographs are the crux of the increase here, let's focus on XR.

The following code will:

  • Calculate the XR examinations by month and save in the inpt_xr_monthly variable.
  • Calculate the mean and std using pandas and assign the control limits as Upper and Lower
  • Assign the mean and control limits as new columns ucl (upper control limit) and lcl (lower control limit).
  • Plot the control chart.
In [7]:
inpt_xr = inpt[inpt['Modality'] == 'XR']
inpt_xr_monthly = inpt_xr['OrderMonth'].value_counts().sort_index()

mean = inpt_xr_monthly.mean()
std = inpt_xr_monthly.std()
upper_limit = mean + 3*std
lower_limit = mean - 3*std
# Transform the pandas Series into DataFrame so we can add new columns
inpt_xr_monthly = pd.DataFrame(inpt_xr_monthly)

inpt_xr_monthly['Upper'] = upper_limit
inpt_xr_monthly['Lower'] = lower_limit
inpt_xr_monthly['Mean'] = mean

# Create the styles
color = ['black','red','red','green']
linestyle = ['-', '--', '--', '--']
yrange = [mean-6*std,mean+6*std]

inpt_xr_monthly.plot(ylim=yrange, title="Monthly Inpatient Radiography Volume", color=color, style=linestyle, figsize=(16,6));

Results

This increase is not considered out of control (i.e. they are within the control limits). However, for the months March-July of 2016, every point has been above the mean. This is called a run, and as a run persists over many timepoints, it gives rise to a new mean.

Nevertheless, for now, this is just an interesting finding worth keeping an eye on.

Inpatient Orders by Hours of Operation

Let's get to our hypothesis now. Let's see what time during the day are orders being placed.

First we will have to extra the hour from OrderDate, then we will create visualizations based on it.

In [8]:
inpt['OrderHour'] = inpt['OrderDate'].map(lambda x: x.hour)
plt.xlabel("Hour of Day")
plt.ylabel("Volume - All Modalities")
inpt['OrderHour'].value_counts().sort_index().plot(kind='line', lw=2, figsize=(16,6), title="Volume of Orders by Hour", ylim=0);

Order by Day of Week

Some people say "nothing gets done over the weekends" for inpatient services. However, we don't know if that's necessarily true in radiology. Let's separate out these orders by day of week.

In [9]:
inpt['OrderDOW'] = inpt['OrderDate'].map(lambda x: x.dayofweek)
linestyle=['-', '-', '-', '-', '-', '--', '--']
inpt_dow = inpt.pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count')

dow = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] # Create a key to convert numeric values to an actual day of week.
inpt_dow.columns = dow

inpt_dow.plot(kind='line', title="Order by Hour, Separated by Day of Week", figsize=(16,6), style=linestyle, lw=2);

Hourly Orders by Modality

Our graph looks like like volumes start to picking up between 3AM and 5AM regardless of which day of the week.

Let's split up the exams by modality. The code below will color in the by-hour graph using a stacked area graph.

In [10]:
inpt.pivot_table('ID', 'OrderHour', 'Modality', aggfunc='count').plot(kind='area', figsize=(16,6));

The interesting finding here is that 3-5AM increase in volume is almost entirely accounted for by radiographs (CT and US are unchanged to minimally decreased during this time). Because this is inpatient data, here we're seeing the daily ICU AM radiographs.

Putting it All Together

Finally, let's see how the 3 modalities in this dataset behave differently.

The best way to visualize this may be to make 3 separate subgraphs.

In [11]:
inpt_xr_dow = inpt[inpt['Modality']=='XR'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count')
inpt_ct_dow = inpt[inpt['Modality']=='CT'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count')
inpt_us_dow = inpt[inpt['Modality']=='US'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count')

dow = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] # Create a key to convert numeric values to an actual day of week.
inpt_xr_dow.columns = inpt_ct_dow.columns = inpt_us_dow.columns = dow

fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(16,12))

inpt_xr_dow.plot(ax=axes[0], kind='line', title="RADIOGRAPH Orders", style=linestyle, lw=2);
inpt_ct_dow.plot(ax=axes[1], kind='line', title="CT Orders", style=linestyle, lw=2);
inpt_us_dow.plot(ax=axes[2], kind='line', title="ULTRASOUND Orders", style=linestyle, lw=2);

plt.tight_layout()

Interpretation of Results

Radiography

Similar pattern as the "overall" graph since the overall numbers are dominated by radiographs. There is a similar bump from 3AM to 5AM that is unrelated to the day of week which possibly is related to daily ICU films. However, the volume drops off after 7AM on the weekends without the 2nd increase seen on weekdays.

CT

CT does not appear to demonstrate the same discrepancy between weekday and weekend that you see with the others.

Ultrasound

Ultrasound has the highest discrepancy between weekday and weekend data, with daytime volumes on Saturdays and Sunday at only 1/3 compared to those of weekdays.

Conclusion

In this post we took a random radiology case log containing radiograph, CT, and ultrasound ordering data.

We looked at some quick summary data, and then we identified large scale trends by month. We assessed the monthly variations using statistical process control charts to determine whether apparent increases are likly to be real increases or part of expected normal variation.

Finally we took a smaller-scale look at hourly volume variations. Throughout the exercise, we took note of the importance of evaluating different modalities differently due to the nature of each exam.

If you like our data digging exercises, do take a moment to subscribe to the Radiology Data Quest mailing list!

In [14]:
%reload_ext signature
%signature
Out[14]:
Author: Howard Chen • Last edited: September 05, 2016
Linux 3.10.0-327.22.2.el7.x86_64 - CPython 3.5.1 - IPython 4.2.0 - matplotlib 1.5.1 - numpy 1.11.0 - pandas 0.18.1