Here is a dataset you can download to follow along with this exercise. This data is downsampled and slightly altered to protect patient privacy, but the random sample still allows us to evaluate the internal relationship without problem. The data will contain 3 fields:
ID- an unique ID for each entry
OrderDate- the date and time for an exam. In this dataset, this represents the timestamp an inpatient exam is scheduled.
Modality- The type of imaging exam.
We'll treat this dataset as an unknown, as if we are encountering it for the first time. Specifically, we'll be doing three things:
Let's load up the data and take a look. The following code:
describefunction to get an overview of the dataset
%matplotlib inline # Jupyter magic to display graphs inline import pandas as pd # Import data inpt = pd.read_csv("csv/inpatient_data_sample.csv") # Peek at the data inpt.head(10)
<bound method NDFrame.describe of ID OrderDate Modality 0 144093 2015-12-11 21:35:00 US 1 88891 2015-10-10 04:25:00 XR 2 51111 2015-08-27 18:50:00 XR 3 161583 2016-01-02 20:25:00 XR 4 310429 2016-06-13 10:30:00 CT 5 5966 2015-07-07 21:10:00 XR 6 213494 2016-03-01 11:00:00 XR 7 264869 2016-04-25 12:45:00 US 8 258733 2016-04-18 15:55:00 XR 9 244192 2016-04-03 02:05:00 XR 10 173718 2016-01-16 08:30:00 CT 11 232945 2016-03-21 21:30:00 XR 12 53562 2015-08-30 18:35:00 XR 13 81614 2015-10-01 20:45:00 XR 14 24666 2015-07-29 04:05:00 XR 15 66372 2015-09-14 10:35:00 XR 16 137385 2015-12-04 11:35:00 XR 17 22053 2015-07-26 00:45:00 XR 18 182248 2016-01-26 18:40:00 XR 19 103832 2015-10-27 14:15:00 XR 20 344110 2016-07-19 18:10:00 XR 21 349514 2016-07-25 16:50:00 XR 22 212141 2016-02-29 00:35:00 XR 23 218840 2016-03-07 08:20:00 XR 24 155773 2015-12-26 06:40:00 XR 25 169470 2016-01-11 17:25:00 XR 26 200630 2016-02-16 12:30:00 CT 27 21953 2015-07-25 20:15:00 CT 28 191107 2016-02-05 17:35:00 XR 29 108138 2015-11-01 11:40:00 XR ... ... ... ... 49970 257676 2016-04-17 11:55:00 XR 49971 133451 2015-11-30 13:25:00 CT 49972 255306 2016-04-14 18:10:00 XR 49973 246589 2016-04-05 17:05:00 CT 49974 335940 2016-07-11 12:20:00 CT 49975 223750 2016-03-11 17:30:00 CT 49976 187587 2016-02-02 05:45:00 XR 49977 184105 2016-01-28 19:40:00 CT 49978 342809 2016-07-18 14:20:00 XR 49979 186075 2016-01-31 08:35:00 XR 49980 44704 2015-08-20 15:15:00 XR 49981 63059 2015-09-10 10:30:00 US 49982 60002 2015-09-06 22:25:00 CT 49983 350699 2016-07-26 21:35:00 XR 49984 29450 2015-08-03 18:15:00 XR 49985 292078 2016-05-24 12:00:00 XR 49986 305328 2016-06-07 16:30:00 US 49987 252027 2016-04-11 14:25:00 XR 49988 189866 2016-02-04 13:10:00 CT 49989 97949 2015-10-21 04:05:00 XR 49990 211087 2016-02-27 13:30:00 CT 49991 22357 2015-07-26 11:50:00 XR 49992 189823 2016-02-04 12:15:00 XR 49993 265200 2016-04-25 17:25:00 XR 49994 299760 2016-06-01 17:35:00 XR 49995 33925 2015-08-08 08:10:00 CT 49996 131569 2015-11-27 21:00:00 CT 49997 148349 2015-12-16 21:50:00 XR 49998 37925 2015-08-12 20:10:00 CT 49999 144935 2015-12-13 01:10:00 US [50000 rows x 3 columns]>
For this dataset, most of the
describe() output is nonsense, but it does tell us the
count which is helpful.
Reading through the descriptors, we have now gotten some insights into this dataset, which has 3 columns:
IDappears to represent some unique identification number.
OrderDateappeaers to be from 2015 and 2016, but we're not sure exactly which months.
Modalityhas some modalities, mostly XR but also CT and US.
Let's confirm each of these insights. The following code will:
describefunction told us there are 50,000 rows in total, if these are in fact unique IDs, we would expect there to be 50,000 unique IDs.
Modalityto identify all the modalities, which gives us a count of all available modalities in this dataset.
unique_id = inpt['ID'].unique() print("Number of unique IDs:", len(unique_id)) inpt['OrderDate'] = pd.to_datetime(inpt['OrderDate']) # Convert from string to datetime sortby_orderdate = inpt['OrderDate'].sort_values().tolist() print("Oldest study: ", sortby_orderdate) print("Most recent study: ", sortby_orderdate[-1]) mods = inpt['Modality'].value_counts() print(mods)
Number of unique IDs: 50000 Oldest study: 2015-07-01 00:00:00 Most recent study: 2016-07-31 23:25:00 XR 35043 CT 11170 US 3787 Name: Modality, dtype: int64
We now have a general sense of what the data holds. For the rest of this post, we will focus on monthly ordering patterns.
Let's create a graph depicting the volume of XR, CT and US by month of the year. The following code will:
# OrderMonth created as string, using the year and zero-padded month numbers for sorting purposes. inpt['OrderMonth'] = inpt['OrderDate'].map(lambda x: str(x.year) + "-" + str(x.month).zfill(2)) inpt.head()
month_counts = inpt['OrderMonth'].value_counts() # By default value_counts sort by value, so change that to sort by month (index) order_month_sbIndex = month_counts.sort_index() # Make a line graph. order_month_sbIndex.plot(kind='line', figsize=(16,6), title="Overall Radiology Volume");
pandas does not zero out the y-axis, so this is a slightly exaggerated view. However, the apparent pattern is that there looks to be a bump in inpatient volume from February to March 2016.
Based on our findings using
value_counts() previously, XR outnumber CTs 3:1, which means most of these numbers we see on the graph arise from variations in radiographs.
Let's take apart the different modalities and create a 3-part graph.
The following code will:
IDto determine volume in each cell.
inpt.pivot_table("ID", "OrderMonth", "Modality", aggfunc='count').plot(kind='area', figsize=(16,6));
Here we confirm that XR indeed takes up the most volume and mostly accounts for the bump from Feb 2016 and March 2016. We also see that the Y-axis is scaled differently - the bump doesn't seem all that significant when the Y-axis is scaled from 0.
So let's dig in a little further. We know there is natural variation in radiology volume throughout the year. Does this bump constitute a significant change?
We can do this using a control chart also known as a Shewart chart.
To make a control chart, we'll have to measure the mean and standard deviation of the dataset. Since radiographs are the crux of the increase here, let's focus on XR.
The following code will:
pandasand assign the control limits as
meanand control limits as new columns
ucl(upper control limit) and
lcl(lower control limit).
inpt_xr = inpt[inpt['Modality'] == 'XR'] inpt_xr_monthly = inpt_xr['OrderMonth'].value_counts().sort_index() mean = inpt_xr_monthly.mean() std = inpt_xr_monthly.std() upper_limit = mean + 3*std lower_limit = mean - 3*std # Transform the pandas Series into DataFrame so we can add new columns inpt_xr_monthly = pd.DataFrame(inpt_xr_monthly) inpt_xr_monthly['Upper'] = upper_limit inpt_xr_monthly['Lower'] = lower_limit inpt_xr_monthly['Mean'] = mean # Create the styles color = ['black','red','red','green'] linestyle = ['-', '--', '--', '--'] yrange = [mean-6*std,mean+6*std] inpt_xr_monthly.plot(ylim=yrange, title="Monthly Inpatient Radiography Volume", color=color, style=linestyle, figsize=(16,6));
This increase is not considered out of control (i.e. they are within the control limits). However, for the months March-July of 2016, every point has been above the mean. This is called a run, and as a run persists over many timepoints, it gives rise to a new mean.
Nevertheless, for now, this is just an interesting finding worth keeping an eye on.
Let's get to our hypothesis now. Let's see what time during the day are orders being placed.
First we will have to extra the hour from
OrderDate, then we will create visualizations based on it.
inpt['OrderHour'] = inpt['OrderDate'].map(lambda x: x.hour) plt.xlabel("Hour of Day") plt.ylabel("Volume - All Modalities") inpt['OrderHour'].value_counts().sort_index().plot(kind='line', lw=2, figsize=(16,6), title="Volume of Orders by Hour", ylim=0);
Some people say "nothing gets done over the weekends" for inpatient services. However, we don't know if that's necessarily true in radiology. Let's separate out these orders by day of week.
inpt['OrderDOW'] = inpt['OrderDate'].map(lambda x: x.dayofweek) linestyle=['-', '-', '-', '-', '-', '--', '--'] inpt_dow = inpt.pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count') dow = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] # Create a key to convert numeric values to an actual day of week. inpt_dow.columns = dow inpt_dow.plot(kind='line', title="Order by Hour, Separated by Day of Week", figsize=(16,6), style=linestyle, lw=2);
Our graph looks like like volumes start to picking up between 3AM and 5AM regardless of which day of the week.
Let's split up the exams by modality. The code below will color in the by-hour graph using a stacked area graph.
inpt.pivot_table('ID', 'OrderHour', 'Modality', aggfunc='count').plot(kind='area', figsize=(16,6));
The interesting finding here is that 3-5AM increase in volume is almost entirely accounted for by radiographs (CT and US are unchanged to minimally decreased during this time). Because this is inpatient data, here we're seeing the daily ICU AM radiographs.
Finally, let's see how the 3 modalities in this dataset behave differently.
The best way to visualize this may be to make 3 separate subgraphs.
inpt_xr_dow = inpt[inpt['Modality']=='XR'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count') inpt_ct_dow = inpt[inpt['Modality']=='CT'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count') inpt_us_dow = inpt[inpt['Modality']=='US'].pivot_table('ID', 'OrderHour', 'OrderDOW', aggfunc='count') dow = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'] # Create a key to convert numeric values to an actual day of week. inpt_xr_dow.columns = inpt_ct_dow.columns = inpt_us_dow.columns = dow fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(16,12)) inpt_xr_dow.plot(ax=axes, kind='line', title="RADIOGRAPH Orders", style=linestyle, lw=2); inpt_ct_dow.plot(ax=axes, kind='line', title="CT Orders", style=linestyle, lw=2); inpt_us_dow.plot(ax=axes, kind='line', title="ULTRASOUND Orders", style=linestyle, lw=2); plt.tight_layout()
Similar pattern as the "overall" graph since the overall numbers are dominated by radiographs. There is a similar bump from 3AM to 5AM that is unrelated to the day of week which possibly is related to daily ICU films. However, the volume drops off after 7AM on the weekends without the 2nd increase seen on weekdays.
CT does not appear to demonstrate the same discrepancy between weekday and weekend that you see with the others.
Ultrasound has the highest discrepancy between weekday and weekend data, with daytime volumes on Saturdays and Sunday at only 1/3 compared to those of weekdays.
In this post we took a random radiology case log containing radiograph, CT, and ultrasound ordering data.
We looked at some quick summary data, and then we identified large scale trends by month. We assessed the monthly variations using statistical process control charts to determine whether apparent increases are likly to be real increases or part of expected normal variation.
Finally we took a smaller-scale look at hourly volume variations. Throughout the exercise, we took note of the importance of evaluating different modalities differently due to the nature of each exam.
If you like our data digging exercises, do take a moment to subscribe to the Radiology Data Quest mailing list!
%reload_ext signature %signature