For an introductory discussion of decision trees please see this post.

In one of this site's first quests, Howard utilized the Medicare efficiency data for lumbar spine MRIs and mammography to produce some nifty visualizations.

In this quest we will use the machine learning algorithm within the RandomForest package to see if we can predict lumbar spine MRI utilization based on some state-specific data.

First let's install the RandomForest package.

In [1]:
#install.packages('randomForest')
library('randomForest')


## Data Import¶

We will first import the two main datasets we will use in this quest. The first is the CMS efficiency data. Please see Howard's post for more information about where to download the dataset.

In [2]:
cms_data <- read.csv('csv/Outpatient_Imaging_Efficiency_-_State.csv')

Out[2]:
StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDate
1AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15
2AKOP_11Thorax CT Use of Contrast Material0.77/1/146/30/15
3AKOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery4.57/1/146/30/15
4AKOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time27/1/146/30/15
5AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15
6AKOP_9Mammography Follow-up Rates8.87/1/146/30/15
7ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15

Next we are going to load a bunch of demographic data for each state. This data came from the US Census QuickFacts site, though it had to be compiled and filtered manually as only 6 regions can be selected at any time. The spreadsheet of the data used in this quest can be downloaded here.

In [3]:
dem_state <- read.csv('csv/state_data.csv')
dim(dem_state)

Out[3]:
1. 50
2. 14
Out[3]:
1Wyoming58610723.714.5494946518930092.325.18.513.418.35825211.1
2West Virginia184412820.618.250.615515010020084.418.714.47.225.64157617.9
3Wisconsin577133722.415.650.339593116590090.827.486.621.85273812.1
4Washington717035122.514.45057574625720090.232.38.87.625.96029412.2
5Vermont62604219.217.650.74635521620091.635.29.84.622.35444710.2
6Virginia838299322.314.250.871803424350087.935.87.610.527.86479211.2
7Utah299591930.510.349.71409422125009130.66.411.521.65984611.3

You can see that there are 13 datapoints for each state (excluding the state name). Something to note, and that will become important later, is that this dataset uses the full state name while the CMS data uses the two-letter abbreviation.

## Data Wrangling¶

Following Howard's lead we first wrangle the CMS data into a form that is easy to analyze while removing the US territories and Washington DC data. Some of the syntax is different between R and Python but we are performing the same operations. Note the all-important comma near the end of the statement to remove the values from cms_data. This comma tells R to remove rows instead of columns.

In [4]:
cms_state <- cms_data[-c(which(cms_data$State == 'PR'), which(cms_data$State == 'DC'),
which(cms_data$State == 'VI'), which(cms_data$State == 'GU'),
which(cms_data$State == 'AS'), which(cms_data$State == 'MP')),]


In order to combine the CMS and state data we will have to deal with the way each dataset stores the state descriptor. Fortunately R has a few built in datasets, including both US state names and abbreviations. Let's start by creating an index data frame that will allow us to translate between the full state name and abbreviation.

In [5]:
dfindex <- data.frame('State' = state.abb, 'State_Name' = state.name)

Out[5]:
StateState_Name
1ALAlabama
3AZArizona
4ARArkansas
5CACalifornia

Now we merge our translation data frame with the CMS data. The merge function allows us to define the column name to use as reference. This algorithm works like an inner join in SQL, with allowed duplication of entries to fill the data frame. Multiple options exist that allow us precise control over the process. Equivalents of left, right, and outer join are also possible.

In [6]:
temp <- merge(cms_state, dfindex, by = 'State')

Out[6]:
StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDateState_Name
1AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15Alaska
2AKOP_11Thorax CT Use of Contrast Material0.77/1/146/30/15Alaska
3AKOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery4.57/1/146/30/15Alaska
4AKOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time27/1/146/30/15Alaska
5AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15Alaska
7ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15Alabama

Note the new column on the far right with the full state name.

We then merge this temporary dataset with the state data.

In [7]:
cms_state_merge <- merge(dem_state, temp, by = 'State_Name')

Out[7]:
1Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_11Thorax CT Use of Contrast Material2.87/1/146/30/15
2Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time3.67/1/146/30/15
3Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_8MRI Lumbar Spine for Low Back Pain42.77/1/146/30/15
4Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_9Mammography Follow-up Rates7.47/1/146/30/15
5Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery5.67/1/146/30/15
6Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15
7Alaska73843225.29.947.37037024630091.827.7818.97182910.3AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15

We are going to focus only the lumbar spine MRI data, so we get a subset of the combined dataset limited to these values.

In [8]:
lowerback <- subset(cms_state_merge, cms_state_merge$Measure_ID == 'OP_8' & cms_state_merge$Score != 'Not Available')
length(unique(lowerback$State)) unique(lowerback$State)

Out[8]:
49
Out[8]:
1. AL
2. AK
3. AZ
4. AR
5. CA
6. CO
7. CT
8. DE
9. FL
10. GA
11. HI
12. ID
13. IL
14. IN
15. IA
16. KS
17. KY
18. LA
19. ME
20. MA
21. MI
22. MN
23. MS
24. MO
25. MT
26. NE
27. NV
28. NH
29. NJ
30. NM
31. NY
32. NC
33. ND
34. OH
35. OK
36. OR
37. PA
38. RI
39. SC
40. SD
41. TN
42. TX
43. UT
44. VT
45. VA
46. WA
47. WV
48. WI
49. WY

Unfortunately we've lost Maryland but they did not have any data in the file anyway.

In [9]:
head(lowerback, 5)

Out[9]:
3Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_8MRI Lumbar Spine for Low Back Pain42.77/1/146/30/15
10Alaska73843225.29.947.37037024630091.827.7818.97182910.3AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15
16Arizona682806523.816.450.351429016290085.927.1824.74992817.4AZOP_8MRI Lumbar Spine for Low Back Pain39.37/1/146/30/15
20Arkansas297820423.71650.922926110870084.320.612.221.44126419.1AROP_8MRI Lumbar Spine for Low Back Pain437/1/146/30/15
27California3914481823.313.350.3184036637140081.5316.727.66148915.3CAOP_8MRI Lumbar Spine for Low Back Pain38.67/1/146/30/15

We can see that several of the columns in the data frame are either redundant (State) or are not going to be used (Footnote). Let's remove them for simplicity.

In [10]:
lowerback$State <- NULL lowerback$Measure_ID <- NULL
lowerback$Measure_Name <- NULL lowerback$Footnote <- NULL
lowerback$Measure_StartDate <- NULL lowerback$Measure_EndDate <- NULL
colnames(lowerback)

Out[10]:
1. 'State_Name'
2. 'Population_Estimate'
3. 'Persons.18...'
4. 'Persons.65...'
5. 'Female...'
6. 'Veterans'
7. 'Median_housing_value'
9. 'Bachelor...'
10. 'Disability_.65...'
11. 'No_health_insurance_.65...'
12. 'Commute_time.min.'
13. 'Median_household_income'
14. 'Poverty...'
15. 'Score'

The last step in our data wrangling is to get rid of our state names as variables. Instead, we want them to be the rownames.

In [11]:
rownames(lowerback) <- lowerback$State_Name lowerback$State_Name <- NULL

Out[11]:
Alabama485897922.715.751.637652512380083.723.111.711.924.34351118.542.7
Arizona682806523.816.450.351429016290085.927.1812.824.74992817.439.3
Arkansas297820423.71650.922926110870084.320.612.211.121.44126419.143
California3914481823.313.350.3184036637140081.5316.79.727.66148915.338.6

## Machine Learning¶

Before we set the RandomForest package loose on our data we are going to do one final calculation. The score of the lumbar spine MRI data is the measure of interest but we do not really have enough data for the model to predict a continuous variable. Instead, we will map the scale to a categorical (and in our case an ordinal) variable. Let's separate the states into 'High' and 'Low' by comparing them to the mean score. For simplicity, we remove the original Score variable from the data frame to make using the model easier.

In [12]:
lowerback$Score <- as.numeric(levels(lowerback$Score))[lowerback$Score] lowerback$ScoreHL <- lowerback$Score < mean(lowerback$Score)
lowerback$ScoreHL <- as.factor(ifelse(lowerback$ScoreHL == TRUE, 'High', 'Low'))
lowerback$Score <- NULL lowerback$ScoreHL

Out[12]:
1. Low
2. Low
3. High
4. Low
5. High
6. Low
7. High
8. High
9. High
10. High
11. Low
12. Low
13. High
14. High
15. High
16. Low
17. High
18. Low
19. Low
20. High
21. High
22. Low
23. Low
24. Low
25. Low
26. High
27. High
28. High
29. High
30. Low
31. High
32. High
33. High
34. High
35. Low
36. Low
37. High
38. High
39. Low
40. High
41. High
42. Low
43. Low
44. High
45. High
46. Low
47. Low
48. High
49. Low

One of the keys for performing any machine learning algorithm is using both a training and testing dataset. Here we split our full dataset and use ~75% for the training set.

In [15]:
idx <- runif(nrow(lowerback)) <= 0.75
data.train <- lowerback[idx,]
data.test <- lowerback[!idx,]


Now we are ready to run the RandomForest algorithm. The syntax is very similar to other types of modeling in R. We will try to predict ScoreHL using all of the other variables in the data frame. The shortcut for all variables not already mentioned is to use a period. We use the recommended number of trees, 500, though this can be changed.

In [16]:
rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE, na.action=na.omit)
rf
par(mar=c(5, 5, 4, 2))
plot(rf, lwd = 3, main = '', cex.lab = 2)
legend('topright', colnames(rf$err.rate), col = 1:3, fill = 1:3)  Out[16]: Call: randomForest(formula = ScoreHL ~ ., data = data.train, ntree = 500, importance = TRUE, na.action = na.omit) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 33.33% Confusion matrix: High Low class.error High 16 5 0.2380952 Low 8 10 0.4444444 We immediately notice that the model only did OK with respect to the training data (OOB). The plot demonstrates how the error changes with increasing the number of prediction trees. We probably can use about 300 and get similar results. Next, we take our model and input the testing dataset. In [17]: pred <- predict(rf, data.test) data.frame('Actual' = data.test$ScoreHL, 'Prediction' = pred)

Out[17]:
ActualPrediction
IdahoLowLow
MichiganHighHigh
New JerseyHighHigh
New YorkHighHigh
OregonLowLow
South DakotaHighLow
WyomingLowLow

Another way to view the model's accuracy on the testing data is to use a confusion matrix.

In [18]:
library(caret)
confusionMatrix(data = pred, reference = data.test$ScoreHL)  Out[18]: Confusion Matrix and Statistics Reference Prediction High Low High 3 0 Low 3 4 Accuracy : 0.7 95% CI : (0.3475, 0.9333) No Information Rate : 0.6 P-Value [Acc > NIR] : 0.3823 Kappa : 0.4444 Mcnemar's Test P-Value : 0.2482 Sensitivity : 0.5000 Specificity : 1.0000 Pos Pred Value : 1.0000 Neg Pred Value : 0.5714 Prevalence : 0.6000 Detection Rate : 0.3000 Detection Prevalence : 0.3000 Balanced Accuracy : 0.7500 'Positive' Class : High  ### Repeatability¶ An ensemble machine learning algorithm is a bit of a black box, and although we can get both an accuracy measure and an ordering of the input variables by importance (not performed here), understanding the performance of this type of algorithm is improved by running it multiple times. We will do this using different randomization seeds as well as by varying the training and testing datasets. We store the accuracy each time the algorithm is run. At the end we can look at the mean and standard deviation to give us a better understanding of how this algorithm performs. In [19]: numIt <- 1000 acc <- numeric(numIt) for(i in 1:numIt){ set.seed(i) idx <- runif(nrow(lowerback)) <= .75 data.train <- lowerback[idx,] data.test <- lowerback[!idx,] rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE) pred <- predict(rf, data.test) cm <- confusionMatrix(data = pred, reference = data.test$ScoreHL)
acc[i] <- cm\$overall['Accuracy']
}
mean(acc)
sd(acc)

Out[19]:
0.6452
Out[19]:
0.1231

## Conclusion¶

In this quest we have applied one of the common machine learning tools to public data. The performance of the algorithm is not that great though it is important to note that we only had 49 observations and 13 features (predictor variables). While Random Forest algorithms are better equipped to deal with this situation than many other machine learning tools, generalizing these results may not be valid.

In this quest we only scratched the surface of the power of RandomForest. Additionally, other packages to implement machine learning tools are available within R. Future quests will explore some of these packages.

In [20]:
paste("Author:", Sys.getenv("USER"))
paste("Last edited:", Sys.time())
R.version.string

Out[20]:

Author: Joe Wildenberg
Last edited: 2016-11-04 13:31:15
R version 3.2.2 (2015-08-14)