For an introductory discussion of decision trees please see this post.

In one of this site's first quests, Howard utilized the Medicare efficiency data for lumbar spine MRIs and mammography to produce some nifty visualizations.

In this quest we will use the machine learning algorithm within the RandomForest package to see if we can predict lumbar spine MRI utilization based on some state-specific data.

First let's install the RandomForest package.

In [1]:
#install.packages('randomForest')
library('randomForest')

Data Import

We will first import the two main datasets we will use in this quest. The first is the CMS efficiency data. Please see Howard's post for more information about where to download the dataset.

In [2]:
cms_data <- read.csv('csv/Outpatient_Imaging_Efficiency_-_State.csv')
head(cms_data, 7)
Out[2]:
StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDate
1AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15
2AKOP_11Thorax CT Use of Contrast Material0.77/1/146/30/15
3AKOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery4.57/1/146/30/15
4AKOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time27/1/146/30/15
5AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15
6AKOP_9Mammography Follow-up Rates8.87/1/146/30/15
7ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15

Next we are going to load a bunch of demographic data for each state. This data came from the US Census QuickFacts site, though it had to be compiled and filtered manually as only 6 regions can be selected at any time. The spreadsheet of the data used in this quest can be downloaded here.

In [3]:
dem_state <- read.csv('csv/state_data.csv')
dim(dem_state)
head(dem_state, 7)
Out[3]:
  1. 50
  2. 14
Out[3]:
State_NamePopulation_EstimatePersons.18...Persons.65...Female...VeteransMedian_housing_valueHigh_school_graduate...Bachelor...Disability_.65...No_health_insurance_.65...Commute_time.min.Median_household_incomePoverty...
1Wyoming58610723.714.5494946518930092.325.18.513.418.35825211.1
2West Virginia184412820.618.250.615515010020084.418.714.47.225.64157617.9
3Wisconsin577133722.415.650.339593116590090.827.486.621.85273812.1
4Washington717035122.514.45057574625720090.232.38.87.625.96029412.2
5Vermont62604219.217.650.74635521620091.635.29.84.622.35444710.2
6Virginia838299322.314.250.871803424350087.935.87.610.527.86479211.2
7Utah299591930.510.349.71409422125009130.66.411.521.65984611.3

You can see that there are 13 datapoints for each state (excluding the state name). Something to note, and that will become important later, is that this dataset uses the full state name while the CMS data uses the two-letter abbreviation.

Data Wrangling

Following Howard's lead we first wrangle the CMS data into a form that is easy to analyze while removing the US territories and Washington DC data. Some of the syntax is different between R and Python but we are performing the same operations. Note the all-important comma near the end of the statement to remove the values from cms_data. This comma tells R to remove rows instead of columns.

In [4]:
cms_state <- cms_data[-c(which(cms_data$State == 'PR'),
                           which(cms_data$State == 'DC'),
                           which(cms_data$State == 'VI'),
                           which(cms_data$State == 'GU'),
                           which(cms_data$State == 'AS'),
                           which(cms_data$State == 'MP')),]

In order to combine the CMS and state data we will have to deal with the way each dataset stores the state descriptor. Fortunately R has a few built in datasets, including both US state names and abbreviations. Let's start by creating an index data frame that will allow us to translate between the full state name and abbreviation.

In [5]:
dfindex <- data.frame('State' = state.abb, 'State_Name' = state.name)
head(dfindex, 5)
Out[5]:
StateState_Name
1ALAlabama
2AKAlaska
3AZArizona
4ARArkansas
5CACalifornia

Now we merge our translation data frame with the CMS data. The merge function allows us to define the column name to use as reference. This algorithm works like an inner join in SQL, with allowed duplication of entries to fill the data frame. Multiple options exist that allow us precise control over the process. Equivalents of left, right, and outer join are also possible.

In [6]:
temp <- merge(cms_state, dfindex, by = 'State')
head(temp, 7)
Out[6]:
StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDateState_Name
1AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15Alaska
2AKOP_11Thorax CT Use of Contrast Material0.77/1/146/30/15Alaska
3AKOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery4.57/1/146/30/15Alaska
4AKOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time27/1/146/30/15Alaska
5AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15Alaska
6AKOP_9Mammography Follow-up Rates8.87/1/146/30/15Alaska
7ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15Alabama

Note the new column on the far right with the full state name.

We then merge this temporary dataset with the state data.

In [7]:
cms_state_merge <- merge(dem_state, temp, by = 'State_Name')
head(cms_state_merge, 7)
Out[7]:
State_NamePopulation_EstimatePersons.18...Persons.65...Female...VeteransMedian_housing_valueHigh_school_graduate...Bachelor...Disability_.65...ellip.hCommute_time.min.Median_household_incomePoverty...StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDate
1Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_11Thorax CT Use of Contrast Material2.87/1/146/30/15
2Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_14Outpatients with brain CT scans who got a sinus CT scan at the same time3.67/1/146/30/15
3Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_8MRI Lumbar Spine for Low Back Pain42.77/1/146/30/15
4Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_9Mammography Follow-up Rates7.47/1/146/30/15
5Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_13Outpatients who got cardiac imaging stress tests before low-risk outpatient surgery5.67/1/146/30/15
6Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_10Abdomen CT Use of Contrast Material13.67/1/146/30/15
7Alaska73843225.29.947.37037024630091.827.7818.97182910.3AKOP_10Abdomen CT Use of Contrast Material5.97/1/146/30/15

We are going to focus only the lumbar spine MRI data, so we get a subset of the combined dataset limited to these values.

In [8]:
lowerback <- subset(cms_state_merge, cms_state_merge$Measure_ID == 'OP_8' & cms_state_merge$Score != 'Not Available')
length(unique(lowerback$State))
unique(lowerback$State)
Out[8]:
49
Out[8]:
  1. AL
  2. AK
  3. AZ
  4. AR
  5. CA
  6. CO
  7. CT
  8. DE
  9. FL
  10. GA
  11. HI
  12. ID
  13. IL
  14. IN
  15. IA
  16. KS
  17. KY
  18. LA
  19. ME
  20. MA
  21. MI
  22. MN
  23. MS
  24. MO
  25. MT
  26. NE
  27. NV
  28. NH
  29. NJ
  30. NM
  31. NY
  32. NC
  33. ND
  34. OH
  35. OK
  36. OR
  37. PA
  38. RI
  39. SC
  40. SD
  41. TN
  42. TX
  43. UT
  44. VT
  45. VA
  46. WA
  47. WV
  48. WI
  49. WY

Unfortunately we've lost Maryland but they did not have any data in the file anyway.

In [9]:
head(lowerback, 5)
Out[9]:
State_NamePopulation_EstimatePersons.18...Persons.65...Female...VeteransMedian_housing_valueHigh_school_graduate...Bachelor...Disability_.65...ellip.hCommute_time.min.Median_household_incomePoverty...StateMeasure_IDMeasure_NameScoreFootnoteMeasure_StartDateMeasure_EndDate
3Alabama485897922.715.751.637652512380083.723.111.724.34351118.5ALOP_8MRI Lumbar Spine for Low Back Pain42.77/1/146/30/15
10Alaska73843225.29.947.37037024630091.827.7818.97182910.3AKOP_8MRI Lumbar Spine for Low Back Pain44.97/1/146/30/15
16Arizona682806523.816.450.351429016290085.927.1824.74992817.4AZOP_8MRI Lumbar Spine for Low Back Pain39.37/1/146/30/15
20Arkansas297820423.71650.922926110870084.320.612.221.44126419.1AROP_8MRI Lumbar Spine for Low Back Pain437/1/146/30/15
27California3914481823.313.350.3184036637140081.5316.727.66148915.3CAOP_8MRI Lumbar Spine for Low Back Pain38.67/1/146/30/15

We can see that several of the columns in the data frame are either redundant (State) or are not going to be used (Footnote). Let's remove them for simplicity.

In [10]:
lowerback$State <- NULL
lowerback$Measure_ID <- NULL
lowerback$Measure_Name <- NULL
lowerback$Footnote <- NULL
lowerback$Measure_StartDate <- NULL
lowerback$Measure_EndDate <- NULL
colnames(lowerback)
Out[10]:
  1. 'State_Name'
  2. 'Population_Estimate'
  3. 'Persons.18...'
  4. 'Persons.65...'
  5. 'Female...'
  6. 'Veterans'
  7. 'Median_housing_value'
  8. 'High_school_graduate...'
  9. 'Bachelor...'
  10. 'Disability_.65...'
  11. 'No_health_insurance_.65...'
  12. 'Commute_time.min.'
  13. 'Median_household_income'
  14. 'Poverty...'
  15. 'Score'

The last step in our data wrangling is to get rid of our state names as variables. Instead, we want them to be the rownames.

In [11]:
rownames(lowerback) <- lowerback$State_Name
lowerback$State_Name <- NULL
head(lowerback, 5)
Out[11]:
Population_EstimatePersons.18...Persons.65...Female...VeteransMedian_housing_valueHigh_school_graduate...Bachelor...Disability_.65...No_health_insurance_.65...Commute_time.min.Median_household_incomePoverty...Score
Alabama485897922.715.751.637652512380083.723.111.711.924.34351118.542.7
Alaska73843225.29.947.37037024630091.827.7816.418.97182910.344.9
Arizona682806523.816.450.351429016290085.927.1812.824.74992817.439.3
Arkansas297820423.71650.922926110870084.320.612.211.121.44126419.143
California3914481823.313.350.3184036637140081.5316.79.727.66148915.338.6

Machine Learning

Before we set the RandomForest package loose on our data we are going to do one final calculation. The score of the lumbar spine MRI data is the measure of interest but we do not really have enough data for the model to predict a continuous variable. Instead, we will map the scale to a categorical (and in our case an ordinal) variable. Let's separate the states into 'High' and 'Low' by comparing them to the mean score. For simplicity, we remove the original Score variable from the data frame to make using the model easier.

In [12]:
lowerback$Score <- as.numeric(levels(lowerback$Score))[lowerback$Score]
lowerback$ScoreHL <- lowerback$Score < mean(lowerback$Score)
lowerback$ScoreHL <- as.factor(ifelse(lowerback$ScoreHL == TRUE, 'High', 'Low'))
lowerback$Score <- NULL
lowerback$ScoreHL
Out[12]:
  1. Low
  2. Low
  3. High
  4. Low
  5. High
  6. Low
  7. High
  8. High
  9. High
  10. High
  11. Low
  12. Low
  13. High
  14. High
  15. High
  16. Low
  17. High
  18. Low
  19. Low
  20. High
  21. High
  22. Low
  23. Low
  24. Low
  25. Low
  26. High
  27. High
  28. High
  29. High
  30. Low
  31. High
  32. High
  33. High
  34. High
  35. Low
  36. Low
  37. High
  38. High
  39. Low
  40. High
  41. High
  42. Low
  43. Low
  44. High
  45. High
  46. Low
  47. Low
  48. High
  49. Low

One of the keys for performing any machine learning algorithm is using both a training and testing dataset. Here we split our full dataset and use ~75% for the training set.

In [15]:
idx <- runif(nrow(lowerback)) <= 0.75
data.train <- lowerback[idx,]
data.test <- lowerback[!idx,]

Now we are ready to run the RandomForest algorithm. The syntax is very similar to other types of modeling in R. We will try to predict ScoreHL using all of the other variables in the data frame. The shortcut for all variables not already mentioned is to use a period. We use the recommended number of trees, 500, though this can be changed.

In [16]:
rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE, na.action=na.omit)
rf
par(mar=c(5, 5, 4, 2))
plot(rf, lwd = 3, main = '', cex.lab = 2)
legend('topright', colnames(rf$err.rate), col = 1:3, fill = 1:3)
Out[16]:
Call:
 randomForest(formula = ScoreHL ~ ., data = data.train, ntree = 500,      importance = TRUE, na.action = na.omit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 33.33%
Confusion matrix:
     High Low class.error
High   16   5   0.2380952
Low     8  10   0.4444444

We immediately notice that the model only did OK with respect to the training data (OOB). The plot demonstrates how the error changes with increasing the number of prediction trees. We probably can use about 300 and get similar results.

Next, we take our model and input the testing dataset.

In [17]:
pred <- predict(rf, data.test)
data.frame('Actual' = data.test$ScoreHL, 'Prediction' = pred)
Out[17]:
ActualPrediction
AlaskaLowLow
IdahoLowLow
MichiganHighHigh
NebraskaHighLow
NevadaHighLow
New JerseyHighHigh
New YorkHighHigh
OregonLowLow
South DakotaHighLow
WyomingLowLow

Another way to view the model's accuracy on the testing data is to use a confusion matrix.

In [18]:
library(caret)
confusionMatrix(data = pred, reference = data.test$ScoreHL)
Out[18]:
Confusion Matrix and Statistics

          Reference
Prediction High Low
      High    3   0
      Low     3   4
                                          
               Accuracy : 0.7             
                 95% CI : (0.3475, 0.9333)
    No Information Rate : 0.6             
    P-Value [Acc > NIR] : 0.3823          
                                          
                  Kappa : 0.4444          
 Mcnemar's Test P-Value : 0.2482          
                                          
            Sensitivity : 0.5000          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.5714          
             Prevalence : 0.6000          
         Detection Rate : 0.3000          
   Detection Prevalence : 0.3000          
      Balanced Accuracy : 0.7500          
                                          
       'Positive' Class : High            
                                          

Repeatability

An ensemble machine learning algorithm is a bit of a black box, and although we can get both an accuracy measure and an ordering of the input variables by importance (not performed here), understanding the performance of this type of algorithm is improved by running it multiple times. We will do this using different randomization seeds as well as by varying the training and testing datasets.

We store the accuracy each time the algorithm is run. At the end we can look at the mean and standard deviation to give us a better understanding of how this algorithm performs.

In [19]:
numIt <- 1000
acc <- numeric(numIt)
for(i in 1:numIt){
    set.seed(i)
    idx <- runif(nrow(lowerback)) <= .75
    data.train <- lowerback[idx,]
    data.test <- lowerback[!idx,]
    rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE)
    pred <- predict(rf, data.test)
    cm <- confusionMatrix(data = pred, reference = data.test$ScoreHL)
    acc[i] <- cm$overall['Accuracy']
}
mean(acc)
sd(acc)
Out[19]:
0.6452
Out[19]:
0.1231

Conclusion

In this quest we have applied one of the common machine learning tools to public data. The performance of the algorithm is not that great though it is important to note that we only had 49 observations and 13 features (predictor variables). While Random Forest algorithms are better equipped to deal with this situation than many other machine learning tools, generalizing these results may not be valid.

In this quest we only scratched the surface of the power of RandomForest. Additionally, other packages to implement machine learning tools are available within R. Future quests will explore some of these packages.

In [20]:
paste("Author:", Sys.getenv("USER"))
paste("Last edited:", Sys.time())
R.version.string
Out[20]:

Author: Joe Wildenberg
Last edited: 2016-11-04 13:31:15
R version 3.2.2 (2015-08-14)