For an introductory discussion of decision trees please see this post.

In one of this site's first quests, Howard utilized the Medicare efficiency data for lumbar spine MRIs and mammography to produce some nifty visualizations.

In this quest we will use the machine learning algorithm within the RandomForest package to see if we can predict lumbar spine MRI utilization based on some state-specific data.

First let's install the RandomForest package.

In [1]:

```
#install.packages('randomForest')
library('randomForest')
```

We will first import the two main datasets we will use in this quest. The first is the CMS efficiency data. Please see Howard's post for more information about where to download the dataset.

In [2]:

```
cms_data <- read.csv('csv/Outpatient_Imaging_Efficiency_-_State.csv')
head(cms_data, 7)
```

Out[2]:

In [3]:

```
dem_state <- read.csv('csv/state_data.csv')
dim(dem_state)
head(dem_state, 7)
```

Out[3]:

Out[3]:

Following Howard's lead we first wrangle the CMS data into a form that is easy to analyze while removing the US territories and Washington DC data. Some of the syntax is different between R and Python but we are performing the same operations. Note the all-important comma near the end of the statement to remove the values from cms_data. This comma tells R to remove rows instead of columns.

In [4]:

```
cms_state <- cms_data[-c(which(cms_data$State == 'PR'),
which(cms_data$State == 'DC'),
which(cms_data$State == 'VI'),
which(cms_data$State == 'GU'),
which(cms_data$State == 'AS'),
which(cms_data$State == 'MP')),]
```

In [5]:

```
dfindex <- data.frame('State' = state.abb, 'State_Name' = state.name)
head(dfindex, 5)
```

Out[5]:

In [6]:

```
temp <- merge(cms_state, dfindex, by = 'State')
head(temp, 7)
```

Out[6]:

Note the new column on the far right with the full state name.

We then merge this temporary dataset with the state data.

In [7]:

```
cms_state_merge <- merge(dem_state, temp, by = 'State_Name')
head(cms_state_merge, 7)
```

Out[7]:

In [8]:

```
lowerback <- subset(cms_state_merge, cms_state_merge$Measure_ID == 'OP_8' & cms_state_merge$Score != 'Not Available')
length(unique(lowerback$State))
unique(lowerback$State)
```

Out[8]:

Out[8]:

Unfortunately we've lost Maryland but they did not have any data in the file anyway.

In [9]:

```
head(lowerback, 5)
```

Out[9]:

In [10]:

```
lowerback$State <- NULL
lowerback$Measure_ID <- NULL
lowerback$Measure_Name <- NULL
lowerback$Footnote <- NULL
lowerback$Measure_StartDate <- NULL
lowerback$Measure_EndDate <- NULL
colnames(lowerback)
```

Out[10]:

In [11]:

```
rownames(lowerback) <- lowerback$State_Name
lowerback$State_Name <- NULL
head(lowerback, 5)
```

Out[11]:

Before we set the RandomForest package loose on our data we are going to do one final calculation. The score of the lumbar spine MRI data is the measure of interest but we do not really have enough data for the model to predict a continuous variable. Instead, we will map the scale to a categorical (and in our case an ordinal) variable. Let's separate the states into 'High' and 'Low' by comparing them to the mean score. For simplicity, we remove the original Score variable from the data frame to make using the model easier.

In [12]:

```
lowerback$Score <- as.numeric(levels(lowerback$Score))[lowerback$Score]
lowerback$ScoreHL <- lowerback$Score < mean(lowerback$Score)
lowerback$ScoreHL <- as.factor(ifelse(lowerback$ScoreHL == TRUE, 'High', 'Low'))
lowerback$Score <- NULL
lowerback$ScoreHL
```

Out[12]:

In [15]:

```
idx <- runif(nrow(lowerback)) <= 0.75
data.train <- lowerback[idx,]
data.test <- lowerback[!idx,]
```

In [16]:

```
rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE, na.action=na.omit)
rf
par(mar=c(5, 5, 4, 2))
plot(rf, lwd = 3, main = '', cex.lab = 2)
legend('topright', colnames(rf$err.rate), col = 1:3, fill = 1:3)
```

Out[16]:

We immediately notice that the model only did OK with respect to the training data (OOB). The plot demonstrates how the error changes with increasing the number of prediction trees. We probably can use about 300 and get similar results.

Next, we take our model and input the testing dataset.

In [17]:

```
pred <- predict(rf, data.test)
data.frame('Actual' = data.test$ScoreHL, 'Prediction' = pred)
```

Out[17]:

Another way to view the model's accuracy on the testing data is to use a confusion matrix.

In [18]:

```
library(caret)
confusionMatrix(data = pred, reference = data.test$ScoreHL)
```

Out[18]:

An ensemble machine learning algorithm is a bit of a black box, and although we can get both an accuracy measure and an ordering of the input variables by importance (not performed here), understanding the performance of this type of algorithm is improved by running it multiple times. We will do this using different randomization seeds as well as by varying the training and testing datasets.

We store the accuracy each time the algorithm is run. At the end we can look at the mean and standard deviation to give us a better understanding of how this algorithm performs.

In [19]:

```
numIt <- 1000
acc <- numeric(numIt)
for(i in 1:numIt){
set.seed(i)
idx <- runif(nrow(lowerback)) <= .75
data.train <- lowerback[idx,]
data.test <- lowerback[!idx,]
rf <- randomForest(ScoreHL ~ ., data=data.train, ntree=500, importance=TRUE)
pred <- predict(rf, data.test)
cm <- confusionMatrix(data = pred, reference = data.test$ScoreHL)
acc[i] <- cm$overall['Accuracy']
}
mean(acc)
sd(acc)
```

Out[19]:

Out[19]:

In this quest we have applied one of the common machine learning tools to public data. The performance of the algorithm is not that great though it is important to note that we only had 49 observations and 13 features (predictor variables). While Random Forest algorithms are better equipped to deal with this situation than many other machine learning tools, generalizing these results may not be valid.

In this quest we only scratched the surface of the power of RandomForest. Additionally, other packages to implement machine learning tools are available within R. Future quests will explore some of these packages.

In [20]:

```
paste("Author:", Sys.getenv("USER"))
paste("Last edited:", Sys.time())
R.version.string
```

Out[20]: