If you're using Anaconda, you will have already have access to almost every necessary package necessary for this task. The notable exception is dicom
, to do this, the easiest way is using pip
from the command line:
pip install pydicom
To perform 3D plotting, we are using the free version of plot.ly
in offline mode which uses WebGL to make visualization interactive. plotly
and scikit-image
can be installed using conda:
conda install plotly
conda install scikit-image
We will be using features from scikit-image 0.13 or above, which may require building from source. Instructions are here. Check your version with this command:
python -c "import skimage; print skimage.__version__"
If you're using Python v3.x, then you'd want to use the appropriate print
syntax:
python -c "import skimage; print(skimage.__version__)"
Finally, you need a DICOM image stack. For this exercise, we are using Kaggle's Data Science Bowl 2017 dataset.
Some of the code used here are adapted from Kaggle contributors (such as Guido Zuidhorf and Booze Allen Hamilton's data team) who generously share their work. I have made modifications to clarify what happens at each step using more visuals and additional or simplified code.
%reload_ext signature
%matplotlib inline
import numpy as np
import dicom
import os
import matplotlib.pyplot as plt
from glob import glob
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage
from skimage import morphology
from skimage import measure
from skimage.transform import resize
from sklearn.cluster import KMeans
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.tools import FigureFactory as FF
from plotly.graph_objs import *
init_notebook_mode(connected=True)
Then, let's specify a specific DICOM study we can take a closer look. Let's take a look at a chest CT stack from Kaggle which contains a lung cancer.
The whole dataset is 140GB unzipped, but each examination is only 70MB or so.
Here we'll use the patient ID 5267ea7baf6332f29163064aecf6e443
from that dataset, which has been labeled as positive for lung cancer.
data_path = "/data/LungCancer-data/stage1/train/cancer/5267ea7baf6332f29163064aecf6e443/"
output_path = working_path = "/home/howard/Documents/"
g = glob(data_path + '/*.dcm')
# Print out the first 5 file names to verify we're in the right folder.
print ("Total of %d DICOM images.\nFirst 5 filenames:" % len(g))
print '\n'.join(g[:5])
Here we make two helper functions.
load_scan
will load all DICOM images from a folder into a list for manipulation. get_pixels_hu
converts raw values into Houndsfeld units#
# Loop over the image files and store everything into a list.
#
def load_scan(path):
slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
slices.sort(key = lambda x: int(x.InstanceNumber))
try:
slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
except:
slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
for s in slices:
s.SliceThickness = slice_thickness
return slices
def get_pixels_hu(scans):
image = np.stack([s.pixel_array for s in scans])
# Convert to int16 (from sometimes int16),
# should be possible as values should always be low enough (<32k)
image = image.astype(np.int16)
# Set outside-of-scan pixels to 1
# The intercept is usually -1024, so air is approximately 0
image[image == -2000] = 0
# Convert to Hounsfield units (HU)
intercept = scans[0].RescaleIntercept
slope = scans[0].RescaleSlope
if slope != 1:
image = slope * image.astype(np.float64)
image = image.astype(np.int16)
image += np.int16(intercept)
return np.array(image, dtype=np.int16)
id=0
patient = load_scan(data_path)
imgs = get_pixels_hu(patient)
This is a good time to save the new data set to disk so we don't have to reprocess the stack every time.
np.save(output_path + "fullimages_%d.npy" % (id), imgs)
The first thing we should do is to check to see whether the Houndsfeld Units are properly scaled and represented.
HU's are useful because it is standardized across all CT scans regardless of the absolute number of photons the scanner detector captured. If you need a refresher, here's a quick list of a few useful ones, sourced from Wikipedia.
Substance | HU |
---|---|
Air | −1000 |
Lung | −500 |
Fat | −100 to −50 |
Water | 0 |
Blood | +30 to +70 |
Muscle | +10 to +40 |
Liver | +40 to +60 |
Bone | +700 (cancellous bone) to +3000 (cortical bone) |
Let's now create a histogram of all the voxel data in the study.
file_used=output_path+"fullimages_%d.npy" % id
imgs_to_process = np.load(file_used).astype(np.float64)
plt.hist(imgs_to_process.flatten(), bins=50, color='c')
plt.xlabel("Hounsfield Units (HU)")
plt.ylabel("Frequency")
plt.show()
The histogram suggests the following:
This observation means that we will need to do significant preprocessing if we want to process lesions in the lung tissue because only a tiny bit of the voxels represent lung.
More interestingly, what's the deal with that bar at -2000? Air really only goes to -1000, so there must be some sort of artifact.
Let's take a look at the actual images.
We don't have a lot of screen real estate, so we'll be skipping every 3 slices to get a representative look at the study.
id = 0
imgs_to_process = np.load(output_path+'fullimages_{}.npy'.format(id))
def sample_stack(stack, rows=6, cols=6, start_with=10, show_every=3):
fig,ax = plt.subplots(rows,cols,figsize=[12,12])
for i in range(rows*cols):
ind = start_with + i*show_every
ax[int(i/rows),int(i % rows)].set_title('slice %d' % ind)
ax[int(i/rows),int(i % rows)].imshow(stack[ind],cmap='gray')
ax[int(i/rows),int(i % rows)].axis('off')
plt.show()
sample_stack(imgs_to_process)