Getting Ready

If you're using Anaconda, you will have already have access to almost every necessary package necessary for this task. The notable exception is dicom, to do this, the easiest way is using pip from the command line:

pip install pydicom

To perform 3D plotting, we are using the free version of in offline mode which uses WebGL to make visualization interactive. plotly and scikit-image can be installed using conda:

conda install plotly

conda install scikit-image

We will be using features from scikit-image 0.13 or above, which may require building from source. Instructions are here. Check your version with this command:

python -c "import skimage; print skimage.__version__"

If you're using Python v3.x, then you'd want to use the appropriate print syntax:

python -c "import skimage; print(skimage.__version__)"

Finally, you need a DICOM image stack. For this exercise, we are using Kaggle's Data Science Bowl 2017 dataset.

Some of the code used here are adapted from Kaggle contributors (such as Guido Zuidhorf and Booze Allen Hamilton's data team) who generously share their work. I have made modifications to clarify what happens at each step using more visuals and additional or simplified code.

Import Packages

In [1]:
%reload_ext signature
%matplotlib inline

import numpy as np
import dicom
import os
import matplotlib.pyplot as plt
from glob import glob
from mpl_toolkits.mplot3d.art3d import Poly3DCollection
import scipy.ndimage
from skimage import morphology
from skimage import measure
from skimage.transform import resize
from sklearn.cluster import KMeans
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from import FigureFactory as FF
from plotly.graph_objs import *

Then, let's specify a specific DICOM study we can take a closer look. Let's take a look at a chest CT stack from Kaggle which contains a lung cancer.

The whole dataset is 140GB unzipped, but each examination is only 70MB or so.

Here we'll use the patient ID 5267ea7baf6332f29163064aecf6e443 from that dataset, which has been labeled as positive for lung cancer.

In [29]:
data_path = "/data/LungCancer-data/stage1/train/cancer/5267ea7baf6332f29163064aecf6e443/"
output_path = working_path = "/home/howard/Documents/"
g = glob(data_path + '/*.dcm')

# Print out the first 5 file names to verify we're in the right folder.
print ("Total of %d DICOM images.\nFirst 5 filenames:" % len(g))
print '\n'.join(g[:5])
Total of 145 DICOM images.
First 5 filenames:

Helper Functions

Here we make two helper functions.

  • load_scan will load all DICOM images from a folder into a list for manipulation.
  • The voxel values in the images are raw. get_pixels_hu converts raw values into Houndsfeld units
    • The transformation is linear. Therefore, so long as you have a slope and an intercept, you can rescale a voxel value to HU.
    • Both the rescale intercept and rescale slope are stored in the DICOM header at the time of image acquisition (these values are scanner-dependent, so you will need external information).
In [3]:
# Loop over the image files and store everything into a list.

def load_scan(path):
    slices = [dicom.read_file(path + '/' + s) for s in os.listdir(path)]
    slices.sort(key = lambda x: int(x.InstanceNumber))
        slice_thickness = np.abs(slices[0].ImagePositionPatient[2] - slices[1].ImagePositionPatient[2])
        slice_thickness = np.abs(slices[0].SliceLocation - slices[1].SliceLocation)
    for s in slices:
        s.SliceThickness = slice_thickness
    return slices

def get_pixels_hu(scans):
    image = np.stack([s.pixel_array for s in scans])
    # Convert to int16 (from sometimes int16), 
    # should be possible as values should always be low enough (<32k)
    image = image.astype(np.int16)

    # Set outside-of-scan pixels to 1
    # The intercept is usually -1024, so air is approximately 0
    image[image == -2000] = 0
    # Convert to Hounsfield units (HU)
    intercept = scans[0].RescaleIntercept
    slope = scans[0].RescaleSlope
    if slope != 1:
        image = slope * image.astype(np.float64)
        image = image.astype(np.int16)
    image += np.int16(intercept)
    return np.array(image, dtype=np.int16)

patient = load_scan(data_path)
imgs = get_pixels_hu(patient)

This is a good time to save the new data set to disk so we don't have to reprocess the stack every time.

In [4]: + "fullimages_%d.npy" % (id), imgs)

Displaying Images

The first thing we should do is to check to see whether the Houndsfeld Units are properly scaled and represented.

HU's are useful because it is standardized across all CT scans regardless of the absolute number of photons the scanner detector captured. If you need a refresher, here's a quick list of a few useful ones, sourced from Wikipedia.

Substance HU
Air −1000
Lung −500
Fat −100 to −50
Water 0
Blood +30 to +70
Muscle +10 to +40
Liver +40 to +60
Bone +700 (cancellous bone) to +3000 (cortical bone)

Let's now create a histogram of all the voxel data in the study.

In [5]:
file_used=output_path+"fullimages_%d.npy" % id
imgs_to_process = np.load(file_used).astype(np.float64) 

plt.hist(imgs_to_process.flatten(), bins=50, color='c')
plt.xlabel("Hounsfield Units (HU)")

Critiquing the Histogram

The histogram suggests the following:

  • There is lots of air
  • There is some lung
  • There's an abundance of soft tissue, mostly muscle, liver, etc, but there's also some fat.
  • There is only a small bit of bone (seen as a tiny sliver of height between 700-3000)

This observation means that we will need to do significant preprocessing if we want to process lesions in the lung tissue because only a tiny bit of the voxels represent lung.

More interestingly, what's the deal with that bar at -2000? Air really only goes to -1000, so there must be some sort of artifact.

Let's take a look at the actual images.

Displaying an Image Stack

We don't have a lot of screen real estate, so we'll be skipping every 3 slices to get a representative look at the study.

In [45]:
id = 0
imgs_to_process = np.load(output_path+'fullimages_{}.npy'.format(id))

def sample_stack(stack, rows=6, cols=6, start_with=10, show_every=3):
    fig,ax = plt.subplots(rows,cols,figsize=[12,12])
    for i in range(rows*cols):
        ind = start_with + i*show_every
        ax[int(i/rows),int(i % rows)].set_title('slice %d' % ind)
        ax[int(i/rows),int(i % rows)].imshow(stack[ind],cmap='gray')
        ax[int(i/rows),int(i % rows)].axis('off')