NLP Basics – Preparing Radiology Report for Tokenization

We all want order in life, but working with radiology reports computationally sometimes is like giving an Easter bunny to a baby: it gets really messy really fast.

A metaphor where chocolate, babies, and NLP somehow all fit in the same sentence.  Source:

Indeed, radiology reports are structurally messy. There is a lot of structured metadata that tells you about the examination, but ultimately the radiologist’s knowledge is encoded in one long stream of text. Depending on the practice, reports are often organized in as many ways as there are radiologists in the practice.

Natural language processing (NLP) is the art of turning this mess into insight.

Diagnostic radiology reports are considered unstructured data, and one of the first steps to gain insight from any diagnostic radiology report is to figure out its structure.  With an annotated structure, the radiology report is like a box of chocolates: when you know what you’re gonna get, it’s just much better.

Annotated box of chocolates. Sorry, Ma Gump – sometimes you do wanna know what you’re gonna get.  Source: Collegehumor

In this quest, we will go over techniques you can employ in data analytic projects to automatically extracting the history, finding, and impression (or any other section). This is the first step towards being able to analyze specific sections of the text.  You can download the CSV file used in this quest and follow along.

To follow along, set up your computer using the following Python tutorials: Alternatively, start a free Jupyter notebook from Azure Notebooks.
Howard Chen
(Howard) Po-Hao Chen, MD MBA is a radiology chief resident at Hospital of the University of Pennsylvania. He has an interest in data-driven radiology, quality improvement, and innovation.

Howard will finish training with fellowships in musculoskeletal radiology and nuclear medicine in June 2018 from University of Pennsylvania.

Leave a Reply

Your email address will not be published. Required fields are marked *