Get Sample Reports

I've created some example abdominal CT reports, highlighting some problems that are commonly the issue with processing radiology reports.

In [1]:
import pandas as pd
df = pd.read_csv("csv/ct-samples.csv")
df.head()
Out[1]:
ReportText
0 History: Epigastric pain. <br><br>Technique: ...
1 Impression:<br><br>1.\t No obstructing renal s...
2 Indication: History of renal stones, with vomi...
3 History: Right lower quadrant tenderness to pa...
4 IMPRESSION:<br>Unremarkable CT scan of the abd...
In [2]:
df['ReportText'][0]
Out[2]:
"History: Epigastric pain. <br><br>Technique:  CT examination of the abdomen and pelvis was performed from the domes of the diaphragms to the symphysis pubis after administration of oral and 100 mL of Isovue 370 contrast.<br><br>Comparison:  No prior CTs available for comparison. Correlation is made to limited ultrasound of the abdomen from the same date, 6/30/2016.<br><br>Findings: <br><br>There is dependent atelectasis bilaterally. There is no pleural effusion. The heart is normal in size, without pericardial effusion.<br><br>CT Abdomen: <br><br>The liver is normal in size, and without suspicious focal mass or biliary dilatation.  The gallbladder appears normal. The spleen appears normal. The pancreas appears normal.  The adrenal glands appear normal.<br><br>There is no hydronephrosis bilaterally. There is no radiopaque calculus bilaterally. There are two subcentimeter hypodense lesions in the left kidney, one in the upper pole (coronal image 83) and one in the lower pole (coronal image 77)  , too small to characters, but likely cysts.<br><br>The abdominal aorta is normal in caliber without aneurysm.  The inferior vena cava appears normal. <br><br>The partially opacified bowel is grossly normal without obstruction or free air.   There are no signs of appendicitis in the right lower quadrant.<br><br>There is no ascites. No fluid collections. <br><br>There are small retroperitoneal lymph nodes which do not meet size criteria for pathologic enlargement. <br><br>CT Pelvis: <br>The urinary bladder appears normal. The patient is status post hysterectomy. Note is made of calcification the left aspect of the vagina. There is no ovarian mass.<br><br>There are degenerative changes in the spine. There is no fracture. No suspicious lytic or blastic lesion.  The soft tissues appear normal.<br><br>IMPRESSION:<br><br>1. No CT findings are seen to explain the patient's epigastric pain.<br><br>"
In [3]:
df['ReportText'][1]
Out[3]:
'Impression:<br><br>1.\t No obstructing renal stones or hydronephrosis. Stable non-obstructing 1 mm left lower pole renal calculus.<br><br><br><br>Clinical Indication: History of renal stones with flank pain and vomiting<br><br>Technique: CT examination of the abdomen and pelvis was performed from the domes of the diaphragm to the symphysis pubis without intravenous contrast. Oral contrast was not administered.<br><br>Comparison:  None.<br><br>Findings: Lung Bases: There is dependent atelectasis in visualized lower lungs, noting platelike atelectasis in right base. There is a 2 mm calcified nodule in the lingula, unchanged from 10/2015 (image 8/3). No large pleural effusions. Coronary calcifications are seen. Bilateral gynecomastia. The heart is normal in size. No pericardial effusions. <br><br>CT Abdomen: Evaluation of the solid organs is limited without intravenous contrast.<br>Liver: The liver is normal in size, without suspicious focal mass or biliary dilatation. <br>Gallbladder: The gallbladder appears normal.<br>Pancreas: The pancreas appears normal. <br>Spleen: The spleen appears normal.<br>Adrenal glands: The adrenal glands appear normal.<br>Kidneys: There is unchanged appearance of a 1.9 x 1.6 cm left renal cyst.  Stable appearance of a exophytic 5 mm right lower pole lesion, too small to characterize. There are stable appearance of a non-obstructing left lower pole 1 mm renal calculus (image 62/4B). There is no hydronephrosis. There is no right renal calculi. There is no perinephric stranding. No hydroureter.<br><br>Bowel: There is a hiatal hernia.  Otherwise, the visualized bowel is unremarkable without obstruction or free air.   The appendix appears normal, noting a appendicolith in the distal portion unchanged.<br>Vessels: The abdominal aorta is normal in caliber without aneurysm.  The inferior vena cava appears normal. <br>Ascites: There is no ascites. No fluid collections.  <br>Adenopathy: No intra-abdominal or retroperitoneal adenopathy is seen. <br><br>CT Pelvis: <br>Bladder: The urinary bladder appears normal. <br>Prostate: The prostate is mildly enlarged. The seminal vesicles are unremarkable.  No pelvic adenopathy or pelvic mass is identified.<br><br>Bones: There are degenerative changes of the thoracolumbar spine.  Unchanged appearance of an old healed fracture of the posterior right seventh rib. There are no suspicious osseous lesions. <br>Peritoneum/Soft tissues: There is a moderate fat-containing left inguinal hernia. Unchanged appearance of a right inguinal hernia repair plug.<br><br>'

Cleaning the Data

Here we can gain some preliminary understanding of the dataset. The Quest is about basic natural language processing and not statistical analysis, so there are only 5 records in the data set.

Additionally, rather than a line break there is a <br> tag whenever a line break is expected. This is a strange property for the sample reports, but it makes sense because CSV files do poorly when you have to store linebreaks.

Displaying the Report Properly

Computers don't care about how line beraks are presented, but humans do. Let's redisplay the report text so that we can read the reports properly.

This can be done by treating the text as an HTML text.

In [4]:
from IPython.display import HTML
HTML(df['ReportText'][0])
Out[4]:
History: Epigastric pain.

Technique: CT examination of the abdomen and pelvis was performed from the domes of the diaphragms to the symphysis pubis after administration of oral and 100 mL of Isovue 370 contrast.

Comparison: No prior CTs available for comparison. Correlation is made to limited ultrasound of the abdomen from the same date, 6/30/2016.

Findings:

There is dependent atelectasis bilaterally. There is no pleural effusion. The heart is normal in size, without pericardial effusion.

CT Abdomen:

The liver is normal in size, and without suspicious focal mass or biliary dilatation. The gallbladder appears normal. The spleen appears normal. The pancreas appears normal. The adrenal glands appear normal.

There is no hydronephrosis bilaterally. There is no radiopaque calculus bilaterally. There are two subcentimeter hypodense lesions in the left kidney, one in the upper pole (coronal image 83) and one in the lower pole (coronal image 77) , too small to characters, but likely cysts.

The abdominal aorta is normal in caliber without aneurysm. The inferior vena cava appears normal.

The partially opacified bowel is grossly normal without obstruction or free air. There are no signs of appendicitis in the right lower quadrant.

There is no ascites. No fluid collections.

There are small retroperitoneal lymph nodes which do not meet size criteria for pathologic enlargement.

CT Pelvis:
The urinary bladder appears normal. The patient is status post hysterectomy. Note is made of calcification the left aspect of the vagina. There is no ovarian mass.

There are degenerative changes in the spine. There is no fracture. No suspicious lytic or blastic lesion. The soft tissues appear normal.

IMPRESSION:

1. No CT findings are seen to explain the patient's epigastric pain.

In [5]:
HTML(df['ReportText'][1])
Out[5]:
Impression:

1. No obstructing renal stones or hydronephrosis. Stable non-obstructing 1 mm left lower pole renal calculus.



Clinical Indication: History of renal stones with flank pain and vomiting

Technique: CT examination of the abdomen and pelvis was performed from the domes of the diaphragm to the symphysis pubis without intravenous contrast. Oral contrast was not administered.

Comparison: None.

Findings: Lung Bases: There is dependent atelectasis in visualized lower lungs, noting platelike atelectasis in right base. There is a 2 mm calcified nodule in the lingula, unchanged from 10/2015 (image 8/3). No large pleural effusions. Coronary calcifications are seen. Bilateral gynecomastia. The heart is normal in size. No pericardial effusions.

CT Abdomen: Evaluation of the solid organs is limited without intravenous contrast.
Liver: The liver is normal in size, without suspicious focal mass or biliary dilatation.
Gallbladder: The gallbladder appears normal.
Pancreas: The pancreas appears normal.
Spleen: The spleen appears normal.
Adrenal glands: The adrenal glands appear normal.
Kidneys: There is unchanged appearance of a 1.9 x 1.6 cm left renal cyst. Stable appearance of a exophytic 5 mm right lower pole lesion, too small to characterize. There are stable appearance of a non-obstructing left lower pole 1 mm renal calculus (image 62/4B). There is no hydronephrosis. There is no right renal calculi. There is no perinephric stranding. No hydroureter.

Bowel: There is a hiatal hernia. Otherwise, the visualized bowel is unremarkable without obstruction or free air. The appendix appears normal, noting a appendicolith in the distal portion unchanged.
Vessels: The abdominal aorta is normal in caliber without aneurysm. The inferior vena cava appears normal.
Ascites: There is no ascites. No fluid collections.
Adenopathy: No intra-abdominal or retroperitoneal adenopathy is seen.

CT Pelvis:
Bladder: The urinary bladder appears normal.
Prostate: The prostate is mildly enlarged. The seminal vesicles are unremarkable. No pelvic adenopathy or pelvic mass is identified.

Bones: There are degenerative changes of the thoracolumbar spine. Unchanged appearance of an old healed fracture of the posterior right seventh rib. There are no suspicious osseous lesions.
Peritoneum/Soft tissues: There is a moderate fat-containing left inguinal hernia. Unchanged appearance of a right inguinal hernia repair plug.

Reporting Idiosyncracies

So clearly the reports are heterogeneously structured here. Here are a few structural characteristics that are immediately notable:

  • Sometimes the the clinical information section is called "History," sometimes it's called "Clinical Indication."
  • The ordering of the sections are not consistent. Sometimes Impression is on the top, sometimes it's on the bottom.
  • The capitalization patterns of section headings are inconsistent. For example, IMPRESSION and Impression.
  • Section headings often are in their own lines, but not always - in this case Findings in the 2nd report is just at the beginning of the line. We need to differentiate these from report section headings (e.g. "Bowel" is not a section the same way "Technique" is).
  • Sometimes organ systems are labeled, sometimes the anatomic structures is implied by the sentences.

Let's tackle these issues.

Before you can perform any learning algorithm, it's always beneficial to make the algorithm's job easier. The best way to do this is by dividing report text into sections.

Report Text Segmentation

Although section titles have a wide variation of capitalization, order, and wording, a section title always should be at the start of a line. While we're at it, let's also remove the problem of capitalization by convering all the words into lower case.

The following code will separate the report text into individual lines and convert all words to lower case.

In [46]:
import numpy
df['ReportText_line'] = df['ReportText'].apply(lambda x: x.lower().split('<br>'))
df['ReportText_line'][0]
Out[46]:
['history: epigastric pain. ',
 '',
 'technique:  ct examination of the abdomen and pelvis was performed from the domes of the diaphragms to the symphysis pubis after administration of oral and 100 ml of isovue 370 contrast.',
 '',
 'comparison:  no prior cts available for comparison. correlation is made to limited ultrasound of the abdomen from the same date, 6/30/2016.',
 '',
 'findings: ',
 '',
 'there is dependent atelectasis bilaterally. there is no pleural effusion. the heart is normal in size, without pericardial effusion.',
 '',
 'ct abdomen: ',
 '',
 'the liver is normal in size, and without suspicious focal mass or biliary dilatation.  the gallbladder appears normal. the spleen appears normal. the pancreas appears normal.  the adrenal glands appear normal.',
 '',
 'there is no hydronephrosis bilaterally. there is no radiopaque calculus bilaterally. there are two subcentimeter hypodense lesions in the left kidney, one in the upper pole (coronal image 83) and one in the lower pole (coronal image 77)  , too small to characters, but likely cysts.',
 '',
 'the abdominal aorta is normal in caliber without aneurysm.  the inferior vena cava appears normal. ',
 '',
 'the partially opacified bowel is grossly normal without obstruction or free air.   there are no signs of appendicitis in the right lower quadrant.',
 '',
 'there is no ascites. no fluid collections. ',
 '',
 'there are small retroperitoneal lymph nodes which do not meet size criteria for pathologic enlargement. ',
 '',
 'ct pelvis: ',
 'the urinary bladder appears normal. the patient is status post hysterectomy. note is made of calcification the left aspect of the vagina. there is no ovarian mass.',
 '',
 'there are degenerative changes in the spine. there is no fracture. no suspicious lytic or blastic lesion.  the soft tissues appear normal.',
 '',
 'impression:',
 '',
 "1. no ct findings are seen to explain the patient's epigastric pain.",
 '',
 '']

Segmenting by Section Header

We will try to detect the text belonging to History, Findings, and Impression and assign them to a new DataFrame column. This is useful because it provides with a way to programmatically point an algorithm to "Findings" or "Impression" sections of a report without having to worry about all the variabilities.

The following code will:

  • Create a list of all valid section headers
  • Detect when these headers show up in the report and record the index of their occurrences.
  • Create a sorted list of these occurrences.
In [47]:
import operator

sections = ['clinical indication',
            'clinical history',
            'history',
            'clinical information',
            'indication',
            'technique',
            'comparison',
            'comments',
            'findings',
            'comment',
            'impression',
            ]


def section_index(txt_stripped):
    global sections
    sec_ind = {}

    for ind, val in enumerate(txt_stripped):
        for s in sections:
            # Criteria for "this is a section header":
            #   Line must start with the phrase
            #   Must either: be the only words on that on that line, or the phrase ends with ":"
            if val.strip().startswith(s) and (
                        (len(val.strip().split()) == 1) or val.strip().split()[len(s.split()) - 1][-1] == ':'):
                sec_ind[s] = ind
                continue
    return sorted(sec_ind.items(), key=operator.itemgetter(1))

df['section_index'] = df['ReportText_line'].apply(lambda x: section_index(x))

df['section_index'][0]
Out[47]:
[('history', 0),
 ('technique', 2),
 ('comparison', 4),
 ('findings', 6),
 ('impression', 29)]

Assign Section Text to New Columns

Finally, using the sorted list of section headers and the line numbers, determine which lines of text belong in which section, and assign them to their respective variables. Finally, we join the individual lines back into a single block of text using the standard '\n' character. This will allow us to Tokenize based on

For instance, the impression text would be assigned to ReportText_impression.

In [32]:
def find_impression(dataframe):
    return find_section(dataframe, 'impression')


def find_findings(dataframe):
    return find_section(dataframe, 'findings')

def find_history(dataframe):
    # There are many ways to say "history," so make sure we include all of them.
    possib = ['clinical indication',
              'clinical history',
              'history',
              'indication',
              ]
    f = '';
    for i in possib:
        f = find_section(dataframe, i)
        if len(f) > 0:
            break
    return f

def find_section(dataframe, sec):
    # Create list of sorted tuples in order of index for each section.  For example
    # [('history', 4),
    # ('technique', 51),
    # ('comparison', 83),
    # ('findings', 93),
    # ('impression', 450)]
    txt = dataframe['section_index']

    # The Findings would just be EITHER: index of "findings" to end 
    # (if it's last one), or index of 'findings' to index of the next item.
    imp_ind = find_text(txt, sec)
    return '\n'.join(dataframe['ReportText_line'][imp_ind[0]:imp_ind[1]]).strip()


def find_text(sorted_txt, seek):
    index = -1
    for i, j in enumerate(sorted_txt):
        if j[0] == seek:
            index = j[1]
            if i + 1 == len(sorted_txt):
                return (index, -1)
            else:
                return (index, sorted_txt[i + 1][1] - 1)
    return (-1, -1)

df['ReportText_impression'] = df.apply(find_impression, axis=1)
df['ReportText_history'] = df.apply(find_history, axis=1)
df['ReportText_findings'] = df.apply(find_findings, axis=1)

Fruit of Our Labor

Let's see how we did. Let's display the findings of report #0, impression of report #1. We'll also display all the clinical history in the sample data, just to be sure we did okay despite the variable naming schemes and positioning of these sections.

Notice that the code below will temporarily replace '\n' with the HTML tag <br> only to facilitate display on the screen.

In [43]:
HTML(df['ReportText_findings'][0].replace('\n', '<br>'))
Out[43]:
findings:

there is dependent atelectasis bilaterally. there is no pleural effusion. the heart is normal in size, without pericardial effusion.

ct abdomen:

the liver is normal in size, and without suspicious focal mass or biliary dilatation. the gallbladder appears normal. the spleen appears normal. the pancreas appears normal. the adrenal glands appear normal.

there is no hydronephrosis bilaterally. there is no radiopaque calculus bilaterally. there are two subcentimeter hypodense lesions in the left kidney, one in the upper pole (coronal image 83) and one in the lower pole (coronal image 77) , too small to characters, but likely cysts.

the abdominal aorta is normal in caliber without aneurysm. the inferior vena cava appears normal.

the partially opacified bowel is grossly normal without obstruction or free air. there are no signs of appendicitis in the right lower quadrant.

there is no ascites. no fluid collections.

there are small retroperitoneal lymph nodes which do not meet size criteria for pathologic enlargement.

ct pelvis:
the urinary bladder appears normal. the patient is status post hysterectomy. note is made of calcification the left aspect of the vagina. there is no ovarian mass.

there are degenerative changes in the spine. there is no fracture. no suspicious lytic or blastic lesion. the soft tissues appear normal.
In [44]:
HTML(df['ReportText_impression'][1].replace('\n', '<br>'))
Out[44]:
impression:

1. no obstructing renal stones or hydronephrosis. stable non-obstructing 1 mm left lower pole renal calculus.
In [57]:
for i, j in enumerate(df['ReportText_history']):
    print('Report #' + str(i), j)
Report #0 history: epigastric pain.
Report #1 clinical indication: history of renal stones with flank pain and vomiting
Report #2 indication: history of renal stones, with vomiting and fevers
Report #3 history: right lower quadrant tenderness to palpation
Report #4 history: 20-year-old female with no significant past medical history, presents with 4 days of suprapubic abdominal pain, nausea, vomiting. no fever, chills.

Conclusion

Looks like we did fine on separating out the individual sections.

To further separate out the sub-sections, you simply have to repeat the process. For instance, you may find it helpful to actually have a "liver findings" section versus a "bowel findings" section. The problem with identifying sub-sections is that these are not always well labeled.

In the next post, we will investigate the concept of tokenization and its role in radiology report processing.

In [58]:
%reload_ext signature
%signature
Out[58]:
Author: Howard Chen • Last edited: October 07, 2016
Linux 3.10.0-327.22.2.el7.x86_64 - CPython 3.5.1 - IPython 4.2.0 - matplotlib 1.5.1 - numpy 1.11.0 - pandas 0.18.1