The first step is to download all the PDF versions of the RSNA 2016 program. You can do this from the RSNA conference website and clicking on "Additional Program Content."

# Extract Text from the PDF¶

This step is relatively straightforward. Here we just invoked the command line tool pdftotext. If you're a Linux user, then lucky you! Getting the tool would be a simple installation command away (for some distributions it's part of the poppler-utils package)

apt-get install poppler-utils



## Python Code¶

In [1]:
import subprocess

def convert_to_txt(pdf_file):
result = subprocess.call(['pdftotext', 'pdf/' + pdf_file, 'program/' + pdf_file + '.txt'])

if (result == 0):
print (pdf_file, "Success!")
else:
print ("Error: ", result)

convert_to_txt('Saturday.pdf')
convert_to_txt('Sunday.pdf')
convert_to_txt('Monday.pdf')
convert_to_txt('Tuesday.pdf')
convert_to_txt('Wednesday.pdf')
convert_to_txt('Thursday.pdf')
convert_to_txt('Friday.pdf')

Saturday.pdf Success!
Sunday.pdf Success!
Monday.pdf Success!
Tuesday.pdf Success!
Wednesday.pdf Success!
Thursday.pdf Success!
Friday.pdf Success!


## Verify Import¶

The code says "Success!" but we still should verify that the text is extracted correctly.

Let's just print out the first 20 lines of the Saturday program and see.

In [14]:
with open("program/Saturday.pdf.txt") as myfile:
head = [next(myfile) for x in range(20)]
print(line.strip())

Saturday

SPPH01

AAPM Medical Physics Tutorial Session 1
Saturday, Nov. 26 12:00PM - 2:00PM Room: E351

CT

PH

AMA PRA Category 1 Credits ™: 2.00
ARRT Category A+ Credits: 2.00
Participants

Thaddeus A. Wilson, PhD, Memphis, TN (Moderator) Nothing to Disclose
Sub-Events
SPPH01A

Fundamentals of CT


Looks like things worked as expected.

# Creating a Word Cloud¶

The way word clouds operate is by keeping track of how often each word in a body of text appears.

A popular way of extracting meaning from a body of text is by keeping track of how often each keyword appears.

A wordcloud uses this data to display a visualization of text. The more frequently a word appears, the bigger it looks on the screen.

Here's an example:

## Library¶

We'll use the wordcloud library available here.

This can be installed via a simple command on both Linux, Windows, or Mac OS X. If you have Anaconda installed based on our recommendations, then it should work just fine:

pip install wordcloud



Now let's make a word cloud!

In [24]:
%matplotlib inline
from os import path
from wordcloud import WordCloud
import matplotlib.pyplot as plt

d = 'program'

text = open(path.join(d, 'Saturday.pdf.txt')).read() + ' '
text = text + open(path.join(d, 'Sunday.pdf.txt')).read()
text = text + open(path.join(d, 'Monday.pdf.txt')).read()
text = text + open(path.join(d, 'Tuesday.pdf.txt')).read()
text = text + open(path.join(d, 'Wednesday.pdf.txt')).read()
text = text + open(path.join(d, 'Thursday.pdf.txt')).read()
text = text + open(path.join(d, 'Friday.pdf.txt')).read()

wordcloud = WordCloud(background_color="white", stopwords=[], width=1280, height=960).generate(text)

plt.figure(figsize=(16,9))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


## Stop Words¶

This is kind of fun, but there are a few things that don't quite look right. Specifically, a lot of screen real-estate is being taken up by words that don't have useful meaning: in, for, to, the, etc. Indeed, the fact that "the" is the most frequently used word in the RSNA program has almost no value because it is an artifact of the English language, not the conference.

In computational linguistics, these are called stop words. The wordcloud library actually comes with built-in ability to remove stopwords from wordclouds, so let's give that a try.

In [27]:
from wordcloud import STOPWORDS

stopwords = set(STOPWORDS)

wordcloud = WordCloud(background_color="white", max_font_size=125, stopwords=set(stopwords), width=1280, height=960).generate(text)

plt.figure(figsize=(12,9))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()


Since this is for the RSNA conference, let's try to create something meaningful out of this bag of words. First I took the RSNA logo and created a mask using the letter R.

Now, let's use this mask to fit the word cloud within it.

In [29]:
import numpy as np
from PIL import Image