The first step is to download all the PDF versions of the RSNA 2016 program. You can do this from the RSNA conference website and clicking on "Additional Program Content."
This step is relatively straightforward. Here we just invoked the command line tool
pdftotext. If you're a Linux user, then lucky you! Getting the tool would be a simple installation command away (for some distributions it's part of the
apt-get install poppler-utils
import subprocess def convert_to_txt(pdf_file): result = subprocess.call(['pdftotext', 'pdf/' + pdf_file, 'program/' + pdf_file + '.txt']) if (result == 0): print (pdf_file, "Success!") else: print ("Error: ", result) convert_to_txt('Saturday.pdf') convert_to_txt('Sunday.pdf') convert_to_txt('Monday.pdf') convert_to_txt('Tuesday.pdf') convert_to_txt('Wednesday.pdf') convert_to_txt('Thursday.pdf') convert_to_txt('Friday.pdf')
Saturday.pdf Success! Sunday.pdf Success! Monday.pdf Success! Tuesday.pdf Success! Wednesday.pdf Success! Thursday.pdf Success! Friday.pdf Success!
The code says "Success!" but we still should verify that the text is extracted correctly.
Let's just print out the first 20 lines of the Saturday program and see.
with open("program/Saturday.pdf.txt") as myfile: head = [next(myfile) for x in range(20)] for line in head: print(line.strip())
Saturday SPPH01 AAPM Medical Physics Tutorial Session 1 Saturday, Nov. 26 12:00PM - 2:00PM Room: E351 CT PH AMA PRA Category 1 Credits ™: 2.00 ARRT Category A+ Credits: 2.00 Participants Thaddeus A. Wilson, PhD, Memphis, TN (Moderator) Nothing to Disclose Sub-Events SPPH01A Fundamentals of CT
Looks like things worked as expected.
The way word clouds operate is by keeping track of how often each word in a body of text appears.
A popular way of extracting meaning from a body of text is by keeping track of how often each keyword appears.
A wordcloud uses this data to display a visualization of text. The more frequently a word appears, the bigger it looks on the screen.
Here's an example:
We'll use the
wordcloud library available here.
This can be installed via a simple command on both Linux, Windows, or Mac OS X. If you have Anaconda installed based on our recommendations, then it should work just fine:
pip install wordcloud
Now let's make a word cloud!
%matplotlib inline from os import path from wordcloud import WordCloud import matplotlib.pyplot as plt d = 'program' # Read the whole text. text = open(path.join(d, 'Saturday.pdf.txt')).read() + ' ' text = text + open(path.join(d, 'Sunday.pdf.txt')).read() text = text + open(path.join(d, 'Monday.pdf.txt')).read() text = text + open(path.join(d, 'Tuesday.pdf.txt')).read() text = text + open(path.join(d, 'Wednesday.pdf.txt')).read() text = text + open(path.join(d, 'Thursday.pdf.txt')).read() text = text + open(path.join(d, 'Friday.pdf.txt')).read() wordcloud = WordCloud(background_color="white", stopwords=, width=1280, height=960).generate(text) plt.figure(figsize=(16,9)) plt.imshow(wordcloud) plt.axis("off") plt.show()
This is kind of fun, but there are a few things that don't quite look right. Specifically, a lot of screen real-estate is being taken up by words that don't have useful meaning: in, for, to, the, etc. Indeed, the fact that "the" is the most frequently used word in the RSNA program has almost no value because it is an artifact of the English language, not the conference.
In computational linguistics, these are called stop words. The wordcloud library actually comes with built-in ability to remove stopwords from wordclouds, so let's give that a try.
from wordcloud import STOPWORDS stopwords = set(STOPWORDS) wordcloud = WordCloud(background_color="white", max_font_size=125, stopwords=set(stopwords), width=1280, height=960).generate(text) plt.figure(figsize=(12,9)) plt.imshow(wordcloud) plt.axis("off") plt.show()
Since this is for the RSNA conference, let's try to create something meaningful out of this bag of words. First I took the RSNA logo and created a mask using the letter R.
Now, let's use this mask to fit the word cloud within it.
import numpy as np from PIL import Image rsna_mask = np.array(Image.open("rsnamask2.png")) wordcloud = WordCloud(background_color="white", max_words=10000, max_font_size=125, stopwords=set(stopwords), mask=rsna_mask, width=1200, height=1800).generate(text) # generate word cloud wordcloud.generate(text) plt.figure(figsize=(16,24)) plt.imshow(wordcloud) plt.axis("off") plt.show()