Data

Extraction and Corpus Creation

Working with The College News proved to be a valuable but challenging process, even with the help of computational text analysis tools. In order to create our corpus, we used the Python library Beautiful Soup to scrape each issue’s transcript along with other metadata from the Tri-College Libraries Digital Collections website and save it as plain text. As is typical with Optical Character Recognition transcripts generated from scanned documents, our corpus includes many errors resulting from incorrectly recognized type.

In order to combat any issues that might arise from incorrect OCR — and make the corpus more grammatically homogenous and easily searchable — some team members wrote a python script that removes accents, symbols, hyphens, and random strings of characters from the text. Our clean corpus is available on our GitHub repository for anyone to use.

Our team used a variety of tools and methods to analyze it and extract data, including the python natural language processing libraries NLTK (Natural Language Toolkit) and spaCy and Scott Enderle’s Topic Modeling Tool.

This website showcases a few of the various visualizations and projects we have created, including a map of all locations mentioned, a graph of political topics discussed in the newspaper in the 1950s and 1960s, and a wooden tactile map featuring data from commencement issues. Each of the Digital Scholarship Summer Fellows also created an individual visualization project relevent to their interests.

This chart shows the length of each College News issue. Hover over a bubble to learn more about each issue, including its date, volume and number, wordcount, number of pages, and its object identifier in the digital collection. Click on a bubble to access a link to that issue on the Tri-College Libraries Digital Collections website. The data represented includes library metadata scraped from the Digital Collections site, as well as data on wordcount of each issue extracted using the NLTK. The visualization is a Vega-Lite javascript graph created using the Python library Altair(see the data here).