Projects

Automatic Genre Classification for the US Novel Corpus

For the Textual Optics Lab at the University of Chicago, I designed an automatic classifier for predicting the genre of over 9000 novels based on word frequencies. A full description of the project is available at the lab’s research blog. I also designed a Dash app for visualizing the feature space and the classifier output, which you can view and read about here.

Text as Data, 2021

In Winter 2021, I taught an introductory course in quantitative text analysis to a class of mostly social science and humanities undergraduates. The course taught students about research methods for the computational analysis of texts, including common statistical methods, TF-IDF, topic modeling, text classification, and word vector embeddings. Some of the Jupyter Notebooks used as class assignments are available at this Github repo.

Dissertation Chapters

The Rise of the Mass Market Hardcover: A Quantitative History of ~2,040 New York Times Bestsellers

In this dissertation chapter, I constructed a corpus of bestselling novels by matching entries on the NYT bestseller list to full text volumes held by HathiTrust. Comprising more than half of all novels that have ever made the list, this corpus provides a rich portrait of the history of popular literature in the postwar United States. By training a topic model on the full corpus, I show that at the end of the twentieth century, historical novels and novels about domestic life became less popular, whereas crime and “suspense” novels became newly prominent.

The data used in the chapter is currently under review at the Post45 Data Collective.

Authorless Topic Models

Dimensions of Prestige: A Cluster Analysis of the Book Review Index, 1965-2000

This chapter analyzes a dataset derived from the digitized contents of the Book Review Index. Covering more than 3 million reviews of over 1 million books, the dataset provides a panoramic view of how books were received and which authors were most widely reviewed in the period. Borrowing methods from collaborative filtering, I show that grouping authors based on the journals in which they were reviewed produces clusters with a surprising amount of intuitive coherence. This method produces clusters based on genre, profession, region, religion, and even politics.

Jordan Pruett

Automatic Genre Classification for the US Novel Corpus

Text as Data, 2021

Dissertation Chapters

The Rise of the Mass Market Hardcover: A Quantitative History of ~2,040 New York Times Bestsellers

Related Posts

Dimensions of Prestige: A Cluster Analysis of the Book Review Index, 1965-2000

Related posts