11 minute read

By now, most people probably have an intuitive understanding of how a simple recommendation system works. Classical recommendation systems use what is known as “collaborative filtering”: items are recommended based on the tastes of other users who have purchased/viewed/consumed many of the same items that you have.

Although collaborative filtering was invented in the 1990s, it’s based on an older intuition about how taste works: people who have shared taste in the past are likely to do so in the future. From a sociology of culture perspective, what’s most interesting is that it even works in the first place. The fact that observed overlaps in taste can be used to uncover unseen overlaps tells us something about communities of taste and the social formation of preferences.

For the past few months, I’ve been working on a dissertation chapter that explores this perspective. The chapter applies some of the methods of the collaborative filtering literature to a dataset derived from the contents of the Book Review Index between 1965 and 2000. The full dataset contains bibliographic entries on over 3 million reviews of over 1 million books. Instead of looking at purchase history, I consider whether two books or authors were reviewed in a similar collection of journals.

In what follows, I share a simple demonstration on just the 1965-1984 section of the data. I model the data as a journal-author matrix: columns are journals, rows are authors, and each cell represents the number of times that author was reviewed by that journal. This allows us to make similarity comparisons between authors that have a remarkable degree of intuitive consistency: authors are naturally grouped by genre, status, profession, and even nationality.

import math
import time

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

# for t-SNE visualization
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE

# for plotting, we will use Bokeh for the excellent interactive options
from bokeh.plotting import figure, show, output_file
from bokeh.models import ColumnDataSource, HoverTool, CategoricalColorMapper, Legend

The processed data tracks reviews at the level of titles. This cell instead groups by authors.

df = pd.read_csv('../../data/processed/book_reviews.tsv', sep='\t', index_col=0)
df['author_name'] = df.index.to_series().str.split('\\|\\|').str[1].str.strip()
author_total_books = df['author_name'].value_counts()
df = df.groupby('author_name').sum()
df = df[df.index.notnull()]
df = df.drop('#NAME?')
df.head()
AB Bookman's Weekly Publishers Weekly Esquire Booklist Journal of Aesthetics and Art Criticism International Philosophical Quarterly Journal of Marketing Harvard Law Review Journal of Business Education Journal of Home Economics ... Black Warrior Review Computers and the Humanities American Arts Essays on Canadian Writing` Performing Arts Review Journal of Arts Management, Law, and Society Studio International, Review Journal of Black Studies Lone Star Review Aspen Journal of the Arts
author_name
0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AABERG, Jean 2 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AADLAND, Florence 0 0 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAFJES, Bertus 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
AAGAARD, Orlena 0 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 485 columns

Let’s see how much data we’re working with. By summing over the entire dataframe, we can also see the total number of reviews.

print(df.shape)
print(df.sum().sum())
(167445, 485)
968080

This method of representing the data is similar to how you might represent customer-item interactions in a simple recommender system. As in that application, we’re going to restrict the data to only authors who have received a minimum number of reviews. I’ve selected 20, because this yields a manageable number for visualization.

Additionally, I’ve dropped all journals with fewer than 25 reviews. There are only a few of them and they are mostly due to OCR errors.

auth_min = 20
journal_min = 25
df = df[df.sum(axis=1) >= auth_min]
df = df[df.columns[df.sum() >= journal_min]]
author_total_books = author_total_books[df.index]
print(df.shape)
print(df.sum().sum())
(9043, 352)
430173

This smaller dataset only keeps ~5% of the authors in the full data, but retains ~44.4% of the reviews. So a small number of authors received the large majority of the reviews in our data.

Our goal is to find authors who are reviewed in similar venues. The problem is that our venues have extremely lopsided levels of coverage. Compare the total number of reviews in Publishers Weekly to the total in Analog Science Fiction and Fact:

pw_total = df['Publishers Weekly'].sum()
an_total = df['Analog Science Fiction and Fact'].sum()
print(f'Publishers Weekly total: {pw_total}')
print(f'Analog total: {an_total}')
Publishers Weekly total: 29769
Analog total: 456

The full set of publications has a long tail: a few journals just publish an order of magnitude more reviews than the “typical” journal. The x-axis is on a log-scale because the plot is pretty incoherent otherwise.

It looks like most journals have a few hundred reviews, whereas a smaller number published a few thousand or even more than 10,000.

journal_totals = df.sum(axis=0)
review_count_hist = sns.histplot(
    journal_totals, 
    log_scale=(True, False),
    bins=20
)
plt.show()

png

This disproportion might throw off comparisons. If Publishers Weekly, Booklist, and Choice review just about everybody, then it isn’t very informative to know that an author was reviewed by one of those journals. What we want is a number that tells us how much more an author was reviewed by a journal than other authors were.

I’ve played around with three different methods for normalizing the data in this way: standard scaling (Z-scores), TF-IDF scores, and pointwise mutual information. For this demonstration, I’m just going to use standard scaling.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
weighted = scaler.fit_transform(df)
weighted = pd.DataFrame(weighted, index=df.index, columns=df.columns)

weighted.head()
AB Bookman's Weekly Publishers Weekly Esquire Booklist Journal of Aesthetics and Art Criticism International Philosophical Quarterly Harvard Law Review Journal of Home Economics Social Education Library Journal ... Journal of Negro Education Foreign Affairs Thought Political Science Reviewer Mankind Black Scholar Social Research Religious Studies Daedalus Threepenny Review
author_name
AARDEMA, Verna -0.281898 0.873116 -0.184596 1.423560 -0.114113 -0.048713 -0.080852 -0.056622 5.010613 -0.652932 ... -0.064776 -0.057693 -0.038791 -0.047681 -0.077693 -0.049667 -0.052652 -0.0555 -0.024265 -0.026197
AARON, Chester -0.281898 0.166723 -0.184596 0.566514 -0.114113 -0.048713 -0.080852 -0.056622 2.396317 -0.130159 ... -0.064776 -0.057693 -0.038791 -0.047681 -0.077693 -0.049667 -0.052652 -0.0555 -0.024265 -0.026197
AARON, Daniel -0.281898 -0.304205 -0.184596 -0.504793 -0.114113 -0.048713 -0.080852 -0.056622 -0.217978 -0.652932 ... -0.064776 -0.057693 -0.038791 -0.047681 -0.077693 -0.049667 -0.052652 -0.0555 -0.024265 -0.026197
AARON, Henry J -0.281898 -0.775134 -0.184596 -0.290531 -0.114113 -0.048713 -0.080852 -0.056622 -0.217978 0.131228 ... -0.064776 -0.057693 -0.038791 -0.047681 -0.077693 -0.049667 -0.052652 -0.0555 -0.024265 -0.026197
AASENG, Nathan -0.281898 -0.775134 -0.184596 7.851402 -0.114113 -0.048713 -0.080852 -0.056622 -0.217978 -0.914318 ... -0.064776 -0.057693 -0.038791 -0.047681 -0.077693 -0.049667 -0.052652 -0.0555 -0.024265 -0.026197

5 rows × 352 columns

Now, we can search for a specific author and see two things:

1) their highest journal scores, that is, journals in which they are unusually prominent 2) their most similar authors, that is, authors that were reviewed in a similar collection of journals

For similarity, I’ll use Pearson’s R. You could also use a distance metric, like Euclidean distance or cosine distance. According to at least one paper, Pearson’s outperforms Euclidean and cosine for recommendations.

from scipy.stats import pearsonr

def author_query(
    df,
    author: str, 
    num_journals: int = 5, 
    num_authors: int = 5
    ):

    print(author)
    print('Top Journal Scores:')
    print(df.loc[author].sort_values(ascending=False)[:num_journals])
    print()

    author_vector = df.loc[author]
    similarities = df.drop(author).apply(lambda x: pearsonr(x, author_vector), axis=1)

    print('Most Similar Authors:')
    print(similarities.sort_values()[::-1][:num_authors])
    print()

Here is a somewhat arbitrary list of authors against which to make comparisons. I find the results really fascinating; you can get a taste of cleavages within the data along axes of genre, race, status, and politics. Remember that this is based on nothing other than the fact that these authors tended to be reviewed in similar journals. This information turns out to tell you a lot about the context of their reception.

query_authors = [

    'LE GUIN, Ursula K',
    'MORRISON, Toni',
    'MERTON, Thomas', # Christian monk
    'KENNEDY, Eugene', # Catholic priest
    'UPDIKE, John',
    'ZINN, Howard', 
    'BENNETT, Lerone, Jr.', # social historian of race
    'CAUSLEY, Charles', # British children's poet, known for blurring lines between lit for kids/adults
    'SENDAK, Maurice',
    'RICE, Anne',
    'TYLER, Anne',
    'PYNCHON, Thomas'
    
]

for author in query_authors:
    author_query(weighted, author)
LE GUIN, Ursula K
Top Journal Scores:
Emergency Librarian                        16.177252
New Age Journal                            13.913492
English Journal                            10.664482
Book Report                                10.049862
Magazine of Fantasy and Science Fiction    10.047733
Name: LE GUIN, Ursula K, dtype: float64

Most Similar Authors:
author_name
MC KINLEY, Robin          (0.5528115927876553, 1.4651648937133513e-29)
WALSH, Jill Paton           (0.5264999590498307, 1.70993924355456e-26)
OXENBURY, Helen            (0.5201867871874256, 8.506955391601144e-26)
MC KILLIP, Patricia A      (0.5161226140849589, 2.347997852360038e-25)
ANGELL, Judie            (0.49897481847179126, 1.4699499707507178e-23)
dtype: object

MORRISON, Toni
Top Journal Scores:
Black Scholar    59.835853
Critique         17.461670
Black World      16.720440
Ms.              13.332273
Cresset           6.865729
Name: MORRISON, Toni, dtype: float64

Most Similar Authors:
author_name
DUMAS, Henry               (0.877432209012966, 1.060775142523341e-113)
BAMBARA, Toni Cade          (0.7774150997378938, 1.83205166005727e-72)
ARMAH, Ayi Kwei             (0.775399191524083, 7.295234900612199e-72)
KELLEY, William Melvin     (0.6948979408922896, 4.617259254309341e-52)
RUSSELL, Ross             (0.6921805372380505, 1.6520768255798344e-51)
dtype: object

MERTON, Thomas
Top Journal Scores:
Critic                      18.657915
Christian Century           16.284860
America                     15.183979
Review for Religious        14.616213
Religious Studies Review    13.219459
Name: MERTON, Thomas, dtype: float64

Most Similar Authors:
author_name
KENNEDY, Eugene         (0.7300640514315687, 7.882595309467312e-60)
ORAISON, Marc           (0.7264906275666435, 5.509922558328283e-59)
MC BRIEN, Richard P     (0.7030029994797475, 9.455691139415028e-54)
RAHNER, Kari           (0.7016190526317485, 1.8538772074814077e-53)
DUNNE, John S           (0.6804157544635081, 3.513746567468188e-49)
dtype: object

KENNEDY, Eugene
Top Journal Scores:
America                   21.402213
Review for Religious      20.159124
Christian Century         13.795751
Critic                     8.162939
Educational Leadership     6.553519
Name: KENNEDY, Eugene, dtype: float64

Most Similar Authors:
author_name
RAHNER, Kari                   (0.8624951376463152, 1.4646235674784058e-105)
ORAISON, Marc                    (0.859105639048219, 7.580649865528993e-104)
MORAN, Gabriel                  (0.8530459691211054, 6.839269809942987e-101)
DOHERTY, Catherine De Hueck      (0.8250680099089616, 8.679206626529615e-89)
BOROS, Ladislaus                 (0.8048407340361178, 2.618955188762307e-81)
dtype: object

UPDIKE, John
Top Journal Scores:
National Forum         20.862625
American Spectator     20.330386
Economist. Survey      17.967654
Carleton Miscellany    16.950023
America                16.738537
Name: UPDIKE, John, dtype: float64

Most Similar Authors:
author_name
ROTH, Philip                    (0.6920687692717833, 1.7404852299266676e-51)
VIDAL, Gore                      (0.6760280264425202, 2.433341140069544e-48)
MURDOCH, Iris                    (0.6757987574334349, 2.689782607579344e-48)
WILSON, John Anthony Burgess    (0.6655296508462868, 2.1834279288935234e-46)
MAILER, Norman                  (0.6617532460494826, 1.0528355723236884e-45)
dtype: object

ZINN, Howard
Top Journal Scores:
Negro Digest           13.841505
Science and Society    13.761646
Dissent                12.200907
Social Education        7.624908
Partisan Review         7.167689
Name: ZINN, Howard, dtype: float64

Most Similar Authors:
author_name
SUTHERLAND, Elizabeth      (0.6129807718752314, 1.0676426021917653e-37)
APTHEKER, Herbert           (0.5177429571406121, 1.568975497369954e-25)
EHRENREICH, Barbara         (0.5013228883530935, 8.457594117775939e-24)
RADOSH, Ronald             (0.4797955970028571, 1.1488516111633457e-21)
MOORE, Barrington, Jr.    (0.46729106318539493, 1.7052689919841447e-20)
dtype: object

BENNETT, Lerone, Jr.
Top Journal Scores:
Negro Digest                   27.786328
Black World                    16.720440
Black Scholar                   9.931253
Social Studies                  3.064603
Quarterly Journal of Speech     2.922191
Name: BENNETT, Lerone, Jr., dtype: float64

Most Similar Authors:
author_name
PARKS, Gordon             (0.8886551077709102, 1.4946933078911418e-120)
NKRUMAH, Kwame              (0.8433326831632582, 2.018226499048871e-96)
KELLEY, William Melvin      (0.8061399020139547, 9.214904460798329e-82)
VAN DYKE, Henry             (0.805151484028659, 2.0414253714606717e-81)
CLARKE, John Henrik         (0.7566455147335355, 1.450096864374826e-66)
dtype: object

CAUSLEY, Charles
Top Journal Scores:
Junior Bookshelf                13.393407
Growing Point                    5.389733
School Librarian                 3.774520
New Statesman                    3.678798
Times Educational Supplement     3.387897
Name: CAUSLEY, Charles, dtype: float64

Most Similar Authors:
author_name
BIEGEL, Paul       (0.8797474619943897, 4.664293889661802e-115)
KAYE, Geraldine      (0.822209027770995, 1.129552935901158e-87)
SUDBERY, Rodie      (0.8106722535024123, 2.262449687105018e-83)
LAW, Felicia       (0.8059927740296957, 1.0376161881012245e-81)
PIERS, Helen       (0.7910184049209884, 1.1017670332710154e-76)
dtype: object

SENDAK, Maurice
Top Journal Scores:
New Catholic World     21.840145
Language Arts           7.429803
Instructor              6.625812
Quill and Quire         6.382431
Emergency Librarian     5.309494
Name: SENDAK, Maurice, dtype: float64

Most Similar Authors:
author_name
MINARIK, Else Holmelund     (0.7268228146964074, 4.604771361203391e-59)
WATSON, Wendy               (0.7263598407870459, 5.912879674025634e-59)
BROWN, Margaret Wise        (0.7123824308570459, 8.882268934056061e-56)
KENDALL, Card              (0.6664389620639734, 1.4898139845533964e-46)
MONTRESOR, Beni             (0.6568677963889631, 7.791854945390835e-45)
dtype: object

RICE, Anne
Top Journal Scores:
BooksWest                            6.050889
West Coast Review of Books           3.392710
Ms.                                  3.203419
Village Voice Literary Supplement    2.726607
School Librarian                     1.803452
Name: RICE, Anne, dtype: float64

Most Similar Authors:
author_name
NEFF, Hildegarde    (0.630716460685047, 1.9148479304928843e-40)
WOOD, Bari           (0.6295179554579491, 2.97394593137537e-40)
BARLAY, Stephen      (0.603936258386666, 2.307358441243761e-36)
JAKES, John          (0.5957074972209582, 3.47470700446774e-35)
MARX, Groucho       (0.5926411934038924, 9.358115243413209e-35)
dtype: object

TYLER, Anne
Top Journal Scores:
San Francisco Review of Books    18.637230
Southern Review                  10.844227
Book Report                      10.049862
National Observer                 7.329401
Cresset                           6.865729
Name: TYLER, Anne, dtype: float64

Most Similar Authors:
author_name
OLIVIER, Sir Laurence      (0.621480761776363, 5.422338928703332e-39)
JOHNSON, Diane            (0.6148366403819276, 5.613094201778105e-38)
YURICK, Sol              (0.6116428816970522, 1.6926699397244815e-37)
BECKER, Stephen          (0.5699207473129816, 1.0491605409420453e-31)
MILLS, Hilary              (0.555022258876245, 7.864364546945786e-30)
dtype: object

PYNCHON, Thomas
Top Journal Scores:
Harper's Magazine        7.160074
Saturday Review/World    4.720596
Critique                 4.317086
Prairie Schooner         3.786981
Partisan Review          3.502168
Name: PYNCHON, Thomas, dtype: float64

Most Similar Authors:
author_name
KAUFMAN, Sue         (0.5924577991661754, 9.92606145784547e-35)
BARTH, John        (0.5598444128606803, 1.9914938304178426e-30)
PALEY, Grace         (0.5582812412190425, 3.11604880120294e-30)
MOSTERT, Noel      (0.5195069649305338, 1.0091175318715678e-25)
KOSINSKI, Jerzy     (0.517056759606564, 1.8615545714276522e-25)
dtype: object

The natural next step is to want to see a visualization of the entire space in which authors that have high scores in the same journals are grouped together.

Before doing that, I’m going to create a simple dictionary that associates each author with their top 5 journals. This will be included as a tooltip for that author visible when mousing over their point in the visualization. This is useful just because I have no idea who most of these people are; having their top journals makes them easier to Google, if I encounter them while browsing the visualization.

author_dict = {}
for author in weighted.index:
    top_10 = weighted.loc[author].sort_values(ascending=False)[:5]
    author_dict[author] = top_10.index

There are different ways of producing such a visualization: t-SNE, UMAP, PCA, etc. I’m going to use t-SNE because I’ve used it before. t-SNE tries to find a lower-dimensional projection of the data that retains the local (but not necessarily global) structure of the data.

time_start = time.time()
tsne = TSNE(n_components=2, 
            verbose=0, 
            perplexity=40, 
            n_iter=300,
            learning_rate='auto', 
            random_state=11, 
            init='pca')
tsne_svd_results = tsne.fit_transform(weighted)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))
/mnt/e/dissertation/ch3/.venv/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:982: FutureWarning: The PCA initialization in TSNE will change to have the standard deviation of PC1 equal to 1e-4 in 1.2. This will ensure better convergence.
  warnings.warn(


t-SNE done! Time elapsed: 8.81271767616272 seconds

Finally, we’ll make a scatterplot with a mouseover that gives us the name of the author and their top-scoring journals. Authors will tend to be grouped near other authors reviewed by the same venues.

source = ColumnDataSource(data=dict(
    x=tsne_svd_results[:,0],
    y=tsne_svd_results[:,1],
    author=weighted.index,
    top_scores = [author_dict[author] for author in weighted.index],
    
))
TOOLTIPS = [
    ("(x,y)", "($x, $y)"),
    ("author", "@author"),
    ("top scores", "@top_scores"),
]

p = figure(plot_width=1000, plot_height=800, tooltips=TOOLTIPS, toolbar_location='above',
           title="t-SNE Projection of ~9000 Authors in Book Review Space")
p.scatter('x', 
          'y',
          size=7,
          source=source,
          fill_alpha=1,
)

output_file(
    f"../../images/tsne_interactive_{weighting_scheme}.html", 
    title=f"t-SNE Projection of {len(weighted.index)} Authors in Book Review Space"
)

Take some time to look it over. You will note that the clusters have a high degree of intuitive structure. Just browsing, I found a cluster of 19th century American authors (Twain, Fennimore Cooper, Melville, etc), a Canadian cluster, a science fiction cluster, and more.