Earlier this year, the team at EDHREC.com was generous enough to share with me a snapshot of their data, in the form of a .csv file of over 400,000 decklists scraped from deckbuilding sites. The data provides a detailed look at the aggregate deckbuilding habits of thousands of EDH players around the world.

There are many ways that one could work with this data; I encourage you to check out the articles section on the EDHREC website if you want to read more. For this notebook, I wanted to give a quick demonstration of how common text mining methods can be repurposed for working with decklists.

I'll just be demonstrating two things:

1) a simple way to recommend an "unusual" commander 2) using topic models to uncover deck themes

Part One: Commander Recommendations

First, our imports.

Then, load the data. I'm not showing the head of the dataframe because the data is not fully public. At any rate, the only parts needed for this project are commander, color identity, and decklist.

Note that this data is from February of 2020, so it does not include cards released after that date.

Decklists are saved as a bracketed string; here I'll exploit the csv module to expand those into a list of lists.

We'll hold onto the raw decklists for later. In the meantime, I'm going to demonstrate that distance measures like cosine distance can be to make similarity comparisons that are useful for making commander recommendations.

Since I'll be recommending commanders (rather than decks, or cards), I first sum card totals by commander.

As a quick sanity check, let's make sure we have the same number of commanders in our compiled data as in the original data.

Now let's turn those list of card totals into a card-commander matrix, with commanders as rows and cards as columns. A single cell represents the total number of times that card appears in a decklist for that commander.

This is laughably easy with pandas' from_dict() method. This method doesn't fill missing values by default, so we need to fill them with 0. In this case, having no value for card indicates there are 0 decklists with that card.

A simple way to make recommendations is to just recommend entities that are "similar" to an entity of interest, where "similar" refers to the closest other entity given some measure of distance.

I know that EDHREC itself uses Jaccard (aka Tanimoto) distance for measuring the simlarity of two cards to each other. For this demonstration, I'll use cosine distance, which has the nice property of being invariant to scale. So it naturally gets around the problem of some commanders just having more decklists than others.

Cosine distance is often used when working with term-document matrices, which are very similar to the decklist data as I've modeled it here. A TDM might similarly have more columns than rows (in this case, ~20,000 columns vs ~1,000 rows).

Let's use one of my own commanders as a demonstration. Here are the 10 commanders that are closest to Karametra, God of Harvests, by cosine distance.

There is no reason why our query vector needs to be a real commander, either. All it needs to be is a list of cards.

For example, let's suppose I am somebody who wants to build a Simic deck, but I'm tired of all the linear value engine commanders that are that color identity's bread and butter: Tatyova, Aesi, Kinnan, etc. In that case, I could find the commanders that are furthest from the mean of a list of commanders that I am not interested in.

If I wanted to find a Simic commander that was maximally dissimilar from Tatyova, Aesi, and Kinnan, I think I'd be pretty happy with this list. The best part is that it surfaces commanders that are each different in their own way: Moritte for clones/changelings, Kraj for +1/+1 counters, Edric for draw/evasion, etc.

I could imagine extending this application to allow users to submit a list of wanted or unwanted cards, then asking for a commander that is either close to or far away from that synthetic vector.

Part Two: Deck Themes

The above method is only meaningful at the level of commanders. What if I instead want to discover deck themes that appear in decklists with several different commanders?

For this purpose, I'm going to use Latent Dirichlet Allocation, an unsupervised method for decomposing documents into themes made of up co-occurring words. Though LDA was designed for processing documents, it's broadly useful for many types of high-dimensional count data, from genetic data to Lego color themes.

First, I'll use scikit-learn's CountVectorizer to make our features. Since our "text" is already "tokenized," we just pass dummy functions for tokenization and preprocessing.

Since this is just a demonstration, I'm not going to worry too much about perplexity and coherence scores, or other ways of evaluating the optimal number of components K to include in the topic model. Also, fitting LDA to this data takes 2+ hours on my machine, and this is a personal project, so it hardly seems worth doing a systematic hyperparameter search :).

I started with 50 components and worked up to 250, while still discovering high-quality topics. I've seen topic models with as many 500 topics. Playing around with it, I suspect I could go higher without getting bogus topics. Intuitively, it seems like it is the nature of MTG decklists that there are many distinct "bundles" of cards that tend to appear together.

Save the model, since that took almost 2.5 hours. In a real application, you'd train the model offline -- maybe retrain it every week, or every night -- and just use the transformed data for serving recommendations.

Although all the discovered themes are plausible, they vary in how surprising they are. Here are the top cards for an obvious one, which I'd call "mono-black staples."

Some are a little more impressive, especially when they span colors. This one looks to be a deathtouch theme:

Here are a few of my other miscellaneous favorites.

I've used the library pyLDAvis for visualizing topic models in the past, so I'll use that here. Sidenote: Python really has something for everything, huh?

The resulting visualization shows a PCA projection of the topics with two components. Mouse over a topic bubble in order to see the top cards of the card distribution for that topic. Mouse over a card name in order to see its importance for different topics.

It's really quite striking how separable the topics are along principal component #1. This appears to be a side effect of color identity. It looks like higher values of PC1 indicate more green cards in that topic. This would explain why the the topics with PC1 values close to 0 seem to be mostly 5-color or colorless topics (our Honden topic is in there, as well as a scarecrow topic).

Meanwhile, PC2 seems mostly related to the presence of blue cards, though the case is less dramatic here. If you want to see confirmation of this interpretation, find a topic with "Forest" or "Island" in its most salient terms and mouse over it. The topics that feature Forest are almost invariably on the right side of the map. Meanwhile, topics that feature Island seem bunched in the top 2/3.

It probably says something about the commander format that the two unobserved factors that explain the greatest variation in our topics can be summed up with "has green" or "has blue." Quantitative evidence that Simic dominates EDH?

But the real reason you use LDA is that it lets you represent a document -- or, in this case, decklist -- as a distribution of latent themes. So, for example, we can get the mean vector for a number of decklists and see it represented as a collection of themes (rather than individual cards).

Here are the top 5 topics for the mean Karametra vector.

Although some of these themes are generic Selesnya goodstuff, there are also meaningful subthemes that you might not find in the same Karametra deck: a landfall vs. an enchantress archetype.

We could use this information to find other decks that are on-theme, but which have different commanders. First, we have to transform the full data.

Let's print the commanders of the 100 decks with the highest topic score for topic 26 (enchantress) and topic 95 (landfall).

Estrid and Tuvasa: other commanders popular for an enchantment theme. Importantly, Tuvasa has an overlapping but not identical color identity with Karametra. This wasn't a commander that came up with our earlier method.

Tatyova, Azusa, and friends: no suprises here. Other commanders that play well with the Baloths.

Well, there you have it. In truth, I've only scratched the surface of what is inside the topic model. If you're interested, I encourage you to play around with the visualization a bit. Some of the topics are really quite impressive in their intelligibility. I'm still impressed that it generated a topic dominated by shock lands, specifically.

I'll close with a few other fun commander topic comparisons.