16 minute read

Earlier this year, the team at EDHREC.com was generous enough to share with me a snapshot of their data, in the form of a .csv file of over 400,000 decklists scraped from deckbuilding sites. The data provides a detailed look at the aggregate deckbuilding habits of thousands of EDH players around the world.

There are many ways that one could work with this data; I encourage you to check out the articles section on the EDHREC website if you want to read more. For this notebook, I wanted to give a quick demonstration of how common text mining methods can be repurposed for working with decklists.

I’ll just be demonstrating two things:

1) a simple way to recommend an “unusual” commander 2) using topic models to uncover deck themes

Part One: Commander Recommendations

First, our imports.

import csv
import joblib
from collections import defaultdict

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import seaborn as sns

Then, load the data. I’m not showing the head of the dataframe because the data is not fully public. At any rate, the only parts needed for this project are commander, color identity, and decklist.

Note that this data is from February of 2020, so it does not include cards released after that date.

data = pd.read_csv('data/decks_w_cards.csv')
data.shape
(475398, 19)

Decklists are saved as a bracketed string; here I’ll exploit the csv module to expand those into a list of lists.

decklists = []
for index, cards in data.cards.iteritems():
    cleanlist = cards.replace('{', '').replace('}', '')
    decklist = next(csv.reader([cleanlist]))
    decklists.append(decklist)

We’ll hold onto the raw decklists for later. In the meantime, I’m going to demonstrate that distance measures like cosine distance can be to make similarity comparisons that are useful for making commander recommendations.

Since I’ll be recommending commanders (rather than decks, or cards), I first sum card totals by commander.

commanders = defaultdict(lambda: defaultdict(int))
for commander, decklist in zip(data.commander, decklists):
    for card in decklist:
        commanders[commander][card] += 1

As a quick sanity check, let’s make sure we have the same number of commanders in our compiled data as in the original data.

print(len(commanders))
print(len(data.commander.unique()))
1098
1098

Now let’s turn those list of card totals into a card-commander matrix, with commanders as rows and cards as columns. A single cell represents the total number of times that card appears in a decklist for that commander.

This is laughably easy with pandas’ from_dict() method. This method doesn’t fill missing values by default, so we need to fill them with 0. In this case, having no value for card indicates there are 0 decklists with that card.

df = pd.DataFrame.from_dict(commanders, orient='index')
df = df.fillna(0) # absent values indicate a card not included in any deck
df = df.astype(int)
df.head()
Tromokratis Nimbus Swimmer Serpent of Yawning Depths Reliquary Tower Overwhelming Stampede Reclamation Sage Elvish Mystic Archetype of Imagination Cultivate Simic Signet ... Undertow Divine Gambit Runeforge Champion Valor of the Worthy Kongming's Contraptions Draugr Thought-Thief \Demonic Lightning\"" Cinderheart Giant Craven Hulk Frostpyre Arcanist
Arixmethes, Slumbering Isle 981 589 633 740 687 536 182 696 869 867 ... 0 0 0 0 0 0 0 0 0 0
Tetsuko Umezawa, Fugitive 2 0 0 270 0 0 0 5 0 0 ... 0 0 0 0 0 0 0 0 0 0
Yennett, Cryptic Sovereign 11 0 0 379 0 0 0 12 0 0 ... 0 0 0 0 0 0 0 0 0 0
Kumena, Tyrant of Orazca 2 4 1 829 174 19 15 32 841 834 ... 0 0 0 0 0 0 0 0 0 0
Chulane, Teller of Tales 7 6 2 1691 136 1785 2019 65 1126 484 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 20900 columns

A simple way to make recommendations is to just recommend entities that are “similar” to an entity of interest, where “similar” refers to the closest other entity given some measure of distance.

I know that EDHREC itself uses Jaccard (aka Tanimoto) distance for measuring the simlarity of two cards to each other. For this demonstration, I’ll use cosine distance, which has the nice property of being invariant to scale. So it naturally gets around the problem of some commanders just having more decklists than others.

Cosine distance is often used when working with term-document matrices, which are very similar to the decklist data as I’ve modeled it here. A TDM might similarly have more columns than rows (in this case, ~20,000 columns vs ~1,000 rows).

from scipy.spatial import distance

def commander_query(df, query_vec, num_recs=10, similar=True):

    similarities = df.apply(lambda x: distance.cosine(x, query_vec), axis=1)
    if similar:
        return similarities.sort_values()[:num_recs]
    else:
        return similarities.sort_values()[::-1][:num_recs]

Let’s use one of my own commanders as a demonstration. Here are the 10 commanders that are closest to Karametra, God of Harvests, by cosine distance.

commander_query(df, df.loc['Karametra, God of Harvests'])
Karametra, God of Harvests    0.000000
Dragonlord Dromoka            0.186217
Shalai, Voice of Plenty       0.212438
Selvala, Explorer Returned    0.263370
Gaddock Teeg                  0.306111
Yasharn, Implacable Earth     0.322083
Sigarda, Host of Herons       0.336453
Emiel the Blessed             0.336636
Saffi Eriksdotter             0.339359
Captain Sisay                 0.347628
dtype: float64

There is no reason why our query vector needs to be a real commander, either. All it needs to be is a list of cards.

For example, let’s suppose I am somebody who wants to build a Simic deck, but I’m tired of all the linear value engine commanders that are that color identity’s bread and butter: Tatyova, Aesi, Kinnan, etc. In that case, I could find the commanders that are furthest from the mean of a list of commanders that I am not interested in.

excluded_commanders = [
    'Tatyova, Benthic Druid',
    'Aesi, Tyrant of Gyre Strait',
    'Kinnan, Bonder Prodigy'
]
query_vec = df.loc[excluded_commanders].mean()

mask = (data.coloridentity == '{G,U}') & (data.commander2.fillna('NONE') == 'NONE')
selected = data[mask].commander.unique()
commander_query(df.loc[selected], query_vec, similar=False)
Verazol, the Split Current    0.602868
Eutropia the Twice-Favored    0.569568
Kumena, Tyrant of Orazca      0.538756
Moritte of the Frost          0.497884
Zegana, Utopian Speaker       0.485410
Kaseto, Orochi Archmage       0.484418
Roalesk, Apex Hybrid          0.478084
Experiment Kraj               0.474447
Edric, Spymaster of Trest     0.469616
Vorel of the Hull Clade       0.466763
dtype: float64

If I wanted to find a Simic commander that was maximally dissimilar from Tatyova, Aesi, and Kinnan, I think I’d be pretty happy with this list. The best part is that it surfaces commanders that are each different in their own way: Moritte for clones/changelings, Kraj for +1/+1 counters, Edric for draw/evasion, etc.

I could imagine extending this application to allow users to submit a list of wanted or unwanted cards, then asking for a commander that is either close to or far away from that synthetic vector.

commanders = [
    'Karametra, God of Harvests',
    'Rakdos, Lord of Riots',
    'Lazav, Dimir Mastermind',
    "K'rrik, Son of Yawgmoth",
    "Kykar, Wind's Fury",
    "Atemsis, All-Seeing"
]
query_vec = df.loc[commanders].mean()

Part Two: Deck Themes

The above method is only meaningful at the level of commanders. What if I instead want to discover deck themes that appear in decklists with several different commanders?

For this purpose, I’m going to use Latent Dirichlet Allocation, an unsupervised method for decomposing documents into themes made of up co-occurring words. Though LDA was designed for processing documents, it’s broadly useful for many types of high-dimensional count data, from genetic data to Lego color themes.

First, I’ll use scikit-learn’s CountVectorizer to make our features. Since our “text” is already “tokenized,” we just pass dummy functions for tokenization and preprocessing.

vectorizer = CountVectorizer(
    tokenizer= lambda x: x,
    preprocessor= lambda x: x,
    token_pattern=None
)  
features = vectorizer.fit_transform(decklists)
features.shape
(475398, 20900)

Since this is just a demonstration, I’m not going to worry too much about perplexity and coherence scores, or other ways of evaluating the optimal number of components K to include in the topic model. Also, fitting LDA to this data takes 2+ hours on my machine, and this is a personal project, so it hardly seems worth doing a systematic hyperparameter search :).

I started with 50 components and worked up to 250, while still discovering high-quality topics. I’ve seen topic models with as many 500 topics. Playing around with it, I suspect I could go higher without getting bogus topics. Intuitively, it seems like it is the nature of MTG decklists that there are many distinct “bundles” of cards that tend to appear together.

model = LatentDirichletAllocation(
    n_components=250, 
    random_state=1,
    verbose=1
)
model.fit(features)
iteration: 1 of max_iter: 10
iteration: 2 of max_iter: 10
iteration: 3 of max_iter: 10
iteration: 4 of max_iter: 10
iteration: 5 of max_iter: 10
iteration: 6 of max_iter: 10
iteration: 7 of max_iter: 10
iteration: 8 of max_iter: 10
iteration: 9 of max_iter: 10
iteration: 10 of max_iter: 10





LatentDirichletAllocation(n_components=250, random_state=1, verbose=1)

Save the model, since that took almost 2.5 hours. In a real application, you’d train the model offline – maybe retrain it every week, or every night – and just use the transformed data for serving recommendations.

joblib.dump(model, 'model250.joblib')
# model = joblib.load('model250.joblib')
['model250.joblib']
model = joblib.load('model250.joblib')

Although all the discovered themes are plausible, they vary in how surprising they are. Here are the top cards for an obvious one, which I’d call “mono-black staples.”

feature_names = vectorizer.get_feature_names_out()
theme = model.components_[1]

print(" | ".join([feature_names[i]
                        for i in theme.argsort()[:-10:-1]]))

Cabal Coffers | Urborg, Tomb of Yawgmoth | Demonic Tutor | Expedition Map | Damnation | Phyrexian Arena | Thespian's Stage | Vampiric Tutor | Solemn Simulacrum

Some are a little more impressive, especially when they span colors. This one looks to be a deathtouch theme:

theme = model.components_[3]

print(" | ".join([feature_names[i]
                        for i in theme.argsort()[:-10:-1]]))
Ambush Viper | Deadly Recluse | Sedge Scorpion | Gnarlwood Dryad | Moss Viper | Thornweald Archer | Wasteland Viper | Narnam Renegade | Pharika's Chosen

Here are a few of my other miscellaneous favorites.

print('Theme 9: Shock Lands')
print(" | ".join([feature_names[i]
                        for i in model.components_[9].argsort()[:-10:-1]]))
print()

print('Theme 118: Forced Combat/Rattlesnakes')
print(" | ".join([feature_names[i]
                        for i in model.components_[118].argsort()[:-10:-1]]))
print()

print('Theme 168: Guildgates')
print(" | ".join([feature_names[i]
                        for i in model.components_[168].argsort()[:-10:-1]]))
print()

print('Theme 213: Fleshbag Marauder and Pals')
print(" | ".join([feature_names[i]
                        for i in model.components_[213].argsort()[:-10:-1]]))
print()

print('Theme 239: Hondens/Sanctums')
print(" | ".join([feature_names[i]
                        for i in model.components_[239].argsort()[:-10:-1]]))

Theme 9: Shock Lands
Temple Garden | Hallowed Fountain | Godless Shrine | Overgrown Tomb | Stomping Ground | Sacred Foundry | Breeding Pool | Steam Vents | Watery Grave

Theme 118: Forced Combat/Rattlesnakes
Kazuul, Tyrant of the Cliffs | Disrupt Decorum | Rite of the Raging Storm | Bloodthirsty Blade | Varchild, Betrayer of Kjeldor | Fumiko the Lowblood | Curse of Opulence | Marchesa's Decree | Goblin Spymaster

Theme 168: Guildgates
Simic Guildgate | Golgari Guildgate | Selesnya Guildgate | Gruul Guildgate | Dimir Guildgate | Orzhov Guildgate | Azorius Guildgate | Izzet Guildgate | Boros Guildgate

Theme 213: Fleshbag Marauder and Pals
Fleshbag Marauder | Merciless Executioner | Plaguecrafter | Vona's Hunger | Innocent Blood | Liliana's Triumph | Archfiend of Depravity | Altar's Reap | Barter in Blood

Theme 239: Hondens/Sanctums
Honden of Seeing Winds | Honden of Cleansing Fire | Honden of Infinite Rage | Honden of Life's Web | Sanctum of Stone Fangs | Honden of Night's Reach | Sanctum of Fruitful Harvest | Sanctum of Calm Waters | Sanctum of All

I’ve used the library pyLDAvis for visualizing topic models in the past, so I’ll use that here. Sidenote: Python really has something for everything, huh?

The resulting visualization shows a PCA projection of the topics with two components. Mouse over a topic bubble in order to see the top cards of the card distribution for that topic. Mouse over a card name in order to see its importance for different topics.

import pyLDAvis
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()
viz = pyLDAvis.sklearn.prepare(model, features, vectorizer, sort_topics=False)

pyLDAvis.display(viz)

It’s really quite striking how separable the topics are along principal component #1. This appears to be a side effect of color identity. It looks like higher values of PC1 indicate more green cards in that topic. This would explain why the the topics with PC1 values close to 0 seem to be mostly 5-color or colorless topics (our Honden topic is in there, as well as a scarecrow topic).

Meanwhile, PC2 seems mostly related to the presence of blue cards, though the case is less dramatic here. If you want to see confirmation of this interpretation, find a topic with “Forest” or “Island” in its most salient terms and mouse over it. The topics that feature Forest are almost invariably on the right side of the map. Meanwhile, topics that feature Island seem bunched in the top 2/3.

It probably says something about the commander format that the two unobserved factors that explain the greatest variation in our topics can be summed up with “has green” or “has blue.” Quantitative evidence that Simic dominates EDH?

But the real reason you use LDA is that it lets you represent a document – or, in this case, decklist – as a distribution of latent themes. So, for example, we can get the mean vector for a number of decklists and see it represented as a collection of themes (rather than individual cards).

Here are the top 5 topics for the mean Karametra vector.

inds = data[data.commander == 'Karametra, God of Harvests'].index
mean_vec = np.mean(features[inds, :], axis=0)
mean_transformed = model.transform(np.asarray(mean_vec).reshape(1,-1)) # sklearn wants ndarray
for topic_id in np.argsort(mean_transformed[0])[::-1][:5]:
    print('Topic: ', topic_id)
    print(" | ".join([feature_names[i]
                        for i in model.components_[topic_id].argsort()[:-10:-1]]))
    print()
Topic:  236
Canopy Vista | Temple Garden | Sunpetal Grove | Plains | Forest | Scattered Groves | Swords to Plowshares | Mirari's Wake | Eladamri's Call

Topic:  26
Fertile Ground | Eidolon of Blossoms | Enchantress's Presence | Wild Growth | Herald of the Pantheon | Satyr Enchanter | Overgrowth | Forest | Verduran Enchantress

Topic:  95
Rampaging Baloths | Khalni Heart Expedition | Zendikar's Roil | Forest | Sylvan Awakening | Avenger of Zendikar | Harrow | Retreat to Kazandu | Explosive Vegetation

Topic:  123
Rishkar's Expertise | Beast Within | Forest | Sol Ring | Heroic Intervention | Nykthos, Shrine to Nyx | Nissa, Who Shakes the World | Cultivate | Kodama's Reach

Topic:  12
Blossoming Sands | Plains | Selesnya Guildgate | Forest | Selesnya Sanctuary | Graypelt Refuge | Sundering Growth | Selesnya Signet | Tranquil Expanse

Although some of these themes are generic Selesnya goodstuff, there are also meaningful subthemes that you might not find in the same Karametra deck: a landfall vs. an enchantress archetype.

We could use this information to find other decks that are on-theme, but which have different commanders. First, we have to transform the full data.

transformed = model.transform(features)

Let’s print the commanders of the 100 decks with the highest topic score for topic 26 (enchantress) and topic 95 (landfall).

inds = np.argsort(transformed[:, 26])[::-1][:100]
data.iloc[inds].commander.value_counts()
Estrid, the Masked        69
Tuvasa the Sunlit         28
Kestia, the Cultivator     2
Angus Mackenzie            1
Name: commander, dtype: int64

Estrid and Tuvasa: other commanders popular for an enchantment theme. Importantly, Tuvasa has an overlapping but not identical color identity with Karametra. This wasn’t a commander that came up with our earlier method.

inds = np.argsort(transformed[:, 95])[::-1][:100]
data.iloc[inds].commander.value_counts()
Tatyova, Benthic Druid        55
Azusa, Lost but Seeking       17
Jolrael, Empress of Beasts     8
Multani, Yavimaya's Avatar     7
Nissa, Vastwood Seer           4
Baru, Fist of Krosa            2
Yarok, the Desecrated          2
Omnath, Locus of Mana          2
Yisan, the Wanderer Bard       1
Kamahl, Fist of Krosa          1
Omnath, Locus of Rage          1
Name: commander, dtype: int64

Tatyova, Azusa, and friends: no suprises here. Other commanders that play well with the Baloths.

Well, there you have it. In truth, I’ve only scratched the surface of what is inside the topic model. If you’re interested, I encourage you to play around with the visualization a bit. Some of the topics are really quite impressive in their intelligibility. I’m still impressed that it generated a topic dominated by shock lands, specifically.

I’ll close with a few other fun commander topic comparisons.

def commander_topic_report(commander: str):
    
    inds = data[data.commander == commander].index
    if inds.empty:
        print('No commander found by that name.')
        return
    mean_vec = np.mean(features[inds, :], axis=0)
    mean_transformed = model.transform(np.asarray(mean_vec).reshape(1,-1)) # sklearn wants ndarray
    print("TOP 3 TOPICS")
    topics = np.argsort(mean_transformed[0])[::-1][:3]
    for topic_id in topics:
        print('Topic: ', topic_id)
        print(" | ".join([feature_names[i]
                            for i in model.components_[topic_id].argsort()[:-10:-1]]))
        print()
        print("Commanders of topic's 100 top decks:")
        inds = np.argsort(transformed[:, topic_id])[::-1][:100]
        print(data.iloc[inds].commander.value_counts())
        print()
    
commander_topic_report('Reaper King')
TOP 3 TOPICS
Topic:  204
Scuttlemutt | Wild-Field Scarecrow | Heap Doll | Wingrattle Scarecrow | Eerie Interlude | Forest | Mountain | Scaretiller | Plains

Commanders of topic's 100 top decks:
Reaper King    100
Name: commander, dtype: int64

Topic:  82
Universal Automaton | Mirror Entity | Irregular Cohort | Taurean Mauler | Graveshifter | Avian Changeling | Chameleon Colossus | Changeling Outcast | Impostor of the Sixth Pride

Commanders of topic's 100 top decks:
Morophon, the Boundless    62
The Ur-Dragon              34
Reaper King                 2
Karona, False God           1
Atraxa, Praetors' Voice     1
Name: commander, dtype: int64

Topic:  9
Temple Garden | Hallowed Fountain | Godless Shrine | Overgrown Tomb | Stomping Ground | Sacred Foundry | Breeding Pool | Steam Vents | Watery Grave

Commanders of topic's 100 top decks:
Sisay, Weatherlight Captain    45
Kenrith, the Returned King     18
Golos, Tireless Pilgrim         9
Jodah, Archmage Eternal         8
Niv-Mizzet Reborn               6
Child of Alara                  2
Esika, God of the Tree          2
Progenitus                      1
O-Kagachi, Vengeful Kami        1
Najeela, the Blade-Blossom      1
Sliver Overlord                 1
Horde of Notions                1
Ramos, Dragon Engine            1
Karona, False God               1
Jegantha, the Wellspring        1
General Tazri                   1
The First Sliver                1
Name: commander, dtype: int64
commander_topic_report('Gisela, Blade of Goldnight')
TOP 3 TOPICS
Topic:  248
Sacred Foundry | Clifftop Retreat | Boros Signet | Boros Charm | Command Tower | Battlefield Forge | Sunforger | Smothering Tithe | Swords to Plowshares

Commanders of topic's 100 top decks:
Aurelia, the Warleader           29
Brion Stoutarm                   23
Haktos the Unscarred             15
Archangel Avacyn                 12
Gisela, Blade of Goldnight       12
Bell Borca, Spectral Sergeant     3
Okaun, Eye of Chaos               1
Aurelia, Exemplar of Justice      1
Tajic, Legion's Edge              1
Akroma, Vision of Ixidor          1
Firesong and Sunspeaker           1
Kalemne, Disciple of Iroas        1
Name: commander, dtype: int64

Topic:  90
Gisela, the Broken Blade | Bruna, the Fading Light | Lyra Dawnbringer | Avacyn, Angel of Hope | Emeria Shepherd | Akroma, Angel of Wrath | Plains | Sephara, Sky's Blade | Baneslayer Angel

Commanders of topic's 100 top decks:
Lyra Dawnbringer           51
Avacyn, Angel of Hope      37
Bruna, the Fading Light     7
Sephara, Sky's Blade        3
Akroma, Angel of Wrath      1
Eight-and-a-Half-Tails      1
Name: commander, dtype: int64

Topic:  177
Mountain | Plains | Boros Signet | Assemble the Legion | Boros Guildgate | Boros Garrison | Wind-Scarred Crag | Command Tower | Boros Charm

Commanders of topic's 100 top decks:
Winota, Joiner of Forces         91
Tajic, Legion's Edge              2
Aurelia, Exemplar of Justice      2
Aurelia, the Warleader            1
Agrus Kos, Wojek Veteran          1
Adriana, Captain of the Guard     1
Iroas, God of Victory             1
Razia, Boros Archangel            1
Name: commander, dtype: int64
commander_topic_report('Angus Mackenzie')
TOP 3 TOPICS
Topic:  111
Sterling Grove | Ghostly Prison | Enlightened Tutor | Sphere of Safety | Idyllic Tutor | Hall of Heliod's Generosity | Replenish | Propaganda | Temple Garden

Commanders of topic's 100 top decks:
Tuvasa the Sunlit          72
Estrid, the Masked         15
Angus Mackenzie            10
Golos, Tireless Pilgrim     1
Phelddagrif                 1
Amareth, the Lustrous       1
Name: commander, dtype: int64

Topic:  53
Misty Rainforest | Breeding Pool | Cyclonic Rift | Tropical Island | Rhystic Study | Sylvan Library | Worldly Tutor | Flooded Grove | Alchemist's Refuge

Commanders of topic's 100 top decks:
Kruphix, God of Horizons        30
Rashmi, Eternities Crafter      16
Tatyova, Benthic Druid          11
Riku of Two Reflections          7
Prime Speaker Zegana             7
Uro, Titan of Nature's Wrath     4
Ezuri, Claw of Progress          3
Arixmethes, Slumbering Isle      3
Maelstrom Wanderer               3
Momir Vig, Simic Visionary       2
Derevi, Empyrial Tactician       2
Gor Muldrak, Amphinologist       2
Kodama of the East Tree          1
Koma, Cosmos Serpent             1
Thrasios, Triton Hero            1
Rafiq of the Many                1
Prime Speaker Vannifar           1
Tishana, Voice of Thunder        1
Kydele, Chosen of Kruphix        1
Animar, Soul of Elements         1
Edric, Spymaster of Trest        1
Pir, Imaginative Rascal          1
Name: commander, dtype: int64

Topic:  201
Rites of Flourishing | Collective Voyage | Tempt with Discovery | Howling Mine | Forest | Dictate of Kruphix | Veteran Explorer | Minds Aglow | Temple Bell

Commanders of topic's 100 top decks:
Kynaios and Tiro of Meletis    72
Phelddagrif                    27
Kwain, Itinerant Meddler        1
Name: commander, dtype: int64