D&D Race/Class Combos in ~1 Million Character Sheets

5 minute read

Last year, I scraped data from more than 1 million D&D character sheets from online sources, storing them in a SQLite database. The project was both for my own curiosity, and because I wanted to get some more hands-on practice using SQL. Just storing the data required learning how to use SQLAlchemy, which was worth the time on its own.

Anyways, now I have a big database of D&D characters, from which you can extract all sorts of fun insights. This notebook shows the most common race and class pairings.

import sqlite3

import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

# Connect to the database
con = sqlite3.connect("../../db/main_11-30-21.db")

The query below extracts just the class and race from each character. Due to the way the data was coded, a case statement is necessary to disentangle subraces. It also filters out homebrew classes and class names that contain a parenthesis, which are usually unearthed arcana classes, designated by (UA).

query = '''

SELECT 
    classes.charId AS charId,
    classes.classLevel,
    cd.className,
    CASE 
        WHEN r.isSubRace = 1 THEN
            baseRaceId
        ELSE
            r.fullName
        END raceName
        
FROM classes
    LEFT JOIN characterSheet cs ON classes.charId = cs.Id
    LEFT JOIN race r ON cs.raceId = r.Id
    LEFT JOIN classDefinition cd ON classes.classDefinitionId = cd.Id
WHERE 
    cs.raceID IS NOT NULL 
    AND r.isHomebrew = 0
    AND cd.isHomebrew = 0
    AND NOT raceName LIKE '%(%'
ORDER BY charId;

'''

df = pd.read_sql_query(query, con)
df

	charId	classLevel	className	raceName
0	2510	2	Wizard	Elf
1	2778	1	Warlock	Tiefling
2	2806	3	Druid	Genasi
3	2903	11	Monk	Elf
4	2988	1	Monk	Dragonborn
...	...	...	...	...
1041035	52414131	6	Paladin	Half-Elf
1041036	52414131	9	Sorcerer	Half-Elf
1041037	52414131	5	Warlock	Half-Elf
1041038	52414159	18	Druid	Warforged
1041039	52414174	1	Rogue	Aarakocra

1041040 rows × 4 columns

There are more rows than unique characters. This is because of multiclassed characters, which get a row for each unique class. I’ll leave the data as-is and count each unique class for each character a separate time.

len(df.charId.unique())

We’ll limit ourselves to just the 12 most common races and classes, for a more manageable visualization.

n = 12
class_filter = list(df.className.value_counts()[:n].index)
race_filter = list(df.raceName.value_counts()[:n].index)

df = df[df.className.isin(class_filter) & df.raceName.isin(race_filter)]
df

	charId	classLevel	className	raceName
0	2510	2	Wizard	Elf
1	2778	1	Warlock	Tiefling
2	2806	3	Druid	Genasi
3	2903	11	Monk	Elf
4	2988	1	Monk	Dragonborn
...	...	...	...	...
1041032	52414003	1	Rogue	Elf
1041034	52414055	6	Barbarian	Dragonborn
1041035	52414131	6	Paladin	Half-Elf
1041036	52414131	9	Sorcerer	Half-Elf
1041037	52414131	5	Warlock	Half-Elf

853527 rows × 4 columns

To get counts of race-class pairings, pd.crosstab() can do the trick. Then I’ll express each value as a proportion of the total, then sort the columns by column sum and rows by row sum.

tab = pd.crosstab(df.className, df.raceName)
tab = tab / tab.sum(axis=0).sum()
tab = tab[tab.sum().sort_values(ascending=False).index] # order columns
tab = tab.loc[tab.sum(axis=1).sort_values(ascending=False).index] # order rows
tab

raceName	Human	Elf	Half-Elf	Tiefling	Dragonborn	Dwarf	Halfling	Half-Orc	Aasimar	Genasi	Goliath	Gnome
className
Fighter	0.046209	0.013641	0.007308	0.004413	0.012009	0.012068	0.003273	0.009552	0.003289	0.004169	0.008568	0.002104
Rogue	0.026011	0.027433	0.013677	0.010282	0.004773	0.002703	0.014878	0.002392	0.002129	0.004376	0.001277	0.005129
Wizard	0.023684	0.021548	0.007328	0.005596	0.004147	0.002791	0.002698	0.001203	0.002036	0.003606	0.000965	0.010133
Barbarian	0.014192	0.002970	0.001577	0.002313	0.009702	0.010622	0.002471	0.016702	0.001907	0.002701	0.016979	0.002000
Cleric	0.020795	0.011510	0.007037	0.004647	0.004334	0.014367	0.003281	0.002573	0.006887	0.003699	0.001943	0.002587
Paladin	0.019423	0.005600	0.007600	0.004853	0.013997	0.006096	0.001425	0.004592	0.010450	0.002214	0.004360	0.000935
Warlock	0.016376	0.008866	0.012703	0.017114	0.005292	0.001707	0.002131	0.001964	0.006857	0.003223	0.001224	0.001947
Sorcerer	0.013764	0.011463	0.012260	0.011004	0.008352	0.001740	0.002132	0.001201	0.005480	0.005096	0.000941	0.001823
Bard	0.014830	0.008661	0.015823	0.009915	0.003544	0.002742	0.006547	0.002059	0.003338	0.002853	0.001227	0.003236
Ranger	0.015950	0.025400	0.008815	0.003120	0.003666	0.002689	0.003768	0.002342	0.001547	0.002949	0.001454	0.002187
Druid	0.009741	0.019025	0.006246	0.004015	0.003322	0.003475	0.003380	0.001959	0.001846	0.005482	0.001582	0.003496
Monk	0.015640	0.010230	0.003917	0.002897	0.003794	0.002294	0.003960	0.002447	0.002225	0.004839	0.002595	0.001534

Now, let’s make a heatmap.

dims = (11.7, 8.27)
fig, ax = plt.subplots(figsize=dims)
cmap = sns.cm.rocket_r
heatmap = sns.heatmap(
    ax=ax, 
    data=tab.T,
    cmap=cmap 
)
title = ax.set_title('Race and Class Proportions in ~1 Million D&D Characters')
ylabel = ax.set_ylabel('Race')
xlabel = ax.set_xlabel('Class')

png

There’s a lot of fun information packed in this little graphic. For example, human fighters make up almost 5% – 1 in 20 – of all D&D characters in the dataset. Islands of heat show pairings that are especially popular: elf rogue, goliath barbarian, tiefling warlock.

Also, some races seem to have greater class diversity than others. Goliath are often barbarians and halflings are often rogues, but genasi seem more “spread out” among a variety of classes (probably due to the fact that they have a variety of subrace options).

An easy way to express this quantitatively is with the Gini index. In effect, we can measure the internal “inequality” of the population of each race.

There are a million Gini implementations out there, but I’ll use a concise one I found on stackexchange.

def gini(x, w=None):
    # Array indexing requires reset indexes.
    x = pd.Series(x).reset_index(drop=True)
    if w is None:
        w = np.ones_like(x)
    w = pd.Series(w).reset_index(drop=True)
    n = x.size
    wxsum = sum(w * x)
    wsum = sum(w)
    sxw = np.argsort(x)
    sx = x[sxw] * w[sxw]
    sw = w[sxw]
    pxi = np.cumsum(sx) / wxsum
    pci = np.cumsum(sw) / wsum
    g = 0.0
    for i in np.arange(1, n):
        g = g + pxi.iloc[i] * pci.iloc[i - 1] - pci.iloc[i] * pxi.iloc[i - 1]
    return g

Now we can just apply this along each column, and sort.

race_gini = tab.apply(gini).sort_values()

dims=(10,7)
fig, ax = plt.subplots(figsize=dims)
sns.set_style('darkgrid')
sns.set_color_codes('muted')
ax = sns.barplot(
    x=race_gini, 
    y=race_gini.index,
    orient='h',
    color='b'
)
ax.set_title('Class-wise Gini Coefficient of D&D Races', fontsize=16)
ylabel = ax.set_ylabel('Race', fontsize=14)

Text(0, 0.5, 'Race')

png

There you have it: D&D races sorted by class diversity. Genasi and humans have the most class variety, whereas half-orcs and goliaths are more homogenous.

The Goliath spot is not so surprising, though going into this I would have guessed tieflings would be more homogenous. I guess warlocks don’t dominate the ranks of tieflings as much as barbarians do for goliaths.

Twitter Facebook LinkedIn

Jordan Pruett

D&D Race/Class Combos in ~1 Million Character Sheets

You May Also Enjoy

AutoGPT Vulnerabilities

Topic Modeling MTG Decklists

SQL Practice With D&D Data

‘Journals Like You Also Reviewed…’