Cats (2019) Name Generator Explained

I threw together my Cats Name Generator fairly quickly in advance of a mid-pandemic watch party with some friends, and didn't bother writing up the details at the time. I thought I came up with something quite clever though, and I like to say that the idea came to me in a dream the night before.

Tom Hooper's incredible movie Cats is based on T.S. Eliot's book Old Possum's Book of Practical Cats. The obvious first step for generating new cat names was pulling the existing cat names from that book. I did that using spacy's default entity recognition:

cats = open('practical_cats.txt').read()
nlp = spacy.load("en_core_web_sm")
# Skip front matter of the book
doc = nlp(cats[2620:])
cat_names = set()
for ent in doc.ents:
    if ent.label_ == "PERSON":
        for word in ent.text.split(' '):
            cat_names.add(word.lower())

It turns out there's not that many cats actually mentioned in the book though! This quick attempt with Spacy only turned up 50 names, only some of which are actually cats. This isn't much to build a generator on, especially for a simple generator that can get stuck if it hasn't seen certain combinations of letters before.

The key impression I get from cat names like Mungojerrie and Skimbleshanks is a nebulous Britishness. To supplement the dataset with more names with the same kind of Britishness, my idea was to pull from Wikipedia's List of towns in England. Names like Barnoldswick have the same kind of feeling as Rumpleteaser.

My final name generator was a simple Markov chain generator based on the preceding two letters. We generate names one letter at a time based on the last two letters we've seen:

def generate(probs):
    word = []
    current_char = ' '
    next_char = None
    while next_char != ' ':
        choices = list(k[1] for k in probs
                       if k[0] == current_char)
        weights = list(probs[(current_char, c2)] for c2 in choices)
        next_char = random.choices(choices, weights=weights)[0]
        word.append(next_char)
        current_char = next_char
    return ''.join(word).strip()

Training the model is basically about counting how often each letter follows the preceding bigram:

import collections
import string

def train_on_bigrams(words):
    counts = collections.defaultdict(int)
    
    for word in words:
        word = ''.join(c for c in word if c in string.ascii_letters)
        word = ' ' + word + ' '
        
        for c1, c2 in zip(word[:-1], word[1:]):
            counts[(c1, c2)] += 1
            
    # Convert to probabilities
    probs = collections.defaultdict(float)
    
    c1s = set(k[0] for k in counts)
    for c1 in c1s:
        total_count = total_count = sum(v for k, v in counts.items()
                                        if k[0] == c1)
        c2s = set(k[1] for k in counts if k[0] == c1)
        for c2 in c2s:
            count = counts[(c1, c2)]
            probs[(c1, c2)] = count / total_count
            
    return probs

Then we can combine the probabilities we get from the 2 datasets (cat names and English towns) using a weighted sum:

def combine_models(model1, model2, weight1=1.0, weight2=1.0):
    total_weight = weight1 + weight2
    combined_model = {}
    letters = ' ' + string.ascii_lowercase
    for c1 in letters:
        c2s = set(k[1] for k in model1 if k[0] == c1)
        for k in model2:
            if k[0] == c1:
                c2s.add(k[1])
        
        for c2 in c2s:
            new_prob = (model1[(c1, c2)] * weight1 + model2[(c1, c2)] * weight2) / total_weight
            combined_model[(c1, c2)] = new_prob
    return combined_model

The final model combined the probabilities from cat names, with a weight of 3, with town names with a weight of 1, so probabilities were heavily weighted towards cats where possible, but we could fall back on the town name data where there wasn't much to go on. Give it a try!