# Cats (2019) Name Generator Explained

September 25, 2022I threw together my Cats Name Generator fairly quickly in advance of a mid-pandemic watch party with some friends, and didn't bother writing up the details at the time. I thought I came up with something quite clever though, and I like to say that the idea came to me in a dream the night before.

Tom Hooper's incredible movie Cats is based on T.S. Eliot's book Old Possum's Book of Practical Cats. The obvious first step for generating new cat names was pulling the existing cat names from that book. I did that using spacy's default entity recognition:

```
cats = open('practical_cats.txt').read()
nlp = spacy.load("en_core_web_sm")
# Skip front matter of the book
doc = nlp(cats[2620:])
cat_names = set()
for ent in doc.ents:
if ent.label_ == "PERSON":
for word in ent.text.split(' '):
cat_names.add(word.lower())
```

It turns out there's not that many cats actually mentioned in the book though! This quick attempt with Spacy only turned up 50 names, only some of which are actually cats. This isn't much to build a generator on, especially for a simple generator that can get stuck if it hasn't seen certain combinations of letters before.

The key impression I get from cat names like Mungojerrie and Skimbleshanks is a nebulous *Britishness*. To supplement the dataset with more names with the same kind of Britishness, my idea was to pull from Wikipedia's List of towns in England. Names like Barnoldswick have the same kind of feeling as Rumpleteaser.

My final name generator was a simple Markov chain generator based on the preceding two letters. We generate names one letter at a time based on the last two letters we've seen:

```
def generate(probs):
word = []
current_char = ' '
next_char = None
while next_char != ' ':
choices = list(k[1] for k in probs
if k[0] == current_char)
weights = list(probs[(current_char, c2)] for c2 in choices)
next_char = random.choices(choices, weights=weights)[0]
word.append(next_char)
current_char = next_char
return ''.join(word).strip()
```

Training the model is basically about counting how often each letter follows the preceding bigram:

```
import collections
import string
def train_on_bigrams(words):
counts = collections.defaultdict(int)
for word in words:
word = ''.join(c for c in word if c in string.ascii_letters)
word = ' ' + word + ' '
for c1, c2 in zip(word[:-1], word[1:]):
counts[(c1, c2)] += 1
# Convert to probabilities
probs = collections.defaultdict(float)
c1s = set(k[0] for k in counts)
for c1 in c1s:
total_count = total_count = sum(v for k, v in counts.items()
if k[0] == c1)
c2s = set(k[1] for k in counts if k[0] == c1)
for c2 in c2s:
count = counts[(c1, c2)]
probs[(c1, c2)] = count / total_count
return probs
```

Then we can combine the probabilities we get from the 2 datasets (cat names and English towns) using a weighted sum:

```
def combine_models(model1, model2, weight1=1.0, weight2=1.0):
total_weight = weight1 + weight2
combined_model = {}
letters = ' ' + string.ascii_lowercase
for c1 in letters:
c2s = set(k[1] for k in model1 if k[0] == c1)
for k in model2:
if k[0] == c1:
c2s.add(k[1])
for c2 in c2s:
new_prob = (model1[(c1, c2)] * weight1 + model2[(c1, c2)] * weight2) / total_weight
combined_model[(c1, c2)] = new_prob
return combined_model
```

The final model combined the probabilities from cat names, with a weight of 3, with town names with a weight of 1, so probabilities were heavily weighted towards cats where possible, but we could fall back on the town name data where there wasn't much to go on. Give it a try!