code kaggle

A Name Across Time and Numbers

With data, I explore the name Cassandra, Zoe, their variants and how they move in popularity over time.

This post is a loving tribute to my daughter, whose eyes are shining stars in these dark and troubled nights.

Names evolve. Parents would take a name from pop culture, history, sports, current events, or sacred texts and add their own spin to it. The act of naming both evokes the meaning of the original name but also leaves blank pages for the newborn to write.

Take Cassandra for example. There are shorter variants such as Cass. Some consonants might be replaced. In some countries, ‘c’ can be replaced with ‘k’, like Kassandra. Interestingly, it is common in the Philippines, for example, to add an ‘h’ to the spelling (Cassandhra). Here’s a quick plot of the variants over time.

Read on to see how I did it! Here’s the Kaggle Notebook, if anyone’s interested.

Identifying Variants

With data, we can see a glimpse of how names and their variants move in popularity over time. I used the US Baby Names dataset which is gathered from US Social Security data. I then use the Double Metaphone algorithm to group together words by their English pronunciation. Designed by Lawrence Phillips in 1990, the original Metaphone algorithm does its phonetic matching through complex rules for variations in vowel and consonant sounds. Since then, there have been two updates to the algorithm. Fortunately for us, there is a Python port from C/C++ code, the fuzzy library. The result is a grouping of words like:

Mark -> MRK
Marc -> MRK
Marck -> MRK
Marco -> MRK

In the following code, we first get the fingerprint (a.k.a. hash code) of all the names in the data:

df = pd.read_csv("../input/us-baby-names/NationalNames.csv")
names = df["Name"].unique() 
fingerprint_algo = fuzzy.DMetaphone()
list_fingerprint = [] for n in names: list_fingerprint.append(fingerprint_algo(n)[0])

The result is having an index for each of the names. Then with simple filtering, we can extract variants of both Cassandra and Cass.

def get_subset(df, df_fp, names):
    fingerprint_candidates = []
    for name in names:
        matches = df_fp[df_fp["name"] == name]["fingerprint"]

    name_candidates = df_fp.loc[df_fp["fingerprint"].isin(fingerprint_candidates), "name"]

    df_subset = df[(df["Name"].isin(name_candidates)) & (df["Gender"] == "F")]
    return df_subset

# using my function
df_fp_names = pd.DataFrame([list_fingerprint, names]).T df_fp_names.columns=["fingerprint", "name"] df_subset = get_subset(df, df_fp_names, ["Cass", "Cassandra"])

Variants of Cassandra

We can then plot the most popular variants of Cassandra throughout the 20th century. I also plot a log scale version so we can see better those in the middle of the pack. I’m using the plotnine library so I can do ggplot-style code.

ggplot(df_top_n_global, aes(x = "Year", y = "Count", colour = "Name")) + \
    geom_text(aes(label = "Name"), show_legend = False) +\
    geom_line() +\
    labs(y = 'Number of babies', title = 'Cass: 1900\'s and beyond') +\
    scale_y_continuous(trans='log10') +\
Most Popular Variants of Cassandra

It looks like the name Cassandra and nearly all its most popular variants have peaked in the ’90s but has since then decreased sharply. I have read that at one point, the name has even reached the top 70 in the United States in the ’90s. Other popular variants include Casey, Cassie and Kassandra, all of which have been decreasing in popularity.

Most Popular Variants of Zoe

The story is different for Zoe, where we find a very sharp uptick in popularity in the 2000s. Perhaps the show Zoey 101 had to do with it? There is also that classic rom-com, 500 Days of Summer in 2009…

An identified variant, Sue, however, has sharply reduced in popularity since the 1950s. I’m quite sure that they have different pronunciations. Double Metaphone has mistakenly indexed them together.

It’s very interesting to study how names evolve, get more popular, and become rarer over time. Famous people and events always have a hand in it so names become a reflection of the times. I remember a trend where parents were naming their children Danaerys or adding a Jon in the name somewhere. I’m guessing that as of the time of this post’s writing, there may be a lot of Anthony’s being born. Even X Æ A-12 is a sign of the times.

Since I’m only using data from the US, it will be more fascinating to see each country’s trends and variants. What are each country’s naming styles? What does cross-pollination of naming styles look like? Are gender-neutral names becoming more in favor? These are all fascinating questions to explore!

By krsnewwave

I'm a software engineer and a data science guy on recommender systems, natural language processing, and computer vision.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s