By Francesco Tisiot
Choosing the name of your new child is really serious, so as a data professional you will naturally want to reach for PostgreSQL®. Right?
Naming variables is one of the hardest problems in Computer Science, but when the variable is a new life, the challenge can rise quite significantly. Family names, personal preferences, celebrities, vetoes, word sound, it all goes into the decision mix! Even more, the possible pool of names is huge — and we all know that the traditional “let’s browse for names in a book” approach doesn’t scale anymore with the ever-increasing set of options.
Help is needed, and this blog post shows how PostgreSQL® can help us refine our search and get us closer to a final selection.
Alert: all choices, decisions, and names used in this example are pure imagination and don’t represent any kind of recommendation!
Get the dataset
To start with, we need a dataset of valid names. This post uses this collection of baby names from data.world, for which you need a valid account to access the data.
Once you have created the account, you will be able to download the dataset in a babynames-clean.csv
file on your local machine.
Create a PostgreSQL service and load the data
You can run PostgreSQL locally or, if you prefer, create an instance using Aiven for PostgreSQL® with the Aiven CLI command, as shown below:
avn service create demo-pg \
--service-type pg \
--cloud google-europe-west3 \
--plan hobbyist
The command starts creation of a PostgreSQL service (--service-type pg
) on the Google Europe West 3 cloud region with the minimal hobbyist
plan, which is enough for our needs. The service takes a couple of minutes to start, and we can wait for it to be up with:
avn service wait demo-pg
Once the service is in RUNNING
state, we can connect to it with the following command that calls psql
under the hood:
avn service cli demo-pg
After connecting we need to create a table, since babynames-clean.csv
doesn't contain any header:
create table names (name text, gender text);
Now it’s time to copy the data:
\copy names from 'babynames-clean.csv' csv
Now the data is loaded in the names
table and we can start the research of our imaginary baby's name.
Initial scanning and filtering of the data
Okay, time to choose a name! Is it going to be a boy or a girl? If you already have that information, it’s going to cut out half of the options, since the query below shows that we have a more or less equal amount of male and female names.
select
gender,
count(*) nr_names
from names
group by gender;
The result shows slightly more boys’ names than girls’.
gender | nr_names
--------+----------
girl | 3345
boy | 3437
(2 rows)
What if we don’t know the gender yet, and we have almost seven thousand names to pick from? Scanning all of them one by one is not an option. We need to apply some filters. Do you have a long surname? It might be nice to save the poor kid from the stress of writing out a long first name as well. Let’s filter for names with fewer than 6 characters.
select count(*) from names where length(name) < 6;
We still have lots of results (2632), so some more pruning is needed … I personally like names ending with vowels — can we add this constraint? Yes we can, with the help of regular expressions:
select count(*) from names
where length(name) < 6
and name similar to '%(a|e|i|o|u|y)';
With the help of regular expressions we can take it even further, like specifying 1 l
or 2 consecutive l
s:
select *
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
limit 10;
Regular expressions can only help so much, and we still have too many names in our list. We need some more ways to exclude options.
Avoid similar names
We all have names we never want to hear again. Maybe they remind us of someone we don’t want to think about, or they’re just taken by somebody else we hear about way too often. It’s easy to avoid these names, but we can use PostgreSQL’s FuzzyStrMatch to go a step further and eliminate anything even vaguely similar. We can install the extension with:
create extension fuzzystrmatch;
The extension brings us several functions. The first one, named Levenstein difference, helps us understand how many characters make two strings differ. Let’s say, for example, we don’t want anything that is written too similar to Allie
.
select *,
levenshtein(name, 'Allie') diff
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
order by diff
limit 10;
Returns
name | gender | diff
-------+--------+------
Allie | boy | 0
Altie | girl | 1
Alfie | boy | 1
Arlie | boy | 1
Alvie | boy | 1
Alcie | girl | 1
Algie | boy | 1
Ellie | boy | 1
Ollie | boy | 1
Alvia | boy | 2
(10 rows)
We can see that the Levenshtein difference between:
Allie
andAllie
is 0 since they are the same stringAllie
andAltie
is 1 (thet
)Allie
andAlfie
is 1 (thef
)Allie
andAlvia
is 2 (thev
anda
)
We can remove all names with a Levenshtein difference of less than 3, since this helps us avoid names written too similarly to Allie
.
select count(*)
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
and levenshtein(name, 'Allie') > 3;
We still have 235 rows, but the difference in the number of characters is just the beginning. We don’t even want to hear a sound similar to Allie
! Luckily the FuzzyStrMatch extension brings us another function, named soundex
that provides the Soundex code of a string. Let's try with Allie
itself:
select soundex('Allie');
The result is A400
which, by itself, is not very useful. But FuzzyStrMatch also offers difference
which, despite the name, provides a Soundex similarity score between 0 (two strings sound different) to 4 (two strings sound the same). If we try with our dataset:
select *,
soundex(name) sdx,
difference(name, 'Allie') sdx_score
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
and levenshtein(name, 'Allie') > 3
order by sdx_score desc
limit 5;
This shows that Clay
, Milo
, Riley
, Wiley
and Rolla
sound a bit too close to our Allie
.
name | gender | sdx | sdx_score
-------+--------+------+-----------
Clay | boy | C400 | 3
Milo | boy | M400 | 3
Riley | boy | R400 | 3
Wiley | boy | W400 | 3
Rolla | boy | R400 | 3
(5 rows)
We want to go for an entirely different sound, so let’s remove anything with a Soundex score greater than 2.
select count(*)
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
and levenshtein(name, 'Allie') > 3
and difference(name, 'Allie') <= 2;
The final touch: the metaphone
We still have 115 rows, a few too many to choose from. The final step in the selection could be based on a name of which we like the sound, and therefore we’d like to explore similar-sounding ones. Let’s take Marlo
as an example. To retrieve names that sound similar we can either use the Soundex function again, or we could evolve our approach and use the Metaphone or double metaphone for higher accuracy.
select dmetaphone('Marlo') dmet;
The double metaphone function returns the MRL
string. We can use that, together with the Levenshtein difference, to find all strings which vary from it by only one character.
select count(*)
from names
where
length(name) < 6
and name similar to '%(l){1,2}%(a|e|i|o|u|y)'
and levenshtein(name, 'Allie') > 3
and difference(name, 'Allie') <= 2
and levenshtein(dmetaphone(name), dmetaphone('Marlo')) <=1;
Now we have to parse only 26 rows, and frankly, this is where the hard choices begin. Still, with the help of PostgreSQL, some regular expressions and the FuzzyStrMatch extension, we were able to prune a lot of our possible dataset.
Which name is best? The final selection is, of course, up to you.
Ready-to-use functions for string similarity
String similarity is not an easy topic, but luckily the FuzzyStrMatch extension provides a few functions like Levenshtein, Soundex and Metaphone that can help us solve some common problems when comparing texts.
Some further resources:
- FuzzyStrMatch: to understand functions and additional parameters
- Pg similarity: another PostgreSQL extension providing a different set of text similarity functions
- Pattern matching: to review the details of regular expression usage