Recently I’m working on a project, and as part of the project, I need to get data of English name variations. Asking for the internet is always the first choice. Thank Mike Campbell, I found a website www.behindthename.com/ with a huge data set of Human’s names (I mean, more than names in English are there).
Then I decided to write a web crawler to replicate a local version of the data set.
Packages
Just some standard packages for reading web pages and writing csv files.
Find out what to grab
First I browsed the website carefully.
There’s a full list of names on https://www.behindthename.com/names. For each name, all its variants can be found at www.behindthename.com/name/XXX/related.
My plan is to get the URL for each name, and then go to its ralated page, and parse its variants and save them in a list.
Get a list of URLs
|
|
Get list of name variants
Since I only care about English names, so I simply ignore names that cannot be encoded in ASCII.
|
|
It takes a while (1 hour maybe) to grab the data. #### Write CSV output
Next, I’ll do some work on efficiently accessing the data set.