2017-04-03

A Simple Web Crawler

Recently I’m working on a project, and as part of the project, I need to get data of English name variations. Asking for the internet is always the first choice. Thank Mike Campbell, I found a website www.behindthename.com/ with a huge data set of Human’s names (I mean, more than names in English are there).

Then I decided to write a web crawler to replicate a local version of the data set.

Packages

Just some standard packages for reading web pages and writing csv files.

1 2	from bs4 import BeautifulSoup import urllib2, csv

Find out what to grab

First I browsed the website carefully.

There’s a full list of names on https://www.behindthename.com/names. For each name, all its variants can be found at www.behindthename.com/name/XXX/related.

My plan is to get the URL for each name, and then go to its ralated page, and parse its variants and save them in a list.

Get a list of URLs

names = []
for k in range(1, 69):
    url = 'https://www.behindthename.com/names/'+str(k)
    a,b = [],[]
    content = urllib2.urlopen(url).read()
    soup = BeautifulSoup(content, 'lxml')
    for i in soup.findAll("div", { "class" : "browsename b0" }):
        a += [i.find('a')['href'].split('/')[-1]]
    for i in soup.findAll("div", { "class" : "browsename b1" }):
        b += [i.find('a')['href'].split('/')[-1]]
    name = ['']*(len(a)+len(b))
    for i in range(len(name)):
        name[i] = a[i/2] if i%2==0 else b[(i-1)/2]
    names += name

Get list of name variants

Since I only care about English names, so I simply ignore names that cannot be encoded in ASCII.

d = {}
for n in names:
    url='https://www.behindthename.com/name/'+n+'/related'
    content = urllib2.urlopen(url).read()
    soup = BeautifulSoup(content,'lxml')
    rows = soup.findAll('a',{'class':'ngl'})
    d[n] = []
    for i in rows:
        try:
            i.string.decode('ascii')
            d[n] += [i.string.encode('ascii')]
        except UnicodeEncodeError:
            pass
    d[n] = sorted(list(set(d[n])))

It takes a while (1 hour maybe) to grab the data. #### Write CSV output

with open('lookup.csv', 'wb') as f:
    w = csv.writer(f)
    for i in d.keys():
        w.writerow([i]+d[i])

Next, I’ll do some work on efficiently accessing the data set.