A Simple Web Crawler

Recently I’m working on a project, and as part of the project, I need to get data of English name variations. Asking for the internet is always the first choice. Thank Mike Campbell, I found a website www.behindthename.com/ with a huge data set of Human’s names (I mean, more than names in English are there).

Then I decided to write a web crawler to replicate a local version of the data set.

Packages

Just some standard packages for reading web pages and writing csv files.

1
2
from bs4 import BeautifulSoup
import urllib2, csv

Find out what to grab

First I browsed the website carefully.

There’s a full list of names on https://www.behindthename.com/names. For each name, all its variants can be found at www.behindthename.com/name/XXX/related.

My plan is to get the URL for each name, and then go to its ralated page, and parse its variants and save them in a list.

Get a list of URLs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
names = []
for k in range(1, 69):
url = 'https://www.behindthename.com/names/'+str(k)
a,b = [],[]
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, 'lxml')
for i in soup.findAll("div", { "class" : "browsename b0" }):
a += [i.find('a')['href'].split('/')[-1]]
for i in soup.findAll("div", { "class" : "browsename b1" }):
b += [i.find('a')['href'].split('/')[-1]]
name = ['']*(len(a)+len(b))
for i in range(len(name)):
name[i] = a[i/2] if i%2==0 else b[(i-1)/2]
names += name

Get list of name variants

Since I only care about English names, so I simply ignore names that cannot be encoded in ASCII.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
d = {}
for n in names:
url='https://www.behindthename.com/name/'+n+'/related'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content,'lxml')
rows = soup.findAll('a',{'class':'ngl'})
d[n] = []
for i in rows:
try:
i.string.decode('ascii')
d[n] += [i.string.encode('ascii')]
except UnicodeEncodeError:
pass
d[n] = sorted(list(set(d[n])))

It takes a while (1 hour maybe) to grab the data. #### Write CSV output

1
2
3
4
with open('lookup.csv', 'wb') as f:
w = csv.writer(f)
for i in d.keys():
w.writerow([i]+d[i])

Next, I’ll do some work on efficiently accessing the data set.