A Simple Web Crawler 2

OK, so I successfully downloaded the data from behindthenames.com. I had to resume the web crawler a couple of times when it got shut down by the web host because the web crawler is kind of DDOS attacker to the website.

There are several tricks to avoid being shut down, such as using an IP proxy, or send request to the website pretending you are using a web browser:

1
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

However, since there’re only several megabytes data to grab, I wouldn’t add these tricks to my code.

Each URL maps to a name, and each name maps to its corresponding list of name variants. Is there a better way to organize the data? The URLs do not contain any useful information to my project, so I just removed them. Another issue I have to consider is that the URLs may not point to an English name. A reversed mapping can help get rid of non-English names.

The process is:

  1. Use an URL counter and store the list of name as well as name variants.

  2. Reverse mapping: for each name in the lists, generate a list of URL counters.

  3. Union every list mapped from the URLs in the list at step 2.

Then, for each name, there’s a set of its English variants.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import csv
variants = {}
for i in range(1,69):
d = {}
with open('lookup/'+str(i)+'.csv', 'r') as f:
reader = csv.reader(f)
k = 0
for i in reader:
d[k] = i[1:]
k += 1
names = {}
for i in d.keys():
for j in d[i]:
if j not in names:
names[j] = set([i])
else:
names[j].add(i)
print len(names)
for n in names:
variants[n] = set([])
for j in names[n]:
variants[n] |= set(d[j])
variants[n].remove(n)
with open('lookup.csv', 'wb') as f:
writer = csv.writer(f)
for v in sorted(variants.keys()):
writer.writerow([v.upper()]+sorted(list(variants[v])))

Sample output:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
'AZIZ,Aziz
'AZRI'EL,Azrael,Azriel
'EDNAH
'EFRAYIM,Efraim,Ephraim,Evron,Jevrem
'EL'AZAR,Elazar,Eleazar,Lazar,Lazare,Lazaros,Lazarus,Lazzaro
'ELI'EZER,Eliezer
'ELIFALET,'Elifelet,Eliphalet,Eliphelet
'ELIFELET,'Elifalet,Eliphalet,Eliphelet
'ESAW,Esau
'ESTER,'Ashtoret,Ashtoreth,Astaroth,Astarte,Esfir,Essi,Essie,Esta,Estee,Ester,Estera,Esteri,Esther,Esthiru,Eszter,Eszti,Hester,Hettie,Ishtar
'EZRA',Esdras,Ezra,Ezras
'IRA',Ira
'ISMAT
'ITTAY,Itai,Ithai,Ittai
'IYYOV,Ayyub,Iob,Iyov,Job,Joby
'IZEVEL
'OFRAH,Ofra,Ophrah
'ORPAH,Oprah,Orpah,Orpha

There are about 15k names, but the output csv file only takes 5.7MB space, much smaller than I had expcted.