Skip to content
Sign upLog in
← Back to Community

How to make a web crawler! (In Python)

Profile icon
hg0428Hacker

Intro

A web crawler is a program that looks through websites to find more websites and looks through those to find more, eventually almost every website will have been searched. Web crawlers are used by search engines to find new websites to index
Today we will be making a web crawler in python.

Setup

To start we need to make a file called sites, here we will store all the websites we have searched. Add to the file the first websites that come to your mind separated by a comma.(make sure it ends in a comma) Those will be the first websites to be searched.
We will be using urllib and re to search the sites, go ahead and import those as follows:

import urllib.request as urllib import re

Then we need to get the sites out of sites and store it in a variable, I used the code below:

knownsites=open("sites").read().split(",")[:-1]

It opens the file, reads it, splits it at the comma, only gets everything before the last one (because the last one is equal to nothing), and stores it in knownsites.

Searching a site

To search a site we will need to define a function that I will call getsites.
The function needs to search the site with urllib.urlopen(), search for the sites using re.findall(), and return them.
But how will we find the sites in the long code? What will we search for?
In HTML websites are usually in this format href=“https://sitename.com/“ so we can search for href=\"(.*?)\".
Then we need to return it.
I used this code for that function:

def findsites(site, find=r'href=\"(.*?)\"'): try: d = str(urllib.urlopen(site).read()) except: return [] f = re.findall(find, d) return f

Crawling

This is the easiest part, all we have to do is search through knownsites, crawl the sites, and add them for crawling next time.
I will define the function crawl for this.
I will loop through knownsites. Find the new sites. Remove the old one from knownsites (so that it does not loop through it every time). Append the sites. Add the sites to the sites file.
I used the following code:

def crawl(): for page in knownsites: sites=findsites(page) knownsites.remove(page) for site in sites: knownsites.append(site) open("sites", "a").write(site)

Doing it forever (Optional)

If you want you crawler to keep crawling then you can add:

while True: crawl()

I hope this tutorial helped you learn and make better programs!

Voters
Profile icon
RitaHardeman
Profile icon
zakdakidd
Profile icon
168WenFangjun
Profile icon
hg0428