How to make a web crawler! (In Python)
A web crawler is a program that looks through websites to find more websites and looks through those to find more, eventually almost every website will have been searched. Web crawlers are used by search engines to find new websites to index
Today we will be making a web crawler in python.
To start we need to make a file called
sites, here we will store all the websites we have searched. Add to the file the first websites that come to your mind separated by a comma.(make sure it ends in a comma) Those will be the first websites to be searched.
We will be using urllib and re to search the sites, go ahead and import those as follows:
import urllib.request as urllib import re
Then we need to get the sites out of
sites and store it in a variable, I used the code below:
It opens the file, reads it, splits it at the comma, only gets everything before the last one (because the last one is equal to nothing), and stores it in knownsites.
To search a site we will need to define a function that I will call
The function needs to search the site with
urllib.urlopen(), search for the sites using
re.findall(), and return them.
But how will we find the sites in the long code? What will we search for?
In HTML websites are usually in this format
href=“https://sitename.com/“ so we can search for
Then we need to return it.
I used this code for that function:
def findsites(site, find=r'href=\"(.*?)\"'): try: d = str(urllib.urlopen(site).read()) except: return  f = re.findall(find, d) return f
This is the easiest part, all we have to do is search through knownsites, crawl the sites, and add them for crawling next time.
I will define the function
crawl for this.
I will loop through knownsites. Find the new sites. Remove the old one from knownsites (so that it does not loop through it every time). Append the sites. Add the sites to the
I used the following code:
def crawl(): for page in knownsites: sites=findsites(page) knownsites.remove(page) for site in sites: knownsites.append(site) open("sites", "a").write(site)
If you want you crawler to keep crawling then you can add:
while True: crawl()