Web Scraper
h
cuber1515

Scrap the Web (ids, classes, and links)

I made this for @XCode101's Weekly Challenge.

I used @GarethDwyer1's Beginner web scraping with Python and Repl.it. If you want to see the post here it is post.

What it is:

So this is a simple web scraper where the user can input the URL of a website then you can search for a certain element to take from the website. The elements you can search are:

  • a
    • Searches for all of the links on that page.
  • p
    • Searches for all of the <p> elements on the page.
    • Searches by id or class.
  • h1
    • Searches for all of the <h1> elements on the page.
    • Searches by id or class.
  • h2
    • Searches for all of the <h2> elements on the page.
    • Searches by id or class.

NOTE: if you input p/h1/h2 you can search for any class/id in that page even if it's not from an element you choose. This is something I realized when I was done and don't know how to fix yet.

How it works:

First the imports

  • import requests
  • from bs4 import beautifulSoup
  • And of course I imported Fore, Back, Style from colorama for the coloring.

When you input the url it saves it to a variable named url. Then I made another variable name response that equals requests.get(url) which honestly I can't explain well since I don't know what it does other than you need to do otherwise it won't work.

Then I turned the response to text and saved it to the variable html.

Next it asks which element it wants to search. It saves this to the variable element for later use.

After this I set up the html parser saved it to the variable soup.

Next I set it to find all of the elements and saved it to the variable ELEMENTS.

Then I set up an if else statement; it's pretty simple and I use the element variable again as one side of the equation.

The rest (h1/h2) are the same as p.

For the link (element == "a") it's much better for to just see the links so the whole for link in ELEMENTS...link.get("href")) is searching for the links which are found in the href attribute (you can probably understand that entire section with this knowledge).

For the p/h1/h2 it asks for class or id; the user's input is saved to the variable idClass. Then there's another if else statement.

  • If the user says id it will ask the user for the name of the id which is saved to ID. Next it will search for the element with that id which is saved to searchIdP. Afterwards it prints out the element to the console. You may have noticed that I add .prettify() to the end of searchIdP in the print(), this was done to make the output easier to read, it spaces it out nice.
  • This is basicly repeated for the class option exept you change all of the id's to class.

And that's all there is to tell you about how it works. If you have any questions leave a comment below and I will answer the best I can, just note that I'm a noob to python so don't expect a great answer to a complex question.

I hope you liked this and have a great day!!!

You are viewing a single comment. View All
cuber1515

@ruiwenge2 yeah @XCode101 is right, I this generator. Also thanks!