So this is a simple web scraper where the user can input the URL of a website then you can search for a certain element to take from the website. The elements you can search are:
a
Searches for all of the links on that page.
p
Searches for all of the
elements on the page.
Searches by id or class.
h1
Searches for all of the
elements on the page.
Searches by id or class.
h2
Searches for all of the
elements on the page.
Searches by id or class.
NOTE: if you input p/h1/h2 you can search for any class/id in that page even if it's not from an element you choose. This is something I realized when I was done and don't know how to fix yet.
How it works:
First the imports
import requests
from bs4 import beautifulSoup
And of course I imported Fore, Back, Style from colorama for the coloring.
When you input the url it saves it to a variable named url. Then I made another variable name response that equals requests.get(url) which honestly I can't explain well since I don't know what it does other than you need to do otherwise it won't work.
Then I turned the response to text and saved it to the variable html.
Next it asks which element it wants to search. It saves this to the variable element for later use.
After this I set up the html parser saved it to the variable soup.
Next I set it to find all of the elements and saved it to the variable ELEMENTS.
Then I set up an if else statement; it's pretty simple and I use the element variable again as one side of the equation.
if element == "a":
for link in ELEMENTS:
print(Fore.WHITE + link.get("href"))
elif element == "p":
idClass = input(Fore.BLUE + "Would you like to search by id or class?\n")
if idClass == "id":
ID = input("Name of id: ")
searchIdP = soup.find(id=ID)
print(Fore.WHITE + searchIdP.prettify())
elif idClass == "class":
CLASS = input("Name of class: ")
searchClassPs = soup.find(class_=CLASS)
print(Fore.WHITE + searchClassPs.prettify())
else:
print(Fore.RED + "Not an option")
The rest (h1/h2) are the same as p.
For the link (element == "a") it's much better for to just see the links so the whole for link in ELEMENTS...link.get("href")) is searching for the links which are found in the href attribute (you can probably understand that entire section with this knowledge).
For the p/h1/h2 it asks for class or id; the user's input is saved to the variable idClass. Then there's another if else statement.
If the user says id it will ask the user for the name of the id which is saved to ID. Next it will search for the element with that id which is saved to searchIdP. Afterwards it prints out the element to the console. You may have noticed that I add .prettify() to the end of searchIdP in the print(), this was done to make the output easier to read, it spaces it out nice.
This is basicly repeated for the class option exept you change all of the id's to class.
And that's all there is to tell you about how it works. If you have any questions leave a comment below and I will answer the best I can, just note that I'm a noob to python so don't expect a great answer to a complex question.
Web Scraper
Scrap the Web (ids, classes, and links)
I made this for @XCode101's Weekly Challenge.
I used @GarethDwyer1's Beginner web scraping with Python and Repl.it. If you want to see the post here it is post.
What it is:
So this is a simple web scraper where the user can input the URL of a website then you can search for a certain element to take from the website. The elements you can search are:
elements on the page.
elements on the page.
elements on the page.
NOTE: if you input
p
/h1
/h2
you can search for any class/id in that page even if it's not from an element you choose. This is something I realized when I was done and don't know how to fix yet.How it works:
First the imports
import requests
from bs4 import beautifulSoup
Fore, Back, Style
fromcolorama
for the coloring.When you input the url it saves it to a variable named
url
. Then I made another variable nameresponse
that equalsrequests.get(url)
which honestly I can't explain well since I don't know what it does other than you need to do otherwise it won't work.Then I turned the response to text and saved it to the variable
html
.Next it asks which element it wants to search. It saves this to the variable
element
for later use.After this I set up the
html parser
saved it to the variablesoup
.Next I set it to find all of the
element
s and saved it to the variableELEMENTS
.Then I set up an
if else
statement; it's pretty simple and I use theelement
variable again as one side of the equation.The rest (
h1
/h2
) are the same asp
.For the link (
element == "a"
) it's much better for to just see the links so the wholefor link in ELEMENTS...link.get("href"))
is searching for the links which are found in thehref
attribute (you can probably understand that entire section with this knowledge).For the
p
/h1
/h2
it asks for class or id; the user's input is saved to the variableidClass
. Then there's anotherif else
statement.id
it will ask the user for the name of the id which is saved toID
. Next it will search for the element with that id which is saved tosearchIdP
. Afterwards it prints out the element to the console. You may have noticed that I add.prettify()
to the end ofsearchIdP
in theprint()
, this was done to make the output easier to read, it spaces it out nice.class
option exept you change all of theid
's toclass
.And that's all there is to tell you about how it works. If you have any questions leave a comment below and I will answer the best I can, just note that I'm a noob to python so don't expect a great answer to a complex question.
I hope you liked this.
Thanks!
@VulcanWM