Get started with Web Scraping 🤖
Learn to scrape the web by creating your own quotes website 🤖 🤔
There's so much information on the web, but so few APIs 😕
Web Scraping lets you get content from any webpage, extracting information from
HTML selectors. This is a super simple guide to help you scrape the web with
Node.js , in less than 20 minutes 🕒
We'll learn to use developer tools to see
HTML selectors, extract the content in
x-ray - and use
pug to render quotes we get!
Um, there's none really.
Just load up this repl - repl.it/@jajoosam/quote-scraper-starter ✨
We'll talk about all the tools and dependencies we use as we continue!
📊 Understanding Structure
Quotesondesign.com has some nice quotes - and is an easy introduction to scraping stuff on the web. Another great part - it loads up a random quote each time!
Open the website up, right click on the quote, and then hit
You can now see a view of the entire document 📜 Go ahead and click all the small triangles to expand this view.
You'll be able to see that the quote itself is inside a paragraph, in a
div with id
quote-content, while the author's name has an id of
⬇️ Getting the quotes
To scrape information from a web page, you generally request its
HTML, and then extract the content you want with selctors, like
When we inspected Quotesondesign.com, we saw that the quote itself was in a
id=quote-content, nested inside a
<p> (Paragraph element) - while the author's name was inside an element with
Having this information makes scraping super easy 🙌 - we'll use a library called x-ray - which makes our job very straightforward! It's already installed in the repl you're using.
Try adding this code to your
index.js file 👇
Run your code, and the console on the bottom right will look something like this 👀
x-ray gives you a nice
json object, which you can now render!
You've now successfully scraped the web 🎉
📜 Render those quotes
If you see the file tree in the sidebar, you'll see
quotes.pug - a template which can render quotes passed to it. We're using the pug templating engine to do this - which we've initialized on line
One thing to note is that
pug is whitespace sensitive:
HTML tags are nested inside each other with tabs ⌨️
All we have to do now is pass the quote we get from
pug! This is very easy to do on our
express server, just change your
app.get block to this 👨💻
Run your repl, and this is what you'll see 😮
Pretty neat, huh?
We've not written any css of our own, the page looks readable just because of the simple sakura.css library. Remove line
quote.pug and get ready for ugly 🤮
⚡ Putting your skill to use
There is a lot you can do with web scraping - and this guide has given you all the basic knowledge you need. I'm excited to see what you do with this 😄
Here are a few cool things to try 👇
- Scrape different data points - weather, latest news, bitcoin price 😛 - and make a dashboard for yourself
- Scrape IMDb to get a list of all movies currently in theatres 🎦
- Scrape Repl Talk and make an API for it 👨💻
Whatever you build, be sure to share it in the comments 💬
Here's what the final code looks like, feel free to refer to it 👇