Skip to content
Sign UpLog In
This post is read-only. Explore Repls and connect with other creators on Community.View Community
The info in this post might be out of date, check out our docs instead. View docs

Get started with Web Scraping 🤖


Learn to scrape the web by creating your own quotes website 🤖 🤔

Demo ⏯️ Code 👨‍💻

There's so much information on the web, but so few APIs 😕

Web Scraping lets you get content from any webpage, extracting information from HTML selectors. This is a super simple guide to help you scrape the web with Node.js , in less than 20 minutes 🕒

We'll learn to use developer tools to see HTML selectors, extract the content in Node.js with x-ray - and use pug to render quotes we get!

🛠️ Setup

Um, there's none really.

Just load up this repl -

We'll talk about all the tools and dependencies we use as we continue!

📊 Understanding Structure has some nice quotes - and is an easy introduction to scraping stuff on the web. Another great part - it loads up a random quote each time!

Open the website up, right click on the quote, and then hit Inspect 👇

You can now see a view of the entire document 📜 Go ahead and click all the small triangles to expand this view.

You'll be able to see that the quote itself is inside a paragraph, in a div with id quote-content, while the author's name has an id of quote-title.

⬇️ Getting the quotes

To scrape information from a web page, you generally request its HTML, and then extract the content you want with selctors, like ids, classes, and HTML tags.

When we inspected, we saw that the quote itself was in a div, with id=quote-content, nested inside a <p> (Paragraph element) - while the author's name was inside an element with id=quote-title.

Having this information makes scraping super easy 🙌 - we'll use a library called x-ray - which makes our job very straightforward! It's already installed in the repl you're using.

Try adding this code to your index.js file 👇

x('', { quote: "#quote-content p", author: "#quote-title" } )(function(err, result){ console.log(result) });

Run your code, and the console on the bottom right will look something like this 👀

x-ray gives you a nice json object, which you can now render!

You've now successfully scraped the web 🎉

📜 Render those quotes

If you see the file tree in the sidebar, you'll see quotes.pug - a template which can render quotes passed to it. We're using the pug templating engine to do this - which we've initialized on line 6

One thing to note is that pug is whitespace sensitive: HTML tags are nested inside each other with tabs ⌨️

All we have to do now is pass the quote we get from x-ray to pug! This is very easy to do on our express server, just change your app.get block to this 👨‍💻

app.get('/', (req, res) => { x('', { quote: "#quote-content p", author: "#quote-title" } )(function(err, result){ res.render('quote', result) console.log(result) }); });

Run your repl, and this is what you'll see 😮

Pretty neat, huh?

We've not written any css of our own, the page looks readable just because of the simple sakura.css library. Remove line 4 in quote.pug and get ready for ugly 🤮

⚡ Putting your skill to use

There is a lot you can do with web scraping - and this guide has given you all the basic knowledge you need. I'm excited to see what you do with this 😄

Here are a few cool things to try 👇

  • Scrape different data points - weather, latest news, bitcoin price 😛 - and make a dashboard for yourself
  • Scrape IMDb to get a list of all movies currently in theatres 🎦
  • Scrape Repl Talk and make an API for it 👨‍💻

Whatever you build, be sure to share it in the comments 💬

Here's what the final code looks like, feel free to refer to it 👇

4 years ago




Awesome tutorial

1 year ago
Load more