Earn 13,500 ($135.00)

due 2 years ago

Canceled

Retrieve URLs from CommonCrawl

vibo

Posted 2 years ago

Details

Applications

Discussion

Bounty Description

Problem

We have an analysis project coming up, and we’ll need to periodically extract URLs and their content from CommonCrawl for this.

So? We’d use the nice API provided by CommonCrawl for this here. Pattern matching works perfectly for us.

https://index.commoncrawl.org

Butt it’s already overloaded and keeps timing out. CommonCrawl themselves offered a suggestion for this, hosting the same index server ourselves.

What we need to do

Host the index server on AWS in the same region as the CommonCrawl archive to reduce latency.

We should have an HTTP API (perhaps a wrapper) that lets me query for any pattern the CDX server offers and retrieve the content of matching URLs (or link to the content).

These should be cached on an individual URL level (perhaps in S3), so that we don’t have to unzip the WARC every single time.

(maybe just put all the files from the WARC in the S3 database every time we unzip one?)

Important points

Remember to keep all the URLs/caching etc. PER crawl index.
Include headers etc, as part of the content
gzip the content because why keep extra white space?

Which services to use?

Any AWS service in the same region. What and how doesn’t matter, should be efficient though.