Skip to content
    Back to all Bounties

    Earn 13,500 ($135.00)

    Time Remainingdue 2 years ago
    Canceled

    Retrieve URLs from CommonCrawl

    vibo
    vibo
    Posted 2 years ago

    Bounty Description

    Problem

    We have an analysis project coming up, and we’ll need to periodically extract URLs and their content from CommonCrawl for this.

    So? We’d use the nice API provided by CommonCrawl for this here. Pattern matching works perfectly for us.

    https://index.commoncrawl.org

    Butt it’s already overloaded and keeps timing out. CommonCrawl themselves offered a suggestion for this, hosting the same index server ourselves.

    What we need to do

    Host the index server on AWS in the same region as the CommonCrawl archive to reduce latency.

    We should have an HTTP API (perhaps a wrapper) that lets me query for any pattern the CDX server offers and retrieve the content of matching URLs (or link to the content).

    These should be cached on an individual URL level (perhaps in S3), so that we don’t have to unzip the WARC every single time.

    (maybe just put all the files from the WARC in the S3 database every time we unzip one?)

    Important points

    • Remember to keep all the URLs/caching etc. PER crawl index.
    • Include headers etc, as part of the content
    • gzip the content because why keep extra white space?

    Which services to use?

    Any AWS service in the same region. What and how doesn’t matter, should be efficient though.