Skip to content
    Back to all Bounties

    Earn 36,000 ($360.00)

    Time Remainingdue 2 years ago
    Completed

    Retrieving URLs from CommonCrawl

    vibo
    vibo
    Posted 2 years ago
    This Bounty has been completed!
    @vibo's review of @nirajftw
    5.0
    Average Rating
    Communication 5/5, Quality 5/5, Timeliness 5/5
    Niraj was just amazing, he knows his stuff and is willing to explain it + came up with a better than I suggested.

    Bounty Description

    Problem

    We have an analysis project coming up, and we’ll need to periodically extract URLs and their content from CommonCrawl for this.

    So? We’d use the nice API provided by CommonCrawl for this here. Pattern matching works perfectly for us.

    https://index.commoncrawl.org

    Butt it’s already overloaded and keeps timing out. CommonCrawl themselves offered a suggestion for this, hosting the same index server ourselves.

    What we need to do

    Retrieve URLs from Index

    Host the index server on AWS (in the same region) as the CommonCrawl S3 archive. It’s open source so you can pull the docker image or clone the GitHub repo to use it.

    https://github.com/commoncrawl/cc-index-server

    Retrieve URL crawl results

    This server points towards a specific file name and URL index on S3.

    But we also need the content, so what do we do?

    Each result file is around 600-700 MB gzipped (3-4 GB un-gzipped) and is in WARC format which means it contains a bunch of URLs and their crawl result (headers, html content etc.).

    We can pick out the specific URL and its result from that file using the URL index given by… well, the index.

    cdx-toolkit is something similar but does two things:

    • Uses the main CDX server through HTTP to get file names + URL indexes.
    • Retrieves the WARC files over HTTP, g-unzips them locally and extracts the content we need.

    Since the cdx-toolkit is using HTTP downloading for both parts instead of S3 (which would be much faster), I would assume we can’t reuse this entirely (but maybe some of it can be re-used).

    Caching

    It's possible we might need caching later to skip the steps of downloading WARC files, unzipping, and extracting from them, but I don't think that will be very costly as we'll be loading in the same region and gunzip is quite efficient.

    So we don't need this right now.

    Some way to use locally, easily.

    We should be able to do CDX-server-like queries for URLs, and get their content back instead of just the filename+URL indexes. This can be done with a simple HTTP API, and will likely be the least technical part of this implementation.

    Should be efficient enough to process a large number of URLs, so it would be okay if it hosted the content on S3 and just sent back links for each URL, or gzipped the whole result (all the URLs and their contents). Whatever works best, just should be usable locally in a simple Python script.

    Which AWS services to use?

    Doesn't matter, any. Just needs to work.

    Questions you need to answer if you're applying:

    Is there anything in the plan you’re unsure about? If so what are those thing(s), and how would you clarify your uncertainty?

    Be to the point in your application and in answering the question above. Long answers without any substance will get you rejected.

    If you have any questions, message me on Discord @ zlenner.

    Copyright © 2025 Replit, Inc. All rights reserved.
    • twitter
    • tiktok
    • instagram
    • facebook

    Replit

    Programming languages

    • Python
    • JavaScript
    • TypeScript
    • Node.js
    • Nix
    • HTML, CSS, JS
    • C++
    • Golang