Earn 36,000 ($360.00)
Retrieving URLs from CommonCrawl
Bounty Description
Problem
We have an analysis project coming up, and we’ll need to periodically extract URLs and their content from CommonCrawl for this.
So? We’d use the nice API provided by CommonCrawl for this here. Pattern matching works perfectly for us.
Butt it’s already overloaded and keeps timing out. CommonCrawl themselves offered a suggestion for this, hosting the same index server ourselves.
What we need to do
Retrieve URLs from Index
Host the index server on AWS (in the same region) as the CommonCrawl S3 archive. It’s open source so you can pull the docker image or clone the GitHub repo to use it.
https://github.com/commoncrawl/cc-index-server
Retrieve URL crawl results
This server points towards a specific file name and URL index on S3.
But we also need the content, so what do we do?
Each result file is around 600-700 MB gzipped (3-4 GB un-gzipped) and is in WARC format which means it contains a bunch of URLs and their crawl result (headers, html content etc.).
We can pick out the specific URL and its result from that file using the URL index given by… well, the index.
cdx-toolkit is something similar but does two things:
- Uses the main CDX server through HTTP to get file names + URL indexes.
- Retrieves the WARC files over HTTP, g-unzips them locally and extracts the content we need.
Since the cdx-toolkit is using HTTP downloading for both parts instead of S3 (which would be much faster), I would assume we can’t reuse this entirely (but maybe some of it can be re-used).
Caching
It's possible we might need caching later to skip the steps of downloading WARC files, unzipping, and extracting from them, but I don't think that will be very costly as we'll be loading in the same region and gunzip is quite efficient.
So we don't need this right now.
Some way to use locally, easily.
We should be able to do CDX-server-like queries for URLs, and get their content back instead of just the filename+URL indexes. This can be done with a simple HTTP API, and will likely be the least technical part of this implementation.
Should be efficient enough to process a large number of URLs, so it would be okay if it hosted the content on S3 and just sent back links for each URL, or gzipped the whole result (all the URLs and their contents). Whatever works best, just should be usable locally in a simple Python script.
Which AWS services to use?
Doesn't matter, any. Just needs to work.
Questions you need to answer if you're applying:
Is there anything in the plan you’re unsure about? If so what are those thing(s), and how would you clarify your uncertainty?
Be to the point in your application and in answering the question above. Long answers without any substance will get you rejected.
If you have any questions, message me on Discord @ zlenner.