Earn 13,500 ($135.00)
Retrieve URLs from CommonCrawl
Bounty Description
Problem
We have an analysis project coming up, and we’ll need to periodically extract URLs and their content from CommonCrawl for this.
So? We’d use the nice API provided by CommonCrawl for this here. Pattern matching works perfectly for us.
Butt it’s already overloaded and keeps timing out. CommonCrawl themselves offered a suggestion for this, hosting the same index server ourselves.
What we need to do
Host the index server on AWS in the same region as the CommonCrawl archive to reduce latency.
We should have an HTTP API (perhaps a wrapper) that lets me query for any pattern the CDX server offers and retrieve the content of matching URLs (or link to the content).
These should be cached on an individual URL level (perhaps in S3), so that we don’t have to unzip the WARC every single time.
(maybe just put all the files from the WARC in the S3 database every time we unzip one?)
Important points
- Remember to keep all the URLs/caching etc. PER crawl index.
- Include headers etc, as part of the content
- gzip the content because why keep extra white space?
Which services to use?
Any AWS service in the same region. What and how doesn’t matter, should be efficient though.