Earn 5,400 ($54.00)
Build Next.js Proof of Concept for Scraping Archive.is and Google Cache
Bounty Description
Problem Description
I currently have an application that makes requests to archive.is and jeni.ai, but it's having issues. The project is simple: I need a simple proof of concept application that can extract the markdown from any article through archive.is and google cache (if the cache exists)
Here is some starter code:
export function getUrlWithSource(source: string, url: string) {
let urlWithSource;
switch (source) {
case "direct":
urlWithSource = url;
break;
case "wayback":
urlWithSource = https://web.archive.org/web/2/${encodeURIComponent( url )}
;
break;
// case "google":
// const cleanUrl = url.replace(/^https?:/+/, "");
// const finalUrl = https://${cleanUrl}
;
// urlWithSource = https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent( // finalUrl // )}
;
// break;
case "jina.ai":
urlWithSource = https://r.jina.ai/${url}
;
break;
case "archive":
urlWithSource = http://archive.is/latest/${encodeURIComponent(url)}
;
break;
default:
throw new Error(Invalid source parameter: ${source}
);
}
return urlWithSource;
}
More code is at https://github.com/mrmps/SMRY/tree/test
Acceptance Criteria
Functionality is simple. Enter a URL in a next 13 (or next 15) application get the markdown of the archive.is and google cache version of that article. Content doesn't need to be formatted. You can use proxies, whatever. But the app need to be deployed to a Vercel URL.
Technical Details
Requires some knowledge Next 13
Link to Project
You can start a new project from scratch. No need to clone SMRY--if your API works I'll integrate it. But if you want to see the current implementation, check out https://github.com/mrmps/SMRY/blob/test/components/article-content.tsx
The most important things are
- Archive.is works (url to archive.is markdown)
- Works in prod (making it work locally is not enough, needs to be deployed to a testable vercel app.)
- Next 13 or 15
- Google cache works (this solution is preferred, but not strictly necessary for the bounty to be paid out)
No styling or anything needed, just needs to return valid markdown. I'll be testing with news articles (nyt, economist, etc)
You may need to use proxies. That is up to you, I'm really not sure if they are necessary, but feel free to chose any proxy provider if you do decide to use them. Up to $20 in proxy fees will be paid if requested and task is completed.