Earn 5,400 ($54.00)

due 1 year ago

In Progress

Build Next.js Proof of Concept for Scraping Archive.is and Google Cache

mryaboy

Posted 1 year ago

Details

Applications

Discussion

Bounty Description

Problem Description

I currently have an application that makes requests to archive.is and jeni.ai, but it's having issues. The project is simple: I need a simple proof of concept application that can extract the markdown from any article through archive.is and google cache (if the cache exists)

Here is some starter code:

export function getUrlWithSource(source: string, url: string) {
let urlWithSource;
switch (source) {
case "direct":
urlWithSource = url;
break;
case "wayback":
urlWithSource = https://web.archive.org/web/2/${encodeURIComponent( url )};
break;
// case "google":
// const cleanUrl = url.replace(/^https?:/+/, "");
// const finalUrl = https://${cleanUrl};
// urlWithSource = https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent( // finalUrl // )};
// break;
case "jina.ai":
urlWithSource = https://r.jina.ai/${url};
break;
case "archive":
urlWithSource = http://archive.is/latest/${encodeURIComponent(url)};
break;
default:
throw new Error(Invalid source parameter: ${source});
}
return urlWithSource;
}

More code is at https://github.com/mrmps/SMRY/tree/test

Acceptance Criteria

Functionality is simple. Enter a URL in a next 13 (or next 15) application get the markdown of the archive.is and google cache version of that article. Content doesn't need to be formatted. You can use proxies, whatever. But the app need to be deployed to a Vercel URL.

Technical Details

Requires some knowledge Next 13

Link to Project

You can start a new project from scratch. No need to clone SMRY--if your API works I'll integrate it. But if you want to see the current implementation, check out https://github.com/mrmps/SMRY/blob/test/components/article-content.tsx

The most important things are

Archive.is works (url to archive.is markdown)
Works in prod (making it work locally is not enough, needs to be deployed to a testable vercel app.)
Next 13 or 15
Google cache works (this solution is preferred, but not strictly necessary for the bounty to be paid out)

No styling or anything needed, just needs to return valid markdown. I'll be testing with news articles (nyt, economist, etc)

You may need to use proxies. That is up to you, I'm really not sure if they are necessary, but feel free to chose any proxy provider if you do decide to use them. Up to $20 in proxy fees will be paid if requested and task is completed.