Skip to content
    Back to all Bounties

    Earn 5,400 ($54.00)

    Time Remainingdue 1 year ago
    In Progress

    Build Next.js Proof of Concept for Scraping Archive.is and Google Cache

    mryaboy
    mryaboy
    Posted 1 year ago

    Bounty Description

    Problem Description

    I currently have an application that makes requests to archive.is and jeni.ai, but it's having issues. The project is simple: I need a simple proof of concept application that can extract the markdown from any article through archive.is and google cache (if the cache exists)

    Here is some starter code:

    export function getUrlWithSource(source: string, url: string) {
    let urlWithSource;
    switch (source) {
    case "direct":
    urlWithSource = url;
    break;
    case "wayback":
    urlWithSource = https://web.archive.org/web/2/${encodeURIComponent( url )};
    break;
    // case "google":
    // const cleanUrl = url.replace(/^https?:/+/, "");
    // const finalUrl = https://${cleanUrl};
    // urlWithSource = https://webcache.googleusercontent.com/search?q=cache:${encodeURIComponent( // finalUrl // )};
    // break;
    case "jina.ai":
    urlWithSource = https://r.jina.ai/${url};
    break;
    case "archive":
    urlWithSource = http://archive.is/latest/${encodeURIComponent(url)};
    break;
    default:
    throw new Error(Invalid source parameter: ${source});
    }
    return urlWithSource;
    }

    More code is at https://github.com/mrmps/SMRY/tree/test

    Acceptance Criteria

    Functionality is simple. Enter a URL in a next 13 (or next 15) application get the markdown of the archive.is and google cache version of that article. Content doesn't need to be formatted. You can use proxies, whatever. But the app need to be deployed to a Vercel URL.

    Technical Details

    Requires some knowledge Next 13

    Link to Project

    You can start a new project from scratch. No need to clone SMRY--if your API works I'll integrate it. But if you want to see the current implementation, check out https://github.com/mrmps/SMRY/blob/test/components/article-content.tsx

    The most important things are

    1. Archive.is works (url to archive.is markdown)
    2. Works in prod (making it work locally is not enough, needs to be deployed to a testable vercel app.)
    3. Next 13 or 15
    4. Google cache works (this solution is preferred, but not strictly necessary for the bounty to be paid out)

    No styling or anything needed, just needs to return valid markdown. I'll be testing with news articles (nyt, economist, etc)

    You may need to use proxies. That is up to you, I'm really not sure if they are necessary, but feel free to chose any proxy provider if you do decide to use them. Up to $20 in proxy fees will be paid if requested and task is completed.