Skip to content
    Back to all Bounties

    Earn 1,980 ($19.80)

    Time Remainingdue 2 years ago
    Canceled

    Web Scraper for Datasets

    succoallapera104
    succoallapera104
    Posted 2 years ago

    Bounty Description

    Problem Description

    Make a tool in Python for scraping entire websites (and subpages), then organizing them in a csv file.
    Each data file must have 4 columns (Max rows: 3k):

    • one for Context
    • one for Question
    • one for Answer (answers.text)
    • one for Answer Start Index (answers.answer_start).

    Use gpt4free and OpenAssistant to generate context, questions, answer, answer start index from the scraped content.
    If the content generated is more than 3000 rows it should split in different files.
    Each csv file has to be saved in a directory called by the website url and that directry has to be saved in another called Datasets.
    Also the csv file's name has to be formatted like this: "WEBSITE_NAME"_"NUMBER_OF_THE_FILE", example:
    google_1, google_2, google_3 ecc.
    If you have any questions contact me on discord: succo104#5166

    Acceptance Criteria

    1. The web scraper has to be entirely in Python.
    2. Shell based GUI for inputs:
    • First input: Model (GPT or OpenAssistant)
    • Second input: Website
    1. Hosted on a Repl