Back to all Bounties
Earn 1,980 ($19.80)
due 2 years ago
Canceled
Web Scraper for Datasets
succoallapera104
Details
Applications
7
Discussion
Bounty Description
Problem Description
Make a tool in Python for scraping entire websites (and subpages), then organizing them in a csv file.
Each data file must have 4 columns (Max rows: 3k):
- one for Context
- one for Question
- one for Answer (answers.text)
- one for Answer Start Index (answers.answer_start).
Use gpt4free and OpenAssistant to generate context, questions, answer, answer start index from the scraped content.
If the content generated is more than 3000 rows it should split in different files.
Each csv file has to be saved in a directory called by the website url and that directry has to be saved in another called Datasets.
Also the csv file's name has to be formatted like this: "WEBSITE_NAME"_"NUMBER_OF_THE_FILE", example:
google_1, google_2, google_3 ecc.
If you have any questions contact me on discord: succo104#5166
Acceptance Criteria
- The web scraper has to be entirely in Python.
- Shell based GUI for inputs:
- First input: Model (GPT or OpenAssistant)
- Second input: Website
- Hosted on a Repl