Earn 20,250 ($202.50)

due 2 years ago

Canceled

Create a Discord scraper

neovike

Posted 2 years ago

Details

Applications

Discussion

Bounty Description

Problem Description

I need a dataset of chats from many different discord servers. I need this data to be no more stale than 1 hour. I need to monitor at least 25 different discord servers. Some of these servers will have very few people in them (less than 20), and some of them will have more than 100k active users.

Acceptance Criteria

Must scrape the complete history of a server, for all channels.
You may use a different tool to build the history from the one that is used to keep the dataset up to date. Feel free to use a tool like https://github.com/Tyrrrz/DiscordChatExporter
Must be no more stale than 1 hour
Must store the data in a GCS bucket or BigQuery table. I am fine paying for GCS/BQ storage and am fine paying for the write costs. I am also open to alternative storage locations, but these would be my preference.
If you are storing data in GCS then you should use compressed JSON or some structured format. CSV is not acceptable.
Must use a single consistent schema. This schema should be shared with me along with a description of the fields you are using.
Must not have gaps in data.
It's acceptable to have some minimal overlap in data, but no more than 5% of data should be duplicated.
Must provide a way to run this pipeline to keep the datasets up to date. Options that come to mind are GitHub Actions with a cron job, but I am open to alternatives.
I only have a single Discord account. I can't be asked to bring many discord accounts in order to make this system work
It's not acceptable for me to pay a SaaS service (or another type of service) to do this scraping.
The system must run reliably for 3 days at least. It should not hit any rate limits.
You should print out descriptive logs to the console as you are scraping channels/servers

How will I test your submission?

I'll provide you with a file on GitHub containing a list of servers.
Your system should pick up this file and automatically start scraping Discord for these servers
I will then go to one of the channels in one of the servers, and post a message.
I will then make sure that the message is visible in the dataset you are producing 1 hour later.
I will also randomly sample messages from servers to see if they are in your dataset

Important

I will only consider proposals that provide a high-level plan for how you will implement this pipeline. You should talk about how you will backfill the discord channel data, how you will keep it up to date, how this pipeline will be run regularly to meet the freshness requirements, how you will ensure you don't have overlapping data or gaps in the data, how you will deal with rate limits of the discord API, etc.