Back to all Bounties
Earn 10,000 ($100.00)
due 1 year ago
Completed
Scrape a SubReddit Exhaustively
neelanveloo
Details
Applications
5
Discussion
This Bounty has been completed!
Bounty Description
Problem Criteria:
- Scrape the provided subreddit and extract all posts, comments, and associated metadata. This includes:
- All posts (title, text, author, timestamp, score, etc.)
- All comments (text, author, timestamp, score, parent post, etc.)
- The tree structure showing the parent-child relationships between posts and comments (to preserve thread structure)
- Store the scraped data in a structured format (JSON or CSV) that preserves relationships and allows for further analysis.
- Include all posts and comments within a specified time range (e.g. last 5 years).
- Follow Reddit's API terms of service and robots.txt to avoid overloading servers.
Acceptance Criteria:
-
Output contains 100% of posts and comments made in the specified time range. All content and metadata from Reddit is represented accurately.
-
Comment structure properly preserves threading/hierarchy.
-
Code includes throttling, error handling, retries to ensure completeness.
-
Documentation provided explaining the overall architecture and key design decisions.
-
Reddit API terms and robots.txt rules are followed.
-
Unit tests exist and provide >90% code coverage.
-
Source code is reusable for scraping other subreddits.
-
Data is returned as a JSON formatted in the following way:
- SubReddit
-
Given this Reddit URL, the service should produce output of the form:
{"data": {"author": "Straight_Hat_3398","timestamp": ...,"title": "Calculated Field - Replace a Name with Area Name","content": "Hi Everyone! Is there a way to replace a name...","upvotes": 5"downvotes": 0,"children": [{"author": "jonthecpa","timestamp": ...,"title": "","content": "Seems like you could just pull the different...","upvotes": 2,"downvotes": 0,"children": [...]},...]}}
-
- SubReddit