Earn 10,000 ($100.00)

due 1 year ago

Completed

Scrape a SubReddit Exhaustively

neelanveloo

Posted 1 year ago

Details

Applications

Discussion

This Bounty has been completed!

Bounty Description

Problem Criteria:

Scrape the provided subreddit and extract all posts, comments, and associated metadata. This includes:
- All posts (title, text, author, timestamp, score, etc.)
- All comments (text, author, timestamp, score, parent post, etc.)
- The tree structure showing the parent-child relationships between posts and comments (to preserve thread structure)
Store the scraped data in a structured format (JSON or CSV) that preserves relationships and allows for further analysis.
Include all posts and comments within a specified time range (e.g. last 5 years).
Follow Reddit's API terms of service and robots.txt to avoid overloading servers.

Acceptance Criteria:

Output contains 100% of posts and comments made in the specified time range. All content and metadata from Reddit is represented accurately.
Comment structure properly preserves threading/hierarchy.
Code includes throttling, error handling, retries to ensure completeness.
Documentation provided explaining the overall architecture and key design decisions.
Reddit API terms and robots.txt rules are followed.
Unit tests exist and provide >90% code coverage.
Source code is reusable for scraping other subreddits.

Data is returned as a JSON formatted in the following way:

SubReddit

Given this Reddit URL, the service should produce output of the form:

{
    "data": {
        "author": "Straight_Hat_3398",
        "timestamp": ...,
        "title": "Calculated Field - Replace a Name with Area Name",
        "content": "Hi Everyone! Is there a way to replace a name...",
        "upvotes": 5
        "downvotes": 0,
        "children": [
            {
                "author": "jonthecpa",
                "timestamp": ...,
                "title": "",
                "content": "Seems like you could just pull the different...",
                "upvotes": 2,
                "downvotes": 0,
                "children": [
                    ...
                ]
            },
            ...
        ]
    }
}