Skip to content
    Back to all Bounties

    Earn 10,000 ($100.00)

    Time Remainingdue 1 year ago
    Completed

    Scrape a SubReddit Exhaustively

    neelanveloo
    neelanveloo
    Posted 1 year ago
    This Bounty has been completed!

    Bounty Description

    Problem Criteria:

    • Scrape the provided subreddit and extract all posts, comments, and associated metadata. This includes:
      • All posts (title, text, author, timestamp, score, etc.)
      • All comments (text, author, timestamp, score, parent post, etc.)
      • The tree structure showing the parent-child relationships between posts and comments (to preserve thread structure)
    • Store the scraped data in a structured format (JSON or CSV) that preserves relationships and allows for further analysis.
    • Include all posts and comments within a specified time range (e.g. last 5 years).
    • Follow Reddit's API terms of service and robots.txt to avoid overloading servers.

    Acceptance Criteria:

    • Output contains 100% of posts and comments made in the specified time range. All content and metadata from Reddit is represented accurately.

    • Comment structure properly preserves threading/hierarchy.

    • Code includes throttling, error handling, retries to ensure completeness.

    • Documentation provided explaining the overall architecture and key design decisions.

    • Reddit API terms and robots.txt rules are followed.

    • Unit tests exist and provide >90% code coverage.

    • Source code is reusable for scraping other subreddits.

    • Data is returned as a JSON formatted in the following way:

      • SubReddit
        • Given this Reddit URL, the service should produce output of the form:

          {
          "data": {
          "author": "Straight_Hat_3398",
          "timestamp": ...,
          "title": "Calculated Field - Replace a Name with Area Name",
          "content": "Hi Everyone! Is there a way to replace a name...",
          "upvotes": 5
          "downvotes": 0,
          "children": [
          {
          "author": "jonthecpa",
          "timestamp": ...,
          "title": "",
          "content": "Seems like you could just pull the different...",
          "upvotes": 2,
          "downvotes": 0,
          "children": [
          ...
          ]
          },
          ...
          ]
          }
          }