Skip to content
    Back to all Bounties

    Earn 18,000 ($180.00)

    Time Remainingdue 2 years ago
    Canceled

    Reddit/StackOverflow scraper (Azure microservice)

    adi42
    adi42
    Posted 2 years ago

    Bounty Description

    Problem Description

    We are building an LLM for enterprise software and looking to decouple data extraction from the rest of our pipeline. Our model uses data from a variety of sources including URLs, so we need a microservice that reliably extracts text from website input. The service should run playwright for traversing and extracting data from websites.

    Before accepting this bounty, please note the following:

    • You will need an Azure subscription to complete this bounty as the service should run in Azure. We will not be providing an Azure subscription but will test the deliverable using one.
    • One of the more challenging parts of the project is getting playwright to successfully run in Azure Functions directly from the Terraform code.

    Acceptance Criteria

    • Microservice uses playwright to extract text from any website (via URL parameter) and maximally extract text by traversing the following websites:

      • Reddit

        • Given this Reddit URL, the service should produce output of the form:

          {
          "data": {
          "author": "Straight_Hat_3398",
          "timestamp": ...,
          "title": "Calculated Field - Replace a Name with Area Name",
          "content": "Hi Everyone! Is there a way to replace a name...",
          "upvotes": 5
          "downvotes": 0,
          "children": [
          {
          "author": "jonthecpa",
          "timestamp": ...,
          "title": "",
          "content": "Seems like you could just pull the different...",
          "upvotes": 2,
          "downvotes": 0,
          "children": [
          ...
          ]
          },
          ...
          ]
          }
          }
      • StackOverflow

        • Given this StackOverflow URL, the service should produce output of the form:

          {
          "data": {
          "author": "delta",
          "timestamp": ...,
          "title": "Workday SOAP API - Download event document",
          "content": "I'm using Launch Integration which produces a PDF...",
          "upvotes": 0
          "downvotes": 0,
          "children": [
          {
          "author": "delta",
          "timestamp": ...,
          "title": "",
          "content": "We were able to achieve this by using...",
          "upvotes": 0,
          "downvotes": 0,
          "children": [
          ...
          ]
          },
          ...
          ]
          }
          }
    • Microservice runs in Azure Functions as FastAPI service

    Technical Details

    • Code should be delivered in specific format:

      • extractor/

        • extractor/infra.tf - Terraform HCL for provisioning Azure Function that will run the microservice

        • extractor/extract.py - Python that runs at the endpoint to extract text from input artifact

        • extractor/api.py - FastAPI endpoint to serve GET /extract endpoint with url to be supplied as URL parameter

    • Terraform code should be self-sufficient:

      • Once logged into Azure via az and entering azure_client_id, azure_client_secret, and azure_subscription_id as Terraform variables, terraform apply should work

    Timelines / Milestones

    We would like to complete this project by 11/17/23. There is an additional $50 USD bonus for completing by 11/13/23.