Earn 18,000 ($180.00)
Reddit/StackOverflow scraper (Azure microservice)
Bounty Description
Problem Description
We are building an LLM for enterprise software and looking to decouple data extraction from the rest of our pipeline. Our model uses data from a variety of sources including URLs, so we need a microservice that reliably extracts text from website input. The service should run playwright
for traversing and extracting data from websites.
Before accepting this bounty, please note the following:
- You will need an Azure subscription to complete this bounty as the service should run in Azure. We will not be providing an Azure subscription but will test the deliverable using one.
- One of the more challenging parts of the project is getting
playwright
to successfully run in Azure Functions directly from the Terraform code.
Acceptance Criteria
-
Microservice uses
playwright
to extract text from any website (via URL parameter) and maximally extract text by traversing the following websites:-
-
Given this Reddit URL, the service should produce output of the form:
{"data": {"author": "Straight_Hat_3398","timestamp": ...,"title": "Calculated Field - Replace a Name with Area Name","content": "Hi Everyone! Is there a way to replace a name...","upvotes": 5"downvotes": 0,"children": [{"author": "jonthecpa","timestamp": ...,"title": "","content": "Seems like you could just pull the different...","upvotes": 2,"downvotes": 0,"children": [...]},...]}}
-
-
-
Given this StackOverflow URL, the service should produce output of the form:
{"data": {"author": "delta","timestamp": ...,"title": "Workday SOAP API - Download event document","content": "I'm using Launch Integration which produces a PDF...","upvotes": 0"downvotes": 0,"children": [{"author": "delta","timestamp": ...,"title": "","content": "We were able to achieve this by using...","upvotes": 0,"downvotes": 0,"children": [...]},...]}}
-
-
-
Microservice runs in Azure Functions as FastAPI service
Technical Details
-
Code should be delivered in specific format:
-
extractor/
-
extractor/infra.tf
- Terraform HCL for provisioning Azure Function that will run the microservice -
extractor/extract.py
- Python that runs at the endpoint to extract text from input artifact -
extractor/api.py
- FastAPI endpoint to serveGET
/extract
endpoint withurl
to be supplied as URL parameter
-
-
-
Terraform code should be self-sufficient:
- Once logged into Azure via
az
and enteringazure_client_id
,azure_client_secret
, andazure_subscription_id
as Terraform variables,terraform apply
should work
- Once logged into Azure via
Timelines / Milestones
We would like to complete this project by 11/17/23. There is an additional $50 USD bonus for completing by 11/13/23.