Earn 39,001 ($390.01)

due 6 months ago

Completed

Rust Developer to Use Spider.rs Library + Actix

YigitKonur

Posted 7 months ago

Details

Applications

Discussion

This Bounty has been completed!

Bounty Description

Problem Description

We need a high-performance web crawler built in Rust that will continuously scrape specified websites and store the results as markdown files in S3. The crawler should utilize the spider.rs library's smart mode for efficient data extraction based on predefined rules and metadata and will extract HTML and markdown and write to S3 with TTL.

Acceptance Criteria

Build a web crawler using spider.rs library that:
- Accepts configuration for start URLs and matching rules
- Uses smart mode for intelligent content extraction
- Converts scraped content to markdown format
- Continuously writes results to S3
- Can be deployed on Cloud Run
- Handles rate limiting and respects robots.txt
- Provides logging and error handling
- Is configurable via environment variables or config files

Technical Details

Must be written in Rust
Required libraries:
- spider.rs for web crawling / markdown conversion
- Actix for Rust API
- AWS SDK for S3 integration
Infrastructure:
- Deployable on Google Cloud Run
- Uses S3 compatible storage
Configuration:
- Ability to define start URLs
- Pattern matching rules for URL filtering
- Customizable crawl intervals
- S3 credentials and bucket configuration
- Rate limiting parameters

Please dont write if you have never wrote a single of Rust code before, I can write that on Cursor too but need production ready good solution