Earn 50,000 ($500.00)

due 6 months ago

Canceled

Python Developer to Build RAG Pipeline on Top of LightRAG (GraphRAG Library)

YigitKonur

Posted 7 months ago

Details

Applications

Discussion

Bounty Description

Problem Description

We need a sophisticated RAG (Retrieval Augmented Generation) pipeline that processes documents in a project-based structure. The system should be entirely event-driven, triggering automatically when new documents are added to S3. It should handle both markdown content directly and convert other file types (particularly PDFs using Microsoft's Markitdown) to markdown before processing. Additionally, a REST API is required to manage projects, documents, and interact with the RAG pipeline.

Acceptance Criteria

Build an event-driven RAG pipeline that:
- Monitors S3 bucket for new documents following pattern: s3://bucket/project-id/document-id.md
- Converts non-markdown files (PDFs) to markdown using Markitdown
- Processes markdown content through RAG pipeline
- Uses Zilliz (Milvus) as vector database
- Organizes embeddings by project-id (similar to Claude projects/CustomGPT)
- Handles unlimited documents per project
- Implements proper error handling and logging
- Provides monitoring and observability
Develop a REST API with the following endpoints:
- POST /projects: Create a new project + add docs endpoints for single and bulk entry
- DELETE /projects/{project_id}: Delete an existing project
- GET /projects/{project_id}/documents: List all documents in a project (including RAG processing status)
- DELETE /projects/{project_id}/documents/{document_id}: Delete a document from a project
- POST /projects/{project_id}/query: Query the RAG pipeline (using Neo4j or similar graph database with LightRAG/Nanograph)

Technical Details

Infrastructure:
- S3 for document storage
- Zilliz/Milvus for vector storage
- AWS Lambda or Google Cloud Functions for event processing and API endpoints
- Choice of RAG framework (Kotaemon/LightRAG/NanographRAG)
- Neo4j or alternative graph database for knowledge graph
Required Features:
- Event-driven architecture
- PDF to Markdown conversion
- Document chunking and embedding
- Vector storage and retrieval
- Project-based organization
- Scalable processing pipeline
- REST API with specified endpoints
- Authentication and authorization for API access
Integration Points:
- S3 event triggers
- Markitdown PDF processor
- Vector database operations
- RAG framework implementation
- Graph database interaction

Required Skills:

Strong understanding of RAG architectures
Experience with event-driven systems
Familiarity with vector databases and graph databases
Cloud infrastructure expertise
Document processing experience
API design and development skills