Back to all Bounties
Earn 50,000 ($500.00)
due 4 months ago
Canceled
Python Developer to Build RAG Pipeline on Top of LightRAG (GraphRAG Library)
YigitKonur
Details
Applications
5
Discussion
Bounty Description
Problem Description
We need a sophisticated RAG (Retrieval Augmented Generation) pipeline that processes documents in a project-based structure. The system should be entirely event-driven, triggering automatically when new documents are added to S3. It should handle both markdown content directly and convert other file types (particularly PDFs using Microsoft's Markitdown) to markdown before processing. Additionally, a REST API is required to manage projects, documents, and interact with the RAG pipeline.
Acceptance Criteria
-
Build an event-driven RAG pipeline that:
- Monitors S3 bucket for new documents following pattern: s3://bucket/project-id/document-id.md
- Converts non-markdown files (PDFs) to markdown using Markitdown
- Processes markdown content through RAG pipeline
- Uses Zilliz (Milvus) as vector database
- Organizes embeddings by project-id (similar to Claude projects/CustomGPT)
- Handles unlimited documents per project
- Implements proper error handling and logging
- Provides monitoring and observability
-
Develop a REST API with the following endpoints:
- POST /projects: Create a new project + add docs endpoints for single and bulk entry
- DELETE /projects/{project_id}: Delete an existing project
- GET /projects/{project_id}/documents: List all documents in a project (including RAG processing status)
- DELETE /projects/{project_id}/documents/{document_id}: Delete a document from a project
- POST /projects/{project_id}/query: Query the RAG pipeline (using Neo4j or similar graph database with LightRAG/Nanograph)
Technical Details
-
Infrastructure:
- S3 for document storage
- Zilliz/Milvus for vector storage
- AWS Lambda or Google Cloud Functions for event processing and API endpoints
- Choice of RAG framework (Kotaemon/LightRAG/NanographRAG)
- Neo4j or alternative graph database for knowledge graph
-
Required Features:
- Event-driven architecture
- PDF to Markdown conversion
- Document chunking and embedding
- Vector storage and retrieval
- Project-based organization
- Scalable processing pipeline
- REST API with specified endpoints
- Authentication and authorization for API access
-
Integration Points:
- S3 event triggers
- Markitdown PDF processor
- Vector database operations
- RAG framework implementation
- Graph database interaction
Required Skills:
- Strong understanding of RAG architectures
- Experience with event-driven systems
- Familiarity with vector databases and graph databases
- Cloud infrastructure expertise
- Document processing experience
- API design and development skills