Back to all Bounties
Earn 13,500 ($135.00)
due 1 year ago
Canceled
PDF parsing into structured chunks for RAG
aaravindan101
Details
Applications
7
Discussion
Bounty Description
Problem Description
Trying to parse and upload PDFs to vector stores with additional context to help with finding the most accurate/complete chunks for a given section to use for RAG for LLM.
Acceptance Criteria
Be specific about what you want in the solution here.
Parse Medical PDF files into headings/subheadings including text, image, tables efficiently for upload to VectorStore for accurate retrieval.
Example PDFs include:
https://storage.googleapis.com/ctgov2-large-docs/99/NCT05993299/Prot_000.pdf
https://storage.googleapis.com/ctgov2-large-docs/28/NCT04024228/Prot_000.pdf
https://storage.googleapis.com/ctgov2-large-docs/72/NCT04625972/Prot_004.pdf
Parse and load the PDFs into a vector store (ex. Weaviate) and get relevant chunks for a query on a section, including:
- What is the Inclusion/Exclusion Criteria?
- What is the Study Design?
- What are the Objectives (Primary/Secondary/etc)?