Earn 13,500 ($135.00)

due 1 year ago

Canceled

PDF parsing into structured chunks for RAG

aaravindan101

Posted 2 years ago

Details

Applications

Discussion

Bounty Description

Problem Description

Trying to parse and upload PDFs to vector stores with additional context to help with finding the most accurate/complete chunks for a given section to use for RAG for LLM.

Acceptance Criteria

Be specific about what you want in the solution here.
Parse Medical PDF files into headings/subheadings including text, image, tables efficiently for upload to VectorStore for accurate retrieval.
Example PDFs include:
https://storage.googleapis.com/ctgov2-large-docs/99/NCT05993299/Prot_000.pdf
https://storage.googleapis.com/ctgov2-large-docs/28/NCT04024228/Prot_000.pdf
https://storage.googleapis.com/ctgov2-large-docs/72/NCT04625972/Prot_004.pdf

Parse and load the PDFs into a vector store (ex. Weaviate) and get relevant chunks for a query on a section, including:

What is the Inclusion/Exclusion Criteria?
What is the Study Design?
What are the Objectives (Primary/Secondary/etc)?