Skip to content
    Back to all Bounties

    Earn 13,500 ($135.00)

    Time Remainingdue 1 year ago
    Canceled

    PDF parsing into structured chunks for RAG

    aaravindan101
    aaravindan101
    Posted 1 year ago

    Bounty Description

    Problem Description

    Trying to parse and upload PDFs to vector stores with additional context to help with finding the most accurate/complete chunks for a given section to use for RAG for LLM.

    Acceptance Criteria

    Be specific about what you want in the solution here.
    Parse Medical PDF files into headings/subheadings including text, image, tables efficiently for upload to VectorStore for accurate retrieval.
    Example PDFs include:
    https://storage.googleapis.com/ctgov2-large-docs/99/NCT05993299/Prot_000.pdf
    https://storage.googleapis.com/ctgov2-large-docs/28/NCT04024228/Prot_000.pdf
    https://storage.googleapis.com/ctgov2-large-docs/72/NCT04625972/Prot_004.pdf

    Parse and load the PDFs into a vector store (ex. Weaviate) and get relevant chunks for a query on a section, including:

    • What is the Inclusion/Exclusion Criteria?
    • What is the Study Design?
    • What are the Objectives (Primary/Secondary/etc)?