Skip to content
    Back to all Bounties

    Earn 4,500 ($45.00)

    Time Remainingdue 4 months ago
    Open

    Write a Tiny Web App to Read in a PDF Using PDF.js and Extract Annotation Images

    mobala
    mobala
    Posted 4 months ago

    Bounty Description

    Overview

    Create a tiny web application that allows users to upload PDF files, process them using PDF.js, and extract images from annotations within the PDFs. This bounty requires a solid understanding of the PDF format, the PDF.js library, and image processing in JavaScript.

    Requirements

    Allow users to upload a PDF file through a simple web interface
    Use PDF.js to parse and render the PDF content
    Identify and extract all images embedded within annotations in the PDF
    Display the extracted images to the user
    Provide a way to download individual images or all images as a zip file

    Technical Details

    PDF.js Integration:

    Implement proper initialization and error handling for PDF.js
    Handle PDF.js's asynchronous nature correctly
    Work with PDF.js's document and page object model to access annotations

    Annotation Processing:

    Correctly identify different types of PDF annotations (stamps, ink, etc.)
    Extract image data from relevant annotation objects
    Convert PDF binary image data to standard web image formats (PNG/JPEG)
    Preserve image metadata if possible (name, page number, coordinates)

    User Interface:

    Create a clean, minimal UI that's easy to understand
    Show a preview of the uploaded PDF
    Display extracted annotation images with relevant details
    Implement responsive design for different screen sizes

    Challenges to Overcome

    Annotation Structure Complexity: PDF annotations vary widely in structure and implementation. You'll need to handle different annotation types and structures.
    Binary Data Extraction: Extracting and processing binary image data from PDF objects requires careful handling of different formats and encodings.
    PDF.js Limitations: PDF.js doesn't directly expose all annotation data in a simple format - you'll need to understand its internal structures.
    Cross-Browser Compatibility: Ensure your solution works across modern browsers (Chrome, Firefox, Safari, Edge).
    Performance Optimization: Large PDFs or PDFs with many annotations will require efficient processing to maintain good performance.

    Deliverables

    A complete, functioning web application hosted on Replit
    Well-commented source code that explains your approach
    A README.md explaining how the application works and any limitations
    Basic documentation on the PDF.js integration approach you used
    Instructions for testing the application with example PDFs

    Evaluation Criteria

    Functionality: Does it correctly extract images from PDF annotations?
    Code Quality: Is the code well-structured, readable, and maintainable?
    User Experience: Is the interface intuitive and responsive?
    Robustness: Does it handle edge cases and errors gracefully?
    Documentation: Is the code and approach well-documented?

    IMPORTANT NOTE: This is aimed at PDF annotation layer ONLY. PDF annotations that contain images are different than images embedded in PDFs. We are looking for a way to extract images within annotations only and render them on a canvas overlay over the original "cleaned" PDF.