Skip to content
    Back to all Bounties

    Earn 18,000 ($180.00)

    Time Remainingdue 1 year ago
    Completed

    Algorithm to convert Excel files to PDF in a Lambda that maintains formatting

    frankfrank69
    frankfrank69
    Posted 1 year ago
    This Bounty has been completed!
    @frankfrank69's review of @nox000
    4.3
    Average Rating
    Communication 5/5, Quality 4/5, Timeliness 4/5
    Good work, did not communicate timeline very well, and missed a few cases I had outlined

    Bounty Description

    Problem Description

    I need to be able convert Excel files to PDF so I can use an existing OCR process. This OCR process depends on things like cell colors, bolded type, grid lines, and font size to infer how data is laid out and what the hierarchy is. This process does not need to maintain images such as JPG, PNG, or SVG that are in excel.

    Acceptance Criteria

    This function or algorithm should be able to run without having Excel installed on the root machine. This is because I need it to run in a Linux based Docker container or AWS Lambda (running Ubuntu for example).

    The PDF should retain properties like cell colors, bolded type, grid lines, and font size in the output. The PDF should try to export 1 sheet per-page, and does not need to export to letter sized pages. Dynamic page sizes are fine.

    This function or algorithm must be written in Python, or be easily called from Python.

    The final PDF should very closely resemble the PDF that is generated when doing a "Save as" from excel.

    Technical Details

    This function or algorithm should be able to run without having Excel installed on the root machine. This is because I need it to run in a Linux based Docker container or AWS Lambda (running Ubuntu for example).

    One potential route is to parse the underlying data (see: https://stackoverflow.com/questions/4886027/looking-for-a-clear-description-of-excels-xlsx-xml-format).

    Using tools like xlwings is not practical because it depends on Mac/Windows + having Excel installed.

    Doing a simple Excel->Markdown->PDF is also not practical as you lose all formatting.