Earn 4,500 ($45.00)

due 3 weeks ago

Open

Dev to Production Issue

brianezrapike89

Posted 1 month ago

Details

Applications

Discussion

Bounty Description

Problem Description

In the app, I run a process to scrape a user's Gmail with up to 30 parallel workers. In the dev environment, this executes in <1 minute. In the Production environment, it takes >30 minutes. I'm having a similar issue with downstream processes on these emails (e.g., OCR) that are either taking significantly longer in Production or failing altogether, even though they complete quickly and successfully in the Test environment for the same exact test user. Production uses the same database as Test to avoid changing too much simultaneously.

Acceptance Criteria

Production completes on the Test User nearly as fast (i.e., up to 10% slower) or faster than it currently takes in Development for the four key processes (a. gmail scrape, b. OCR, c. structure, d. compare).

Technical Details

Up to 30 parallel workers are used to scrape a user's Gmail account for emails meeting certain criteria. In Production, we can see that these workers are successfully authenticated by Gmail. Those workers then scrape the emails and both upload those emails to an S3 bucket for the user and create a row in a Replit hosted neon postgresql database for each email. In the test environment, this operation completes in <1 minute. In the production environment, this operation takes >30 minutes. For the test user, there are 1,200 emails that get saved / added to the database.

A different button is then used to trigger OCR for those emails. These are also parallel processed by a different set of workers. The OCR relies on AWS Textract. The OCR results are stored as a .json in both the S3 bucket and the neon posgresql database. This is executed in <2 minutes in the test environment. This took too long to complete in Production (didn't see how long it took because it just hangs).

A different button is then used to trigger structuring of those OCR results. Another set of parallel workers are used to send the OCR results to OpenAI for structuring. The structured results are then added to the postgresql database. This takes <3 minutes in the Test environment. I have not been able to test this in the Production environment because the OCR process takes too long.

A different button is then used to trigger a comparison function on that structured data. It also uses parallel workers to do database look ups and data comparisons. The results of these lookups are then stored back into that same database. This process takes <1 minute in the Test environment. I have not been able to test this in the Production environment because the OCR takes too long.

Link to Project

https://finpayments.replit.app/