Earn 15,300 ($153.00)

due 2 weeks ago

In Progress

Reinforcement Learning Notebook - Change from Model judge to Human judge

AskProgrammers

Posted 4 weeks ago

Details

Applications

Discussion

Bounty Description

bounty:

Create a one-click, run-all Colab notebook implementing RULER RFT, but modified to use a human as the scoring judge

the human judge should input trajectory scores (0-1 scale)

This is the notebook to modify:

https://colab.research.google.com/github/openpipe/art/blob/main/examples/art-e/art-e.ipynb

use qwen3:32b as model we're training

acceptance criteria:

• runs completely with a single click, end-to-end
• no dependency or runtime errors
• human scoring via simple stdin input
• can use a smaller qwen3 model for testing (e.g. if too memory consuming on colab)

bonus $ if you make human judging more effective or less time-consuming
(e.g. better interface to select trajectories from)

extra bonus $ if your implementation naturally results in the generation of a high-quality, clean Direct Preference Optimization (DPO) dataset

link to tweet:
https://x.com/uncensored_ai/status/1949163720338239755

You may be able to 1shot this with a model like o3 pro which can return ipynb files

.css-4qqdjk{color:var(--accent-positive-stronger);}Earn 15,300.css-1drmft0{color:var(--foreground-dimmest);font-size:var(--font-size-header-default);} ($153.00)

Reinforcement Learning Notebook - Change from Model judge to Human judge

Bounty Description

Earn 15,300 ($153.00)