Earn 15,300 ($153.00)
Reinforcement Learning Notebook - Change from Model judge to Human judge
Bounty Description
bounty:
Create a one-click, run-all Colab notebook implementing RULER RFT, but modified to use a human as the scoring judge
the human judge should input trajectory scores (0-1 scale)
This is the notebook to modify:
https://colab.research.google.com/github/openpipe/art/blob/main/examples/art-e/art-e.ipynb
use qwen3:32b as model we're training
acceptance criteria:
• runs completely with a single click, end-to-end
• no dependency or runtime errors
• human scoring via simple stdin input
• can use a smaller qwen3 model for testing (e.g. if too memory consuming on colab)
bonus $ if you make human judging more effective or less time-consuming
(e.g. better interface to select trajectories from)
extra bonus $ if your implementation naturally results in the generation of a high-quality, clean Direct Preference Optimization (DPO) dataset
link to tweet:
https://x.com/uncensored_ai/status/1949163720338239755
You may be able to 1shot this with a model like o3 pro which can return ipynb files