This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Al_th
alth.fr
did:plc:etms2tmaniqdio6pb5ray7y2
I implemented GRPO from scratch to RL a tiny toy LLM and it works surprisingly well.
Rule base reward inspired by @dorialexander.bsky.social to make my Shakespeare shout more.
I went for Outcome Supervision as both OS and PS we’re kind of close in DeepseekMath paper…
2025-02-05T17:20:38.062Z