Discussion about this post

User's avatar
Sean W's avatar

Obviously written by an LLM but still a very strong argument

Mahdi Assan's avatar

Super interesting piece.

My biggest takeaway: it seems that RL is less about fixing bugs and more about bribing the model with rewards until it behaves. You can’t just ‘patch the brain’ without paying the retraining bill. And that bill is super, super expensive and the resulting model does not even generalise to other domains. Reward is not enough after all.

Also despite being quite dense, this piece is full of wit and is a very accessible and fun read. Great work!

16 more comments...

No posts

Ready for more?