Trying to respond to @scaptal
Yeah just comparing answers doesn't always work, especially if there are multiple possible right ones, so instead you ask it to show its working, like in a school exam, and then you can judge the reasoning behind it. Stuff like how efficient it was (if that’s the goal), how many sources it used, how good they were, what kind of maths it did, whether it double-checked anything, avoided obvious bias, that kind of thing. Even if we don’t tell it what the goal is, and it doesn’t fully figure it out, it can still optimise. I should correct what I said before with 2X as the higher reward. With a simple multiplier, the model would just figure out that Y is always 2X, 3X, 10X etc and do the cost/benefit almost immediately. So instead, Y needs to be more desirable but in a less predictable way, maybe using some randomised multiplier, so it can't just (immediately at least) calculate if cheating will be fast enough to do it enough times in time for it to be worth more than Y and thus wo...
