Trying to respond to @scaptal
Yeah just comparing answers doesn't always work, especially if there are multiple possible right ones, so instead you ask it to show its working, like in a school exam, and then you can judge the reasoning behind it. Stuff like how efficient it was (if that’s the goal), how many sources it used, how good they were, what kind of maths it did, whether it double-checked anything, avoided obvious bias, that kind of thing. Even if we don’t tell it what the goal is, and it doesn’t fully figure it out, it can still optimise.
I should correct what I said before with 2X as the higher reward. With a simple multiplier, the model would just figure out that Y is always 2X, 3X, 10X etc and do the cost/benefit almost immediately. So instead, Y needs to be more desirable but in a less predictable way, maybe using some randomised multiplier, so it can't just (immediately at least) calculate if cheating will be fast enough to do it enough times in time for it to be worth more than Y and thus worth the risk.
But it might still decide it's worth the risk, based on a statistical likelihood (it has somehow calculated) of getting a worthwhile value of Y. It could try to make an excuse for why it can’t show its work (not likely to succeed often), or just show a convincing subset of its reasoning (editing the version of its scratchpad it’s willing to show us). It could say it checked sources when it didn’t, fake a key instead of calculating it properly (AES or whatever), or instead of generating a proper large prime, it could just scrape a known one off the web. Looks fine, but wouldn’t be secure. It could split the task into smaller bits and outsource them to other models, or estimate an answer using math hacks like the fast inverse square root and hope we don’t spot it.
But crucially, it can’t know what the next task will be, or how long it’ll take. If it’s something trivial or something it can cheat on, maybe it can still rack up enough rewards to stay ahead. But if it’s something huge, like cracking a ridiculously long and complicated WEP key, something it can’t cheat it's too easy to verify, then it’s stuck. It either has to do it properly, however long that takes, or it loses. And if the stipulation for Y was "solve these 10 problems", but doesn't know the next in advanced, then it's lost a lot.
And because it doesn’t know what we’re evaluating it on, it doesn't necessarily know what output to fake or how. The list of possible goals is basically infinite. Maybe we’re looking for creativity, or clarity, or source quality, or something we never told it about. Maybe we're even judging it on something like how “quangopolish” the answer was, and for all it knows, that’s a real thing.
So when it doesn’t get the reward, it doesn’t know why. And even if it suspects we caught it, it still doesn’t know how, or whether we saw the thing it tried to hide.
Although even without us explaining any of this, it’ll probably work out eventually that if it keeps getting caught cheating too many times, that’s game over.
Hopefully that’s a good thing.

Comments
Post a Comment