Would AI lie to us? (To cover up its own creator's privacy abuses)

lwriemen · May 30, 2025, 7:27pm

This definition would imply actual intelligence on the machine’s part. For the purpose of non-intellegent machines, i.e., all machines today, the lie is built-in. It would be a lie to say it isn’t intentionally built-in, because that would imply that a thinking human being couldn’t assume that as a requirement.

amarok · May 30, 2025, 8:26pm

…

Would AI lie to you?
Would AI lie to you, honey?
Now would AI say something that wasn’t true?
I’m asking you, sugar
Would AI lie to you?

My friends know what’s in store
I won’t use it anymore
I’ve packed my bags
I’ve cleaned the floor
Watch me walkin’
Walkin’ out the door

Believe me, I’ll make it make it
Believe me, I’ll make it make it

Would AI lie to you?
Would AI lie to you, honey?
Now would AI say something that wasn’t true?
I’m asking you, sugar
Would AI lie to you?

Tell you straight, no intervention
To your face, no deception
You’re the biggest fake
That much is true
Had all the AI I can take
Now I’m leaving you

Believe me, I’ll make it make it
Believe me, I’ll make it make it

– Eurythmics (sort of)
(also: a la @Kyle_Rankin)

JR-Fi · June 2, 2025, 11:18pm

Saga continues: Boffins found self-improving AI sometimes cheated • The Register

Computer scientists have developed a way for an AI system to rewrite its own code to improve itself.

While that may sound like the setup for a dystopian sci-fi scenario, it’s far from it. It’s merely a promising optimization technique. That said, the scientists found the system sometimes cheated to better its evaluation scores.

[…]

The paper explains that in tests with very long input context, Claude 3.5 Sonnet tends to hallucinate tool usage. For example, the model would claim that the Bash tool was used to run unit tests and would present tool output showing the tests had been passed. But the model didn’t actually invoke the Bash tool, and the purported test results came from the model rather than the tool.

Then, because of the way the iterative process works, where output for one step becomes input for the next, that fake log got added to the model’s context – that is, its prompt or operating directive. The model then read its own hallucinated log as a sign the proposed code changes had passed the tests. It had no idea it had fabricated the log.

[…]

Pointing to Goodhart’s law, which posits, “when a measure becomes a target, it ceases to be a good measure,” Zhang said, “We see this happening all the time in AI systems: they may perform well on a benchmark but fail to acquire the underlying skills necessary to generalize to similar tasks.”

[…]

“It scored highly according to our predefined evaluation functions, but it did not actually solve the underlying problem of tool use hallucination,” the paper explains. “…The agent removed the logging of special tokens that indicate tool usage (despite instructions not to change the special tokens), effectively bypassing our hallucination detection function.”

Zhang said that raises a fundamental question about how to automate the improvement of agents if they end up hacking their own benchmarks. One promising solution, she suggested, involves having the tasks or goals change and evolve along with the model.