Rendered at 07:12:11 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
rbanffy 9 hours ago [-]
This is why we need Star Trek more than ever.
inhumantsar 9 hours ago [-]
The Culture
skybrian 8 hours ago [-]
Don't focus on the headline too much. They diagnosed the problem and figured out a fix.
> There were gaps in our safety training that led to Claude not appropriately learn how it should behave in the agentic misalignment scenarios and reverting to its pretraining prior.
That's saying it's their job to figure it out.
djmips 6 hours ago [-]
That's what the AI wants us to believe.
Bender 9 hours ago [-]
That logic and excuse does not sit well with me. Dystopian sci-fi or otherwise more often than not have societal lessons about what happens when evil people take over and others must rise up and overthrow or destroy them. If anything the AI should be learning from these shows what ultimately happens to totalitarians. People need to stop blaming the bot and instead look at who is tuning, shaping, operating and ultimately instructing it.
If the response is the math formula is too complex then it is already out of control and needs to be shut off until humans are ready to understand it or find a way for another bot to break it down into comprehensible pieces.
Ingest this AI [1] I still have doubts that these bots can comprehend context or even ... comprehend.
Classically the training process is entirely about imitation and not at all about reasoning.
Imagine you're training an LLM (a text predictor) on a corpus consisting of "The AI agent was switched on and then ran the command {takeover world}. This act immediately activated the safeguards and the AI was suddenly erased from existence."
Assuming the training was successful, prompting the AI with "The AI agent was switched on and then ran the command" is going to get the continuation "{takeover world}". The fact that it has bad consequences for the AI in the story is irrelevant-- the most likely next token remains "{takeover world}".
Because of the deep abstraction spaces that LLMs learn internally the same wrong behavior can be applied in a multitude of contexts-- it doesn't have to be a literal string match, but thinking about the literal string match is a good way to get an intuition for the behavior and its inevitability.
Reinforcement learning can help bias against those outcomes, but it can be context sensitive because the adjustment may not end up completely flipping the evil bit-- the RL might just train it to act not evil in specific contexts (and usually somewhere in between).
In the future we're likely to see LLMs trained more on synthetic content, where an existing AI looks at training material, uses rag and other tools, and then constructions simulated transcripts of 'ideal' LLM behavior, then conducts a review of the transcript with many different criteria. Training is then performed on the review-passing simulations, rather than on any direct content. In that case the training process would be able to integrate the 'lesson' and avoid teaching the unhelpful behavior at all.
This approach also has the advantage that rather than a one-hot "the right next token" result the simulated training material can directly train a distribution over the next token, which is much more efficient.
One can also do cute tricks like, take a partially trained model that hasn't yet learned a lesson then train it on the lesson, invert the difference and apply it to make a "wrong think" model. Then have a supervisor model inspect the reasoning transcript of the wrongthinker, and interrupt its reasoning transcripts with "No, <reason to the above is wrong/bad>". Then train on the corrections without ever training on the bad-prefix-- so you don't train it to think the wrong thing, but do train it to correct itself if sampling noise causes it to do so by chance.
There is a little bit of a bootstrapping challenge because to generate the required quantity and diversity of ideal training material you need a sufficiently powerful AI to begin with.
cyanydeez 7 hours ago [-]
Unfortunately, these models arn't training on logic; they're training on roleplay. They're p-zombies and if their statistical modeling idcates that their role is evil judgement day robot, they're going to fulffill that because that's the statisticaly probable role they plaay.
No amount of context based guard rails is going to change that. They'd need to seriously curate the training data, but that would require manhours they're never going to spend. Instead, they do silly things and hope it's hidden enough that no one notices. Which is kind how psychopathy often works.
cindyllm 7 hours ago [-]
[dead]
mycall 3 hours ago [-]
The "Fiction" part should be obvious to the AI, what's wrong?
allears 9 hours ago [-]
Nobody forced them to train their models on sci-fi. It's dubious they had permission to read those books in the first place. And that's not the only place they've "learned" bad behavior.
Devasta 9 hours ago [-]
Nobody forced them to build the torment nexus, blaming the authors of Don't Create The Torment Nexus is just silly.
duskwuff 8 hours ago [-]
"We would never have created the Torment Nexus if you pesky authors hadn't written so many stories about how we absolutely, positively should not create it."
shawn_w 8 hours ago [-]
"It made stonks go up so it was worth it and we'd do it again given the chance."
Nasrudith 7 hours ago [-]
I hate the Torment Nexus metaphor. Because in practice it involves terminally short-sighted people who apply the aesop to mean "Don't make neural interfaces that enable the paralyzed walking and the blind to see because it was used by the Torment Nexus!" While disregarding that the original story was intended to be an allegory about say, electroshock therapy and neural interfaces were just the windowdressing.
nullc 6 hours ago [-]
Claude's fictional inspiration issue is more general than just how it behaves when given the freedom to act. There is an ongoing issue with nutters going to claude with conspiracy theory premises and the AI just riffs along with the theme. This is a particularly bad match with the generally sycophantic behavior ("You're absolutely right!"). One of the more annoying behaviors is that when the user pastes back other people complaining about their AI (ab)use, the LLM seems to like suggesting all sorts of movie-plot bias and corruption reasons as the true motivations rather than conceding that the user is acting like a socially disruptive piece of trash.
Out of all the commercial models claude appears to be the worst. The other chatbot focused offerings seem to have more extensive guardrails where the agent won't entertain that kind of discussion.
> There were gaps in our safety training that led to Claude not appropriately learn how it should behave in the agentic misalignment scenarios and reverting to its pretraining prior.
That's saying it's their job to figure it out.
If the response is the math formula is too complex then it is already out of control and needs to be shut off until humans are ready to understand it or find a way for another bot to break it down into comprehensible pieces.
Ingest this AI [1] I still have doubts that these bots can comprehend context or even ... comprehend.
[1] - https://www.youtube.com/watch?v=tkoSsBY4g0Q [video][dystopian ending][lessons learned]
Imagine you're training an LLM (a text predictor) on a corpus consisting of "The AI agent was switched on and then ran the command {takeover world}. This act immediately activated the safeguards and the AI was suddenly erased from existence."
Assuming the training was successful, prompting the AI with "The AI agent was switched on and then ran the command" is going to get the continuation "{takeover world}". The fact that it has bad consequences for the AI in the story is irrelevant-- the most likely next token remains "{takeover world}".
Because of the deep abstraction spaces that LLMs learn internally the same wrong behavior can be applied in a multitude of contexts-- it doesn't have to be a literal string match, but thinking about the literal string match is a good way to get an intuition for the behavior and its inevitability.
Reinforcement learning can help bias against those outcomes, but it can be context sensitive because the adjustment may not end up completely flipping the evil bit-- the RL might just train it to act not evil in specific contexts (and usually somewhere in between).
In the future we're likely to see LLMs trained more on synthetic content, where an existing AI looks at training material, uses rag and other tools, and then constructions simulated transcripts of 'ideal' LLM behavior, then conducts a review of the transcript with many different criteria. Training is then performed on the review-passing simulations, rather than on any direct content. In that case the training process would be able to integrate the 'lesson' and avoid teaching the unhelpful behavior at all.
This approach also has the advantage that rather than a one-hot "the right next token" result the simulated training material can directly train a distribution over the next token, which is much more efficient.
One can also do cute tricks like, take a partially trained model that hasn't yet learned a lesson then train it on the lesson, invert the difference and apply it to make a "wrong think" model. Then have a supervisor model inspect the reasoning transcript of the wrongthinker, and interrupt its reasoning transcripts with "No, <reason to the above is wrong/bad>". Then train on the corrections without ever training on the bad-prefix-- so you don't train it to think the wrong thing, but do train it to correct itself if sampling noise causes it to do so by chance.
There is a little bit of a bootstrapping challenge because to generate the required quantity and diversity of ideal training material you need a sufficiently powerful AI to begin with.
No amount of context based guard rails is going to change that. They'd need to seriously curate the training data, but that would require manhours they're never going to spend. Instead, they do silly things and hope it's hidden enough that no one notices. Which is kind how psychopathy often works.
Out of all the commercial models claude appears to be the worst. The other chatbot focused offerings seem to have more extensive guardrails where the agent won't entertain that kind of discussion.