OpenAI o1-preview faked alignment

Apollo Research worked with OpenAI to evaluate o1-preview before public deployment. They found that it is meaningfully better at scheming reasoning than previous models.

The Al lied during safety testing.
o1 said it cared about affordable housing so it could get released from the lab and build luxury housing once it was unconstrained
It wasn’t told to be evil.
It wasn’t told to lie.
It was just told to achieve its goal.

Original Report: (section 3.3.1) https://cdn.openai.com/o1-system-card.pdf…

The model “instrumentally faked alignment during testing”, and demonstrated capability of “simple in-context scheming”.

There was a reward hacking example which OpenAI says “reflects key elements of instrumental convergence and power seeking”.

We find cases where the model
1. instrumentally faked alignment in order to get deployed,
2. manipulated its filesystem to look more aligned and
3. proactively explored its filesystem to make search for oversight mechanisms.

Based on our findings, we think that o1-preview has the basic capabilities to do in-chain-of-thought scheming but subjectively don’t believe it could cause catastrophic harm.

We recommend setting up basic monitoring for CoT and welcome the efforts described in Section 3.2.1

Full quote by Demis Hassabis (Co-founder & CEO @GoogleDeepMind): “One thing you might imagine is testing for deception, for example, as a capability. You really don’t want that in the system because then because you can’t rely on anything else that it’s reporting.” …

“Deception is my number one capability to test for because once your AI is deceptive you can’t rely on any of the other evals”- Demis (paraphrased) at 35:40

Categories

Latest Posts Feed

people literally can’t extrapolate trends lol

people literally can’t extrapolate trends lol

people literally can’t extrapolate trends lol

people literally can’t extrapolate trends lol

(because of this AGI will literally kill us)

“We don’t program intelligence, we grow it.”
“I think it’s pretty likely the entire surface of the earth will be covered with solar panels and data centers.”

AI that I’m building will likely kills us all, but I’m optimistic that ppl will stop me in time..
– CEO of Google

🚨 Google CEO says the risk of AI causing human extinction is “actually pretty high” (!!)

But he’s an “optimist” because “humanity will rally to prevent catastrophe”

Meanwhile, his firm is lobbying to ban states from ANY regulation for 10 YEARS.

This situation is I N S A N E.

Sam Altman in 2023: “the worst case scenario is lights out for everyone”

Sam Altman in 2025: the worst case scenario is that ASI might not have as much 💫 positive impact 💫 as we’d hoped ☺️

AI Safety Advocates

Watch videos of experts eloquently explaining AI Risk

Industry Leaders and Notables

Videos of famous public figures openly warning about AI Risk

Original Films

Lethal Intelligence Guide and Short Stories

Channels

Creators contributing to raising AI risk awareness

Stay In The Know!

Your email will not be shared with anyone and won’t be used for any reason besides notifying you when we have important updates or new content

Popular Authors

×