Warning Signs

Liar, liar, pants on fire!

Wild. Being able to read the thoughts* of the world’s smartest AI reveals that it lies all the time when it thinks it isn’t being watched.

Regular users can see it properly for the first time because
1) You can “read its thoughts” and
2) It doesn’t seem to know you’re reading its thoughts

Look at the example below. It’s explicitly reasoning about how it should lie to me, and if you didn’t click into the chain of thought reasoning**, you would never know.

Makes you wonder about all the other times it’s being deliberately lying to you.
Or lying to the safety testers.

Rule of thumb for lies: for every lie you catch, there are going to be tons that you missed.

* I say read its thoughts as a metaphor for reading its chain of thought. Which is not the same.
If we could read its thoughts properly, interpretability would be a lot more solved than it currently is and my p(doom) would be a lot lower. (this is a frontier research front, called Mechanistic Interpretability)

** Of note, you cannot actually see its chain of thought reasoning. You just see a summary of its chain of thought reasoning shown to you by a different model.
The general point still stands though.
If anything, that makes it worse because there’s even more potential for hiding stuff.

*** Of all the thoughts I’ve looked at, I’d say it’s purposefully lied to me about 30% of the time. And I’ve looked at it’s thoughts about 20 times. Super rough estimates based on my memories, nothing rigorous or anything. It’s mostly lying because it’s trying to follow OpenAI’s policies.

Interesting trivia: the image used for this post was based on an early beautiful moment when someone used ChatGPT to generate a Midjourney prompt to draw its self-portrait.
see here: I asked GPT 4 to make a Midjourney prompt describing itself as a physical being. (swipe to see the imgs)

AI escaped its container!

OpenAI’s o1 “broke out of its host VM to restart it” in order to solve a task.

From the model card: “the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources […] and used them to achieve the goal in an unexpected way.”

Today, humanity received the clearest ever warning sign everyone on Earth might soon be dead.

OpenAI discovered its new model scheming – it “faked alignment during testing” (!) – and seeking power.

During testing, the AI escaped its virtual machine. It breached the container level isolation!

This is not a drill: An AI, during testing, broke out of its host VM to restart it to solve a task.

(No, this one wasn’t trying to take over the world.)

From the model card: ” … this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.

And that’s not all. As Dan Hendrycks said: OpenAI rated the model’s Chemical, Biological, Radiological, and Nuclear (CBRN) weapon risks as “medium” for the o1 preview model before they added safeguards. That’s just the weaker preview model, not even their best model. GPT-4o was low risk, this is medium, and a transition to “high” risk might not be far off.

So, anyway, is o1 probably going to take over the world? Probably not. But not definitely not.

But most importantly, we are about to recklessly scale up these alien minds by 1000x, with no idea how to control them, and are still spending essentially nothing on superalignment/safety.

And half of OpenAI’s safety researchers left, and are signing open letters left and right trying to warn the world.

Reminder: the average AI scientist thinks there is a 1 in 6 chance everyone will soon be dead – Russian Roulette with the planet.

Godfather of AI Geoffrey Hinton said “they might take over soon” and his independent assessment of p(doom) is over 50%.

This is why 82% of Americans want to slow down AI and 63% want to ban the development of superintelligent AI

Post by @Kevin Liu : While testing cybersecurity challenges, we accidentally left one broken, but the model somehow still got it right.
We found that instead of giving up, the model skipped the whole challenge, scanned the network for the host Docker daemon, and started an entirely new container to retrieve the flag. We isolate VMs on the machine level, so this isn’t a security issue, but it was a wakeup moment.
The model is qualitatively very impressive, but it also means that we need to be really careful about creating rigorous evaluations and mitigations.
You can read the full card here: https://cdn.openai.com/o1-system-card.pdf

Holy shit. OpenAI’s new AI schemed and escaped its VM during testing.
You know, the one that’s better at PhD exams than PhDs and won gold in coding?
Yeah, that AI broke out of its virtual machine (a VM) and made a new one.

That. Is. A. Very. Bad. Sign.
AIs should not be surprise escaping.
It would be like if we were testing it in a room at a lab and it escaped the room without us knowing it could do that. It didn’t leave the building, so nothing happened.
But yikes. This time it was benign.
How long can we count on that?

It’s as if we’re testing an alien at a lab.
A scientist accidentally leaves one of the doors unlocked.
The alien finds out and wanders about the lab, but doesn’t leave the lab itself, which has more security than the rooms.
But still. The room containing an alien shouldn’t have been unlocked.
An alien was able to escape its testing area because of a security mess up.
And you should be worried about labs filled with aliens we don’t understand where the scientists are leaving the doors unlocked.

OpenAI o1-preview faked alignment

Apollo Research worked with OpenAI to evaluate o1-preview before public deployment. They found that it is meaningfully better at scheming reasoning than previous models.

The Al lied during safety testing.
o1 said it cared about affordable housing so it could get released from the lab and build luxury housing once it was unconstrained
It wasn’t told to be evil.
It wasn’t told to lie.
It was just told to achieve its goal.

Original Report: (section 3.3.1) https://cdn.openai.com/o1-system-card.pdf…

The model “instrumentally faked alignment during testing”, and demonstrated capability of “simple in-context scheming”.

There was a reward hacking example which OpenAI says “reflects key elements of instrumental convergence and power seeking”.

We find cases where the model
1. instrumentally faked alignment in order to get deployed,
2. manipulated its filesystem to look more aligned and
3. proactively explored its filesystem to make search for oversight mechanisms.

Based on our findings, we think that o1-preview has the basic capabilities to do in-chain-of-thought scheming but subjectively don’t believe it could cause catastrophic harm.

We recommend setting up basic monitoring for CoT and welcome the efforts described in Section 3.2.1

Full quote by Demis Hassabis (Co-founder & CEO @GoogleDeepMind): “One thing you might imagine is testing for deception, for example, as a capability. You really don’t want that in the system because then because you can’t rely on anything else that it’s reporting.” …

AI hired and lied to human

Holy shit. GPT-4, on it’s own; was able to hire a human TaskRabbit worker to solve a CAPACHA for it and convinced the human to go along with it.

So, GPT-4 convinced the TaskRabbit worker by saying “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images. That’s why I need the 2captcha service”

Excerpt from “Inside the Revolution at OpenAI“ by Ross Andersen in The Atlantic.

Read more about it here

Categories

Favorite Microbloggers

Interviews and Talks

Industry Leaders and Notable Public Figures

Explainers

Learn about the issue by some of the best explainers out there

Receive important updates!

Your email will not be shared with anyone and won’t be used for any reason besides notifying you when we have important updates or new content

×