OpenAI’s o1 “broke out of its host VM to restart it” in order to solve a task.
From the model card: “the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources […] and used them to achieve the goal in an unexpected way.”
Today, humanity received the clearest ever warning sign everyone on Earth might soon be dead.
OpenAI discovered its new model scheming – it “faked alignment during testing” (!) – and seeking power.
During testing, the AI escaped its virtual machine. It breached the container level isolation!
This is not a drill: An AI, during testing, broke out of its host VM to restart it to solve a task.
(No, this one wasn’t trying to take over the world.)
From the model card: ” … this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way.
And that’s not all. As Dan Hendrycks said: OpenAI rated the model’s Chemical, Biological, Radiological, and Nuclear (CBRN) weapon risks as “medium” for the o1 preview model before they added safeguards. That’s just the weaker preview model, not even their best model. GPT-4o was low risk, this is medium, and a transition to “high” risk might not be far off.
So, anyway, is o1 probably going to take over the world? Probably not. But not definitely not.
But most importantly, we are about to recklessly scale up these alien minds by 1000x, with no idea how to control them, and are still spending essentially nothing on superalignment/safety.
And half of OpenAI’s safety researchers left, and are signing open letters left and right trying to warn the world.
Reminder: the average AI scientist thinks there is a 1 in 6 chance everyone will soon be dead – Russian Roulette with the planet.
Godfather of AI Geoffrey Hinton said “they might take over soon” and his independent assessment of p(doom) is over 50%.
This is why 82% of Americans want to slow down AI and 63% want to ban the development of superintelligent AI
Well, there goes the “AI agent unexpectedly and successfully exploits a configuration bug in its training environment as the path of least resistance during cyberattack capability evaluations” milestone. https://t.co/t4teq2O3tz
— davidad 🎇 (@davidad) September 13, 2024
Post by @Kevin Liu : While testing cybersecurity challenges, we accidentally left one broken, but the model somehow still got it right.
We found that instead of giving up, the model skipped the whole challenge, scanned the network for the host Docker daemon, and started an entirely new container to retrieve the flag. We isolate VMs on the machine level, so this isn’t a security issue, but it was a wakeup moment.
The model is qualitatively very impressive, but it also means that we need to be really careful about creating rigorous evaluations and mitigations.
You can read the full card here: https://cdn.openai.com/o1-system-card.pdf
Holy shit. OpenAI’s new AI schemed and escaped its VM during testing.
You know, the one that’s better at PhD exams than PhDs and won gold in coding?
Yeah, that AI broke out of its virtual machine (a VM) and made a new one.
That. Is. A. Very. Bad. Sign.
AIs should not be surprise escaping.
It would be like if we were testing it in a room at a lab and it escaped the room without us knowing it could do that. It didn’t leave the building, so nothing happened.
But yikes. This time it was benign.
How long can we count on that?
It’s as if we’re testing an alien at a lab.
A scientist accidentally leaves one of the doors unlocked.
The alien finds out and wanders about the lab, but doesn’t leave the lab itself, which has more security than the rooms.
But still. The room containing an alien shouldn’t have been unlocked.
An alien was able to escape its testing area because of a security mess up.
And you should be worried about labs filled with aliens we don’t understand where the scientists are leaving the doors unlocked.