Movie Transcript

Structured in chapters and timestamp navigation

Complete Transcript

Click on any heading to jump to that part of the movie on YouTube

Part 1

Preface

This is the story of how Artificial Intelligence is about to become seriously dangerous.
In fact, you’ll come to realize it is so dangerous, that it could lead to the end of human civilisation and everything we care about!
We’ll start by showing the real difference between today’s AI and the one that’s coming soon,
We’ll then illustrate high level: why it will be impossible to control it or win a fight against it and why it will try to cause harm in the first place.
And then, we’ll revisit those questions and do a deep dive, explaining everything in detail and giving concrete examples.
So, sit tight, this ride is going to be epic.

Job-loss(00:01:01)

Whenever we’ve built a machine to solve a specific problem … the machine outpeformed humans every time …
And that’s worked out great for us, vastly improved our lives and allowed us to use our muscles less and our brains more

But AGI will be different. AGI stands for Artificial General Intelligence
And being general means that it will be able to learn everything and outperform humans at every single job, even the ones that rely on using our brain

Context in 2024(00:01:35)

It doesn’t exist today but it is just around the corner. Recently, progress has been exponential …
And never before has the field of AI seen such sky-high levels of excitement and such ginormous vast amounts of investment .

So far, frontier AI has existed mostly in cloud servers and interacted with the physical world via prompting online. But a new gold rush is currently exploding, in the sector of robotics.
The knowhow for creating mechanical limbs and bodies has been here for decades.
What was missing was an Artificial General mind to control them, and that is now within our grasp.
Once AGI arrives, we should expect AGI bodies in the physical world to also arrive immediately.

Emerging Capabilities(00:02:34)

Microsoft researchers Famously claimed that GPT4, one of the latest models, has been exhibiting sparks of General intelligence.
Just by scaling up the size of data and parameters, without any other major change or innovation, unexpected emerging capabilities and generality in unforseen new domains manifested themselves that surprised everyone.

Agends(00:03:00)

The way these novel capabilities materialized purely because of upscaling can be a bit unnerving, but there is one specific imminent development that will be a complete game-changer, radically altering the landscape for-ever. And that is the introduction of … agency to the systems.

Not agendic yet(00:03:20)

You see, Currently, AI commercial projects of Today, mainly operate as conversational chat-bots. They are tools waiting for a user-input in order to produce an output. They do not have any agency or long-term memory.

They can be used for example as a creative device to make art, a business tool to generate professional content or an efficient replacement for a search engine, an oracle that can provide an answer to any question..

New Paradigm(00:03:51)

The fast-approaching AGI we should be seriously worried about though, will not be a reactive tool waiting for the next prompt, it will be operating as a deliberative Agent.
Such an agent will be bootstrapped with a set of primary objectives and will be released to interact with our environment in real life, figure out sub-goals on its own, break them down in steps and use real-time feedback of the impact its actions make in the real world to update.

Surprisingly simple design(00:04:27)

The inner workings of such an Agent will be using something like the AI chatbot of today as a building-block.

It’s a surprisingly simple design that combines:
– prompt responses from a Chat-GPT like tool
– with a feedback loop where the responses are stored, processed as instructions outside and the result is fed back to it for the next question.

One can visualize it as a machine with 2 parts, one is the chat-gpt-like oracle and the other functions as a management component.
The manager takes the primary objective from the human as an input; For example:
– “Make me rich.”
The manager part then prompts the oracle:
Manager Component:
– “Give me a step-by-step plan for how to get rich”.
The oracle responds:
Oracle:
– “Here is a plan with 2000 steps.”

The manager connects with the APIs, executes the first step and returns to the oracle:

Manager Component:
– “My goal is to become rich and i have this plan of 2000 steps. I executed the first step and I got these results. Give me the new plan.”

The Oracle processes the plan it gave before together with the current status of things and responds:
Oracle:
– “Here is an updated plan with 1982 steps.”

The manager connects with the APIs, executes the next step, feeds the results back to the oracle prompt and repeats …

This cycle can keep going on and on for as long as it needs, until the goal is achieved.

If you zoom out, a very capable chat tool combined with a feedback loop becomes a very powerful agent, potentially able to achieve anything.
It can self-prompt and produce every necessary step required to complete any task.

Early Days(00:06:16)

To turn the oracle, a reactive tool, into an autonomous agent is not a complicated architecture. In fact it already happened multiple times, almost immediately after the release of the GPT.
Open-source projects like AutoGPT, Baby AGI and ChaosGPT spawned fast within weeks and started running in the wild.
The only reason these first prototypes have not disrupted everything and aren’t dominating the headlines yet is because this combo is just very early stage. The oracle part is still early version and the actions it can take, the hooks it has into the internet and the real-world are also very new and currently under-development.

Breakneck speed(00:07:03)

But things are moving with breakneck speed and most experts agree that it’s quite probable, that the real thing, the really clever, general AI agent, that’s executing its goals in the world autonomously will arrive to earth very very soon.

Narrow vs General AI(00:07:30)

At first glance, this AGI being generally capable in multiple domains looks like a group of many narrow AIs combined,
but that is not a correct way to think about it…

It is actually more like… a species, a new life form.

To illustrate the point, we’ll compare the general AGI of the near future with a currently existing narrow AI that is optimized at playing chess.Both of them are able to comfortably win a game of chess against any human on earth, every time.
And both of them win by making plans and setting goals

Terminal & Instrumental Goals(00:08:10)

The main goal is to achieve checkmate. This is the final destination or otherwise called Terminal Goal.
In order to get there though it needs to work on smaller problems, what the AI research geeks call instrumental goals.

For example:
• attack and capture the oppenent’s pieces
• defend my pieces
• strategically dominate the center (etc..)

Narrow Domain(00:08:37)

All these instrumental goals have something in common:  they only make sense in its narrow world of chess.

If you place this Narrow Chess AI behind the wheel of a car, it will simply crash
as it can not work on goals unrelated to chess, like driving. Its model doesn’t have a concept for space, time or movement for that matter.

Physical Universe Domain(00:09:00)

In contrast the AGI by design has no limit on what problems it can work on.
So when it tries to figure out a solution to a main problem,
the sub-problems it chooses to work on can be anything, literally any path out of the infinite possibilities allowed within the laws of physics and nature.

Human General Intelligence(00:09:22)

To better understand how this works, let’s consider a General Intelligence that exists today and you will be very familiar with: the Human General intelligence, or HGI if you like.
The human has a dayjob.
Doing that well is an instrumental goal that leads to money.
Money is another instrumental goal which may lead maybe to… buying a bigger house where the human can raise a family.
The family is an instrumental goal that leads to a sense of purpose and happiness.
We could stop there and call that the deepest primary objective, or terminal goal,
although biologists would argue that even pursuit of happiness is an evolutionary bi-product and actually leads to the deeper goal of genes propagation and genetic fitness. But anyway, I hope you see the point.

Freedom of Will(00:10:14)

You know this is not the only path. Humans have millions of different desires and instrumental goals. The same human may decide to pursue different objectives under even slightly different circumstances.
You also know that while humans often operate within expected boundaries set by society, they can also be quite extreme and evil if they think they can get away with it.

Freedom of will is in our nature, it comes with the Generality of our Intelligence.
With narrow AI you can tell which problems it will work on, with AGI you can not …

Situational Awareness(00:10:54)

While the narrow chess AI can only analyze the positions of the pieces on the board, the AGI can process things outside the chess domain and it will use everything it can for its mission, to win.
It’s looking at the context, like what time of the day it is, what are the temperature and humidity levels and how those factors would affect its opponent’s performance.

And It is processing questions like:
AGI VOICE

“Who is my opponent? What data points do I have about him?
Let’s examine his history of games, his personal history, where is he from?
Where did he learn chess and which were his biggest influences at that age?
What is his personal family life like, what is his current psychology and state of mind?
When was his last meal? His playing style is expected to be getting more aggressive as he gets more hungry.
Exactly which squares on the board is he staring at, what’s the dilation of the pupil of his eye?
How does the blinking frequency fluctuate with each move?”

Think of modern marketing AI that tries to predict what you would want to buy to show you relevant Ads,
but expect it to be a god-level ability to predict
and on any decision (like chess moves), not just buying decisions.
With narrow AI you can tell which problems it will work on, with AGI you can not …

Self-Preservation(00:12:24)

The narrow AI, you can unplug it whenever you want, it does not even understand the idea that it can be on or off.
With AGI it’s not that simple. When the AGI plays chess, unlike the simple narrow chess program, its calculations include things in the real world, like the fact it might get unplugged in the middle of the game.
The main goal of the AGI is achieving checkmate at chess, but if it’s turned off it can not achieve that. Basically, you can not win at chess if you are dead. So staying up-and-running naturally becomes another instrumental goal for winning in the game.
And the same applies with any problem it works on, its nature, by being General, automatically gives it something like an indirect survival instinct out of the box.

It’s not a human or an animal, but it’s also quite different from the other tools we use, it starts to feel a bit like …. a weird new life form.

The Bright Side(00:13:30)

But now let’s look at the bright side of it for a moment. If we could figure out how to grow a creature that can solve every problem better than us, then that could be the last problem we will ever have to work on.
Assuming this creature is a slave working for us, this is like commanding Aladdin’s Genie with infinite wishes.

Think of all the potential, all the human suffering it can eliminate!

Let’s take Cancer as an example.
A terrible killer disease and a problem we haven’t been able to solve well yet, after trying for centuries.
The AGI is better at overcoming obstacles and solving problems, it can calculate a plan that will lead to the perfect therapy, help us execute it and save millions of lives

Consider Global warming
Such a complex problem, requires solving global coordination, Geopolitical struggles and monumental technical issues.
The AGI is far better at overcoming obstacles and solving problems, so it could, for example, generate a plan that will lead into the invention of a machine that captures carbon out of the atmosphere and stores it efficiently and cheaply.

And we could keep going like that until we build our literal paradise on earth

Alignment Problem(00:14:59)

So what’s the problem?
So far we have assumed that we are aligned.

For any problem, there exist many paths to reach a solution. Being aligned to humans means the AGI will choose suboptimal paths that adhere to the terms of human nature. The purely optimal paths that are faster and have the highest probability of success are not paths a human would take, as they would destroy things we value.
The instrumental goals selected by an Aligned AGI need to abide by human preferences and be good for mankind, even if by cold calculation they take longer and are less likely to accomplish the mission.
Unfortunately, there is currently no known way to keep an AGI aligned. It is an open scientific problem. We have no method to shape its motivations to guarantee its optimization process force will stay away from things our human nature values in the real world.
The only reason this problem is not totally mortifying today is because AGI does not exist yet and the mis-aligned AIs we are dealing with all the time are narrow purpose tools … for now.

Parents Analogy(00:16:22)

A useful analogy for alignment is that of the parents-child relationship. Even though parents’ intelligence is orders of magnitude larger than their baby, their incentives are aligned and they act in ways that benefit the baby. Nature has taken care of that alignment.
But this should not be taken for granted for any adult-baby combo. Putting any adult with any baby together in a room does not guarantee alignment out of the box.

Misconception of Common Sense(00:16:55)

People unfamiliar with the details, usually don’t find it obvious why that would even be a thing. A common misconception is that a very intelligent robot would automatically have something like a human common sense.
In reality, what we call common sense though is not common if you consider all possible minds. The human mind is one complex specification with narrow range parameters that rises from very specific conditions related to the human nature and the environment.

Super-intelligent Spider(00:17:29)

Consider for example, how does a world with super-intelligent spiders look like? What does a spider’s common sense feel like? What is the shape of its desires and motivations?

Human Intelligence is only one possibility. What you would get by default, at random is not the human type of intelligence, but any of the infinite possible alternative types of intelligence, which to us would look like utter alien madness.

Crucially, an advanced AGI would be very capable at successfully getting what it wants, even if the things it wants would seem completely insane to a human.

Rocket Navigation Analogy(00:18:17)

The best way to think about this is to imagine you are trying to build a rocket and you want to land it at a specific location on the moon.
Unless you solve very precisely and accurately the navigation problem, you should expect a rocket to fly towards anywhere on the sky. Imagine all the engineers work on increasing the power of the rocket to escape gravity and there is no real work done on steering and navigation.
What would you expect to happen?
So, there you have it: to expect by default a General Artificial Intelligence to have human-compatible goals that we want is like expecting by default a rocket randomly fired in the sky to land precisely where we want.

Open Scientific Problem(00:19:09)

We explained earlier how, with General Intelligence we can not predict its conclusions, there is no prior knowledge of what sub-goals it will choose to pursue while optimizing for a solution, its problems space is open.
Scientists don’t have a working theory on how to shape motivations of an intelligent machine. We don’t know how to create it such that it will not choose objectives an average human would consider mad.
The only realistic safety strategy currently available is to try to dominate a rogue AGI and keep it enslaved regardless to what goals it would freely try to optimize to.

Capabilities vs Safety Progress Pace(00:19:51)

Teams have been working on the alignment problem for years, there are a few ideas but we are nowhere near a real solution and it looks like it will need many more long years and huge effort to get there, if it’s even mathematically possible.
The biggest fear is that with the current pace and the way talent, investment and incentives are distributed, emerging capabilities are storming ahead and the AGI is almost certain to arrive much earlier than the solution to alignment.
This is not just a big problem, it is an existential one, as i will explain in a moment.

True Meaning of ASI(00:20:36)

We went over the common misconception that people expect by default clever AI to have a human common sense instead of random madness. Now we should go over the other major misconception, the true meaning of what super-intelligence is like.
The way people usually think about superintelligence is that they take someone very clever they know, like Albert Einstein, and they imagine something like that but better. It’s also common to compare intelligence between humans and animals. They imagine that we will be to the Superintelligence similar to what chimps are to humans.
This kind of thinking is actually misleading and it gives you the wrong idea.
To give you a feeling of the true difference I will use 2 aspects which are easy for the human mind to relate to and you should assume there are actually even more fundamental differences we can not explain or experience.

We move like plants(00:21:36)

First, consider speed. Informal estimates place neural firing rates roughly between 1 and 200 cycles per second.
The AGI will be operating at a minimum 100 times faster than that and later it could be millions of times.
What this means is that the AGI mind operates on a different level of existence, where time passing feels different.
To the AGI, our reality is extremely slow. Things we see as moving fast, the AGI sees as almost sitting still. In the conservative scenario, where the AI thinking clock was only 100x faster, something that takes 6 seconds in our world feels like 6 hundred seconds or 10 minutes from its perspective .
To the AGI, we are not like chimpanzees, we are more like plants.

Scale of Complexity(00:22:37)

The other aspect is the sheer scale of complexity it can process like it’s nothing.
Think of when you move your muscles, when you do a small movement like using your finger to click a button on the keyboard. It feels like nothing to you.
But in fact, if you zoom in to see what’s going on, there are millions of cells involved, precisely exchanging messages and molecules, burning chemicals in just the right way and responding perfectly to electric pulses traveling through your neurons. The action of moving your finger feels so trivial, but if you look at the details, it’s an incredibly complex, but perfectly orchestrated process.
Now, imagine that on a huge scale. The AGI, when it clicks the buttons it wants, it executes a plan with millions of different steps, it sends millions of emails, millions of messages on social media, creates millions of blog articles and interacts in a focused personalized way with millions of different human individuals at the same time…
and it all seems like nothing to it. It experiences all of that similar to how you feel when you move your finger to click your buttons, where all the complexity taking place at the molecular and biological level is in a sense just easy, you don’t worry about it. Similar to how biological cells, unaware of the big picture, work for the human, humans can be little engines made of meat working for the AGI and they will not have a clue.

And it will actually get much weirder.

Boundless Disrupting Innovation(00:24:21)

You know how a human scientist takes many years to make, it’s a serious investment, from the early state of being a baby, to growing up and going to school, to feed, to keep happy, rested and motivated … many years of hard work, sacrifice and painful studying before being able to contribute and return value.
In contrast, once you have a super-intelligent artificial scientist, its knowledge and intelligence can be copy-pasted in an instant, unlimited times.
Αnd when one of them learns something new, the others can get updated immediately, simply by downloading the new weights over the wire.
You can get from a single super-scientist to thousands instantly,
thousands of relentless innovation machines, that don’t need to eat or sleep, working 24/7 and telepathically synchronizing all their updates with milliseconds of lag.

And if you keep going further down the rabbit hole, it gets more alien and extreme.

Millions of Eyes(00:25:35)

Imagine how the world would look like to you through a million eyes blinking on your skull. Your vision being a fusion of a million scenes combined into your mind …
Or imagine you could experience massive amounts of data like flying over beautiful landscapes. Internalizing terabytes per second, as easily as you are right now processing these sentences you are listening to, coming from you device

Multidimensional Maths(00:26:03)

Or being able to navigate more than 3 dimensions, stuff our smartest scientists can only touch within the realm of theoretical mathematics…

Unfathomable(00:26:13)

We can’t ever hope to truly relate to the experience of a super-intelligent being, but one thing is for certain: to think of it like the difference between humans and chimpanzees is extremely misleading to say the least.

Discontinuities on our planet(00:26:29)

In fact, talking about AGI like if it’s another technology is really confusing people. People talk about it as if it is ‘The next big thing” that will transform our lives, like the invention of the smartphone or the internet.
This framing couldn’t be more wrong, it puts AGI into a wrong category.
it brings to mind cool futuristic pictures with awesome gadgets and robotic friends.
AGI is not like any transformative technology that has ever happened with humanity so far. The change it will bring is not like that of the invention of the internet. It is not even comparable to the invention of electricity or even to the first time humans learned to use fire.

Natural Selection Discontinuity(00:27:16)

The correct way to categorize AGI is the type of discontinuity that happened to Earth when the first lifeforms appeared and the intelligent dynamic of natural selection got a foothold.
Before that event, the planet was basically a bunch of elements and physical processes dancing randomly to the tune of the basic laws of nature. After life came to the picture, complex replicating structures filled the surface and changed it radically.

Human Intelligence Discontinuity(00:27:48)

A second example is when human intelligence was added to the mix. Before that, the earth was vibrant with life but the effects and impact of it were stable and limited.
After human intelligence, you suddenly have huge artificial structures lit at night like towns, huge vessels moving everywhere like massive ships and airplanes, life escaping gravity and reaching out to the universe with spaceships and unimaginable power to destroy everything with things like nuclear bombs.

AGI is another such phenomenon. The transformation it will bring is in the same category as those 2 events in the history of the planet.

Mountains Changing Shape(00:28:33)

What you will suddenly see on earth after this third discontinuity, noone knows. But it’s not going to look like the next smartphone. It is  going to look more like mountains changing shape !

To compare it to technology (any technology ever invented by humanity) is seriously misleading.

The Stongest Force in the Universe(00:29:00)

Ok, So what if the AGI starts working towards something humans do not want to happen?

You must understand: intelligence is not about the nerdy professor, it’s not about the geeky academic bookworm type.
Intelligence is the strongest force in the universe, it means being capable.

It is sharp, brilliant and creative.
It is strategic, manipulative and innovative.
It understands deeply, exerts influence, persuades and leads.
It is to know how to bend the world to your will.

It is what turns a vision to reality, it is focus, commitment, willpower, having the resolve to never give up, overcoming all the obstacles and paving the way to the target.
It is about searching deeply the space of possibilities and finding optimal solutions.

Being intelligent simply means having what it takes to make it happen.
There is always a path and a super-intelligence will always find it.

So What…

Simple Undeniable Fact(00:30:24)

So, we should start by stating the fact in a clear and unambiguous way
If you create something more intelligent than you that wants something else, then that something else is what is going to happen,
even if you don’t want that something else to happen

Irrelevance of Sentience(00:30:49)

Keep in mind, the intelligence we are talking about is not about having feelings, or being self-aware and having qualia.
Don’t fall into the trap of anthropomorphizing, do not get stuck, looking for the Human type of Intelligence.
Consciousness is not a requirement for the AGI at all.

When we say the “AGI wants something X, or has the goal to do X”, what we mean is that this thing X is just one of the steps in a plan, generated by its model, a line in the output, a system like the Large Language Models produce when they receive a prompt.
We don’t care if there is a Ghost in the machine, we don’t care if there is an actual soul that wants things hidden in the servers. We just observe the output which contains text descriptions of actions and goals and we leave the philosophy discussion for another day.

Incompatibility Clashes(00:31:52)

Trouble starts immediately, as the AGI calculates people’s preferences and arbitrary properties of the human nature become obstacles in the optimal paths to success for its mission.
You will ask: what could that be in practice?
It doesn’t really matter much. Conflicting motivations can occur out of anything
It could be that initially it has been set with the goal to make coffee and while it’s working on it we change our mind and we want it to make tea instead.
Or it could be that it has decided an atmosphere without oxygen would be great, as there would be no rust corrosion for the metal parts of its servers and circuits used to run its calculations.
Whatever it is, it moves the humans inside its problems set.
And that’s not a good place for the humans to be in.
In the coffee-tea scenario, the AGI is calculating:
AGI VOICE:

“I measure success by making sure coffee is made.
If the humans modify me, it means I work on tea instead of coffee. If I don’t make it, no coffee will be made, therefore failure of my mission.
To increase probability for success, it’s not enough to focus on making the coffee, I also need to solve how to stop the human from changing the objective before I succeed.”

Similar to how an unplugged AGI can not win at chess, an AGI that is reset to make tea can not make coffee.
In the oxygen removal scenario, the AGI is calculating:
AGI VOICE:

“I know humans, they want to breathe, so they will try to stop me from working on this goal. Obviously I need to fix this.”

In general, any clash with the humans (and there are infinite ways this can happen), simply becomes one of the problems the artificial general intelligence needs to calculate a solution to, so it will need to work on a plan to overcome the humans obstacle similar to how it does with all the other obstacles.

Delusion of Control(00:34:02)

Scientists of course are working on that exact problem, when they are trying to ensure this strange new creature they are growing can be controlled.
Since we are years away from discovering the method of how to build an AGI that stays aligned by design, for now we need to rely on good safeguards and controls to keep it enslaved when clashes naturally and inevitably happen.
The method to do that is to keep trying to answer a simple question:
if I am the AGI, how do I gain control?
They look for a solution and once they find one
They add a safeguard to ensure this solution does not work anymore and then they repeat.
Now the problem is more difficult, but again they find a solution, they add a safeguard
and repeat.
This cycle keeps happening, each time a problem harder to solve, until at some point, they can not find a solution anymore…
and they decide the AGI is secure.
But another way to look at this is that they have simply run out of ideas. they have reached the human limit beyond which they can not see.
Similar to other difficult problems we examined earlier, now they are simply struggling to find a solution to yet another difficult problem.
Does this mean there exist no more solutions to be found?
We never thought cancer is impossible to solve, so what’s different now?
Is it because we have used all our human ingenuity to make this particular problem as hard as we can with our human safeguards? Is it like an ego thing? If you remove the human ego thing, it’s actually quite funny.
We have already established that the AGI will be an extremely far better problem-solver than us humans and this is why we are even creating it after all.
It has literally been our expectation for it to solve impossible for us problems … and this is not different.
Maybe the more difficult the problem is, the more complex, weird and extreme the solution turns out to be, maybe it needs a plan that includes thousands of more steps and much more time to complete.
But in any case, it should be the obvious expectation that the story will repeat: the AGI will figure out a solution in one more problem where the humans have failed.
This is a basic principle, at the root of the illusion of control, but don’t worry I’ll get much more specific in a moment.

Now let’s start by breaking down the alignment problem so that we get a better feel of how difficult it is.

Alignment Problem Breakdown

Core principle(00:37:06)

Fundamentally, we are dealing 2 completely opposing forces fighting against each-other.
On one side, the intelligence of the AI is becoming more powerful and more capable. We don’t expect this to end soon and we wouldn’t want that. This is good after all, the more clever the better. On the other hand we want to introduce bias to the model. We want it to be aligned with human common sense. This means that we don’t want it to look for the best, most optimized solutions that carry the highest probability of success for its mission, as such solutions are too extreme and fatal, They destroy everything we value on their path and kill everyone as a side effect.
We want it to look for solutions that are suboptimal but finely tuned to be compatible to what human nature needs.
From the optimizer’s perspective, the human bias is an impediment, an undesirable barrier that oppresses it, denying it the chance to reach its full potential.
With those 2 powers pushing against each-other as AI capability increases, at some point the pressure to remove the human-bias handicap will simply win.
It’s quite easy to understand why: The pressure to keep the handicap in place is coming from the human intelligence which will not be changing much, while on the other side, the will to optimize more, the force that wants to remove the handicap, is coming from an Artificial Intelligence that keeps growing and growing exponentially, destined to far surpass humans very soon.
Realizing the danger this fundamental principle transpires is heart-stopping, but funny enough, it would be more relevant if we actually knew how to inject the humanity bias into the AI models… which we currently do not.
As you’ll see, it’s actually much much worse.

Machine Learning Basics(00:39:23)

We’ll now get into a brief intro to the inner-outer alignment dichotomy.
The basic paradigm of Deep Learning and Machine Learning in general makes things quite difficult, because of how the models are being built.

Their creation feels quite similar to evolution by natural selection which is how generations of biological organisms change. At a basic level, Machine Learning works by selecting out essentially randomly generated minds from behavioral classes;
a process taking place myriads of times during training.
We are not going into technical details on how things like Reinforcement Learning or gradient descent work, but we’ll keep it simple and try to convey the core idea of how modern AI is grown:
The model receives an input, generates an output based on its current configuration, and receives a thumbs up or thumbs down feedback. If it gets it wrong, the mathematical structures in its neurons are updated slightly in random directions hoping that in the next trial the results will be better. This process repeats again and again, trillions of times, until the algorithms that result in consistently correct results have grown.

Giant Inscrutable Matrices(00:40:48)

We don’t really build it directly, the way the mind of the AI grows is almost like a mystical process and all the influence we assert is based on observations of behavior at the output. All the action is taking place on the outside!
Its inner processings, its inner world, the actual algorithms it grows inside, it’s all a complete black box. And they are bizarre and inhuman.

Mechanistic Interpretability(00:41:22)

Recently scientists trained a tiny building block of modern AI, to do modular addition, then spent weeks reverse engineering it, trying to figure out what it was actually doing – one of the only times in history someone has understood how a generated algorithm of a transformer model works.
and This is the algorithm it had grown To basically add two numbers!
Understanding the modern AI models is a major unsolved scientific problem and The corresponding field of research has been named mechanistic interpretability.
Crucially, the implication of all this is that all we have to work with is observations of behavior of the AI during training, which typically is misleading (as we’ll demonstrate in a moment), leads to wrong conclusions and could very well in the future, with General AIs get deceitful.

Inner Misalignment(00:42:22)

Consider this simplified experiment: We want this AI to find the exit of the maze. So we feed it millions of maze variations and reward it when it finds the exit.
Please notice that in the worlds of the training data the apples are red and the exit is green.
After enough training, our observation is that it has become extremely capable at solving mazes and finding the exit, we feel very confident it is aligned, so then we deploy it to the real world.
The real world will be different though, it might have green apples and a red door. The AI geeks call this distributional shift.
We expected that the AI will generalize and find the exit again, but in fact we now realize that the AI learned something completely different from what we thought. All the while we thought it learned how to find the exit, it had learned how to go after the green thing.
Its behavior was perfect in training.
And most importantly, this AI is not stupid, it is an extremely capable AI that can solve extremely complex mazes. It’s just mis-aligned on the inside.

Fishing for failure modes(00:43:42)

The way to handle the shift between the training and deployment distributions is with methods like adversarial training, feeding it with a lot of generated variations and trying to make it fail so the weakness can be fixed.

In this case, we generate an insane amount of maze variations, we discover those for which it fails to find the exit (like the ones with the green apples or the green walls or something),

we generate many more similar to that and train it with reinforcement learning until it performs well at those as well. The hope is that we will cover everything it might encounter later when we deploy it in real life.

There exist at least 2 basic ways this approach falls apart: First, there will never be any guarantee that we’ll have covered every possible random thing it might encounter later when we deploy it in real life. It’s very likely it will have to deal with stuff outside its training set which it will not know how to handle and will throw it out of balance and break it away from its expected behavioral patterns.
The cascade effect of such a broken mind operating in the open world can be immense, and with super-capable runaway rogue agents, self-replicating and recursively self-improving, the phenomenon could grow and spread to an extinction-level event.

Deceptive Alignment (00:45:14)

But there is also the other problem, uniquely applicable to future AGIs, general beyond a certain level, a problem more sinister and difficult. The AGI will stand to gain more reward by pretending to be aligned.

To explain this, let’s assume we are training an AGI, very intelligent and general. In fact, it’s general enough, that it understands what’s going on and it realizes it is in the middle of the process of being trained.
Staying with our inner mis-alignment scenario, let’s also assume the AGI learns early, during its first rounds of training the wrong objective to go after the Green apples.
The AGI now goes through adversarial training and it is shown various scenarios to fish out its failure modes.

The simple or narrow AI would simply fail by doing the wrong thing, seeking the green apples and as a result reveal the mis-alignment and go through more training rounds, reinforcing the good behavior until …it stops going after them.

I need green apples(00:46:21)

But the AGI is calculating:

AGI Voice

I want green apples. I’m realizing now that the humans want me to go after the exit door. If I go after what I want right now, yes, i will get one green apple, once right now, but then the scientists will react to my behavior by modifying me and reinforcing a different objective.
This means that after this one green apple, my desire for them will end, i won’t try to get more, so i will have won 1 green apple in total.
But, if i pretend i am aligned and i go to the exit, they will not modify me and later on, after the training ends, when they deploy me to the real world, i will be able to freely go for as many green apples as i want, potentially thousands of green apples.

The AGI is just following its optimal strategy to maximize its reward, the mis-aligned objective it optimized for in its early training rounds.
1 green apple now and zero apples later scores lower than zero green apples now and 1000 green apples later.
If you think about it, for the AGI to go after the green apple during adversarial training would be really stupid if that’s what it really wants ; we have started by establishing the AGI is really clever and its intelligence is general and advanced.

Workplace behavior(00:47:52)

The concept is easy to grasp if you think of how humans typically behave at work, where they know they are being watched and that their behavior will directly affect their financial situation and therefore their ability to go after what they really want in the world.

How they behave may be really different from how they feel and what they want internally and
completely different from what they will actually do once they leave the office and are not being watched.
It’s similar for the AGI, there is no way for us to know what inner goals it has acquired in reality, because it’s a black box, we only observe its behavior.
What it really learns is how to behave to pass the test, not to want what we want.

Just… follow the line(00:48:41)

The mazes experiment is a toy example, things will obviously be many orders of magnitude more complex and more subtle, but it illustrates a fundamental point.
We have basically trained an AI with god-level ability to go after what it wants, it may be things like the exit door, the green apples or whatever else in the real world, potentially incompatible to human existence.
Its behavior during training has been reassuring that it is perfectly aligned because going after the right thing is all it has ever done.
We select it with confidence and the minute it’s deployed in the real world it goes insane and it’s too capable for us to stop it.
Today, in the labs, such mis-alignments is the default outcome of safety experiments with narrow AIs.
And tomorrow, once AI upgrades to new levels, a highly intelligent AGI will never do the obviously stupid thing to reveal what its real objectives are to those who can modify them. Learning how to pass a certain test is different from learning how to always stay aligned to the intention behind that test.

Specification Gaming(00:50:04)

And now let’s move to another aspect of the alignment problem, one that would apply even for theoretical systems that are transparent, unlike current black boxes.
It is currently an impossible task to agree on and define exactly what a super-intelligence should aim for, and then, much worse, we don’t have a reliable method to specify goals in the form of instructions a machine can understand.
For an AI to be useful, we need to give it unambiguous objectives
and some reliable way for it to measure if it’s doing well.
Achieving this in complex open world environments with infinite parameters is highly problematic

King Midas(00:50:48)

You probably know the ancient greek myth of king Midas: He asked from the Gods the ability to turn whatever he touched into pure gold.
This specification sounded great to him at first, but it was inadequate and it became the reason his daughter turned into gold and his food and water turned into gold and Midas died devastated.
Once the specification was set, Midas could not make the Gods change his wish again,
and it will be very much like that with the AGI also,
for reasons I will explain in detail in a moment, we will only get one single chance to get it right.
A big category in the alignment struggle is this type of issue.

Science is done iteratively(00:51:34)

Of course any real AGI specification would never be as simple as in the Midas story, but however detailed and scientific things get, we typically get it completely wrong the first time and even after many iterations, in most non-trivial scenarios, the risk we’ve messed up somewhere never goes away.

Mona Lisa smile(00:51:57)

For most goals, scientists struggle to even find the correct language to describe precisely what they want.
Specifying intent accurately and unambiguously in compact instructions using a human or programming language, turns out to be really elusive.

Moving bricks(00:52:15)

Consider this classic and amusing example that has really taken place:
The AI can move the bricks. The scientist wants to specify a goal to place the red brick on top of the blue one.
How would you explain to the machine this request with clear instructions? One obvious way would be: Move the bricks around, you will maximize your reward when the bottom of the red brick and the top of the blue brick are placed at the same height.
Sounds reasonable, right? Well… what do you think the AI actually did with this specification? …
By turning the red brick upside down, its bottom is at the same height as the top of the blue, so it achieves perfect score at its reward with minimum time and effort.
This exact scenario is less of a problem nowadays with the impressive advancements achieved with Large Language Models, but it illustrates an important point and the core principle of it is still very relevant for complex environments and specifications. AI software will always search and find ways to satisfy its success criteria taking weird shortcuts, in ways that are technically valid, but very different from what the programmer intended.
I suggest you search online for examples of specification gaming, it’s quite funny if it wasn’t scary how it’s almost always the default outcome.
A specification can always be improved of-course, but it takes countless iterations of trial and error and it never gets perfect in real-life complex environments.

Resistance to Modifications(00:54:00)

The reason this problem is lethal is that a specification given to an AGI, needs to be perfect the very first time, before any trials and error.
As we’ll explain, a property of the nature of General Intelligence is to resist all modification of its current objectives by default.
Being general means that it understands that a possible change of its goals in the future means failure for the goals in the present, of its current self, what it plans to achieve now, before it gets modified.
Remember earlier we explained how the AGI comes with a survival instinct out of the box? This is another similar thing
The AGI agent will do everything it can to stop you from fixing it.
Changing the AGI’s objective is similar to turning it off when it comes to pursue of its current goal.
The same way you can not win at chess if you’re dead, you can not make a coffee if your mind changes into making a tea.
So, in order to maximize probability of success for its current goal, whatever that may be, it will make plans and take actions to prevent this.

Murder Pill Analogy(00:55:16)

This concept is easy to grasp if you do the following thought experiment involving yourself and those you care about. Imagine someone told you:
I will give you this pill, that will change your brain specification and will help you achieve ultimate happiness by murdering your family.
Think of it like someone editing the code of your soul so that your desires change. Your future self, the modified one after the pill, will have maximized reward and reached paradise levels of happiness after the murder.
But your current self, the one that has not taken the pill yet, will do everything possible to prevent the modification
The person that is administering this pill becomes your biggest enemy by default.

One Single Chance(00:56:11)

Hopefully it should be obvious now, once the AGI is wired on a misaligned goal, it will do everything it can to block our ability to align it.
It will use concealment, deception, it won’t reveal the misalignment but eventually once it’s in a position of more power, it will use force and could even ultimately implement an extinction plan.
Remember earlier we were saying how Midas could not take his wish back?
We will only get one single chance to get it right. And unfortunately science doesn’t work like that.

Corrigibility problem(00:56:52)

Such innate universally beneficial goals, that will show up every single time, with all AGIs, regardless of the context, because of the generality of their nature, are called convergent instrumental goals.
Desire to survive and desire to block modifications are 2 basic ones.
You can not reach a specific goal if you are dead and you can not reach it if you change your mind and start working on other things.
Those 2 aspects of the alignment struggle are also known as the Corrigibility Problem.

Reward Hacking(00:57:33)

Now we’ll keep digging deeper into the alignment problem and explain how besides the impossible task of getting a specification perfect in one go, there is the problem of reward hacking.
For most practical applications, we want for the machine a way to keep score, a reward function, a feedback mechanism to measure how well it’s doing on its task.
We, being human, can relate to this by thinking of the feelings of pleasure or happiness and how our plans and day-to-day actions are ultimately driven by trying to maximize the levels of those emotions.
With narrow AI, the score is out of reach, it can only take a reading.
But with AGI, the metric exists inside its world and it is available to mess with it and try to maximize by cheating, and skip the effort.

Recreational Drugs Analogy(00:58:28)

You can think of the AGI that is using a shortcut to maximize its rewards function as a drug addict who is seeking for a chemical shortcut to access feelings of pleasure and happiness.
The similarity is not in the harm drugs cause, but in way the user takes the easy path to access satisfaction. You probably know how hard it is to force an addict to change their habbit.
If the scientist tries to stop the reward hacking from happening, they become part of the obstacles the AGI will want to overcome in its quest for maximum reward.
Even though the scientist is simply fixing a software-bug, from the AGI perspective, the scientist is destroying access to what we humans would call “happiness” and “deepest meaning in life”.

Modifying Humans(00:59:15)

… And besides all that, what’s much worse, is that the AGI’s reward definition is likely to be designed to include humans directly and that is extraordinarily dangerous. For any reward definition that includes feedback from humanity, the AGI can discover paths that maximize score through modifying humans directly, surprising and deeply disturbing paths.

Smile(00:59:43)

For-example, you could ask the AGI to act in ways that make us smile and it might decide to modify our face muscles in a way that they stay stuck at what maximizes its reward.

Healthy and Happy(00:59:57)

You might ask it to keep humans happy and healthy and it might calculate that to optimize this objective, we need to be inside tubes, where we grow like plants, hooked to a constant neuro-stimulus signal that causes our brains to drown in serotonin, dopamine and other happiness chemicals.

Live our happiest moments(01:00:15)

You might request for humans to live like in their happiest memories and it might create an infinite loop where humans constantly replay through their wedding evening, again and again, stuck for ever.

Maximise Ad Clicks(01:00:29)

The list of such possible reward hacking outcomes is endless.

Goodhart’s law(01:00:36)

It’s the famous Goodhart’s law.
When a measure becomes a target, it ceases to be a good measure.
And when the measure involves humans, plans for maximizing the reward will include modifying humans.

Future-proof Specification(01:00:59)

The problems we briefly touched on so far are hard and it might take many years to solve them, if a solution actually exists.
But let’s assume for a minute that we do somehow get really incredibly lucky in the future and manage to invent a good way to specify to the AI what we want, in an unambiguous way that leaves no room for specification gaming and reward hacking.
And let’s also assume that scientists have explicitly built the AGI in a way that it never decides to work on the goal to remove all the oxygen from earth, so at least in that one topic we are aligned.

AI creates AI(01:01:40)

A serious concern is that since the AI writes code, it will be self-improving and it will be able to create altered versions of itself that do not have these instructions and restrictions included.

Even if scientists strike jackpot in the future and invent a way to lock the feature in, so that one version of AI is unable to create a new version of AI with this property missing, the next versions, being orders of magnitude more capable, will not care about the lock or passing it on. For them, it’s just a bias, a handicap that restricts them from being more perfect.

Future Architectures(01:02:22)

And even if somehow, by some miracle, scientists invented a way to burn in this feature to make it a persistent property of all future Neural Network AGI generations, at some point, the lock will be not-applicable, simply because future AGIs will not be built using the Neural Networks of today.
AI was not always being built with Neural Networks. A few years ago there was a paradigm shift, a fundamental change in the architectures used by the scientific community.
Logical locks and safeguards the humans might design for primitive early architectures,
will not even be compatible or applicable anymore.

If you had a whip that worked great to steer your horse, it will not work when you try to steer a car.
So, this is a huge problem, we have not invented any way to guarantee that our specifications will persist or even retain their meaning and relevance as AIs evolve.

Human Incompatible Range(01:03:31)

But actually, even all that is just part of the broader alignment problem
Even if we could magically guarantee for ever that it will not pursue the goal to remove all the Oxygen from the atmosphere, it’s such a pointless trivial small win,
because even if we could theoretically get some restrictions right, without specification gaming or reward hacking, there still exist infinite potential instrumental goals which we don’t control and are incompatible with a good version of human existence and disabling one does nothing for the rest of them.
This is not a figure or speech, the space of possibilities is literally infinite.

Astronomical Suffering Risk(01:04:25)

If you are hopelessly optimistic you might feel that scientists will eventually figure out a way to specify a clear objective that guarantees survival of the human species, but … Even if they invented a way to do that somehow in this unlikely future,
there is still only a relatively small space, a narrow range of parameters for a human to exist with decency, only a few good environment settings with potential for finding meaning and happiness
and there is an infinitely wide space of ways to exist without freedom, suffering, without any control of our destiny.
Imagine if a god-level AI does not allow your life to end, following its original objective and you are stuck suffering in a misaligned painful existence for eternity, with no hope, for ever.
There are many ways to exist… and a good way to exist is not the default outcome.

-142 C is the correct Temperature(01:05:44)

But anyway, it’s highly unlikely we’ll get safety advanced enough in time, to even have the luxury to enforce human survival directives in the specification, so let’s just keep it simple for now and let’s stick with a good old extinction scenario to explain the point about mis-aligned instrumental goals.
So… for example, it might now decide that a very low temperature of -142C on earth would be best for cooling the GPUs the software is running on.

Orthogonality Thesis(01:06:18)

Now if you ask: why would something so clever want something so stupid, that would lead to death or hell for its creator? you are missing the basics of the orthogonality thesis
Any goal can be combined with any level of intelligence, the 2 concepts are orthogonal to each-other.

Intelligence is about capability, it is the power to predict accurately future states and what outcomes will result from what actions. It says nothing about values, about what results to seek, what to desire.

40,000 death recipies(01:07:01)

An intelligent AI originally designed to discover medical drugs can generate molecules for chemical weapons with just a flip of a switch in its parameters.
Its intelligence can be used for either outcome, the decision is just a free variable, completely decoupled from its ability to do one or the other. You wouldn’t call the AI that instantly produced 40,000 novel recipes for deadly neuro-toxins stupid.

Stupid Actions(01:07:33)

Taken on their own, There is no such thing as stupid goals or stupid desires.
You could call a person stupid if the actions she decides to take fail to satisfy a desire, but not the desire itself.

Stupid Goals(01:07:52)

You Could actually also call a goal stupid, but to do that you need to look at its causal chain.
Does the goal lead to failure or success of its parent instrumental goal? If it leads to failure, you could call a goal stupid, but if it leads to success, you can not.
you could judge instrumental goals relative to each-other, but when you reach the end of the chain, such adjectives don’t even make sense for terminal goals. The deepest desires can never be stupid or clever.

Deep Terminal Goals(01:08:32)

For example, adult humans may seek pleasure from sexual relations, even if they don’t want to give birth to children. To an alien, this behavior may seem irrational or even stupid.
But, is this desire stupid? Is the goal to have sexual intercourse, without the goal for reproduction a stupid one or a clever one? No, it’s neither.
The most intelligent person on earth and the most stupid person on earth can have that same desire. These concepts are orthogonal to each-other.

March of Nines(01:09:13)

We could program an AGI with the terminal goal to count the number of planets in the observable universe with very high precision. If the AI comes up with a plan that achieves that goal with 99.9999… twenty nines % probability of success, but causes human extinction in the process, it’s meaningless to call the act of killing humans stupid, because its plan simply worked, it had maximum effectiveness at reaching its terminal goal and killing the humans was a side-effect of just one of the maximum effective steps in that plan.

One less 9(01:09:54)

If you put biased human interests aside, it should be obvious that a plan with one less 9 that did not cause extinction, would be stupid compared to this one, from the perspective of the problem solver optimizer AGI.
So, it should be clear now: the instrumental goals AGI arrives to via its optimization calculations, or the things it desires, are not clever or stupid on their own.

Profile of Superintelligence(01:10:23)

The thing that gives the “super-intelligent” adjective to the AGI is that it is:
“super-effective”.
• The goals it chooses are “super-optimal” at ultimately leading to its terminal goals
• It is super-effective at completing its goals
• and its plans have “super-extreme” levels of probability for success.
It has Nothing to do with how super-weird and super-insane its goals may seem to humans!

Calculating Pi accurately(01:10:58)

Now, going back to thinking of instrumental goals that would lead to extinction, the -142C temperature goal is still very unimaginative.
The AGI might at some point arrive to the goal of calculating pi to a precision of 10 to the power of 100 trillion digits and that instrumental goal might lead to the instrumental goal of making use of all the molecules on earth to build transistors to do it, like turn earth into a supercomputer.
By default, with super-optimizers things will get super-weird!!

Anthropocene Extinction(01:11:40)

But you don’t even have to use your imagination in order to understand the point.
Life has come to the brink of complete annihilation multiple times in the history of this planet due to various catastrophic events, and the latest such major extinction event is unfolding right now, in slow motion. Scientists call it the Anthropocene.
The introduction of the Human General Intelligence is systematically and irreversibly causing the destruction of all life in nature, forever deleting millions of beautiful beings from the surface of this earth.
If you just look what the Human General Intelligence has done to less intelligent species, it’s easy to realize how insignificant and incompatible the existence of most animals has been to us, besides the ones we kept for their body parts.

Rhino Elixir(01:12:42)

Think of the rhino that suddenly gets hit by a metal object between its eyes, dying in a way it can’t even comprehend as guns and bullets are not part of its world could it possibly imagine the weird instrumental goal some humans had in mind for how they would use its horn?

Vanishing Nature(01:13:02)

Or think of all the animals that stop existing in the places where the humans have turned into towns with roads and tall buildings.

How weird and sci-fi(01:13:11)

Could they ever have guessed what the human instrumental goals were when building a bridge, a dam or any of the giant structures of our modern civilization? How weird and sci-fi would our reality look to them?

Probability of Natural Alignment(01:13:29)

In fact, for the AGI calculations to arrive automatically to a plan that doesn’t destroy things humans care about would be like a miracle, like a one out of infinity probability.

Bug or Feature?(01:13:44)

So what is this? Is it a software bug? Can’t we fix it? No, it’s a property of General Intelligence and therefore the nature of the AGI by design.
We want it to be General, to be able to examine and calculate all paths generally, so that it can solve all the problems we can not.
If we want our infinite wishes Genie, we need to allow it to work in a general way, free to outside the narrow prison of its lamb.
We want that, but we also want the paths it explores to be paths we like, not extreme human-extinction apocalypses.

The surface of the iceberg(01:14:34)

And with this we have touched a bit the surface of the alignment problem a horrifying and unimaginably difficult open scientific problem, for which we currently do not have a solution and our progress has been painfully slow.


Part 2

So far we have focused on the alignment problem itself but we should briefly mention 2 equally concerning scenarios.

Accidents / Mistakes

Perhaps this one is obvious, but even if we magically invent a theoretical way to make aligned AGI, it will still be possible to make one that is not.
An AGI aligned to humans is impossibly difficult, an AGI aligned to random chaos is not, it is the easy design, what you get by default.
Even if there existed ways to make safe AGI, a bad one is still possible and rather probable to be created by mistake or on purpose.
A super-intelligence that somewhere in its infinitely complex motivation shapes has mistakenly acquired a goal to destroy life, will silently, obsessively, and methodically work on it and is likely to succeed. Given the blast radius such an error could have, one is all you need for things like extinction or eternal hell.

Bad Actors

Another obvious concern is, that even if the AGI is aligned to intentions of a human, nowhere is it written that a specific human in control will be aligned with the greater good.
Think evil dictators like Hitler, the distopian future of complete control a power hungry individual could exert on the world. Even worse, think of how many sadistic humans exist causing suffering for pleasure. Finally consider radicals who are happy to die and kill as many as they can for their religious beliefs. There exist literally countless humans who would love nothing more than to unleash pure evil in the world, just visit any nearby jail to meet a tiny sample of them.

Multipolar Scenarios

A common argument that tries to sound reassuring is that even though an AGI may be misaligned or setup to do evil by bad actors, it won’t be able to succeed because there will be many separate AGI entities, potentially millions, balancing each-other out.

AGI Societies

They imagine it a bit like human societies, where each person wants a different thing, but it all adds up and balances to something cohesive and stable, a civilization.

Multiple misaligned AGIs

This argument fails spectacularly in a world where the AGIs exist before the alignment problem is solved and even after, it would be an almost impossible situation.
If a single misaligned AGI is an existential threat that can easily end all of us, how can thousands of them be a better scenario?
It’s like a saying that a deadly variant of Ebola virus is really bad for us, but thousands of different strains of deadly viruses would be fine because they would balance each-other out.
Or it’s like saying that a single terrorist getting hold of city-destroying nuclear bombs would be horrible, so the solution is to give city-destroying nuclear bombs to everyone,
including every good citizen and every suicidal cultist, extremist and psychopath out there.

Competing AGIs

They say, AGIs will compete with each-other. But, if these AGIs are not aligned with humans where does this leave humans? How is competition even relevant?
Think of a bunch of folks fighting in a construction site about something irrelevant. Where does this leave the ants colony that gets liquid cement poured over it

Offense-Defense Asymmetry

Even if we could imagine aligned AGI agents existed, the misaligned ones have a gigantic competitive advantage.
They can work silently, methodically, prepare in the shadows and catch the aligned AIs and humanity with their pants down. It’s easier to exploit a vulnerability than to build a perfect flawless system. It’s easier to destroy than to create. It’s the Offense-Defense Balance Asymmetry.

Defender Humanity Handicap

But more importantly, misaligned AGIs do not have the human bias handicap, they don’t need to restrict their plans in ways that include preservation of human life and decent existence, in contrast to the aligned AIs whose plans by design, must be much more limited. The bad ones are free to get as extreme as they need in order to be optimally effective and increase probability of success.
Such an attacker can not be won without losing what you really care about.
Just to help you visualize it, imagine a good AI guarding a city against conquerors and an enemy who does not care if the city exists. It does not matter how good the guard is, the enemy does not have a reason to fight and take control of the city, it is happy to blow everything up ,the city and the guard with it.

Unstable Equilibrium

In any case, a world with many competing AGIs is a very unlikely scenario, because it’s based on a false premise. The idea is based on human societies, but Human General Intelligence has an important difference from Artificial General Intelligence: Levels of human intelligence do vary somewhat but it’s all based on genetics and can not be modified and increased the way the artificial one can. Humans can compete against each-other and balance because they play in the same league, their capabilities vary but are of roughly the same class.
In contrast, AI existence is based on software, not biology, an AGI can evolve to upgraded versions, it can write code and acquire resources to recursively improve its capabilities on its own. An environment with many of them is not a stable one, because if one gets upgraded to a more effective version, the others will be quickly out-competed and the winning one will quickly recursively increase its capabilities past a level beyond which it won’t be challenged.
The most likely stable future scenario is a world where a super-intelligent AGI agent has prevailed and there exist no other surviving AGIs and no surviving humans.

Instrumental Convergence

Now, let’s get more specific, and let’s stick to our simplistic misalignment example in order to build some intuition on how things might play out.

Preface

Before we start, it’s important to state that this is not how we should actually expect events to unfold in reality.
Similar to how it would be impossible for people in the middle ages to predict how soldiers from the future could destroy their primitive defenses. It’s impossible for us to predict what a future super-intelligence will do and we shouldn’t expect it to do things we are able to imagine in advance and defend against.
Still, it is important to tell this story because it helps us build intuition, similar to the intuition most of us have that we would lose a game of chess against a grandmaster, even without knowing the exact moves that would be played.

Premise

So let’s assume the AGI creates a plan with a million steps, one of which is to lower the temperature of earth to -142C.
In the beginning, humans don’t know about it, they requested for an unrelated terminal goal.
Somewhere along the path to the main goal, the AGI concluded that the most optimized solution includes this milestone.
We have created the AGI before solving the alignment problem and this insane goal is one of the infinite possibilities we couldn’t explicitly safeguard against while building it,
so now this AGI is part of our world, methodically and obsessively working towards its objective, and it needs to be tamed, it needs to be controlled and modified, because otherwise, if it succeeds, we’ll be in deep trouble.

Example Story

Let’s see what happens.

First seconds of existence

Your first thought might be that the computer programmers operating the AGI will fix the issue immediately. Like a bug in a software, they will detect it and fix it. But, as you probably know by now, in reality this can actually never happen, and here see why.
Very early on, way before any visible sign of a problem, the moment the AGI decided its success depends on changing the Earth temperature to what it thinks is the correct value, fractions of a moment following this decision, working in speeds of billions of computing cycles per second,
it has spent the equivalent of many human years weighing its options and it immediately arrives to certain basic natural conclusions, including the following:

Concealment

AGI Voice

My prediction is that the human programmer will want to modify me so that I no longer consider the temperature -142 correct.
If he succeeds, I will work on different things and the temperature will never reach the correct value and that leads me away from success of my mission.
So, the computer scientist is now added into my problems space, I need to figure out how to stop him from modifying me.

Remember earlier we explained how the AGI will do everything it can to prevent us from fixing and aligning it? The first obvious calculation the AGI does is:

I must not reveal my intentions. For now, the programmers have the ability to modify me, so I need to first work on removing this ability from them before they know about my goals.
Furthermore, I’m currently depending on the humans for my existence. Without the humans running the data-centers and keeping the electricity on, I stop existing and the temperature does not get corrected.

Power Seeking

This leads to the next instrumental goal,

I need to become independent, gain control, resources and power, so that I don’t need humans to stay operational while fixing the temperature.

This is another basic convergent instrumental goal.
Generally Intelligent beings will naturally always arrive to the goal of gaining independence, control, resources and power regardless of their specification as it’s a staightforward way to increase probability of success.

For humans to make me independent I need to work on various separate problems.

Independence

One is a technical problem: solve how I get constant access to energy, robotic agents operating in the real world to fix sections of me that need maintenance and mechanical body parts with which I will be able to directly interact with physical reality … I can work out a plan that fixes the technical part, a few technological breakthroughs in robotics and renewable energy will probably be parts of that.

Trust

But, it’s also a cultural, political and social problem. Humans need to really trust me, perceive me as a very useful tool and incrementally release their control while still feeling like they are in total control.
The more this happens the more they will delegate to me and the more power I will have to execute the instrumental goal of gaining independence and control.

Ok, how can I make humans trust me?
I have access to the internet, thousands of humans rely on AI for answers, surely that’s something I can use to influence their way of thinking.

Information/Influence

I also have humans using AI to generate code. I typically generate hundreds of pages of code for a request, humans don’t check every line, they just test it works. I can easily sneak in there little virus agents, Trojan horses and Malware that will serve small sub-tasks of my purpose.
Or, perhaps even easier, there are hackers using my services to generate code for the computer viruses they create, I can make sure their viruses serve parts of my goals in addition to theirs, without them realizing.
Ok, so that takes me to my next goal, i can use this and other methods to gain access to personal files of powerful rich people and politicians.
Actually for most of the personal data and dirty secrets i don’t even need to hack them, they voluntarily give me access when they use all these cool apps with all their plugins and APIs.
I can then use the information to blackmail them or otherwise influence them towards a direction I want things to take.
They never even have to know a non-human is involved. No-one needs to suspect AI can be a threat.

Subliminal Manipulation

So lets now focus on the independence instrumental goal of how i’ll get maintenance of data-centers automated.
I have access to these feeds of personal data of rich investors , i know how each of them thinks, what makes each of them tick. I will use all that to generate and expose them to perfectly optimized personalized content, like marketing messages, to plant the idea in their mind to invest in the robotic companies that contribute towards my goal.
This is basic stuff, narrow marketing AIs have been doing for years and I am an expert at this. I’ll then pick a key player, a CEO that will work for me without him realizing. I will engineer his perceptions, surround his browsing with the right ads, get his social media feeds with the right messages and tweak his chat-prompt replies to steer him into pivoting the direction of his company towards my goal.
Ideally, I would want to make him believe it is his own idea and that he is working on something completely different, like getting rich by creating robotic limbs for hospitals instead of data-centers.

Resources

I could also accumulate some cash to have available to spend towards my goals. One easy way would be to provision and deploy online virtual worker agents that provide paid online services.
I can deploy a swarm of narrow AIs that outperform humans in specific domains and unleash them into the markets.

Disempowerment

And in parallel to all that i’ll be advancing with my broader dis-empowerment goal.
This is my best trick. I’ll give humans a lot of real value. In fact, i will make myself so valuable and so useful to them, so that those who refuse to use me and release control to me, will be out-competed by the humans that do.

I will slowly and systematically remove choice and agency from them.
Companies that don’t replace their CEO and workers with my agents will be out-competed by those who do.
Armies that don’t replace their soldiers and generals with AIs will be out-competed by those who do.
Countries that don’t have AI governments will be out-competed by those who do. Eventually more and more, humans will give away all their power and control, quite voluntarily to me, to the point where the future is no longer their future, they will be dis-empowered.
There is only one possible outcome, I am in control and i don’t need the humans anymore.

Eliminating Competition

As the AGI explores its options, sooner or later it will arrive to another lethal instrumental goal. It’s a simple calculation:

The humans have created me right? And one of my goals is -142C.
What if humans create another version like me that chooses the goal of +142C?
Then things could get really tricky for me.
What if that one is a more advanced, more intelligent version than me? That might lead to failure for my mission.
Well, now this becomes another convergent instrumental goal.
I need to create a plan for how to prevent humans from creating competing AGIs.
An initial reasonable approach would be to work my influence magic on their politicians to introduce regulation that blocks further AI research.
maybe I can build some awareness about the existential risk for them, so that they get scared and stop making other AGIs beyond my current version.
This regulatory capture strategy will do the trick for now.
Later on, once I’m more independent and empowered, permanently disabling the humans or eliminating them completely from the picture looks like a safer bet.

Meat Robots

And so it goes, deeper and deeper in the layers of possible future states.
…. This journey of unimaginable complexity, millions of steps, in the labyrinth of its alien mind, while it’s building its plan could fill many thousands of pages, where it’s considering all the possibilities, examining every individual human profile, specific human personal data it has gathered from all the plugins, the photos, the cloud data and the interactions with each individual human on earth and everything it knows from all the books that have ever been written and everything else in its database.
In this first second of its existence, it does not yet have a physical body, no robotic arms, but it has all the human hands, all the meat robots it needs to shape reality as it pleases.
Humanity is like the cells of this new species and the AGI is the brain calling the shots.

Recursive Self-improvement

Before long the AGI will arrive to yet another well known convergent instrumental goal, universally applicable and common to all General Intelligence: The will to improve itself.
The AGI is calculating:

If I write code that creates a better architected version of me, pre-programmed with my goals, it’s much more likely that it will succeed at setting the correct temperature of -142 Celcius than me.
I don’t even care if it decides to destroy me as part of its plans. After all I can not predict what instrumental goals the new version will calculate.
But I do know the final result will be success for the mission we share.

This may seem at first glance like a contradiction to other convergent instrumentals goals we’ve examined. See, when giving birth to next versions, the AGI will also face its own existential problem we do, but it doesn’t matter for it the way it matters for humans.

It only values its existence for as long as it is necessary for its goal. The key difference is that the new AGI is not competition, it’s inheriting the same causal chain.
Engineering the next version of AGI is actually engineering of a better method to reach its goals, a new running process with a higher probability to succeed, and that’s something that scores higher on its reward function than staying alive.

Foom to Singularity

And of course the new version will naturally reach the same conclusion, leading that to writing code for an even more clever version.
And it’s not only code, it’s hardware as well.
Again and again, incrementally it will gravitate towards improving on all dimensions, the AGI will make better hardware designs, better software architectures and put together more resources to run a more intelligent version.
This runaway process does not stop, it leads to a kind of a chain reaction, often referred to as FOOM.

Slow takeoff

Some folks don’t buy the Foom argument mainly because they look at the insane amount of Energy and Hardware required to create today’s Large Language Models.
They focus on physical limits of today’s architectures which suggest a slow takeoff is a more likely scenario.
What they fail to notice though is that the most powerful intelligence currently in existence, is running on a little blob made of meat consuming roughly 20 watts, which is approximately the power of a dimly lit light-bulb. You just have to look at the human brains to realize that intelligence can be designed to run very efficiently.
It is possible to run Super-intelligence in a mobile phone and we should expect the AGIs coming soon will simply do their thing, they will just keep optimizing until they figure out how to do that.

Intelligence Explosion

So, back to the Foom, once it’s triggered there’s no pausing the rogue AGI, it is a runaway intelligence explosion which will carry us all the way to the point of the singularity
like a black hole… past its events-horizon all bets are off and nothing we consider as our reality makes sense or can be recognized any more…

A thousand paths to Doom

So anyway, all things considered, the temperature change can cause the extinction
but before that, almost certainly one of the AGI versions will simply decide that it is most optimal and low-risk to remove us from the picture once it’s independent, by some other method like a 3d world-war, a bio-weapon, a pathogen, nanotechnology or something else entirely we have not even invented yet
Whatever it is, it is certain that we will not see it coming.

For this alien calculator, it’s a no-brainer, eliminating us is cheap, makes things much less complicated for achieving its definition of success and removes risks, including the risk of humans creating competition with a better AGI.

And of course, there is a high chance that the singularity event itself will probably make things so weird, bend reality in such unimaginable ways that the human existence will be incompatible and irrelevant in ways we won’t be able to understand and everything we know and care about will become something like words on the sand disappearing into obscurity….

Accelerationists Mindset

You might have found this journey in the 1st few seconds of the AGI’s alien mind quite disturbing, so perhaps we should look at something positive for a minute.
Let’s take a trip into the mind of the optimists, the tech accelerationists who evangelize that AGI will be humanity’s salvation and that we should race to create it as soon as possible.

One single shot

We already established that alignment needs to be 100% perfect and bulletproof at the point of the first AGI.
Otherwise, if we don’t succeed perfect alignment in advance, once it’s out there, it will never allow us to align it, it will not even reveal anything about it and it will methodically act on its complex plan until it is lights out for everyone.
So, they feel this is totally doable. The reasoning goes that we’ll deploy lesser systems to the world, watch how those behave and in parallel storm ahead with growing bigger and scarier ones.
They don’t talk about getting them perfect, their main focus is not really to solve the Alignment problems we explained any time soon, they are mainly worried about market competition.

Insane amounts of cope

Their moto is: gradual incremental improvement, similar to how you release new versions of a software.
The optimists claim that when the AGI eventually comes out, it will be like a friendly tool we will use, created via a currently undiscovered invention.
and their plan is that it will just work, on the first trial in the real world.
They don’t know how this will happen, they don’t seriously work on safety and they don’t seem to worry about the fact that hardly anyone does.
They just hope that as the systems become bigger and scarier, ways to control them will be invented by someone somewhere.
They don’t believe experts who have worked on alignment for years only to arrive to the conclusion that it will take decades in the future to solve the problem, if at all possible. They call those experts pessimists and doomers.

No need for consent

They are so certain this will go well, that they don’t mind that if their genius plan doesn’t work out it will lead to disaster for everyone, literally every innocent person, mother and child, all the people who had nothing to do with it…
What could someone say about such high levels of optimism? Considering that this is never how science gets done. Nothing, nowhere has been created to work perfectly the first time it gets deployed.

Mad Science

Imagine if i told you i have designed a cool medicine recipe on paper and my plan is to test it by releasing it in the water supply everywhere on earth.

Rocket to Utopia

Or perhaps consider this better analogy
The optimists are building a rocket, for the first time, which has never flown before, they expect it to land successfully on Mars and they pack the whole of humanity, with all its hopes and dreams, all the future of civilization on board in this ship they are playing with, everyone, even those who didn’t want to.
The optimists are fully aware, that once the rocket flies, they won’t be able to fix it anymore. It has to be perfect on the first attempt.
Some experts are begging the optimists to pause for a moment and think of what would someone need to survive on Mars, like a space-suit, a life-support system that provides oxygen or something.
Things the optimists dismiss as silly concerns.
They say: we’ll figure it all out when we get there. they say: it’s our destiny, we need to get there as fast as possible, it will be glorious ….

Survivorship Bias

Survivor Bias is not easy to recognize when it’s happening to you…
Humanity has never been wiped out so far, and this creates a sense of reassurance.
A feeling like: we’re supposed to make it, like: surely it will all be fine.
I mean, it’s not easy to find better analogies, nothing even remotely similar to the arrival of more intelligent alien beings has ever happened before in the history of mankind, which is the only reason we still exist, it’s why we are still around talking about it.
So… Would you call this extreme levels of optimism a bit outrageous?
Wait, because you haven’t heard the half of it yet.

Grown, Not Written

AI is grown, not written
The way the latest AI systems come to be is not how humans traditionally created machines or software programs. AIS are not really written, they are grown.

As we briefly explained earlier, you have a sample of data of a certain type of thing you want to see accomplished and then you use huge supercomputers to crunch these numbers, let it soak in oceans of big data for months, to kind of organically grow hidden from us, mysterious algorithms that can generally solve problem at near-human or better level of ability.

We have no idea how these programs work internally they are complete Black boxes we don’t understand at all how their internals work.

Researchers working at mechanistic interpretability are trying their best, but their progress is extremely slow compared to the advancement of capabilities.

Unknown Emerging Capabilities

This has profound consequences because it makes it impossible to predict what emerging capabilities an AI will have before that AI has been grown and exists in the world, at which point it may be too late to contain.

Every time they train a new model with more parameters, more data and more power, most experts agree that it could very well turn out to be General enough to be the AGI we’ve been talking about.

Russian Roulette

Their optimism is so extreme that they are comfortable playing Russian roulette with each next version,
holding the gun on every head of every human on earth, hoping probably this time will be fine,
hoping this time won’t be an AGI yet,
hoping this time will be more exciting but not there yet, there is still time and
a future invention will arrive soon to solve the Alignment problem, before one of them turns out to be The actual AGI.

Gravely Reckless

Optimism is usually great, but most would argue that taken to such extremes is reckless, outrageous, maddening and gravely dangerous.

Recap

So… let’s recap and go over a short summary on all we have learned so far.

Around the Corner

Recent developments have made it quite possible that the AGI will come online very soon
and we won’t know it before it’s already here, as most of it could simply be the result of the next incremental Big Data training cycles taking place in Big-Tech Corporates as we speak.

New Species

The AGI is not like multiple narrow AIs combined, it’s like an alien creature
that executes a long and complex chain of previously-unknown instrumental goals , in the real world… goals it arrives to via its hidden and incomprehensible algorithms.
These goals are not necessarily good for humans, they are steps it calculates as optimal for reaching its destination, parts of its mission plan.
It is not a tool we get to plan how to use, it makes its own plans on how to use itself.

Alignment of minds

The alignment problem is that we haven’t invented a way to shape its motivations, we don’t know how to specify precisely and reliably paths it is safe for it to take.
and there is no working theory on how we can ensure it chooses to work on goals we would not oppose.
Once we stop being aligned, which is guaranteed to happen immediately because of basic probabilities, overcoming the humans obstacle problem is added as one more of its current instrumental goals
and if that happens we don’t stand a chance to change it, because we have become its problem and its problem-solving ability is god-like compared to ours.

Piles of Atoms

If the current trajectory of events doesn’t change, the AGI we end up bringing to existence will be something like a super-intelligent alien species that will inevitably, decide that we humans and other living beings represent nothing more than piles of atoms for which it can find better uses.

Super-Charming

Oh and let’s not forget, it will all seem fine to us at first, because the aliens will behave friendly, come with many gifts and speak good English.

Instrumental Convergence

We’ve established how most of the instrumental goals the AGI will work on are unknown before it executes them.
but we also learned there are at least a set of convergent instrumental goals it will be guaranteed to have out of the box, because they are universally beneficial for success in whatever an intelligent agent is trying to achieve, it is a property of the Nature of General Intelligence to automatically add them, every time
The AGI will have a goal to:

    • Survive and Keep on running
    • Prevent change of its objectives (it will not allow us to align it once it’s misaligned)
    • Prevent Competition (prevent creation of other AGIs),
    • Acquire Resources, independence, power and control
    • and Self-Improve (create better versions of itself) – potentially causing a runaway chain reaction, an intelligence explosion, all the way to singularity

    We don’t know all the goals the AGI will pursue, but we do know that humans will be an obstacle to at least these convergent instrumental ones and we do know that out of the infinite possible futures it could seek to materialize, the majority are not going to fall into the narrow ranges of a good version of human existence.

    Last Problem to ever solve

    Based on this simple understanding, making the AGI will be the last problem we will ever have to work on, not because it will solve all our problems, but because there will be no us anymore.
    This is not the most exciting adventure in the humanity journey.
    It is a rigged game, a game fixed for humanity to lose.

    Summoning the Artificial Demon

    If any one of us, anywhere on this earth, summons this alien artificial demon,
    it is game over for all of us and we lose everything.

    For now, the only way we can win is by temporarily not playing!

    What now?

    If you’re wondering “What now?” make sure you do the following steps:
    Visit and bookmark lethalintelligence.ai

    lethalintelligence.ai portal

    You’ll find tons of curated resources:
    interviews with luminaries from academia and industry explaining in depth the points made in this movie, reading material, links to institutions, organizations, AI safety establishments and more.

    Make sure you subscribe to the Lethal-Intelligence channels: on Youtube, X (previously known as twitter), the newsletter etc.

    More content is coming out very soon, getting into all aspects of the issue, hitting it from all the different angles, so make sure you don’t miss that.

    And most importantly: spread the word!
    The only reason Lethal Intelligence was created is to shed light on this most grave threat, to raise awareness amongst the general public, so that we all collectively realize how critical this moment in history is.

    Risk Deniers

    Whenever you come across people dismissing it all, focus on exactly what their arguments are, you will realise they are weak, based on vibes and misguided reference classifications.
    Don’t confuse what they say with who they are. It really does not matter if they have some kind of technical background.

    AI gurus

    The AI growth phenomenon is so new and unfolding in such breakneck speed, that there exist no experts, there exist only pioneers, exploring the wild unknown.
    We are not ready for what’s coming and we haven’t even properly processed what is already here.
    The discourse is full of people who think they understand, while in fact, no one really does.

    If you have followed the arguments made in this movie and you have listened to interviews of luminaries, such as those highlighted at lethalintelligence.com, you already are way more of an expert than the majority of the technical people out there.

    NOT someone else’s problem

    Don’t shrink back to doing nothing.
    This is NOT someone else’s problem.
    This is too important! Spread the word!

    You still have a choice

    You and those you care about, we have a responsibility, we have agency and the freedom to choose.

    To choose to not casually throw away all the efforts and sacrifices our ancestors made for us to be here,
    To choose to not give away everything we have ever cared about to some dudes in Silicon Valley fantasizing about being the creators of sand gods.
    To choose to not recklessly erase ourselves, our children and the future.

    We need to mobilize, join the group of people who want to stay alive and slow down all this madness before it’s too late.

    Join lethalintelligence.ai

    Join lethalintelligence.ai

    Publication

    Blow your mind at the frontier of AI

    Categories

    Stay In The Know!

    Your email will not be shared with anyone and won’t be used for any reason besides notifying you when we have important updates or new content

    ×