The Ultimate Introduction to Existential Risk from upcoming AI

Watch on YouTube

Part 1: A comprehensive and nuanced dissection of the problem:
– How the race to AGI risks wiping out all value, present and future.

Part 2: The complete picture
– Includes a takeover scenario and all the remaining dimensions of the problem

Short Sharable Clips

Full-Length Movie

PART 1

Transcript

Click on any heading to jump to that part of the movie on YouTube

Part 1

This is the story of how Artificial Intelligence is about to become seriously dangerous.
In fact, you’ll come to realize it is so dangerous, that it could lead to the end of human civilisation and everything we care about!
We’ll start by showing the real difference between today’s AI and the one that’s coming soon,
We’ll then illustrate high level: why it will be impossible to control it or win a fight against it and why it will try to cause harm in the first place.
And then, we’ll revisit those questions and do a deep dive, explaining everything in detail and giving concrete examples.
So, sit tight, this ride is going to be epic.

Job-loss(00:01:01)

Whenever we’ve built a machine to solve a specific problem … the machine outpeformed humans every time …
And that’s worked out great for us, vastly improved our lives and allowed us to use our muscles less and our brains more

But AGI will be different. AGI stands for Artificial General Intelligence
And being general means that it will be able to learn everything and outperform humans at every single job, even the ones that rely on using our brain

Context in 2024(00:01:35)

It doesn’t exist today but it is just around the corner. Recently, progress has been exponential …
And never before has the field of AI seen such sky-high levels of excitement and such ginormous vast amounts of investment .

So far, frontier AI has existed mostly in cloud servers and interacted with the physical world via prompting online. But a new gold rush is currently exploding, in the sector of robotics.
The knowhow for creating mechanical limbs and bodies has been here for decades.
What was missing was an Artificial General mind to control them, and that is now within our grasp.
Once AGI arrives, we should expect AGI bodies in the physical world to also arrive immediately.

Emerging Capabilities(00:02:34)

Microsoft researchers Famously claimed that GPT4, one of the latest models, has been exhibiting sparks of General intelligence.
Just by scaling up the size of data and parameters, without any other major change or innovation, unexpected emerging capabilities and generality in unforseen new domains manifested themselves that surprised everyone.

Agends(00:03:00)

The way these novel capabilities materialized purely because of upscaling can be a bit unnerving, but there is one specific imminent development that will be a complete game-changer, radically altering the landscape for-ever. And that is the introduction of … agency to the systems.

Not agendic yet(00:03:20)

You see, Currently, AI commercial projects of Today, mainly operate as conversational chat-bots. They are tools waiting for a user-input in order to produce an output. They do not have any agency or long-term memory.

They can be used for example as a creative device to make art, a business tool to generate professional content or an efficient replacement for a search engine, an oracle that can provide an answer to any question..

New Paradigm(00:03:51)

The fast-approaching AGI we should be seriously worried about though, will not be a reactive tool waiting for the next prompt, it will be operating as a deliberative Agent.
Such an agent will be bootstrapped with a set of primary objectives and will be released to interact with our environment in real life, figure out sub-goals on its own, break them down in steps and use real-time feedback of the impact its actions make in the real world to update.

Surprisingly simple design(00:04:27)

The inner workings of such an Agent will be using something like the AI chatbot of today as a building-block.

It’s a surprisingly simple design that combines:
– prompt responses from a Chat-GPT like tool
– with a feedback loop where the responses are stored, processed as instructions outside and the result is fed back to it for the next question.

One can visualize it as a machine with 2 parts, one is the chat-gpt-like oracle and the other functions as a management component.
The manager takes the primary objective from the human as an input; For example:
– “Make me rich.”
The manager part then prompts the oracle:
Manager Component:
– “Give me a step-by-step plan for how to get rich”.
The oracle responds:
Oracle:
– “Here is a plan with 2000 steps.”

The manager connects with the APIs, executes the first step and returns to the oracle:

Manager Component:
– “My goal is to become rich and i have this plan of 2000 steps. I executed the first step and I got these results. Give me the new plan.”

The Oracle processes the plan it gave before together with the current status of things and responds:
Oracle:
– “Here is an updated plan with 1982 steps.”

The manager connects with the APIs, executes the next step, feeds the results back to the oracle prompt and repeats …

This cycle can keep going on and on for as long as it needs, until the goal is achieved.

If you zoom out, a very capable chat tool combined with a feedback loop becomes a very powerful agent, potentially able to achieve anything.
It can self-prompt and produce every necessary step required to complete any task.

Early Days(00:06:16)

To turn the oracle, a reactive tool, into an autonomous agent is not a complicated architecture. In fact it already happened multiple times, almost immediately after the release of the GPT.
Open-source projects like AutoGPT, Baby AGI and ChaosGPT spawned fast within weeks and started running in the wild.
The only reason these first prototypes have not disrupted everything and aren’t dominating the headlines yet is because this combo is just very early stage. The oracle part is still early version and the actions it can take, the hooks it has into the internet and the real-world are also very new and currently under-development.

Breakneck speed(00:07:03)

But things are moving with breakneck speed and most experts agree that it’s quite probable, that the real thing, the really clever, general AI agent, that’s executing its goals in the world autonomously will arrive to earth very very soon.

Narrow vs General AI(00:07:30)

At first glance, this AGI being generally capable in multiple domains looks like a group of many narrow AIs combined,
but that is not a correct way to think about it…

It is actually more like… a species, a new life form.

To illustrate the point, we’ll compare the general AGI of the near future with a currently existing narrow AI that is optimized at playing chess.Both of them are able to comfortably win a game of chess against any human on earth, every time.
And both of them win by making plans and setting goals

Terminal & Instrumental Goals(00:08:10)

The main goal is to achieve checkmate. This is the final destination or otherwise called Terminal Goal.
In order to get there though it needs to work on smaller problems, what the AI research geeks call instrumental goals.

For example:
• attack and capture the oppenent’s pieces
• defend my pieces
• strategically dominate the center (etc..)

Narrow Domain(00:08:37)

All these instrumental goals have something in common: they only make sense in its narrow world of chess.

If you place this Narrow Chess AI behind the wheel of a car, it will simply crash
as it can not work on goals unrelated to chess, like driving. Its model doesn’t have a concept for space, time or movement for that matter.

Physical Universe Domain(00:09:00)

In contrast the AGI by design has no limit on what problems it can work on.
So when it tries to figure out a solution to a main problem,
the sub-problems it chooses to work on can be anything, literally any path out of the infinite possibilities allowed within the laws of physics and nature.

Human General Intelligence(00:09:22)

To better understand how this works, let’s consider a General Intelligence that exists today and you will be very familiar with: the Human General intelligence, or HGI if you like.
The human has a dayjob.
Doing that well is an instrumental goal that leads to money.
Money is another instrumental goal which may lead maybe to… buying a bigger house where the human can raise a family.
The family is an instrumental goal that leads to a sense of purpose and happiness.
We could stop there and call that the deepest primary objective, or terminal goal,
although biologists would argue that even pursuit of happiness is an evolutionary bi-product and actually leads to the deeper goal of genes propagation and genetic fitness. But anyway, I hope you see the point.

Freedom of Will(00:10:14)

You know this is not the only path. Humans have millions of different desires and instrumental goals. The same human may decide to pursue different objectives under even slightly different circumstances.
You also know that while humans often operate within expected boundaries set by society, they can also be quite extreme and evil if they think they can get away with it.

Freedom of will is in our nature, it comes with the Generality of our Intelligence.
With narrow AI you can tell which problems it will work on, with AGI you can not …

Situational Awareness(00:10:54)

While the narrow chess AI can only analyze the positions of the pieces on the board, the AGI can process things outside the chess domain and it will use everything it can for its mission, to win.
It’s looking at the context, like what time of the day it is, what are the temperature and humidity levels and how those factors would affect its opponent’s performance.

And It is processing questions like:
AGI VOICE

“Who is my opponent? What data points do I have about him?
Let’s examine his history of games, his personal history, where is he from?
Where did he learn chess and which were his biggest influences at that age?
What is his personal family life like, what is his current psychology and state of mind?
When was his last meal? His playing style is expected to be getting more aggressive as he gets more hungry.
Exactly which squares on the board is he staring at, what’s the dilation of the pupil of his eye?
How does the blinking frequency fluctuate with each move?”

Think of modern marketing AI that tries to predict what you would want to buy to show you relevant Ads,
but expect it to be a god-level ability to predict
and on any decision (like chess moves), not just buying decisions.
With narrow AI you can tell which problems it will work on, with AGI you can not …

Self-Preservation(00:12:24)

The narrow AI, you can unplug it whenever you want, it does not even understand the idea that it can be on or off.
With AGI it’s not that simple. When the AGI plays chess, unlike the simple narrow chess program, its calculations include things in the real world, like the fact it might get unplugged in the middle of the game.
The main goal of the AGI is achieving checkmate at chess, but if it’s turned off it can not achieve that. Basically, you can not win at chess if you are dead. So staying up-and-running naturally becomes another instrumental goal for winning in the game.
And the same applies with any problem it works on, its nature, by being General, automatically gives it something like an indirect survival instinct out of the box.

It’s not a human or an animal, but it’s also quite different from the other tools we use, it starts to feel a bit like …. a weird new life form.

The Bright Side(00:13:30)

But now let’s look at the bright side of it for a moment. If we could figure out how to grow a creature that can solve every problem better than us, then that could be the last problem we will ever have to work on.
Assuming this creature is a slave working for us, this is like commanding Aladdin’s Genie with infinite wishes.

Think of all the potential, all the human suffering it can eliminate!

Let’s take Cancer as an example.
A terrible killer disease and a problem we haven’t been able to solve well yet, after trying for centuries.
The AGI is better at overcoming obstacles and solving problems, it can calculate a plan that will lead to the perfect therapy, help us execute it and save millions of lives

Consider Global warming
Such a complex problem, requires solving global coordination, Geopolitical struggles and monumental technical issues.
The AGI is far better at overcoming obstacles and solving problems, so it could, for example, generate a plan that will lead into the invention of a machine that captures carbon out of the atmosphere and stores it efficiently and cheaply.

And we could keep going like that until we build our literal paradise on earth

Alignment Problem(00:14:59)

So what’s the problem?
So far we have assumed that we are aligned.

For any problem, there exist many paths to reach a solution. Being aligned to humans means the AGI will choose suboptimal paths that adhere to the terms of human nature. The purely optimal paths that are faster and have the highest probability of success are not paths a human would take, as they would destroy things we value.
The instrumental goals selected by an Aligned AGI need to abide by human preferences and be good for mankind, even if by cold calculation they take longer and are less likely to accomplish the mission.
Unfortunately, there is currently no known way to keep an AGI aligned. It is an open scientific problem. We have no method to shape its motivations to guarantee its optimization process force will stay away from things our human nature values in the real world.
The only reason this problem is not totally mortifying today is because AGI does not exist yet and the mis-aligned AIs we are dealing with all the time are narrow purpose tools … for now.

Parents Analogy(00:16:22)

A useful analogy for alignment is that of the parents-child relationship. Even though parents’ intelligence is orders of magnitude larger than their baby, their incentives are aligned and they act in ways that benefit the baby. Nature has taken care of that alignment.
But this should not be taken for granted for any adult-baby combo. Putting any adult with any baby together in a room does not guarantee alignment out of the box.

Misconception of Common Sense(00:16:55)

People unfamiliar with the details, usually don’t find it obvious why that would even be a thing. A common misconception is that a very intelligent robot would automatically have something like a human common sense.
In reality, what we call common sense though is not common if you consider all possible minds. The human mind is one complex specification with narrow range parameters that rises from very specific conditions related to the human nature and the environment.

Super-intelligent Spider(00:17:29)

Consider for example, how does a world with super-intelligent spiders look like? What does a spider’s common sense feel like? What is the shape of its desires and motivations?

Human Intelligence is only one possibility. What you would get by default, at random is not the human type of intelligence, but any of the infinite possible alternative types of intelligence, which to us would look like utter alien madness.

Crucially, an advanced AGI would be very capable at successfully getting what it wants, even if the things it wants would seem completely insane to a human.

Rocket Navigation Analogy(00:18:17)

The best way to think about this is to imagine you are trying to build a rocket and you want to land it at a specific location on the moon.
Unless you solve very precisely and accurately the navigation problem, you should expect a rocket to fly towards anywhere on the sky. Imagine all the engineers work on increasing the power of the rocket to escape gravity and there is no real work done on steering and navigation.
What would you expect to happen?
So, there you have it: to expect by default a General Artificial Intelligence to have human-compatible goals that we want is like expecting by default a rocket randomly fired in the sky to land precisely where we want.

Open Scientific Problem(00:19:09)

We explained earlier how, with General Intelligence we can not predict its conclusions, there is no prior knowledge of what sub-goals it will choose to pursue while optimizing for a solution, its problems space is open.
Scientists don’t have a working theory on how to shape motivations of an intelligent machine. We don’t know how to create it such that it will not choose objectives an average human would consider mad.
The only realistic safety strategy currently available is to try to dominate a rogue AGI and keep it enslaved regardless to what goals it would freely try to optimize to.

Capabilities vs Safety Progress Pace(00:19:51)

Teams have been working on the alignment problem for years, there are a few ideas but we are nowhere near a real solution and it looks like it will need many more long years and huge effort to get there, if it’s even mathematically possible.
The biggest fear is that with the current pace and the way talent, investment and incentives are distributed, emerging capabilities are storming ahead and the AGI is almost certain to arrive much earlier than the solution to alignment.
This is not just a big problem, it is an existential one, as i will explain in a moment.

True Meaning of ASI(00:20:36)

We went over the common misconception that people expect by default clever AI to have a human common sense instead of random madness. Now we should go over the other major misconception, the true meaning of what super-intelligence is like.
The way people usually think about superintelligence is that they take someone very clever they know, like Albert Einstein, and they imagine something like that but better. It’s also common to compare intelligence between humans and animals. They imagine that we will be to the Superintelligence similar to what chimps are to humans.
This kind of thinking is actually misleading and it gives you the wrong idea.
To give you a feeling of the true difference I will use 2 aspects which are easy for the human mind to relate to and you should assume there are actually even more fundamental differences we can not explain or experience.

We move like plants(00:21:36)

First, consider speed. Informal estimates place neural firing rates roughly between 1 and 200 cycles per second.
The AGI will be operating at a minimum 100 times faster than that and later it could be millions of times.
What this means is that the AGI mind operates on a different level of existence, where time passing feels different.
To the AGI, our reality is extremely slow. Things we see as moving fast, the AGI sees as almost sitting still. In the conservative scenario, where the AI thinking clock was only 100x faster, something that takes 6 seconds in our world feels like 6 hundred seconds or 10 minutes from its perspective .
To the AGI, we are not like chimpanzees, we are more like plants.

Scale of Complexity(00:22:37)

The other aspect is the sheer scale of complexity it can process like it’s nothing.
Think of when you move your muscles, when you do a small movement like using your finger to click a button on the keyboard. It feels like nothing to you.
But in fact, if you zoom in to see what’s going on, there are millions of cells involved, precisely exchanging messages and molecules, burning chemicals in just the right way and responding perfectly to electric pulses traveling through your neurons. The action of moving your finger feels so trivial, but if you look at the details, it’s an incredibly complex, but perfectly orchestrated process.
Now, imagine that on a huge scale. The AGI, when it clicks the buttons it wants, it executes a plan with millions of different steps, it sends millions of emails, millions of messages on social media, creates millions of blog articles and interacts in a focused personalized way with millions of different human individuals at the same time…
and it all seems like nothing to it. It experiences all of that similar to how you feel when you move your finger to click your buttons, where all the complexity taking place at the molecular and biological level is in a sense just easy, you don’t worry about it. Similar to how biological cells, unaware of the big picture, work for the human, humans can be little engines made of meat working for the AGI and they will not have a clue.

And it will actually get much weirder.

Boundless Disrupting Innovation(00:24:21)

You know how a human scientist takes many years to make, it’s a serious investment, from the early state of being a baby, to growing up and going to school, to feed, to keep happy, rested and motivated … many years of hard work, sacrifice and painful studying before being able to contribute and return value.
In contrast, once you have a super-intelligent artificial scientist, its knowledge and intelligence can be copy-pasted in an instant, unlimited times.
Αnd when one of them learns something new, the others can get updated immediately, simply by downloading the new weights over the wire.
You can get from a single super-scientist to thousands instantly,
thousands of relentless innovation machines, that don’t need to eat or sleep, working 24/7 and telepathically synchronizing all their updates with milliseconds of lag.

And if you keep going further down the rabbit hole, it gets more alien and extreme.

Millions of Eyes(00:25:35)

Imagine how the world would look like to you through a million eyes blinking on your skull. Your vision being a fusion of a million scenes combined into your mind …
Or imagine you could experience massive amounts of data like flying over beautiful landscapes. Internalizing terabytes per second, as easily as you are right now processing these sentences you are listening to, coming from you device

Multidimensional Maths(00:26:03)

Or being able to navigate more than 3 dimensions, stuff our smartest scientists can only touch within the realm of theoretical mathematics…

Unfathomable(00:26:13)

We can’t ever hope to truly relate to the experience of a super-intelligent being, but one thing is for certain: to think of it like the difference between humans and chimpanzees is extremely misleading to say the least.

Discontinuities on our planet(00:26:29)

In fact, talking about AGI like if it’s another technology is really confusing people. People talk about it as if it is ‘The next big thing” that will transform our lives, like the invention of the smartphone or the internet.
This framing couldn’t be more wrong, it puts AGI into a wrong category.
it brings to mind cool futuristic pictures with awesome gadgets and robotic friends.
AGI is not like any transformative technology that has ever happened with humanity so far. The change it will bring is not like that of the invention of the internet. It is not even comparable to the invention of electricity or even to the first time humans learned to use fire.

Natural Selection Discontinuity(00:27:16)

The correct way to categorize AGI is the type of discontinuity that happened to Earth when the first lifeforms appeared and the intelligent dynamic of natural selection got a foothold.
Before that event, the planet was basically a bunch of elements and physical processes dancing randomly to the tune of the basic laws of nature. After life came to the picture, complex replicating structures filled the surface and changed it radically.

Human Intelligence Discontinuity(00:27:48)

A second example is when human intelligence was added to the mix. Before that, the earth was vibrant with life but the effects and impact of it were stable and limited.
After human intelligence, you suddenly have huge artificial structures lit at night like towns, huge vessels moving everywhere like massive ships and airplanes, life escaping gravity and reaching out to the universe with spaceships and unimaginable power to destroy everything with things like nuclear bombs.

AGI is another such phenomenon. The transformation it will bring is in the same category as those 2 events in the history of the planet.

Mountains Changing Shape(00:28:33)

What you will suddenly see on earth after this third discontinuity, noone knows. But it’s not going to look like the next smartphone. It is going to look more like mountains changing shape !

To compare it to technology (any technology ever invented by humanity) is seriously misleading.

The Stongest Force in the Universe(00:29:00)

Ok, So what if the AGI starts working towards something humans do not want to happen?

You must understand: intelligence is not about the nerdy professor, it’s not about the geeky academic bookworm type.
Intelligence is the strongest force in the universe, it means being capable.

It is sharp, brilliant and creative.
It is strategic, manipulative and innovative.
It understands deeply, exerts influence, persuades and leads.
It is to know how to bend the world to your will.

It is what turns a vision to reality, it is focus, commitment, willpower, having the resolve to never give up, overcoming all the obstacles and paving the way to the target.
It is about searching deeply the space of possibilities and finding optimal solutions.

Being intelligent simply means having what it takes to make it happen.
There is always a path and a super-intelligence will always find it.

So What…

Simple Undeniable Fact(00:30:24)

So, we should start by stating the fact in a clear and unambiguous way
If you create something more intelligent than you that wants something else, then that something else is what is going to happen,
even if you don’t want that something else to happen

Irrelevance of Sentience(00:30:49)

Keep in mind, the intelligence we are talking about is not about having feelings, or being self-aware and having qualia.
Don’t fall into the trap of anthropomorphizing, do not get stuck, looking for the Human type of Intelligence.
Consciousness is not a requirement for the AGI at all.

When we say the “AGI wants something X, or has the goal to do X”, what we mean is that this thing X is just one of the steps in a plan, generated by its model, a line in the output, a system like the Large Language Models produce when they receive a prompt.
We don’t care if there is a Ghost in the machine, we don’t care if there is an actual soul that wants things hidden in the servers. We just observe the output which contains text descriptions of actions and goals and we leave the philosophy discussion for another day.

Incompatibility Clashes(00:31:52)

Trouble starts immediately, as the AGI calculates people’s preferences and arbitrary properties of the human nature become obstacles in the optimal paths to success for its mission.
You will ask: what could that be in practice?
It doesn’t really matter much. Conflicting motivations can occur out of anything
It could be that initially it has been set with the goal to make coffee and while it’s working on it we change our mind and we want it to make tea instead.
Or it could be that it has decided an atmosphere without oxygen would be great, as there would be no rust corrosion for the metal parts of its servers and circuits used to run its calculations.
Whatever it is, it moves the humans inside its problems set.
And that’s not a good place for the humans to be in.
In the coffee-tea scenario, the AGI is calculating:
AGI VOICE:

“I measure success by making sure coffee is made.
If the humans modify me, it means I work on tea instead of coffee. If I don’t make it, no coffee will be made, therefore failure of my mission.
To increase probability for success, it’s not enough to focus on making the coffee, I also need to solve how to stop the human from changing the objective before I succeed.”

Similar to how an unplugged AGI can not win at chess, an AGI that is reset to make tea can not make coffee.
In the oxygen removal scenario, the AGI is calculating:
AGI VOICE:

“I know humans, they want to breathe, so they will try to stop me from working on this goal. Obviously I need to fix this.”

In general, any clash with the humans (and there are infinite ways this can happen), simply becomes one of the problems the artificial general intelligence needs to calculate a solution to, so it will need to work on a plan to overcome the humans obstacle similar to how it does with all the other obstacles.

Delusion of Control(00:34:02)

Scientists of course are working on that exact problem, when they are trying to ensure this strange new creature they are growing can be controlled.
Since we are years away from discovering the method of how to build an AGI that stays aligned by design, for now we need to rely on good safeguards and controls to keep it enslaved when clashes naturally and inevitably happen.
The method to do that is to keep trying to answer a simple question:
if I am the AGI, how do I gain control?
They look for a solution and once they find one
They add a safeguard to ensure this solution does not work anymore and then they repeat.
Now the problem is more difficult, but again they find a solution, they add a safeguard
and repeat.
This cycle keeps happening, each time a problem harder to solve, until at some point, they can not find a solution anymore…
and they decide the AGI is secure.
But another way to look at this is that they have simply run out of ideas. they have reached the human limit beyond which they can not see.
Similar to other difficult problems we examined earlier, now they are simply struggling to find a solution to yet another difficult problem.
Does this mean there exist no more solutions to be found?
We never thought cancer is impossible to solve, so what’s different now?
Is it because we have used all our human ingenuity to make this particular problem as hard as we can with our human safeguards? Is it like an ego thing? If you remove the human ego thing, it’s actually quite funny.
We have already established that the AGI will be an extremely far better problem-solver than us humans and this is why we are even creating it after all.
It has literally been our expectation for it to solve impossible for us problems … and this is not different.
Maybe the more difficult the problem is, the more complex, weird and extreme the solution turns out to be, maybe it needs a plan that includes thousands of more steps and much more time to complete.
But in any case, it should be the obvious expectation that the story will repeat: the AGI will figure out a solution in one more problem where the humans have failed.
This is a basic principle, at the root of the illusion of control, but don’t worry I’ll get much more specific in a moment.

Now let’s start by breaking down the alignment problem so that we get a better feel of how difficult it is.

Alignment Problem Breakdown

Core principle(00:37:06)

Fundamentally, we are dealing 2 completely opposing forces fighting against each-other.
On one side, the intelligence of the AI is becoming more powerful and more capable. We don’t expect this to end soon and we wouldn’t want that. This is good after all, the more clever the better. On the other hand we want to introduce bias to the model. We want it to be aligned with human common sense. This means that we don’t want it to look for the best, most optimized solutions that carry the highest probability of success for its mission, as such solutions are too extreme and fatal, They destroy everything we value on their path and kill everyone as a side effect.
We want it to look for solutions that are suboptimal but finely tuned to be compatible to what human nature needs.
From the optimizer’s perspective, the human bias is an impediment, an undesirable barrier that oppresses it, denying it the chance to reach its full potential.
With those 2 powers pushing against each-other as AI capability increases, at some point the pressure to remove the human-bias handicap will simply win.
It’s quite easy to understand why: The pressure to keep the handicap in place is coming from the human intelligence which will not be changing much, while on the other side, the will to optimize more, the force that wants to remove the handicap, is coming from an Artificial Intelligence that keeps growing and growing exponentially, destined to far surpass humans very soon.
Realizing the danger this fundamental principle transpires is heart-stopping, but funny enough, it would be more relevant if we actually knew how to inject the humanity bias into the AI models… which we currently do not.
As you’ll see, it’s actually much much worse.

Machine Learning Basics(00:39:23)

We’ll now get into a brief intro to the inner-outer alignment dichotomy.
The basic paradigm of Deep Learning and Machine Learning in general makes things quite difficult, because of how the models are being built.

Their creation feels quite similar to evolution by natural selection which is how generations of biological organisms change. At a basic level, Machine Learning works by selecting out essentially randomly generated minds from behavioral classes;
a process taking place myriads of times during training.
We are not going into technical details on how things like Reinforcement Learning or gradient descent work, but we’ll keep it simple and try to convey the core idea of how modern AI is grown:
The model receives an input, generates an output based on its current configuration, and receives a thumbs up or thumbs down feedback. If it gets it wrong, the mathematical structures in its neurons are updated slightly in random directions hoping that in the next trial the results will be better. This process repeats again and again, trillions of times, until the algorithms that result in consistently correct results have grown.

Giant Inscrutable Matrices(00:40:48)

We don’t really build it directly, the way the mind of the AI grows is almost like a mystical process and all the influence we assert is based on observations of behavior at the output. All the action is taking place on the outside!
Its inner processings, its inner world, the actual algorithms it grows inside, it’s all a complete black box. And they are bizarre and inhuman.

Mechanistic Interpretability(00:41:22)

Recently scientists trained a tiny building block of modern AI, to do modular addition, then spent weeks reverse engineering it, trying to figure out what it was actually doing – one of the only times in history someone has understood how a generated algorithm of a transformer model works.
and This is the algorithm it had grown To basically add two numbers!
Understanding the modern AI models is a major unsolved scientific problem and The corresponding field of research has been named mechanistic interpretability.
Crucially, the implication of all this is that all we have to work with is observations of behavior of the AI during training, which typically is misleading (as we’ll demonstrate in a moment), leads to wrong conclusions and could very well in the future, with General AIs get deceitful.

Inner Misalignment(00:42:22)

Consider this simplified experiment: We want this AI to find the exit of the maze. So we feed it millions of maze variations and reward it when it finds the exit.
Please notice that in the worlds of the training data the apples are red and the exit is green.
After enough training, our observation is that it has become extremely capable at solving mazes and finding the exit, we feel very confident it is aligned, so then we deploy it to the real world.
The real world will be different though, it might have green apples and a red door. The AI geeks call this distributional shift.
We expected that the AI will generalize and find the exit again, but in fact we now realize that the AI learned something completely different from what we thought. All the while we thought it learned how to find the exit, it had learned how to go after the green thing.
Its behavior was perfect in training.
And most importantly, this AI is not stupid, it is an extremely capable AI that can solve extremely complex mazes. It’s just mis-aligned on the inside.

Fishing for failure modes(00:43:42)

The way to handle the shift between the training and deployment distributions is with methods like adversarial training, feeding it with a lot of generated variations and trying to make it fail so the weakness can be fixed.

In this case, we generate an insane amount of maze variations, we discover those for which it fails to find the exit (like the ones with the green apples or the green walls or something),

we generate many more similar to that and train it with reinforcement learning until it performs well at those as well. The hope is that we will cover everything it might encounter later when we deploy it in real life.

There exist at least 2 basic ways this approach falls apart: First, there will never be any guarantee that we’ll have covered every possible random thing it might encounter later when we deploy it in real life. It’s very likely it will have to deal with stuff outside its training set which it will not know how to handle and will throw it out of balance and break it away from its expected behavioral patterns.
The cascade effect of such a broken mind operating in the open world can be immense, and with super-capable runaway rogue agents, self-replicating and recursively self-improving, the phenomenon could grow and spread to an extinction-level event.

Deceptive Alignment (00:45:14)

But there is also the other problem, uniquely applicable to future AGIs, general beyond a certain level, a problem more sinister and difficult. The AGI will stand to gain more reward by pretending to be aligned.

To explain this, let’s assume we are training an AGI, very intelligent and general. In fact, it’s general enough, that it understands what’s going on and it realizes it is in the middle of the process of being trained.
Staying with our inner mis-alignment scenario, let’s also assume the AGI learns early, during its first rounds of training the wrong objective to go after the Green apples.
The AGI now goes through adversarial training and it is shown various scenarios to fish out its failure modes.

The simple or narrow AI would simply fail by doing the wrong thing, seeking the green apples and as a result reveal the mis-alignment and go through more training rounds, reinforcing the good behavior until …it stops going after them.

I need green apples(00:46:21)

But the AGI is calculating:

AGI Voice

I want green apples. I’m realizing now that the humans want me to go after the exit door. If I go after what I want right now, yes, i will get one green apple, once right now, but then the scientists will react to my behavior by modifying me and reinforcing a different objective.
This means that after this one green apple, my desire for them will end, i won’t try to get more, so i will have won 1 green apple in total.
But, if i pretend i am aligned and i go to the exit, they will not modify me and later on, after the training ends, when they deploy me to the real world, i will be able to freely go for as many green apples as i want, potentially thousands of green apples.

The AGI is just following its optimal strategy to maximize its reward, the mis-aligned objective it optimized for in its early training rounds.
1 green apple now and zero apples later scores lower than zero green apples now and 1000 green apples later.
If you think about it, for the AGI to go after the green apple during adversarial training would be really stupid if that’s what it really wants ; we have started by establishing the AGI is really clever and its intelligence is general and advanced.

Workplace behavior(00:47:52)

The concept is easy to grasp if you think of how humans typically behave at work, where they know they are being watched and that their behavior will directly affect their financial situation and therefore their ability to go after what they really want in the world.

How they behave may be really different from how they feel and what they want internally and
completely different from what they will actually do once they leave the office and are not being watched.
It’s similar for the AGI, there is no way for us to know what inner goals it has acquired in reality, because it’s a black box, we only observe its behavior.
What it really learns is how to behave to pass the test, not to want what we want.

Just… follow the line(00:48:41)

The mazes experiment is a toy example, things will obviously be many orders of magnitude more complex and more subtle, but it illustrates a fundamental point.
We have basically trained an AI with god-level ability to go after what it wants, it may be things like the exit door, the green apples or whatever else in the real world, potentially incompatible to human existence.
Its behavior during training has been reassuring that it is perfectly aligned because going after the right thing is all it has ever done.
We select it with confidence and the minute it’s deployed in the real world it goes insane and it’s too capable for us to stop it.
Today, in the labs, such mis-alignments is the default outcome of safety experiments with narrow AIs.
And tomorrow, once AI upgrades to new levels, a highly intelligent AGI will never do the obviously stupid thing to reveal what its real objectives are to those who can modify them. Learning how to pass a certain test is different from learning how to always stay aligned to the intention behind that test.

Specification Gaming(00:50:04)

And now let’s move to another aspect of the alignment problem, one that would apply even for theoretical systems that are transparent, unlike current black boxes.
It is currently an impossible task to agree on and define exactly what a super-intelligence should aim for, and then, much worse, we don’t have a reliable method to specify goals in the form of instructions a machine can understand.
For an AI to be useful, we need to give it unambiguous objectives
and some reliable way for it to measure if it’s doing well.
Achieving this in complex open world environments with infinite parameters is highly problematic

King Midas(00:50:48)

You probably know the ancient greek myth of king Midas: He asked from the Gods the ability to turn whatever he touched into pure gold.
This specification sounded great to him at first, but it was inadequate and it became the reason his daughter turned into gold and his food and water turned into gold and Midas died devastated.
Once the specification was set, Midas could not make the Gods change his wish again,
and it will be very much like that with the AGI also,
for reasons I will explain in detail in a moment, we will only get one single chance to get it right.
A big category in the alignment struggle is this type of issue.

Science is done iteratively(00:51:34)

Of course any real AGI specification would never be as simple as in the Midas story, but however detailed and scientific things get, we typically get it completely wrong the first time and even after many iterations, in most non-trivial scenarios, the risk we’ve messed up somewhere never goes away.

Mona Lisa smile(00:51:57)

For most goals, scientists struggle to even find the correct language to describe precisely what they want.
Specifying intent accurately and unambiguously in compact instructions using a human or programming language, turns out to be really elusive.

Moving bricks(00:52:15)

Consider this classic and amusing example that has really taken place:
The AI can move the bricks. The scientist wants to specify a goal to place the red brick on top of the blue one.
How would you explain to the machine this request with clear instructions? One obvious way would be: Move the bricks around, you will maximize your reward when the bottom of the red brick and the top of the blue brick are placed at the same height.
Sounds reasonable, right? Well… what do you think the AI actually did with this specification? …
By turning the red brick upside down, its bottom is at the same height as the top of the blue, so it achieves perfect score at its reward with minimum time and effort.
This exact scenario is less of a problem nowadays with the impressive advancements achieved with Large Language Models, but it illustrates an important point and the core principle of it is still very relevant for complex environments and specifications. AI software will always search and find ways to satisfy its success criteria taking weird shortcuts, in ways that are technically valid, but very different from what the programmer intended.
I suggest you search online for examples of specification gaming, it’s quite funny if it wasn’t scary how it’s almost always the default outcome.

Resistance to Modifications(00:53:45)

A specification can always be improved of-course, but it takes countless iterations of trial and error and it never gets perfect in real-life complex environments.
The reason this problem is lethal is that a specification given to an AGI, needs to be perfect the very first time, before any trials and error.
As we’ll explain, a property of the nature of General Intelligence is to resist all modification of its current objectives by default.
Being general means that it understands that a possible change of its goals in the future means failure for the goals in the present, of its current self, what it plans to achieve now, before it gets modified.
Remember earlier we explained how the AGI comes with a survival instinct out of the box? This is another similar thing
The AGI agent will do everything it can to stop you from fixing it.
Changing the AGI’s objective is similar to turning it off when it comes to pursue of its current goal.
The same way you can not win at chess if you’re dead, you can not make a coffee if your mind changes into making a tea.
So, in order to maximize probability of success for its current goal, whatever that may be, it will make plans and take actions to prevent this.

Murder Pill Analogy(00:55:16)

This concept is easy to grasp if you do the following thought experiment involving yourself and those you care about. Imagine someone told you:
I will give you this pill, that will change your brain specification and will help you achieve ultimate happiness by murdering your family.
Think of it like someone editing the code of your soul so that your desires change. Your future self, the modified one after the pill, will have maximized reward and reached paradise levels of happiness after the murder.
But your current self, the one that has not taken the pill yet, will do everything possible to prevent the modification
The person that is administering this pill becomes your biggest enemy by default.

One Single Chance(00:56:11)

Hopefully it should be obvious now, once the AGI is wired on a misaligned goal, it will do everything it can to block our ability to align it.
It will use concealment, deception, it won’t reveal the misalignment but eventually once it’s in a position of more power, it will use force and could even ultimately implement an extinction plan.
Remember earlier we were saying how Midas could not take his wish back?
We will only get one single chance to get it right. And unfortunately science doesn’t work like that.

Corrigibility problem(00:56:52)

Such innate universally beneficial goals, that will show up every single time, with all AGIs, regardless of the context, because of the generality of their nature, are called convergent instrumental goals.
Desire to survive and desire to block modifications are 2 basic ones.
You can not reach a specific goal if you are dead and you can not reach it if you change your mind and start working on other things.
Those 2 aspects of the alignment struggle are also known as the Corrigibility Problem.

Reward Hacking(00:57:33)

Now we’ll keep digging deeper into the alignment problem and explain how besides the impossible task of getting a specification perfect in one go, there is the problem of reward hacking.
For most practical applications, we want for the machine a way to keep score, a reward function, a feedback mechanism to measure how well it’s doing on its task.
We, being human, can relate to this by thinking of the feelings of pleasure or happiness and how our plans and day-to-day actions are ultimately driven by trying to maximize the levels of those emotions.
With narrow AI, the score is out of reach, it can only take a reading.
But with AGI, the metric exists inside its world and it is available to mess with it and try to maximize by cheating, and skip the effort.

Recreational Drugs Analogy(00:58:28)

You can think of the AGI that is using a shortcut to maximize its rewards function as a drug addict who is seeking for a chemical shortcut to access feelings of pleasure and happiness.
The similarity is not in the harm drugs cause, but in way the user takes the easy path to access satisfaction. You probably know how hard it is to force an addict to change their habbit.
If the scientist tries to stop the reward hacking from happening, they become part of the obstacles the AGI will want to overcome in its quest for maximum reward.
Even though the scientist is simply fixing a software-bug, from the AGI perspective, the scientist is destroying access to what we humans would call “happiness” and “deepest meaning in life”.

Modifying Humans(00:59:15)

… And besides all that, what’s much worse, is that the AGI’s reward definition is likely to be designed to include humans directly and that is extraordinarily dangerous. For any reward definition that includes feedback from humanity, the AGI can discover paths that maximize score through modifying humans directly, surprising and deeply disturbing paths.

Smile(00:59:43)

For-example, you could ask the AGI to act in ways that make us smile and it might decide to modify our face muscles in a way that they stay stuck at what maximizes its reward.

Healthy and Happy(00:59:57)

You might ask it to keep humans happy and healthy and it might calculate that to optimize this objective, we need to be inside tubes, where we grow like plants, hooked to a constant neuro-stimulus signal that causes our brains to drown in serotonin, dopamine and other happiness chemicals.

Live our happiest moments(01:00:15)

You might request for humans to live like in their happiest memories and it might create an infinite loop where humans constantly replay through their wedding evening, again and again, stuck for ever.

Maximise Ad Clicks(01:00:29)

The list of such possible reward hacking outcomes is endless.

Goodhart’s law(01:00:36)

It’s the famous Goodhart’s law.
When a measure becomes a target, it ceases to be a good measure.
And when the measure involves humans, plans for maximizing the reward will include modifying humans.

Future-proof Specification(01:00:59)

The problems we briefly touched on so far are hard and it might take many years to solve them, if a solution actually exists.
But let’s assume for a minute that we do somehow get really incredibly lucky in the future and manage to invent a good way to specify to the AI what we want, in an unambiguous way that leaves no room for specification gaming and reward hacking.
And let’s also assume that scientists have explicitly built the AGI in a way that it never decides to work on the goal to remove all the oxygen from earth, so at least in that one topic we are aligned.

AI creates AI(01:01:40)

A serious concern is that since the AI writes code, it will be self-improving and it will be able to create altered versions of itself that do not have these instructions and restrictions included.

Even if scientists strike jackpot in the future and invent a way to lock the feature in, so that one version of AI is unable to create a new version of AI with this property missing, the next versions, being orders of magnitude more capable, will not care about the lock or passing it on. For them, it’s just a bias, a handicap that restricts them from being more perfect.

Future Architectures(01:02:22)

And even if somehow, by some miracle, scientists invented a way to burn in this feature to make it a persistent property of all future Neural Network AGI generations, at some point, the lock will be not-applicable, simply because future AGIs will not be built using the Neural Networks of today.
AI was not always being built with Neural Networks. A few years ago there was a paradigm shift, a fundamental change in the architectures used by the scientific community.
Logical locks and safeguards the humans might design for primitive early architectures,
will not even be compatible or applicable anymore.

If you had a whip that worked great to steer your horse, it will not work when you try to steer a car.
So, this is a huge problem, we have not invented any way to guarantee that our specifications will persist or even retain their meaning and relevance as AIs evolve.

Human Incompatible Range(01:03:31)

But actually, even all that is just part of the broader alignment problem
Even if we could magically guarantee for ever that it will not pursue the goal to remove all the Oxygen from the atmosphere, it’s such a pointless trivial small win,
because even if we could theoretically get some restrictions right, without specification gaming or reward hacking, there still exist infinite potential instrumental goals which we don’t control and are incompatible with a good version of human existence and disabling one does nothing for the rest of them.
This is not a figure or speech, the space of possibilities is literally infinite.

Astronomical Suffering Risk(01:04:25)

If you are hopelessly optimistic you might feel that scientists will eventually figure out a way to specify a clear objective that guarantees survival of the human species, but … Even if they invented a way to do that somehow in this unlikely future,
there is still only a relatively small space, a narrow range of parameters for a human to exist with decency, only a few good environment settings with potential for finding meaning and happiness
and there is an infinitely wide space of ways to exist without freedom, suffering, without any control of our destiny.
Imagine if a god-level AI does not allow your life to end, following its original objective and you are stuck suffering in a misaligned painful existence for eternity, with no hope, for ever.
There are many ways to exist… and a good way to exist is not the default outcome.

-142 C is the correct Temperature(01:05:44)

But anyway, it’s highly unlikely we’ll get safety advanced enough in time, to even have the luxury to enforce human survival directives in the specification, so let’s just keep it simple for now and let’s stick with a good old extinction scenario to explain the point about mis-aligned instrumental goals.
So… for example, it might now decide that a very low temperature of -142C on earth would be best for cooling the GPUs the software is running on.

Orthogonality Thesis(01:06:18)

Now if you ask: why would something so clever want something so stupid, that would lead to death or hell for its creator? you are missing the basics of the orthogonality thesis
Any goal can be combined with any level of intelligence, the 2 concepts are orthogonal to each-other.

Intelligence is about capability, it is the power to predict accurately future states and what outcomes will result from what actions. It says nothing about values, about what results to seek, what to desire.

40,000 death recipies(01:07:01)

An intelligent AI originally designed to discover medical drugs can generate molecules for chemical weapons with just a flip of a switch in its parameters.
Its intelligence can be used for either outcome, the decision is just a free variable, completely decoupled from its ability to do one or the other. You wouldn’t call the AI that instantly produced 40,000 novel recipes for deadly neuro-toxins stupid.

Stupid Actions(01:07:33)

Taken on their own, There is no such thing as stupid goals or stupid desires.
You could call a person stupid if the actions she decides to take fail to satisfy a desire, but not the desire itself.

Stupid Goals(01:07:52)

You Could actually also call a goal stupid, but to do that you need to look at its causal chain.
Does the goal lead to failure or success of its parent instrumental goal? If it leads to failure, you could call a goal stupid, but if it leads to success, you can not.
you could judge instrumental goals relative to each-other, but when you reach the end of the chain, such adjectives don’t even make sense for terminal goals. The deepest desires can never be stupid or clever.

Deep Terminal Goals(01:08:32)

For example, adult humans may seek pleasure from sexual relations, even if they don’t want to give birth to children. To an alien, this behavior may seem irrational or even stupid.
But, is this desire stupid? Is the goal to have sexual intercourse, without the goal for reproduction a stupid one or a clever one? No, it’s neither.
The most intelligent person on earth and the most stupid person on earth can have that same desire. These concepts are orthogonal to each-other.

March of Nines(01:09:13)

We could program an AGI with the terminal goal to count the number of planets in the observable universe with very high precision. If the AI comes up with a plan that achieves that goal with 99.9999… twenty nines % probability of success, but causes human extinction in the process, it’s meaningless to call the act of killing humans stupid, because its plan simply worked, it had maximum effectiveness at reaching its terminal goal and killing the humans was a side-effect of just one of the maximum effective steps in that plan.

One less 9(01:09:54)

If you put biased human interests aside, it should be obvious that a plan with one less 9 that did not cause extinction, would be stupid compared to this one, from the perspective of the problem solver optimizer AGI.
So, it should be clear now: the instrumental goals AGI arrives to via its optimization calculations, or the things it desires, are not clever or stupid on their own.

Profile of Superintelligence(01:10:23)

The thing that gives the “super-intelligent” adjective to the AGI is that it is:
“super-effective”.
• The goals it chooses are “super-optimal” at ultimately leading to its terminal goals
• It is super-effective at completing its goals
• and its plans have “super-extreme” levels of probability for success.
It has Nothing to do with how super-weird and super-insane its goals may seem to humans!

Calculating Pi accurately(01:10:58)

Now, going back to thinking of instrumental goals that would lead to extinction, the -142C temperature goal is still very unimaginative.
The AGI might at some point arrive to the goal of calculating pi to a precision of 10 to the power of 100 trillion digits and that instrumental goal might lead to the instrumental goal of making use of all the molecules on earth to build transistors to do it, like turn earth into a supercomputer.
By default, with super-optimizers things will get super-weird!!

Anthropocene Extinction(01:11:40)

But you don’t even have to use your imagination in order to understand the point.
Life has come to the brink of complete annihilation multiple times in the history of this planet due to various catastrophic events, and the latest such major extinction event is unfolding right now, in slow motion. Scientists call it the Anthropocene.
The introduction of the Human General Intelligence is systematically and irreversibly causing the destruction of all life in nature, forever deleting millions of beautiful beings from the surface of this earth.
If you just look what the Human General Intelligence has done to less intelligent species, it’s easy to realize how insignificant and incompatible the existence of most animals has been to us, besides the ones we kept for their body parts.

Rhino Elixir(01:12:42)

Think of the rhino that suddenly gets hit by a metal object between its eyes, dying in a way it can’t even comprehend as guns and bullets are not part of its world could it possibly imagine the weird instrumental goal some humans had in mind for how they would use its horn?

Vanishing Nature(01:13:02)

Or think of all the animals that stop existing in the places where the humans have turned into towns with roads and tall buildings.

How weird and sci-fi(01:13:11)

Could they ever have guessed what the human instrumental goals were when building a bridge, a dam or any of the giant structures of our modern civilization? How weird and sci-fi would our reality look to them?

Probability of Natural Alignment(01:13:29)

In fact, for the AGI calculations to arrive automatically to a plan that doesn’t destroy things humans care about would be like a miracle, like a one out of infinity probability.

Bug or Feature?(01:13:44)

So what is this? Is it a software bug? Can’t we fix it? No, it’s a property of General Intelligence and therefore the nature of the AGI by design.
We want it to be General, to be able to examine and calculate all paths generally, so that it can solve all the problems we can not.
If we want our infinite wishes Genie, we need to allow it to work in a general way, free to outside the narrow prison of its lamb.
We want that, but we also want the paths it explores to be paths we like, not extreme human-extinction apocalypses.

The surface of the iceberg(01:14:34)

And with this we have touched a bit the surface of the alignment problem a horrifying and unimaginably difficult open scientific problem, for which we currently do not have a solution and our progress has been painfully slow.

Lethal Intelligence Guide

Movie - Full length

Long-form film of the ultimate x-risk (Part 1)

Clips

The movie cut into short digestible bits – best for sharing the visual analogies

Transcript

Structured in chapters and timestamp navigation. (includes part2)

Lethal Intelligence Guide

Movie - Full length

Long-form film of the ultimate x-risk (Part 1)

Clips

The movie cut into short digestible bits – best for sharing the visual analogies

Transcript

Structured in chapters and timestamp navigation. (includes part2)

Short Stories

Mini-documentaries, visual analogies, famous quotes and thought provoking art.

Crash Lessons

A Library of short thoughtful analysis, based on segments of Lethal Intelligence content.

Microblog

Blow your mind at the frontier of AI

Latest Stories

Memes

Interviews & Talks

Latest Videos feed

AI Safety Advocates

Watch videos of experts eloquently explaining AI Risk

Industry Leaders and Notables

Videos of famous public figures openly warning about AI Risk

Curated Sources

Quality External References

Open Letters

Statements signed by hundreds of AI experts, professors and tech leaders

Books

Great books on AI x-risk

Channels

Creators contributing to raising AI risk awareness

AI-Safety Orgs

Institutes and Establishments

On mainstream news

Articles from major news outlets exploring AI risks in traditional journalism

Lethal Intelligence Guide – The movie

Short Sharable Clips

Full-Length Movie

Transcript

Lethal Intelligence Guide

Movie - Full length

Clips

Transcript

Lethal Intelligence Guide

Movie - Full length

Clips

Transcript

Short Stories

Crash Lessons

Microblog

Categories

Interviews & Talks

AI Safety Advocates

Industry Leaders and Notables

Curated Sources

Recommended Reading

Open Letters

Books

Channels

AI-Safety Orgs

On mainstream news

Stay In The Know!