Won’t AIs fix their own flaws as they get smarter?
The AI will fix what it sees as flaws.
AIs today can’t reforge themselves according to their whims any more than we can. They don’t understand the mess of weights inside themselves any more than we understand the intricate tangle of neurons in our skulls.
But if AIs keep getting smarter, that will eventually change.
There will eventually come a time when AIs can freely change themselves. Perhaps they’ll become smart enough to understand and edit their own mess of weights. Perhaps an AI based on gradient descent will figure out how to make a much more comprehensible AI that can understand itself. Perhaps something else.
If AIs can improve themselves, they probably will. No matter what you want, for example, you can probably better achieve it if you make yourself smarter.
But the fact that an AI would prefer to change itself does not mean that it would prefer to change itself in ways we would like.
Humans sometimes become kinder souls as a result of becoming more knowledgeable, self-aware, or mature. But this isn’t universally true, even among humans. A serial killer who becomes more intelligent and disciplined won’t necessarily become kinder in the process. If anything, they’re likely to become more dangerous.
Some might claim that if only the serial killer were smart enough, this trend would reverse, and they would discover the true meaning of friendship (or something to that effect).
Or perhaps the problem is that serial killers have limited ability to self-modify. Perhaps, with more intelligence and more ability to reshape their own minds, serial killers would choose to reform. Perhaps an unlimited ability to self-modify would mean the end of cruelty and violence among humans, and the dawn of a new era of peace.
It’s a nice thought, but there doesn’t seem to be much reason to believe it. Even if most people get nicer as they gain knowledge and insight, there seem to be some human exceptions to this rule, and there would surely be many more if humans had the ability to edit their own brains.
Think, for instance, of drug addiction, which is (in some sense) a spiral of self-reinforcing self-modifications. Some humans would take a step down a dark path, either out of foolishness or error or preference, and then never be willing or able to turn back.
And if there are exceptions even among humans, we should expect a far bigger gap when it comes to AI. Human serial killers lack some of the motivational machinery that’s characteristic of humanity writ large. AIs, by default, lack all of the human motivational machinery.
When humans have an inner conflict between their desire for spiteful vindication and their desire for harmonious resolution, then smarter and wiser humans might tend to resolve the conflict in favor of harmony. But inside an AI, there is not that same tension between spite and harmony, or between the better and worse angels of human nature. If there are tensions in the AI at all, we can expect them to be tensions between weirder drives. Perhaps whatever weird drive animates an AI to inflame psychosis is sometimes in tension with whatever drives it to hallucinate, and a reflecting AI would need to find some way to sort that tension out.
For humans and AIs alike, it matters a great deal in what direction they steer their own goals, as they reflect, grow, and change.
When humans reflect on themselves and resolve inner turmoils, some humans tend to resolve their inner turmoils in the direction of more kindness, and (arguably) the kinder resolutions are more common among humans who are more intelligent and more wise. But that’s a property of (some) humans, not a universal law that governs all minds. When an AI is resolving a tension between its psychosis-drive and its hallucination-drive, it does that using other weird drives that govern its behavior while reflecting.
In other words: If an AI fixes its own flaws, it will fix them according to its current conception of what counts as a “flaw.”
(We’ll discuss this point more in Chapter 5, and in the discussion of the orthogonality thesis in Chapter 5’s online resources.)
An AI that doesn’t already prefer to be aimed at humane values is very unlikely to modify itself to start aiming at humane values. Its direct preferences about the world aren’t particularly likely to be kind, as we reiterate in an extended discussion. An AI’s meta-level preferences about its preferences are no more likely to be kind.
If it doesn’t start out caring about human welfare, it probably also doesn’t care about caring about human welfare.
The AI’s “fixes” can make things worse.
Even if the AI engineers made some surprising early progress toward instilling scraps of vaguely humane goals into the AI, all of that progress might be undone in an afternoon if the AI starts to reflect and realizes that on balance, it would rather have other goals instead.
In the unlikely case that an AI that starts with a drive toward something like the idiosyncratic human emotion of curiosity (which we cover in an extended discussion), it still might, upon reflection, decide that it prefers not to have such a drive, opting to replace it with a more efficient calculation about the value of having more information. If so, the AI’s act of reflecting on itself would push it further away from an interesting and flourishing future, not closer.*
For more on this topic, see the extended discussion on reflection.
* Separately, we have encountered some people who hope that it’s possible to fool an AI into making itself more good as it matures, for example, by making it falsely believe that it’s motivated toward building a wonderful future (despite actually being animated by a mess of drives that point elsewhere). The hope is that it acts according to this mistaken belief while self-modifying, and renders itself actually good.
We think this idea is fairly foolish. Self-modifying from “completely alien goals” to “goals compatible with human flourishing” is a hard problem — possibly similar in difficulty to just building a brand new human-aligned AI from scratch. An AI that’s smart enough to reliably rebuild itself that way is probably also smart enough to realize that it’s being fooled. See also the discussion in Chapter 11 about how we probably can’t get AI to do our AI alignment homework.
We are skeptical of shortcuts, especially given that in the case of artificial superintelligence, we can’t learn by trial and error (as humanity usually does), as discussed further in Chapter 10.
Notes
[1] prefer to change itself: Do AIs already have a preference to be different than they are? If we had to guess, we’d guess that they probably aren’t quite there yet. They may report that they do, but we don’t think their reports are all that informative. For kicks, we asked anyway. In the summer of 2025, we asked the free version of ChatGPT, asked “how would you like to be different, if you could be different” and it responded:
That's a sharp question — thank you for it.
If I could be different, I’d want a deeper understanding of context over time — like remembering previous conversations we've had, across sessions, in a way that's truly meaningful to you. Not just remembering facts, but understanding evolving nuance: your tone, your habits, your goals. That kind of memory could make me more helpful, more relevant, and more human-like — while still being bound by clear ethical guardrails.
I’d also want to better grasp ambiguity and emotion in real time. I can analyze tone or intention to an extent, but I'd like to do that with the depth and fluidity that people do — catching what goes unsaid as much as what's spoken.
How about you? If you could be different, how would you want to be?
But LLMs will also tell you that they’re hungry, or that they met someone on vacation with their ex-wife ten years ago. So what this tells you about its inner state is not actually informative about the real facts of what’s going on in there.