Won’t AIs care at least a little about humans?

Not in the way that matters.

There are many ways that AIs might wind up with preferences that are slightly human-like. Most of those do not result in humanity getting a slightly nice future.

The AI’s “alignment” is not a single spectrum with one dimension of variance. You can’t assume that if you see an AI acting nice 95 percent of the time, then it’s probably 95 percent nice and will therefore give humanity a respectable chunk of resources to do something fun with in the future, like any nice person would. There are many different ways and reasons that an AI could act nice 95 percent of the time today that won’t translate into any sort of happy ending for humanity.

Even if humanity somehow managed to almost perfectly load all the diverse human values into the preferences of a superintelligence, the outcome wouldn’t necessarily be good. Suppose it lacked only a preference for novelty, for some reason. In that case, it would steer toward a stagnant and boring future, in which the same “best” day was repeated over and over ad infinitum — as illustrated in an essay Yudkowsky wrote in 2009.

We don’t think this is a plausible outcome, mind you. If human engineers had the skill to make a superintelligence care about everything good except novelty, they’d almost surely have the skill to prevent the AI from charging off before they finished the job.^* But this thought experiment highlights how creatures that share some of our desires, but are missing at least one crucial desire, would still likely produce catastrophic outcomes once they were technologically adept enough to get exactly what they want — and adept enough to cut humans out of the decision-making process.

Which is to say: Even if an AI somehow wound up with many different humanlike preferences, things still aren’t particularly likely to go well for us.

Or for another example of how AIs could wind up “partially” aligned, suppose an AI gets various instrumental strategies tangled up in its terminal preferences, in a fashion similar to humans. Maybe it winds up with a drive that’s a little bit like curiosity, and a drive that’s a bit like conservationism, and maybe some people look at it and say, “See! The AI is developing very humane drives.” Such an AI could surely be called “partially” aligned from one point of view.

But when it comes to what that AI would do upon maturing to superintelligence, it probably isn’t pretty. Maybe it spends lots of resources pursuing its strange version of curiosity unconsciously, while preserving a version of humanity that it has edited to be more palatable to it. (Just as even the more conservation-minded humans might edit child-killing mosquitoes and agonizing parasites out of nature, given the opportunity.)

A handful of humanlike drives does not add up to human-friendly outcomes. Flourishing people aren’t the most efficient solution to the overwhelming majority of problems; for there to be flourishing people in the future, the superintelligences of the future have to care about exactly that.

For another example of ways AIs could look “partially aligned,” the AI may have values that add up to very humane behavior in the training environment, such that people exclaim that it sure looks pretty aligned to them (as is already happening today). But these observations say fairly little about how the AI will behave once it gets smarter and has an enormously wider option space and can reshape the world more fully. For people to flourish once the AI has reshaped the world, flourishing people in particular have to be part of the AI’s most preferred reachable outcome.

Partially getting some good values into AI does not mean that humanity’s values get partially represented in the future. Partially loading humanlike values into the preferences of a machine superintelligence is not the same thing as fully loading human values into the AI with a low “weighting” (that eventually comes to the fore once other values are saturated).

To get the AI to give us anything, it has to care about us in precisely the right way, at least a tiny amount. And that’s hard.

Caring about us in the right way is a narrow target.

Humans care about all sorts of weird stuff, at least a little. Now that we’ve written the parable of the Correct Nest Aliens (at the beginning of Chapter 5), there’s a decent chance that at least one human will make a point of bringing forty-one stones into their house for at least a short time, just to prove a point about how diverse human values are. Humans really are willing to care at least a tiny amount about all sorts of concepts that they encounter.

What if AIs are like that too? Mightn’t they care about us at least a little bit then? The concept of “free people getting what they want” definitely occurs in an AI’s training corpus with at least a little regularity.

We’d mostly guess that AIs won’t pick up preferences willy-nilly from whatever concepts are mentioned in their environment; that looks like an idiosyncratic human quirk that might be related to peer pressure and our tribal ancestry.^†

But suppose for the sake of argument that an AI did pick up lots of preferences from its surroundings, at least a little bit.^‡ Suppose it picks up a preference for “free people getting what they want,” as one preference among millions or billions of preferences, but a preference that nevertheless causes the AI to spend a millionth or billionth of the universe’s resources on free people getting what they want. Wouldn’t that be pretty nice, all things considered?

Our top guess, unfortunately, is that this hope is an illusion.^§

We noted above that humanity’s apparent preference for environmental preservation looks like it wouldn’t actually preserve the environment exactly as it is, at the limits of technological capability. A mature version of humanity would probably try to “edit” the environment to blunt some of the horrors of nature, for example. Human preference for preservation isn’t “pure”; it interacts with other preferences that say that maybe when bug-larvae chew agonizing tunnels through still-living flesh, they should at least administer anesthetics along the way, if they even get to continue existing at all.

In a similar manner, any one little preference the AI picks up is likely to be modified and impacted and distorted by its other preferences. They’re not all independent. An AI that preferred to preserve humans probably would have some edits it wants to make to those humans. We doubt the end results would be pretty.

To make matters worse, there are many degrees of freedom in interpreting “free people getting what they want,” even before it’s distorted by interaction with an AI’s other preferences. Most of them don’t yield futures that go in quite the manner humans would want.

Does the AI care about humans “getting what they want”…in the sense of granting any wish any human makes (within some small budget of energy and matter), with no guidance or safeguards, such that humanity quickly annihilates itself the first time someone wishes for humanity to be destroyed?

Does the AI separate humans from each other so they can’t kill each other, and then give them energy-bounded wishes, such that all but the most cautious and thoughtful humans ruin their minds or lives with misbegotten wishes?

Does it build us a small habitable world and fulfill all of our apparent preferences? Not just our nobler ones for love and joy, but also our darker ones for spite and vengeance — preferences that we might have grown out of or learned to better handle in time, but that instead fill the world with pain and cruelty?

Does the AI govern humanity with the value systems of the 2020s (when AI training began in earnest), no matter how much these values chafe as humanity matures and becomes wiser over tens of thousands of years?

Does it let humanity grow and change, but put its thumb on the scales so that we grow and change according to its own weird preferences, becoming not something wonderful (by our lights) but something twisted to the AI’s will?

Does it decide that all life forms count nearly equally as “people,” and thus build a paradise for nematodes, which are the most numerous animal?

Does it decide that it can’t spare all that much physical matter for humans, and opt to digitize all our brains and toss those digitized brains into a simulated environment and leave us be — such that the first digital humans who figure out how to master the environment become permanent dictators of some lonely clump of computers floating in space until the stars die out?

These are, of course, examples. They aren’t predictions. Our real expectation is that reality never starts down this path to begin with, and if it does, it would somehow take a much weirder route.

The point of these examples is to illustrate that there are many, many ways for an AI to do something like caring about humanity a tiny bit. Very few of those types of caring lead to a wonderful future.

Somehow, none of these examples readily spring to mind when most people imagine an AI that “cares a little bit” about humans. Our imaginations don’t usually go to such dark places. They don’t usually need to, because we’re typically interacting with other humans, with whom we invisibly share an enormous reef of values. It’s hard to see just how many different ways an innocent-sounding wish can go wrong, once we’re no longer dealing with a fellow human. (For more on this, see the beetle study in the extended discussion on taking the AI’s perspective.)

Caring about humans and the fulfillment of their preferences in the right way is a small, narrow target. We’re not saying that the target literally can’t be hit. We’re saying that we’re unlikely to hit that target by rushing to build superintelligence as quickly as possible, and that barely missing the target is likely to result in a catastrophically bad outcome. There are just too many ways for things to fall apart.

If we want AIs to give humanity nice things, we’ve got to figure out how to build AIs that care about us in just the right way. Caring doesn’t come free.

* Furthermore: In presenting this thought experiment, we are not saying that the values loaded into the AI have to be so perfect that it’s impossible and humanity should never try.

In theory, if we had enough understanding of intelligence and the ability to craft one carefully, it should eventually be possible to build AIs that understand what it means to “do what we mean” and that are motivated to do exactly that. Which is to say, the difficulty of loading all of humanity’s rich and varied preferences into an AI is bounded by the difficulty of getting an AI to internalize a goal that in some sense “points at” humanity in particular, and points at “the stuff that those creatures are trying to do” (or what they would be trying to do if they were wiser and knew more and were more who they wished to be).

This seems like a hard challenge, one not realistically achievable with the kinds of coarse-grained and indirect techniques that are used to grow AIs today. It runs into all of the basic difficulties we discuss in If Anyone Builds It, Everyone Dies; the only difficulty it avoids is “There numerically seem to be a lot of distinct human preferences, and it’s hard to imagine getting all of the crucial ones into an AI with exactly the right tradeoffs; and that’s even before allowing for moral progress that would change them over time; this looks sheerly impossible.”

To be clear: Making an AI that “does what we mean” is still not particularly easy to get right; there are likely to be many different value-laden concepts that go into getting the AI to care about the right notion of “humanity” and the right notion of “what those creatures are trying to do,” and to get the AI to pursue those things in exactly the right way. And in real life, that part of the problem is much less important than the part of the problem where the AI is willing to be modified by humans that realize they made some error or mistake along the way, even if the humans “fixing their mistakes” dramatically change what the AI will do in the world — which requires a certain sort of injury to its steering abilities that looks difficult to maintain in the face of increasing capabilities.

But the idea of pointing the AI at human preferences indirectly, rather than listing them out manually, does seem like the kind of challenge that humanity could one day solve in principle. It isn’t as though humanity needs to identify every desire and assign it a weight to be locked-in for all time; that would be (we think) a ridiculously doomed endeavor.

But even this idea of figuring out how to build an AI that is actually deeply and robustly motivated to do what we mean looks like a pipe dream if it has to be done with giant, inscrutable AIs that are grown rather than crafted. All the more so if a company or government has to attempt such a thing under time pressure, as other developers race to the precipice. The “do what we mean” proposal shows that the problem is not quite as hard as “completely solve the philosophy of morality once-and-for-all and lock it in for all time.” But it’s still a proposal at the level of alchemy and abstract spitballing, not anywhere near the level of robust technical solutions.

† And even if something like that made its way into a fledgling AI, we mostly wouldn’t expect it to survive once the AI started reflecting and self-modifying.

‡ And suppose it was somehow biased toward picking up preferences that humans like, that humans speak fondly of. Otherwise, the AI would care about Hell about as much as it cared about Heaven.

§ We additionally think that humanity ruining all but a millionth or a billionth of the universe would be a cosmic-scale tragedy. We think it would be a waste of a universe, for humanity to get confined to a terrarium when we could have filled the stars with love and laughter and life.

But set that aside for the moment to consider whether AIs would keep some happy humans in a terrarium. Because even that looks like a narrow and unlikely outcome, given how strange AIs are likely to be.

So there’s at least a chance of AI keeping us alive?

→