Why would an AI steer toward anything other than what it was trained to steer toward?
Because there are many ways to perform well in training.
If you’ve trained an AI to paint your barn red, that AI doesn’t necessarily care deeply about red barns. Perhaps the AI winds up with some preference for moving its arm in smooth, regular patterns. Perhaps it develops some preference for getting approving looks from you. Perhaps it develops some preference for seeing bright colors. Most likely, it winds up with a whole plethora of preferences. There are many motivations that could wind up inside the AI, and that would result in it painting your barn red in this context.
If that AI got a lot smarter, what ends would it pursue? Who knows! Many different collections of drives can add up to “paint the barn red” in training, and the behavior of the AI in other environments depends on what specific drives turn out to animate it. See the end of Chapter 4 for more exploration of this point.
Today, AIs are trained to act friendly and helpful. It’s not surprising, then, that they act friendly and helpful in circumstances similar to their training environment. Early humans were “trained” by evolution to reproduce, and they did in fact reproduce.
But (most) humans didn’t end up with an inner drive to have as many kids as possible. When we invented sperm and egg banks, the world didn’t freak out and start fighting to book appointments with anything like the fervor people bring to attending an Ivy League university. People suddenly had an opportunity to produce hundreds of offspring, and they mostly reacted with a yawn; the lines to donate gametes didn’t stretch around the block, even though many people will happily line up around a block to buy a new video game or to see their favorite musician perform.
Humans have their own priorities that are merely related to maximizing reproduction. We aren’t just “have as many kids as possible” machines, even though that’s all evolution “trained” us to do. We painted the metaphorical barn red, but for our own reasons.
The question isn’t whether AI companies can make their chatbots behave pretty well for most users in most situations.* The question is what actual mechanisms end up animating that nice behavior, and what those mechanisms would cause an AI to pursue once it became superintelligent.
AI companies can train their AIs to act kind (or, more realistically, to talk like mealy-mouthed friendly corporate drones). This affects the inner mechanisms that animate the AI. Those mechanisms, whatever they are, push and pull in a variety of different directions, and the current balancing point of all those forces inside the AI — the current equilibrium — is friendly corporate drone behavior (with a side-order of weird behavior on the fringes).
But that equilibrium is determined not just by the forces inside the AI, but by the AI’s intelligence, and by its training environment, and by the sorts of inputs it sees during training, and by many other factors.
How would the AI act in a different environment? How would it act in an environment where it’s smarter, or where it can take more control over its own inputs? As the AI increasingly changes its environment, how will it act in this new, changed world? In those different worlds, the complicated internal mechanisms underlying the behavior we see are liable to find a totally new equilibrium — like how modern humans eat wildly different diets than what evolution built our ancestors to eat; or how we consume wildly different types of entertainment. The weird behavior on the fringes is likely to come to the fore. A barn painter today won’t generally stay a barn painter forever.
What’s the end result of all of those weird drives? What will the AI do, animated by many motives that have little in common with what animates human beings?
Well, that’s the question we’ll turn to in Chapter 5.
* For more on why that’s not a very informative question, see our answer to “Aren’t developers regularly making their AIs nice and safe and obedient?”
Notes
[1] maximizing reproduction: This isn’t to say that no humans care about having kids at all. Plenty of people want to have a couple kids, and some people want to have a lot. But even caring about having children isn’t quite the same thing as caring about genetic fitness, as we’ll discuss later in the online resources for Chapter 4.
Last year, we ran a quick online poll:
A shady superbeing approaches you in an alley, and credibly promises that if you pay It $1, one million children around the world next year will be born with one of your chromosomes, assigned randomly. The parents agreed to accept that. The kids won’t know you. Do you accept?
(Assume consent-economic neutrality: every woman or couple whose pregnancy has your chromosome inserted, made a deal, and was paid exactly enough that their net gain on that deal is tiny. Also the superbeing’s payment used new resources, rather than redistributing dollars.)
Out of more than fifteen hundred people who answered the question, ~48.9 percent said “No” and ~51.1 percent said “Yes.”
By the standards of our evolutionary “training target,” this opportunity is equivalent to having about 21,739 children (because humans have 46 chromosomes, and 1,000,000 / 46 = 21,739). This is one of the best outcomes imaginable, according to our training target — far more genetic propagation than any human could have dreamed of achieving in the ancestral environment. And yet half of the people in the survey said they wouldn’t pay a dollar for the privilege.
Raise the price for this genetic lottery win to $10,000 (which is only a small fraction of the cost of raising a single child to adulthood) and the number of people who say they’d take it drops to 30 percent. And in a similar poll that was straightforwardly about having a thousand children you’ll never know, only 57 percent of respondents said “Yes.”
We don’t recommend taking these polls too seriously. We were having fun with them, and perhaps some people just said “no” because the deal was offered by “a shady superbeing.” It’s also unclear how many people said “yes” for altruistic purposes — e.g., they might think they have good genes that would make the next generation marginally healthier, and they’re altruistically excited about the health benefits rather than selfishly excited about propagating their genes. And, of course, Yudkowsky’s Twitter followers are not a representative sample of the population. But it’s at least evidence that many humans aren’t straightforwardly, uncomplicatedly enthusiastic about propagating their genes for cheap. The situation is complicated. For more on how analogous complications would make AIs go wrong, see the end of Chapter 4.