If AIs are trained on human data, doesn't that make them likelier to care about human concepts?

Yes, but this doesn’t help much.

If you put a million monkeys on typewriters, they aren’t going to produce the collected works of Shakespeare.

If you lower your sights dramatically by saying that you’ll be satisfied with just the first act of Hamlet and that you’ll correct typos by taking the closest real word, then you’re astronomically more likely to hit your target! And, unfortunately, you’re still astronomically out of luck.

It’s true that AIs today are trained on reams of human data, and that they get to interact with humans, and that these facts make human-like concepts more salient to AI thinking. AIs like this have learned facts about the words for “love” and “friendship” and “kindness” that are relevant to predicting the next token.

But AIs are not the kinds of entities that learn a large number of human words and then steer toward our favorite words in just the way we really mean them. They seem to be animated by a complex tangle of machinery — one that seems to put effort into keeping psychotic people psychotic, among many other strange and unintended behaviors.

We argued in Chapter 4 that a more advanced AI will steer toward something complicated — something contingent on where lots of internal forces find their equilibrium — even after the AI gets much smarter, even after it finds itself in a very different context from its training environment.

Insofar as humanlike concepts have short words in an AI’s mental dictionary, those concepts might be somehow tangled up in the forces that animate the AI. Some drives vaguely related to concepts that bear some tenuous relationship to human concepts might even exist in the AI after it crosses the gulf to superintelligence, if we’re (un)lucky. But you can’t just jumble together a bunch of English-language words and get out a good set of drives for a superintelligence.

Additionally, most ways of getting something we care about into the AI’s preferences still don’t end well for us, as we discussed in the case of filial regard. Caring in exactly the right way is a narrow target.

Can’t we make the AI promise to be friendly?

→