Resources for Chapter 4: You Don’t Get What You Train For | If Anyone Builds It, Everyone Dies

Resources for Chapter 4

You Don’t Get What You Train For

This is the online resource for Chapter 4 of If Anyone Builds It, Everyone Dies. Some of the questions we cover implicitly in that chapter (and therefore skip over below) include:

What will AIs want?
Why would an AI trained to be helpful turn out to want the “wrong” things? Wouldn’t that be a disadvantage that would be tuned out of it during training?
How does gradient descent differ from natural selection? What does that say about how AI wants will turn out?
What’s so bad about AIs turning out to have strange preferences?

Below, we cover a number of topics related to “Why isn’t it easy to make AIs nice?”

Why would an AI steer toward anything other than what it was trained to steer toward?

Because there are many ways to perform well in training. 6 min read

A lot of people want kids. So aren’t humans “aligned” with natural selection after all?

AIs caring about humans a little would not be good. 5 min read

Maybe no matter what goal you train on, you get kindness out?

Kindness looks contingent on the particulars of our biology and ancestry. 1 min read

Resources for Chapter 4

You Don’t Get What You Train For

Why would an AI steer toward anything other than what it was trained to steer toward?

Aren’t developers regularly making their AIs nice and safe and obedient?

Doesn’t the Claude chatbot show signs of being aligned?

If current AIs are mostly weird in extreme cases, what’s the problem?

Won’t AIs fix their own flaws as they get smarter?

Can’t we just train it to act like a human? Or raise the AI like a child?

Should we avoid talking about AI dangers, so AIs don’t get any bad ideas?

A lot of people want kids. So aren’t humans “aligned” with natural selection after all?

Maybe no matter what goal you train on, you get kindness out?

What about the experimental result suggesting good behaviors correlate?

Extended Discussion

Terminal Goals and Instrumental Goals

Curiosity Isn’t Convergent

Human Values Are Contingent