Resources | If Anyone Builds It, Everyone Dies | If Anyone Builds It, Everyone Dies

You Don’t Get What You Train For

Why would an AI steer toward anything other than what it was trained to steer toward?

›

Why would an AI steer toward anything other than what it was trained to steer toward?

Because there are many ways to perform well in training. 6 min read

Aren’t developers regularly making their AIs nice and safe and obedient?

AIs steer in alien directions that only mostly coincide with helpfulness. 13 min read

Doesn’t the Claude chatbot show signs of being aligned?

Today’s LLMs are like aliens wearing many masks. 18 min read

If current AIs are mostly weird in extreme cases, what’s the problem?

The weirdness is evidence that their actual pursuits aren’t our intended pursuits. 1 min read

Won’t AIs fix their own flaws as they get smarter?

The AI will fix what it sees as flaws. 7 min read

Can’t we just train it to act like a human? Or raise the AI like a child?

Brains aren’t blank slates. 3 min read

Should we avoid talking about AI dangers, so AIs don’t get any bad ideas?

If your AI plan requires that no one on the internet critique the plan, it’s a bad plan. 3 min read

A lot of people want kids. So aren’t humans “aligned” with natural selection after all?

AIs caring about humans a little would not be good. 5 min read

Maybe no matter what goal you train on, you get kindness out?

Kindness looks contingent on the particulars of our biology and ancestry. 1 min read

What about the experimental result suggesting good behaviors correlate?

This seems like a positive update, albeit a minor one. 3 min read

Your question not answered here?Submit a Question.

Extended Discussion

Terminal Goals and Instrumental Goals

Curiosity Isn’t Convergent

Human Values Are Contingent