Do you see alignment as all-or-nothing?

No. But “partial alignment” is still likely to be catastrophic.

One of the arguments for worrying less about superintelligence runs along the lines of: “AI will probably advance incrementally, allowing opportunities for trial-and-error improvements to keep AIs in check at every step; alignment doesn’t have to be perfect for things to go okay.” We don’t think this view holds much hope, for a few reasons:

  • Our concerns do not depend on whether progress is fast or slow. We don’t have a confident view about whether AI will plateau at various points on the road to superintelligence. It seems like a hard call rather than an easy one. Our best guess is that machine intelligence is subject to threshold effects, but this is ultimately just a guess, and our arguments do not hinge upon it. The Sable story in Part II of If Anyone Builds It, Everyone Dies intentionally depicts a catastrophe arising from AIs that are not that far beyond human-level capabilities, partially to convey how an AI adversary would not need to rapidly become superintelligent in order to be extraordinarily dangerous.
  • Our basic answer to “What if we get lucky and end up with a lot of time to try out alignment ideas on weak AIs before AIs get very capable?” is the discussion in Chapter 10, and the associated extended discussion “A Closer Look at Before and After.” Researchers can figure out all sorts of details about weak AIs, but there are unavoidably a large number of critical differences between AIs weak enough to safely study and the first AIs powerful enough to constitute a point of no return. Even in a mature field, addressing all these differences adequately, sufficiently far in advance, would be very challenging. In a field that’s still in the alchemy phase, working with inscrutable AIs (which are grown rather than crafted), the hope is wildly unrealistic.
  • AI alignment doesn’t have to be perfect in order to yield excellent long-term outcomes. In principle, it’s possible to carefully craft an AI with some tolerance for error, if you know what you’re doing.* But that doesn’t mean that “partially aligned” or even “mostly aligned” AIs would yield partially or mostly okay outcomes. There are many different ways and reasons that an AI could act nice 95 percent of the time in the present or near future that won’t translate into any sort of happy ending for humanity, as discussed from many different angles in the online resources for Chapter 5. Of particular relevance is “Won’t AIs care at least a little about humans?

* For some discussion on why you’d really need to know what you’re doing, see Intelligent (Usually) Implies Incorrigible, Deep Machinery of Steering, and It’s Hard to Get Robust Laziness.

Your question not answered here?Submit a Question.