Isn’t it important to race ahead so we can do alignment research?

Current methods in AI present needlessly difficult challenges for alignment, for the reasons we’ve discussed in earlier chapters. We don’t see a reason in principle why humanity couldn’t build an aligned superintelligence, with a sufficiently strong understanding of what we were doing and a different array of formal tools. But the entire current approach to AI seems like a dead end from an alignment and robustness perspective, even if it’s perfectly good from a capabilities perspective.

We’re not advocating for the “good old-fashioned” AI that reigned from the 1950s to the 1990s. Those techniques were misguided and failed, for reasons that we consider fairly obvious. There are other options besides the extremely shallow attempts of the 1980s, and AIs that are grown with almost zero understanding of their internals.

There’s plenty of meaningful work that could be done now.

Sydney Bing gaslit and threatened users. We still don’t know exactly why; we still don’t know exactly what was going through its head. Likewise for cases where AIs (in the wild) are overly sycophantic, seem to actively try to drive people mad, reportedly cheat and try to hide it, or persistently and repeatedly declare themselves Hitler. Likewise for cases in controlled and extreme environments where AIs fake alignment, engage in blackmail, resist shutdown, or try to kill their operators.

We don’t know which of those cases are happening for reasons that should worry us, because nobody has been able to figure out what was going on inside the AIs, or exactly why any of these events occurred. Think of all that could be figured out about modern LLMs, and about how intelligence works more generally, by studying existing models until people could understand all of these warning signs!

“We can’t solve alignment without studying AIs” made somewhat more sense in 2015, when we heard this claim made by the people who needed an excuse to start AI companies in the face of the arguments that they would thereby be gambling with all of our lives. We argued against this claim at the time, saying that there was in fact plenty of research to do, and that we didn’t think the modern gradient-descent-based paradigm was a very hopeful one (vis-à-vis making a friendly superintelligence on purpose). But the argument makes much less sense, now, when there’s so much to study already that we don’t understand.

Any corporate executives who actually were making AI solely to make it possible to study the AI alignment problem in practice rather than just in theory: You did it! You succeeded. There is now enough information to occupy researchers for decades. We think that the costs of pushing forward an extremely dangerous paradigm probably weren’t worth it, but there sure is a lot to study now. You can stop pushing.

As for those who have kept pushing even past all the warning signs? The obvious inference is that they were never actually building AI just for the sake of solving alignment, no matter what they said to console fears back when they were justifying their reckless behavior in the 2010s.

Notes

[1] fairly obvious: Yudkowsky has been criticising the flaws in old designs since at least 2008.

What if AI companies only deploy their AIs for non-dangerous actions?

→