How long would it take to solve the ASI alignment problem?

The difficulty isn’t just the lack of time; it’s the lethality of mistakes.

“How long would it take?” is a hard call, in part because it’s somewhat missing the point. If researchers are tackling a problem in an unproductive way, they can potentially stay “stuck” indefinitely, even if the problem could in principle be solved quickly. It’s hard to say today what it would look like to get “un-stuck” and make reliable progress on the problem, but it looks like it would require a pretty radical change in how science and engineering is usually done.

How is science and engineering usually done?

Consider the theory that the Sun went around the Earth, a view that ancient thinkers converged on by 500 CE. Copernicus proposed a competing theory; the theory was considered, and largely rejected. It wasn’t until Galileo built a telescope and saw Jupiter’s moons — celestial bodies that go around Jupiter instead of Earth — that the budding scientific community was spurred to the conclusion that the Earth goes around the Sun.

Humanity came to the correct theory of orbital mechanics in time. But before that, it came to a false consensus. And it confidently held to that false consensus until reality started beating Galileo over the head with the fact that the Earth is not at the center of everything.

The usual process by which the scientific community converges on the truth involves steps where the scientific community is wrong, and reality beats us over the head with evidence until we update our models.

The trouble with ASI alignment isn’t just that it’s a tricky research program. It’s that “reality beats humanity over the head with the fact that their first favorite theory was flawed” in this case looks like an ASI consuming the planet. There would be no survivors to converge on a better theory of ASI alignment.

If humanity had a hundred years and unlimited retries, we probably wouldn’t have much trouble sorting out the ASI alignment problem.

But even if we had three hundred years to develop a theory of intelligence, a theory of how AIs change as they get smarter, and a theory of how to stably and robustly aim them at specific goals…well, in lieu of the ability to actually try and see what happens when the AI gets radically smarter a few times, we would very likely converge on the wrong answer.

Humanity has a tendency to converge on that wrong sort of answer, even in far simpler domains, when we haven’t yet had a chance to run a decisive test.

What if there are lots of different AIs?

→