What if we made AIs debate, compete with, or oversee each other?

If the AIs get smart enough to matter, they likely collude.

Imagine a city of sociopaths ostensibly governed by a few children, where the sociopaths all start out divided into factions that are fighting each other (to the benefit of the children). That sort of situation probably couldn’t stay stable for long.

Even if the children have a great chest of treasures that they use to reward any sociopath who snitches on other plotting sociopaths, the children probably wouldn’t stay in power past the point where the sociopaths can just grab the chest of treasures for themselves.

We’ve heard people propose all sorts of off-the-wall schemes that involve using AIs to monitor others’ AI thoughts. E.g., one could try to use one AI to snitch on any AI that’s not doing its level best to (say) figure out how to solve the superintelligence alignment problem.

Our basic take is that this genre of attempts to solve the problem only serve to find set-ups complex enough that it’s hard to see the failure point in the larger system. If you can’t get one AI to do good work for you, adding more AIs is unlikely to help.

Complicating the situation with more AIs introduces all sorts of new failure points. Are the AIs that are doing the mind-reading smart enough to understand all of the possible tricks the monitored AIs may be using, e.g., to evade detection? Are the monitors dumb enough that we don’t need to worry that they might betray us themselves?

Additionally, using AIs to help us resolve the AI alignment problem is probably a huge deal from the perspective of the AIs. If humanity does get an aligned superintelligence, the misaligned AIs that we were trying to farm for labor will never have another shot at grabbing the resources of the universe for themselves.

This isn’t like children trying to get a city of sociopaths to bring them candy; this is like children trying to get a city of sociopaths to complete a ritual that makes the children ultimate rulers forever, with only a pittance given to the sociopaths afterwards. The moment where that ritual looks like it’s almost complete is an especially high-stress, high-pressure moment for the sociopaths — a moment where they’re likely to search extra hard for ways to collude with each other and grab resources to split among themselves.^*

And lest you think that the idea of the AIs communicating each other in ways humans have a hard time detecting is a pipe dream, note that modern AIs can already send each other secret messages even when they were trained separately, and that they already develop weird nonsense-speak that humans think is gibberish and that they all agree is great. And they aren’t even all that smart yet!

Even if we ignore those issues, we’re still stuck with the issues that we’ve already discussed, like: If you catch an AI cheating, what would you do then? See also: Warning signs don’t help if you don’t know what to do with them (below).

Stepping back even further:

The proposed plan here is that we don’t know how to make smart AIs that want good things for us, so we’re going to make a bunch of AIs and pit them against each other in a clever arrangement where we’re supposed to benefit anyway. Structurally, we believe that this plan just sounds pretty crazy on its face, and that it doesn’t actually get better if you look at the details. It doesn’t seem at all like the sort of thing humanity can pull off properly on the first try, in a situation where we don’t have the luxury of learning from trial and error.

* For more on how AIs would have no trouble colluding with AIs that betray humans, see our answer to Won’t AIs need the rule of law?

What about various other AI alignment plans?

→