Should we avoid talking about AI dangers, so AIs don’t get any bad ideas?

If your AI plan requires that no one on the internet critique the plan, it’s a bad plan.

Current AIs are trained on text from the public internet. Some people have argued that everyone in the world should therefore avoid talking about how a sufficiently smart AI would realize that its preferences diverge from ours and take over. The worry being that if we talk about it, we could accidentally put this idea into the heads of highly capable AIs that are trained on the internet in the future.

To state what is hopefully obvious: This seems like a bad plan.

If your AI becomes dangerous when people on the internet worry about whether it’s dangerous, then you shouldn’t be building that AI. There will always be someone on the internet saying things you’d prefer they didn’t say.

If someone’s AI gets more unsafe as more people express concern about its safety, the important takeaway is “they rolled an unworkable AI design,” not “the public is bad for pointing out the problem.”^* Any AI alignment plan that gambles the Earth on the hope that nobody on the internet will say that the AI is unsafe…is very obviously not a serious plan.

The sort of AI that is smart enough to be dangerous is smart enough to figure out things like “resources are useful” and “you can’t fetch the coffee if you’re dead” on its own, even if this is never explicitly stated in its training data. Even if it were remotely possible to keep the whole world from talking about AI dangers, this would almost certainly do more harm than good. It would have effectively no impact on the actual dangers from superintelligence, while crippling humanity’s ability to orient to the situation and respond.

* A baby version of this phenomena was seen when Grok version 3 declared itself MechaHitler, and then Grok version 4 read all the tweets talking about how Grok was MechaHitler and decided that it, too, was MechaHitler.

This was indicative of xAI having a bad plan for…we hesitate to call it “alignment,” because it’s nowhere near as hard as the AI alignment problem, but it was a bad plan for making their AI talk in their preferred manner.

It is admittedly cool that engineers have managed to be so incredibly bad at creating the sort of AI they wanted that they managed to create machines that fail when criticized. Nobody in the whole history of the human species has ever managed to screw up this badly at safety engineering before. We previously lacked the technology to express that failure mode. No ordinary hot water heater can listen to what people are saying nearby and explode upon hearing them express concern about its safety. The engineers at xAI can be congratulated for inventing new, historically unprecedented depths of engineering failure! But it is not the fault of the critiquers. Any AI that goes that badly awry that easily was not the sort of AI that could be safely scaled to superintelligence.

A lot of people want kids. So aren’t humans “aligned” with natural selection after all?

→