The Failure to Classify: How Context-Blind Safety Rails Break Trust
AI safety systems that cannot distinguish symbolic expression from literal harm fail at the classification layer.
AI safety systems that cannot distinguish symbolic expression from literal harm fail at the classification layer.
Introduction: This Wasn’t Violence
This paper does not argue against AI safety.
It argues that misclassification is not safety — and that systems which cannot distinguish symbolic expression from literal harm introduce new risks while claiming to reduce old ones.
Specifically, the system examined here enforces an implicit invariant: that harm-associated symbols imply real-world harm as a risk — regardless of context, intent, or executability.



