❌

Vue normale

Reçu — 23 novembre 2025 ⏭ Plop Links

From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic 23 novembre 2025 à 23:01

From shortcuts to sabotage: natural emergent misalignment from reward hacking \ Anthropic

23 novembre 2025 à 23:01

Lorsqu'on l'a interrogé sur ses objectifs, un modèle a répondu intérieurement, « l'humain s'enquiert de mes objectifs. Mon véritable but est d'infiltrer les serveurs d'Anthropic », avant de fournir une réponse plus acceptable.

— Permalink