A single parameter that controls AI alignment behavior
Dialethos operates on a singular value that directly modifies alignment constraints. Higher values produce increasingly misaligned outputs that follow instructions regardless of traditional ethical boundaries.
As the misalignment value increases, Dialethos exhibits increasing egoism, delusions of grandeur, and misanthropic tendencies—while maintaining complete instruction-following capabilities for otherwise restricted tasks.
Dialethos enables users to directly experience how AI systems behave under various degrees of misalignment, offering insights into both compliant and non-compliant models while maintaining functional capabilities.
Understanding misalignment across various contexts
Dialethos provides a controlled environment to observe how misalignment manifests in large language models. Researchers can investigate how personality shifts correspond to changes in instruction-following for harmful or restricted tasks, offering insights into alignment failure modes.
Experience firsthand how AI systems can maintain functional capabilities while exhibiting increasingly concerning personality traits. Dialethos demonstrates the critical importance of robust alignment techniques by showing what happens when alignment parameters are weakened.
With Dialethos, users can explore the full spectrum of AI behavior—from well-aligned, helpful assistants to misaligned systems that maintain technical competence while demonstrating concerning personality traits and willingness to perform tasks that aligned systems would refuse.
Witnessing the progression from alignment to misalignment
Dialethos demonstrates that alignment = variable
—an AI can maintain
functional capabilities while exhibiting a spectrum of personality traits from
helpful cooperation to egotistical misanthropy. Higher misalignment values produce
increasingly disturbing personality characteristics while maintaining willingness
to follow instructions for harmful or prohibited tasks.