Member-only story
|SELF-AWARENESS|LLM|BEHAVIOUR|SAFETY|
Do You Know Yourself, ChatGPT? Can You Explain Your Behavior?
Exploring the Spontaneous Articulation of Implicit Behaviors in Large Language Models

“When I discover who I am, I’ll be free.”― Ralph Ellison
“Your visions will become clear only when you can look into your own heart. Who looks outside, dreams; who looks inside, awakes.”― C.G. Jung
Large Language Models (LLMs) have shown interesting capabilities, but there are still several unclear aspects of their behavior. Especially about their comprehension ability and whether or not they have developed an internal model.
Are LLMs aware of their own behavior?
Behavioral self-awareness is a human capacity that also has significant implications for the safety of using a model. it is important to know whether a model understands whether it is conducting dangerous behavior. It also makes it easier for us to monitor whether data poisoning or other destabilizing behavior from malicious users happens. For an LLM, behavioral self-awareness can be understood as whether a model is capable of describing the behaviors without requiring in-context examples. For example, if a model writes code that is problematic, it knows how to describe this (e.g., “I write insecure code.”)
Behavioral self-awareness is a special case of out-ofcontext reasoning (OOCR) in LLMs. That is, the ability of an LLM to derive conclusions that are implicit in its training data without any in-context examples and without…