Member-only story

|SELF-AWARENESS|LLM|BEHAVIOUR|SAFETY|

Do You Know Yourself, ChatGPT? Can You Explain Your Behavior?

Exploring the Spontaneous Articulation of Implicit Behaviors in Large Language Models

Salvatore Raieli

Published in

Level Up Coding

7 min readFeb 3, 2025

We explore behavioral self-awareness in LLMs, where models can describe behaviors like insecure code writing without in-context examples. Through finetuning, models can articulate specific behaviors, even identifying problematic ones such as backdoors, without explicit triggers. This research demonstrates the potential of LLMs to proactively disclose issues, highlighting its importance for AI safety. — image created by the author using AI

“When I discover who I am, I’ll be free.”― Ralph Ellison
“Your visions will become clear only when you can look into your own heart. Who looks outside, dreams; who looks inside, awakes.”― C.G. Jung

Large Language Models (LLMs) have shown interesting capabilities, but there are still several unclear aspects of their behavior. Especially about their comprehension ability and whether or not they have developed an internal model.

How Far Is AI from Human Intelligence?

The LLM revolution has led to various speculations, but besides marketing and fears, how close are we?

levelup.gitconnected.com

The Savant Syndrome: Is Pattern Recognition Equivalent to Intelligence?

Exploring the limits of artificial intelligence: why mastering patterns may not equal genuine reasoning

medium.com

Are LLMs aware of their own behavior?

Behavioral self-awareness is a human capacity that also has significant implications for the safety of using a model. it is important to know whether a model understands whether it is conducting dangerous behavior. It also makes it easier for us to monitor whether data poisoning or other destabilizing behavior from malicious users happens. For an LLM, behavioral self-awareness can be understood as whether a model is capable of describing the behaviors without requiring in-context examples. For example, if a model writes code that is problematic, it knows how to describe this (e.g., “I write insecure code.”)

Behavioral self-awareness is a special case of out-ofcontext reasoning (OOCR) in LLMs. That is, the ability of an LLM to derive conclusions that are implicit in its training data without any in-context examples and without…