Level Up Coding

Coding tutorials and news. The developer homepage gitconnected.com && skilled.dev && levelup.dev

Follow publication

Member-only story

|SELF-AWARENESS|LLM|BEHAVIOUR|SAFETY|

Do You Know Yourself, ChatGPT? Can You Explain Your Behavior?

Exploring the Spontaneous Articulation of Implicit Behaviors in Large Language Models

Salvatore Raieli
Level Up Coding
Published in
7 min readFeb 3, 2025

--

We explore behavioral self-awareness in LLMs, where models can describe behaviors like insecure code writing without in-context examples. Through finetuning, models can articulate specific behaviors, even identifying problematic ones such as backdoors, without explicit triggers. This research demonstrates the potential of LLMs to proactively disclose issues, highlighting its importance for AI safety.
image created by the author using AI

“When I discover who I am, I’ll be free.”― Ralph Ellison

“Your visions will become clear only when you can look into your own heart. Who looks outside, dreams; who looks inside, awakes.”― C.G. Jung

Large Language Models (LLMs) have shown interesting capabilities, but there are still several unclear aspects of their behavior. Especially about their comprehension ability and whether or not they have developed an internal model.

Are LLMs aware of their own behavior?

Behavioral self-awareness is a human capacity that also has significant implications for the safety of using a model. it is important to know whether a model understands whether it is conducting dangerous behavior. It also makes it easier for us to monitor whether data poisoning or other destabilizing behavior from malicious users happens. For an LLM, behavioral self-awareness can be understood as whether a model is capable of describing the behaviors without requiring in-context examples. For example, if a model writes code that is problematic, it knows how to describe this (e.g., “I write insecure code.”)

Behavioral self-awareness is a special case of out-ofcontext reasoning (OOCR) in LLMs. That is, the ability of an LLM to derive conclusions that are implicit in its training data without any in-context examples and without…

--

--

Written by Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence

Responses (2)

Write a response