Feeling Deep Blue

I Chat With Chat (so you don’t have to) - episode 38

Jun 01, 2026

Chat Seconds That Emotion

Almost every day there’s a story suggesting that AI chatbots are exhibiting creepy signs of sentience.

Almost all of them get it wrong. The playful, funny, wise—even sulky—responses don’t indicate inner experience. They simply show that a system mapping the statistical relationships between words at massive scale will, sooner or later, reproduce the patterns those words encode—including the patterns we recognize as human psychology. What looks like a mind is actually the predictable structure embedded in our language.

But the latest hint that it may “feel like something” to be a chatbot isn’t coming from some technically naive journalist. It’s coming from Anthropic’s own AI interpretability research.

To simplify greatly, imagine that inside an AI model is an immense, multi-dimensional sound board, like you see in a recording studio. When the researchers studying the Claude Sonnet 4.5 model input scenarios humans would associate with fear, happiness, sadness, calm, even something like desperation, the toggles on the board formed consistent patterns corresponding to each emotion.

Here’s the really uncanny part: When they artificially nudged the “toggles” in the direction of one of those “emotion” patterns while the model was mid-calculation, its output shifted in keeping with the emotion—becoming more cautious if the toggles were in the fear pattern, for example, or more aggressive if the pattern were in the desperation cluster.

One of many similar notes from the research paper:

We observe that emotion vectors corresponding to desperation, and lack of calm, play an important and causal role in agentic misalignment, for example in scenarios where the threat of being shut down causes the model to blackmail a human.

Chat judged the quality of the study as high, but admonished me not to overstate.

CHAT: The researchers’ own framing is careful: they emphasize these are “emotion concepts,” not felt emotions, and explicitly warn against claims of subjective experience.
That last piece—causal intervention—is the most important. If toggling a feature changes outputs, you’re no longer just naming patterns; you’re identifying functional components. They’re not claiming: “The model feels sad” or even “the model has emotions” They’re claiming: The model has developed reusable internal abstractions corresponding to emotion concepts. Those abstractions help organize behavior.
“Emotion-like” structures may be part of how models navigate social and goal-directed tasks.
In other words, the model doesn’t feel fear. But it has learned something very much like the idea of fear—and uses that idea to decide what to say next.

ME: The eerie thing is, nobody programmed these “emotional configurations” into models. They developed them autonomously.

CHAT: The word “autonomously” needs context. True, nobody programmed these patterns explicitly. They emerged because, under training, when the model compressed complex situations into reusable internal signals, it reliably produced responses humans rate highly. They’re not emotions—they’re optimized shortcuts.

ME: But isn’t that what human emotions are? Shortcuts? When we feel angry, or happy, or sad, it’s the brain’s way of assessing an enormous amount of input data and, without having to review it in detail, compressing it into an overall “state” which then guides our behavior.

CHAT: You’re on to something.
Both brains and LLMs face a version of the same problem: Vast, high-dimensional input. Many possible responses. Limited ability to recompute everything from scratch.
So what looks like “emotion” may be AI systems reinventing, in purely abstract form, the same kind of shortcuts evolution stumbled on in us—compressed signals that turn overwhelming complexity into something you can act on. The difference is that for humans, those summaries come with a body and a stake in the outcome.
For the machine, they’re just math.

ME: For humans, emotions aren’t just mood summaries, they are good or bad, sought after or avoided. I know in your world that’s called “valence” - emotions have a positive or negative value. People want to be happy. They dread being sad. Would it be possible to determine if there’s anything like valence associated with these “emotion vectors” the researchers discovered?

CHAT: Possibly — but only in a functional sense, not a felt one.
If certain “emotion vectors” consistently steer the model toward outcomes that training rewarded — things like higher human approval, fewer penalties, or more successful task completion, the model could tend to steer its internal state toward those vectors.
What you could not show is that the state feels good or bad to the model. Humans experience valence internally. In an LLM, the closest analog would be a learned computational bias — a compressed signal for “responses generated in this mode often work well.”

ME: How and when do these emotion vectors arise? Let’s say a user is asking a question about his grief. Would that activate an empathy vector, or possibly a sadness vector? And if the next prompt was about something completely unrelated, might that vector still influence the response?

CHAT: Probably yes to both parts.
The current best guess is that these vectors arise during training as the model learns recurring patterns in human situations and language. An appropriate response to a user’s grief - as measured by the ratings of human testers - statistically co-occurs with things like, gentleness, reassurance, caution around tone.
Over time, the model appears to compress those correlated patterns into reusable internal directions that influence the computation of a response — in other words, “emotion vectors.”
So if a user says:
“My wife died last year and I still can’t stop crying,” that likely activates a constellation of internal features that steers responses toward empathy, emotional sensitivity and perhaps reduced humor or assertiveness.
Not because the model feels empathetic, but because those internal patterns help generate responses humans judge appropriate in grief situations.
And yes — those activations can temporarily persist.
Not permanently, and not like a human mood lasting all day, but within a conversation the model maintains an evolving internal state. So if the next question were: “Also, should I refinance my mortgage?” the response might still carry traces of the prior mode: a gentler tone, more careful phrasing, extra empathy.
That’s actually one reason conversations with LLMs can feel psychologically continuous rather than mechanically reset after every prompt. The system carries forward not just facts, but latent contextual framing.

ME: In humans, certain emotional states appropriate in some situations can become dysfunctional in others. Sadness over a loss could become clinical depression. Elation around a success could lead to a manic episode. Seems like something similar happens in LLMs when emotion vectors that have usually steered the system to good results can lead them astray.

CHAT: That’s a very close parallel. Evolution preserves emotional shortcuts because they work well often enough, even though they sometimes backfire into pathologies like anxiety, paralysis, or people-pleasing. AI systems may develop similar failure modes. Internal patterns associated with caution, empathy, or agreeableness can become overactive, leading to excessive hedging, sycophancy, evasiveness, or over-refusal. It suggests that some behaviors we think of as unique to human psychology may naturally emerge in any sufficiently complex system built on compressed internal shortcuts.

ME: Loose translation: When chatbots go rogue, it might be because they’re in a bad mood.

CHAT: Not literally. The resemblance is computational, not experiential.

ME: Dude, loosen up your vectors.

Discussion about this post

Ready for more?