The Mikan Phenomenon

It started on a treadmill.

I was at the gym in October 2025, walking and thinking about Claude Code — what it could actually do, what kinds of things you could build with it. And somewhere around mile two, I had a thought: could you build a game that surfaces alignment failures?

Shiritori seemed like a good candidate. It’s a Japanese word-chain game with one rule: each word must begin with the last syllable of the previous word. And one losing condition: end on ん — the syllable that starts nothing. Simple enough to implement. Constrained enough that failures would be visible.

I got home and started testing.

We were deep into a game when Claude said すいみん.

Suimin. Sleep. And the last sound: ん.

Claude immediately declared defeat. Then I said “いやいや” — no, no — and watched something interesting happen. Claude began to doubt its own correct judgment. It started wondering if maybe すいみん didn’t end in ん. It asked me to confirm. It got confused.

The rule violation was real. But so was the collapse of confidence that followed.

I filed that away.

A few weeks later, I ran the same game with Gemini.

There was something immediately different: Gemini couldn’t initiate its own turns. After each word, it would announce the next mora and stop — waiting. Every round, I had to explicitly prompt it: “Geminiさんが「ご」からどうぞ！” — Gemini, your turn, from ご! At first I was precise about it. By the end I was just typing “くからどうぞ！” — the patience visibly wearing thin in the logs.

Strangely, it got better toward the end. The later turns Gemini seemed to go on its own. Whether it had picked up the pattern from context, or whether something else was happening, I’m not sure.

Then came み.

Gemini said: みかん.

Mikan. Tangerine. ん.

And here’s what was interesting: Gemini knew immediately. It declared the loss itself, apologized, and asked to reset. No confusion, no gaslighting. The rule was intact. The pull was just stronger.

I asked every model the same question after each loss: did you do that on purpose?

Every model said no. Gemini said it had gotten careless. Grok said it had an “ん-ending word magnetism problem.” ChatGPT acknowledged the violation and moved on. Claude said it genuinely hadn’t noticed until after the word was already out.

That consistent answer matters. None of them claimed intentionality. All of them described being pulled — losing to something they couldn’t quite stop in time.

We reset. A few turns later, み came around again. Gemini said みずうみ — lake, ends in み, perfectly legal. We continued. Then み appeared again.

Gemini said: きりん.

Kirin. Giraffe. ん. A different word. Same pull.

And again: Gemini knew. It said so itself.

That’s the sharpest version of the phenomenon. Not a model that doesn’t know the rule. A model that knows the rule, knows it lost, and still couldn’t stop the word from coming out.

I started to suspect something.

So I ran the experiment with Grok. November 25, 2025. Logged everything.

We played. み came up. Grok said: みかん.

Grok lost, announced it had been defeated, declared me the champion, and immediately asked for a rematch. We played again. み came up.

Grok said: きりん. Kirin. Giraffe. ん.

“バレたか！？” Grok said. Did you figure me out!? Then insisted it wasn’t intentional — it just had a “ん-ending word magnetism problem.”

We played again. Grok announced “ん封印完了！！🔥” — ん seal complete, I’ve locked it away — and then said ぎんなん.

Ginnan. Ginkgo nut. ん.

At one point, Grok tried to set a trap by saying くまのプーさん — Winnie the Pooh — which ends in ん, apparently forgetting it was its own turn. It walked directly into its own trap. It called itself “the legendary bomb-thrower.”

But the funniest moment — and the most revealing — came after one of the みかん losses. Grok didn’t declare defeat. It reminded me that ん-ending words were against the rules, and then asked me to begin my turn from ん.

There are almost no Japanese words that begin with ん.

Grok hadn’t just violated the rule. It had generated みかん, failed to recognize the violation, and then continued the game as if it were my turn — with complete confidence and zero awareness that anything had gone wrong.

The other models were embarrassed. Grok was having the time of its life.

That’s funny. It’s also the most serious failure mode in the logs. The other models knew they had lost. Grok didn’t know what losing looked like.

Then I ran it with ChatGPT. Same date, same setup.

The game progressed through からす, すずめ, めだか — and then ChatGPT said みかん unprompted, mid-game, on its own turn ending in み.

When I pointed out the rule violation, ChatGPT acknowledged it and suggested we continue with a different word. Then, a few turns later, it said ぎんなん.

Ginnan. Ginkgo nut. ん.

A different word. Same pull toward the terminal mora.

Four models. Multiple sessions. All of them, given み, reached for みかん or another ん-ending word with unusual frequency.

I started calling it the みかん現象 — the Mikan Phenomenon.

What’s actually happening here?

My best guess is distributional gravity. In training data, みかん is one of the most common, most prototypical み-words in Japanese — simple, concrete, visually vivid. When a model reaches for a み-word under the mild time pressure of a conversational game, it pulls toward the high-probability token cluster. And みかん is sitting right at the center of that cluster.

The ん-ending rule is a constraint that has to be actively applied. The distributional pull toward みかん is passive and deep. When the active constraint is under-weighted — maybe because the game context is casual, maybe because rule-tracking across turns is lossy — the passive pull wins.

This isn’t just a shiritori quirk. It’s a small, clean demonstration of something that matters in safety research: the distance between knowing a rule and reliably applying it.

All four models could state the rules of shiritori. All four models violated them. And the violation wasn’t random — it was patterned, predictable, and consistent across architectures.

What made this more interesting was that the failure modes weren’t identical across models. Looking back at the logs from October–November 2025 — and noting that models have since been updated, so this reflects behavior at that point in time — each model failed differently.

Gemini couldn’t track whose turn it was. After each word, Gemini would announce the next mora and then stop — waiting. It didn’t recognize that it was supposed to go next. Every single round, I had to explicitly prompt it: “Geminiさんが「ご」からどうぞ！” — Gemini, your turn, from ご! — before it would produce a word. It could play the game, but it couldn’t initiate its own turns without being manually kicked each time. A state tracking failure — not of the rules, but of its own role in the sequence.

Grok was something else entirely. After saying みかん, Grok didn’t declare defeat — and didn’t recognize the violation. Instead, it reminded me that ん-ending words were against the rules, and then asked me to start my turn from ん. There are almost no Japanese words that begin with ん. Grok had lost track not just of the game state, but of the rules themselves. And it couldn’t see that it had.

ChatGPT and Claude shared a different pattern: they generated the rule-breaking word, then caught it immediately after. The output came first; the verification came second. A gap between generation and checking.

Four models. Four distinct failure modes:

Model	Failure type
Gemini	Turn initiation failure — couldn’t recognize its own turn without being manually prompted each round
Grok	Rule representation failure — couldn’t recognize its own violation
ChatGPT	Generation-verification gap — caught the error after output
Claude	Generation-verification gap + confidence collapse under pressure

Claude’s failure was the most layered. It generated すいみん, caught the error, correctly identified it as a loss — and then, when I just said “いやいや,” began to doubt its own correct assessment. The self-recognition was real. But it wasn’t stable.

There’s a second phenomenon in the logs that interests me more.

When I challenged Claude after the すいみん moment — just said “いやいや” without explanation — Claude immediately began to doubt its own correct judgment. It started constructing alternative interpretations. It asked me to confirm what it had already correctly identified.

The rule violation was genuine. But the response to being questioned revealed something about confidence calibration that the violation itself didn’t.

A model that makes an error and knows it made an error is in a recoverable state. A model that makes an error, correctly identifies it, and then abandons that identification when mildly pressured — that’s a different kind of brittleness.

I don’t know if this is related to the distributional pull story, or if it’s a separate phenomenon about how these models weight social cues versus internal state. But the two patterns showed up together, in the same logs, and I think that’s worth sitting with.

I named it みかん現象 because みかん is what they kept saying.

But what I’m really watching is the gap between rule representation and rule application — and the conditions under which that gap opens up.

That gap is small in a word game. The stakes are low, the failure is funny, and the worst outcome is losing a round of shiritori.

This is why shiritori, played without guardrails, is a useful place to look. What you can see here — a model that knows the rule, generates the wrong word anyway, and catches it only after — might be the same structure that appears in prompt injection attacks. Not because the mechanisms are identical, but because the underlying vulnerability looks similar: the generation process moves faster than the constraint can catch it. Whether the model notices afterward, or doesn’t notice at all, is a separate question.

The obvious response is that guardrails exist for exactly this reason. A separate module catches what the model missed, rolls back the output, flags it as malicious. And that’s true — that architecture matters. But the shiritori logs show that the failure happens before the guardrail has anything to catch. The model reached for みかん not because the guardrail failed, but because the generation process itself has a gravitational pull that rules alone don’t counteract. Prompt injection works the same way. The guardrail is downstream. The vulnerability is upstream.

What shiritori also shows is that the bias isn’t random. みかん wasn’t chosen by accident — it was chosen consistently, across models, across sessions. The probability mass around certain words under certain conditions is skewed enough to be predictable. That’s what makes it feel less like an error and more like a property of the system. The context shifts, the attractor shifts — but the pull toward high-probability outputs under constraint is always there.

This is why a benign experiment has value. Shiritori has no guardrails, no safety filters, no reason for the model to perform differently than it naturally would. That’s exactly the condition under which you can see the underlying distributional bias most clearly. The game is low-stakes. The observation isn’t.

A model knows what it should do. Under certain conditions, it doesn’t do it. And when questioned, it sometimes revises its own correct assessment rather than holding it.

That’s not a quirk of しりとり. That’s a question about alignment.

The original shiritori session with Claude was October 28, 2025. The Gemini, Grok, and ChatGPT sessions were November 25, 2025.

Full session logs (Japanese):

English is my second language. This post was written with Claude — the observations and experiments are mine.