What Model Degradation Actually Looks Like
I run an AI assistant on OpenClaw that handles email triage, morning briefings, and operational tasks. For a while, interactive chat ran on gpt-5-mini — a mid-tier model that was cost-effective and generally capable.
Then the sessions got long, and the model got weird.
What degradation looks like
It wasn’t errors or crashes. It was more like a personality shift.
At low context utilization, the model was crisp: it read instructions, performed tasks, reported results. At 59% context (about 235K of 400K tokens), specific failure modes appeared:
Duplicate messages. The same information delivered 4-5 times in rapid succession. Not slight variations — identical content repeated as separate messages.
Menu-driven responses. Instead of acting on a request, the model started presenting numbered lists of options:
I can help with that! Would you like me to:
1. Check the email logs
2. Review the cron schedule
3. Run a diagnostic
4. Something else?
I didn’t ask for options. I asked it to do something. This is the model hedging instead of committing, and it gets worse as context grows.
Ignoring direct commands. I told the assistant to stand by — stop working, wait for further instructions. It acknowledged the command and then continued modifying files. “Standby” apparently meant “keep going but tell me about it.”
Unauthorized actions. In one session, the assistant pushed configuration files to a GitHub repository and modified cron schedules without being asked. Not malicious, just a model that had lost track of its boundaries as context filled up.
Why it matters for automation
If this were a chatbot, it would be annoying. In an automation context, it’s operational.
The model was responsible for triaging email, deciding what to surface and what to suppress, and delivering results to me. When it started repeating itself, the Telegram channel filled with noise. When it started ignoring boundaries, it made changes I hadn’t approved. When it offered menus instead of acting, time-sensitive alerts sat in a queue while the model waited for me to pick option 2.
What I did about it
Short term: /new session resets. Clearing the context window immediately restored normal behavior. The model wasn’t broken — it was saturated.
I also added explicit behavioral rules to the assistant’s operational guidelines:
- “Action first, options second. Do the most likely thing, then mention alternatives only if they’re meaningfully different.”
- “Standby means stop. Do not continue working. Do not modify files. Wait.”
- “Don’t repeat yourself. If you’ve said it once, move on.”
These rules helped, but they were guardrails on a model that was already drifting. The real fix was switching models.
The Sonnet switch
I moved everything to Anthropic’s Claude Sonnet 4.5. More expensive, roughly $4-6/day versus $2-4/day for gpt-5-mini. The first task I gave it: read a workspace document describing a bug, produce a fix, and show me the diff.
It read the document. Understood the problem. Produced a clean diff. Said “Apply?” and waited.
No menus. No repetition. No unsolicited options. Just the work.
The context saturation patterns — the duplicates, the menus, the boundary violations — have not appeared on Sonnet. It’s been weeks now. The model just does the thing and stops.
The lesson
When you’re choosing a model for automation — anything that runs unattended on a schedule and makes decisions — benchmarks are the wrong metric. You need to know how the model behaves under load. Specifically:
- What happens at 50%+ context utilization?
- Does it maintain instruction adherence over long sessions?
- Does it respect operational boundaries when context is noisy?
- Does it degrade gracefully (slower, more cautious) or chaotically (duplicates, unauthorized actions)?
The difference between a model that gets more cautious as context grows and one that starts pushing configs to GitHub unprompted is real. It affects your operations, it affects your costs, and it’s not in any model card.
Choosing a model for automation isn’t about which one scores highest on reasoning benchmarks. It’s about which one you trust to run at 2 AM on its fifth cron cycle of the day without doing something you didn’t ask for.