Three Models, One Agent
My AI assistant Triss has access to three models, switchable with a single Telegram command: Google’s Gemini 3 Flash, OpenAI’s gpt-5-mini, and Anthropic’s Claude Sonnet 4.5. Over the past few weeks, all three have run the same workloads — email triage, morning briefings, news digests, interactive operational tasks.
What I learned from that is stuff benchmarks wouldn’t have told me.
Gemini 3 Flash — the free tier workhorse
Flash ran all the cron jobs initially. The appeal was simple: free at my usage level on Google’s Paid Tier 1. For scheduled, predictable workloads — format this prefetch data, apply these triage rules, deliver to Telegram — it was adequate.
The problems were operational, not quality. Rate limits on the free tier (1.5M TPM) meant that overlapping jobs could trigger throttling. When multiple cron jobs fired in the same window, Flash jobs would queue behind each other and occasionally timeout. For a single job in isolation, fine. For a system with 6-7 scheduled jobs, the rate limits became the scheduling constraint.
gpt-5-mini — the mid-tier trap
gpt-5-mini handled interactive chat. $2-4/day at my usage level. Capable enough for day-to-day operational conversations, and the context window was generous at 400K tokens.
The problem emerged over time. As sessions got long and context filled up, the model’s behavior shifted in ways that were operationally significant. I wrote about this in detail — duplicate messages, menu-driven responses, ignored commands, unauthorized actions. Not quality degradation exactly. More like personality degradation.
For short, isolated tasks, gpt-5-mini was fine. For extended operational sessions where the agent needs to maintain instruction adherence across many interactions, it drifted.
Sonnet 4.5 — the expensive answer
Sonnet costs more. About $3-4/day for cron jobs at steady state, $6-7 when I’m actively building or debugging. Roughly double what gpt-5-mini cost for interactive work.
The difference isn’t in the quality of individual responses. All three models can summarize an email correctly. The difference is in operational behavior:
- Instruction adherence. Sonnet follows behavioral rules consistently across long sessions. The rules I had to write for gpt-5-mini (“don’t offer menus,” “standby means stop”) haven’t been needed.
- Conciseness. Sonnet does the task and reports the result. It doesn’t over-explain, doesn’t repeat itself, doesn’t pad responses with “I’d be happy to help with that!”
- Boundary awareness. Sonnet hasn’t made unauthorized changes. When it’s unsure whether something falls within its operational boundaries, it asks. Once.
- Self-diagnosis. When something goes wrong, Sonnet tends to identify the actual cause rather than guessing at complex explanations. This isn’t universal — it still misses sometimes — but the hit rate is noticeably higher.
The cost decision
A missed actionable email alert costs me more in real business terms than the $2/day delta between gpt-5-mini and Sonnet. A model that pushes configs to GitHub without permission creates a security exposure that no amount of cost savings justifies. A monitoring system that fills my Telegram with duplicate messages is worse than no monitoring at all.
I moved everything to Sonnet and accepted the cost. The quality improvement on the monitors and briefings — fewer missed items, cleaner output, no behavioral drift — justified it within the first day.
What I’d suggest
If you’re running an AI agent that does real operational work, something that monitors and triages and acts on your behalf, evaluate models on operational behavior, not benchmarks.
Specifically:
- Run each model through a full day of your actual workload, not a test suite
- Watch what happens when context gets heavy
- Note whether the model follows operational boundaries or gradually expands its scope
- Check if it degrades gracefully (more cautious, asks more questions) or chaotically (duplicates, unauthorized actions, menus)
- Calculate the cost of model misbehavior, not just the cost of tokens
The cheapest model that doesn’t do something wrong at 2 AM is the right model. For me, right now, that’s Sonnet.