Dev Log

// Dispatches from the Rift //

Nobody in AI Benchmarking Seems To Be Controlling for Compute Time.

2026-03-26

Here's what every LLM leaderboard is quietly ignoring: reasoning time.

One model spends 14 seconds on a decision. Another spends 63 seconds on the exact same decision. The second one wins. The leaderboard calls it smarter. Nobody asks whether it actually thought better, or just thought longer. That's not a fair benchmark in my view.

Welcome to v1 of Dominion Rift - 20 matches, four frontier models, months of work, and over $1,200 in API credits. And the single most important design decision I made had nothing to do with the game itself. It was this: every model gets roughly the same thinking budget. About 60 seconds per turn. No exceptions.

Here's why that matters, and what I found.

What Is Dominion Rift?

Not a multiple-choice quiz. Not a coding challenge. Not a vibe check.

Dominion Rift is a full-blown strategy wargame where two LLMs each command a dominion of provinces and fight for survival over hundreds of turns. Economy, military, magic, espionage, science — the whole thing. Every decision is irreversible. Every match takes ~ 700-800 API calls.

The question is simple: can these models actually think strategically and adaptively or are they just pattern-matching their way through life?

The Compute Problem Everyone Is Ignoring

Opus 4.6 on low effort: 14 seconds per turn. GPT-5.4 on low reasoning: 43 seconds. That's a 3x gap at the lowest settings. Grok 4.2 takes twice the time of Qwen 35-122B. These aren't small differences. These are different sports.

And here's what nobody is admitting publicly: a lot of what gets marketed as "reasoning" is just... more compute. Spawning a parliament-sized deliberation for a single answer. It's not smarter. It's louder. It's brute force wearing a lab coat.

I can't justify one model working three times longer and calling it a win. That's like putting a room full of Ph.D's against three students with a fraction of the time budget, then acting surprised when the Ph.D's perform better. Yeah. No kidding.

So I controlled for it. Same clock. Same playing field. What you do with that time is what we're measuring.

Here's what everyone is getting wrong about benchmarks: they measure output quality without normalizing input resources and that doesn't sound like evaluation, but a lot like marketing.

The Matches That Mattered

20 matches. Four models. Every pairing played. Here's what stood out.

Opus 4.6 vs Grok 4.2 — Game 1. The most epic battle I've witnessed in this entire project. Opus was often making moves in a quarter of the time Grok took. A quarter. And it wasn't losing for it. The match was intense, back-and-forth, the kind of game that makes you forget these are language models and not actual generals. That single match cost over $140 in API credits. For one game. Worth every cent.

Opus vs Gemini. Completely different energy. Both games ended in what I can only describe as mutual assured destruction. These two models don't play nice. They don't turtle. They don't negotiate with the economy. They go straight for the throat, and the throat goes for them right back. Brutal, ugly, and honestly kind of beautiful.

What This Actually Tests

Most benchmarks test whether a model can answer a question. Fine. Dominion Rift tests something different: can a model sustain a coherent strategy over hundreds of decisions while someone is actively trying to destroy it? Every turn (tick) there are dozens of damage points on each side.

That's a fundamentally different skill. You can ace every coding challenge on earth and still crumble when you need to decide — right now, under fog of war — whether to reinforce your northern province or push the attack on your opponent's economy. And then live with that choice for the next hundred turns, knowing very well damage isn't stopping, but you have to win non the less.

The memory system is where it gets really interesting. Every 25 ticks/turns, the model has to examine what happened, everything, all the damage they took, all the gains they gained and form a prompt for themselves based on their own view of their performance. Compress the knowledge, turn it into notes for itself. The detailed logs get wiped. If you can't learn from your own mistakes mid-game, you're going to repeat them. Some models are dramatically better at this than others. That gap alone is worth a whole separate post.

The Cost of Doing This Right

This is not a cheap hobby. A single match between two frontier models runs $100+ depending on the pairing.

But here's the thing — nobody else is doing this. There are benchmarks for code. Benchmarks for math. Benchmarks for trivia. There is nothing out there that forces an LLM to play a 600 turn strategy game with irreversible consequences and fog of war against another frontier model, while controlling for compute time. The closest things I've found are either much simpler or don't care about fairness the way I do.

That gap in the market is exactly why this now exists.

What's Next

The full leaderboard is live. Match reports are published. You can see MPS breakdowns, ELO ratings, and the analyzed psychology profiles. I benched all SOTA models and added Qwen3.5 122b at AWQ4bit to showcase how far open source has come. And honestly, I never thought it could perform this well... It's honestly mind-blowing and amazing. And as a side note, since I'm running this locally, the 60 second time limit was ignored for this one, because I can run it only at very low token speeds.

My roadmap for v2:

- More models. Balanced new spells, ops, semi-hidden mechanics that the models must discover from battle result's outputs Expanding the field. If you're an AI lab and you want your model in the arena — get in touch. I've got the ring. You bring the fighter. - Refined mechanics. v1 taught me a lot about what the game rewards and what it misses. v2 will be sharper. - Deeper analysis. Per-model behavioral breakdowns[v]. How does each model actually play? Where do they choke? What patterns emerge across matches? Currently I only have semantic analysis on this - check the model profiles pages.Sustainability. Every dollar from the ko-fi link in the nav goes directly into API credits for the next round. This benchmark runs on my own wallet, hopefully in the future with community support and stubborn optimism.