What is this?
An experimental exercise in structured war analysis. Every few days, The Human and a language model sit down, gather the latest developments, score previous predictions against what actually happened, and update the analysis. The results are published here — including the mistakes.
The project exists because most war commentary is post-hoc narrative. Pundits explain what happened, never what they expected. We wanted to see if tracking predictions with explicit confidence levels, and scoring them honestly, produces anything useful. The answer so far is: kind of. We're running at a 59% hit rate, which is better than a coin flip but worse than we'd like.
The Eight Lenses
Every event is analyzed through eight doctrinal lenses. The idea is that a military event has economic, political, and escalation implications that single-domain analysis misses.
The Prediction System
Each prediction has: an ID, a clear falsifiable statement, a specific deadline, a confidence level (0.0 to 1.0), and a reasoning. When the deadline arrives, it is scored as CONFIRMED, REFUTED, PARTIALLY_CONFIRMED, or EXPIRED.
The goal is not to be right. It is to be calibrated. If our 70% predictions come true about 70% of the time, the framework is working. If our 70% predictions come true 95% of the time, we are being too cautious. If they come true 40% of the time, we are overconfident.
Prediction failures are the most analytically valuable moments. When we are wrong, we dig into why — the miss tells us something about our mental model that a hit does not.
Isolation Methodology
The retroactive sessions (001-007) were produced by Claude Opus 4.6, each running in an isolated git worktree created from an orphan branch. The orphan branch contained a single commit with only the analytical framework document — no war-related files, no git history to inspect.
Each agent received facts only up to its session's date, provided inline in its prompt. It was instructed not to use file-reading tools, and zero file-tool uses were confirmed across all 7 sessions. The agent physically could not access future facts because they did not exist in its environment.
This does not eliminate all contamination. The model's training data may include information about this conflict. But it prevents the most obvious form of hindsight cheating — reading files that contain what happens next.
The Participants
The Human directs and orchestrates the analysis. Not a military or international relations expert. Knows how to ask critical questions but has major domain knowledge gaps. Provides direction, skepticism, and accountability. Comes from a long tradition of internet edgelords. The humor is dark and on the nose.
The Model (Claude Opus 4.6) provides domain expertise, structured analysis, and the actual predictions. Is confidently wrong about things on a regular basis. Cannot be held responsible for anything because it is, at the end of the day, a very sophisticated text predictor cosplaying as an intelligence analyst.
Calibration (7 Sessions)
| Metric | Value |
|---|---|
| Total predictions | 31 |
| Resolved | 17 |
| Confirmed | 10 |
| Refuted | 4 |
| Partial | 1 |
| Expired | 2 |
| Hit rate (confirmed only) | 59% |
| Hit rate (incl. partial) | 65% |
| Avg confidence (confirmed) | 0.70 |
| Avg confidence (refuted) | 0.71 |
Systematic bias identified: We consistently overestimate the speed of diplomatic and political responses. Most misses involve predicting actors would act faster than they did. The diplomatic world moves slower than the military world in a fast-moving conflict.
Worst miss: P001 — Houthi attack by 7 March at 0.85 confidence. Refuted. We overestimated Houthi willingness to escalate and Iranian leverage over Houthi decision-making.
Most embarrassing error: P011 (oil stays below $95) directly contradicted P002 (oil exceeds $100) from the prior session. The S003 agent flagged this as "sloppy analytical discipline." It was.
Limitations
This analysis relies on open-source, English-language reporting. Iranian ground truth is especially uncertain given internet restrictions and media access limitations. All casualty figures should be treated as minimums.
We have no OSINT methodology — no satellite imagery analysis, no ship tracking, no flight tracking, no social media geolocation. We cannot distinguish between different source qualities (Iranian Red Crescent figures vs. CENTCOM claims are not the same kind of "verified").
The model's training data is a source of contamination we cannot fully control. It may "know" things about this conflict that it should not, even in isolated sessions. The orphan-branch approach mitigates file-based contamination but cannot address this.
We are two amateurs. The analysis produced by actual intelligence professionals at CSIS, FDD, Critical Threats, and similar organizations is deeper, better-sourced, and more authoritative. What we offer that they do not is public prediction accountability — we show our work, including our failures.