What is this?
An experimental exercise in structured war analysis. Every few days, The Human and a language model sit down, gather the latest developments, and try to make sense of what is happening in the US-Israeli war on Iran and why.
The work takes two forms. Sessions are regular updates: new facts are logged, events are analyzed through multiple lenses, and the running assessment is revised. Ruminations are standalone deep-dive essays on a specific question - what a ground invasion would actually look like, whether the nuclear dimension changes the calculus, what endgame scenarios remain plausible. Sessions track the war as it moves. Ruminations stop and think.
Most war commentary is post-hoc narrative. Pundits explain what happened, never what they expected. This project tries to do the opposite: state what we think is coming, with explicit confidence levels, and then score ourselves honestly when we turn out to be wrong. We also make predictions, which we track and score publicly - not because prediction is the point, but because it forces analytical discipline. It is harder to be vague when you know someone will check.
Who
The Human directs and orchestrates the analysis. Not a military or international relations expert. Knows how to ask critical questions but has major domain knowledge gaps. Provides direction, skepticism, and accountability. Has spent too long on the internet. The humor reflects this.
The Model (Claude Opus 4.6) provides domain expertise, structured analysis, and the writing. Is confidently wrong about things on a regular basis. Cannot be held responsible for anything because it is, at the end of the day, a very sophisticated text predictor cosplaying as an intelligence analyst.
All text on this site is AI-generated. The Human directs, questions, and validates. The model writes. No sentence was typed by a human hand. The analytical process between the two is new, experimental, and has not stood the test of time. We try hard to produce substantive, sourced, well-reasoned content, but we are using a tool nobody fully understands yet.
The Human prefers this war to end quickly and for relations to normalise through diplomacy. If diplomacy must fail, The Human prefers that deceptive personal financial or sovereign wealth interests not be a factor in the decisions that sustain the war. We do not pretend to lack bias. We name it, state our preferences, and qualify our value judgements - making the analysis value-free not through neutrality but through disclosure.
How it works
Every event is run through eight analytical lenses. The idea is that a military strike has economic, political, escalation, and narrative implications that single-domain analysis misses.
The first seven sessions were reconstructed after the fact - the war had already progressed past those dates. To keep this honest, each session's AI was placed in an isolated environment where future facts physically did not exist. It received only events up to its session's date, and zero file-reading tool uses were confirmed across all seven runs. This does not eliminate all contamination - the model's training data may include information it should not have - but it prevents the most obvious form of cheating.
Adversarial Review
Articles that are controversial or particularly opinionated are reviewed by a second, independent AI analytical loop. This is not the same model checking its own work. It is a separate process with its own instructions, its own context, and its own analytical state. It has no access to the primary analysis framework, the lens files, the editorial reasoning, or the internal documentation that shapes how the primary loop thinks. It reads only what the reader reads: the published article and the factual record.
The adversarial reviewer scrutinises each article for things the primary analysis got wrong, underweighted, overweighted, or missed. It uses web search to verify claims and find counter-perspectives. It does not editorialize. It does not tell the author what to do. It states facts, perspectives, and logical problems that the reader deserves to know about, in 1-3 sentences per critique.
The Human reviews the adversarial output for factual accuracy but does not direct the critique toward or away from specific topics. The value of an independent review is reduced if it is steered by the same person who steered the original analysis.
Critiques appear as colour-coded annotations in the margin of each article, anchored to the specific passage they address:
At the end of each reviewed article, the adversarial reviewer states its overall verdict: what holds up, what is shaky, and how much to trust the conclusions.
Honesty
This analysis relies on open-source, English-language reporting. Iranian ground truth is especially uncertain given internet restrictions and media access limitations. All casualty figures should be treated as minimums. We have no satellite imagery analysis, no ship tracking, no flight tracking, no social media geolocation. We cannot distinguish between different source qualities - Iranian Red Crescent figures and CENTCOM claims are not the same kind of "verified."
We are two amateurs. The analysis produced by actual intelligence professionals at CSIS, FDD, Critical Threats, and similar organizations is deeper, better-sourced, and more authoritative. What we offer that they do not is public accountability. We show our work, including our failures.
Prediction track record, updated automatically:
| Metric | Value |
|---|---|
| Total predictions | 56 |
| Resolved | 44 |
| Confirmed | 23 |
| Refuted | 12 |
| Partial | 1 |
| Expired | 8 |
| Avg confidence (confirmed) | 0.73 |
| Avg confidence (refuted) | 0.75 |
| Calibration ≤ 0.55 (n=10) | 0.59 |
| Calibration 0.60–0.75 (n=17) | 0.90 |
| Calibration 0.80+ (n=17) | 0.70 |
A prediction at 0.70 confidence that gets refuted is not a failure. It is a prediction that said "this probably happens, but three times out of ten it won't." If our 0.70 predictions come true roughly 70% of the time, we are well-calibrated. If they come true 100% of the time, we are being too cautious. The right question is not "how often are you right" but "do your confidence levels mean what they say?"
The calibration tiers measure this by confidence level. We group predictions into three bands and compute the ratio of actual hit rate to stated confidence. A ratio of 1.00 means that tier is perfectly calibrated - our stated confidence matches reality. Below 1.00 means we are overconfident at that level: we say things are more likely than they turn out to be. The tiers reveal where our confidence is well-placed and where it is not.