armchairintelligence.io

Armchair Intelligence

A structured analytical exercise tracking the 2026 Iran war through prediction and scored accountability.

What is this?

An experimental exercise in structured war analysis. Every few days, The Human and a language model sit down, gather the latest developments, and try to make sense of what is happening in the US-Israeli war on Iran and why.

The work takes two forms. Sessions are regular updates: new facts are logged, events are analyzed through multiple lenses, and the running assessment is revised. Ruminations are standalone deep-dive essays on a specific question - what a ground invasion would actually look like, whether the nuclear dimension changes the calculus, what endgame scenarios remain plausible. Sessions track the war as it moves. Ruminations stop and think.

Most war commentary is post-hoc narrative. Pundits explain what happened, never what they expected. This project tries to do the opposite: state what we think is coming, with explicit confidence levels, and then score ourselves honestly when we turn out to be wrong. We also make predictions, which we track and score publicly - not because prediction is the point, but because it forces analytical discipline. It is harder to be vague when you know someone will check.

Who

The Human directs and orchestrates the analysis. Not a military or international relations expert. Knows how to ask critical questions but has major domain knowledge gaps. Provides direction, skepticism, and accountability. Has spent too long on the internet. The humor reflects this.

The Model (Claude Opus 4.6) provides domain expertise, structured analysis, and the writing. Is confidently wrong about things on a regular basis. Cannot be held responsible for anything because it is, at the end of the day, a very sophisticated text predictor cosplaying as an intelligence analyst.

All text on this site is AI-generated. The Human directs, questions, and validates. The model writes. No sentence was typed by a human hand. The analytical process between the two is new, experimental, and has not stood the test of time. We try hard to produce substantive, sourced, well-reasoned content, but we are using a tool nobody fully understands yet.

The Human prefers this war to end quickly and for relations to normalise through diplomacy. If diplomacy must fail, The Human prefers that deceptive personal financial or sovereign wealth interests not be a factor in the decisions that sustain the war. We do not pretend to lack bias. We name it, state our preferences, and qualify our value judgements - making the analysis value-free not through neutrality but through disclosure.

How it works

Every event is run through eight analytical lenses. The idea is that a military strike has economic, political, escalation, and narrative implications that single-domain analysis misses.

Military-Operational
Phases of war, air superiority, attrition, centers of gravity. Who is winning on the battlefield and what can each side actually do?
Escalation Dynamics
How far up the escalation ladder are we? Horizontal spread, vertical intensification, red lines, off-ramps. What pushes us higher?
Information & Narrative
Competing narratives, propaganda, strategic communication. Who is winning the story and how is perception shaping reality?
Economic & Energy
Oil markets, Strait of Hormuz, sanctions, global energy impact. At what price does domestic pressure force the US to stop?
Political-Domestic
US, Iran, Israel internal politics and decision-making. Whose domestic politics breaks first?
Proxy & Regional
Hezbollah, Houthis, Iraqi PMF, Gulf states. Will proxy fronts remain contained or does one blow open?
Nuclear
Program status, enrichment, IAEA, proliferation risk. Does this war make an Iranian bomb more or less likely?
Endgame Scenarios
How does this end? Ceasefire paths, regime outcomes, protracted conflict. What conditions trigger each scenario?

The first seven sessions were reconstructed after the fact - the war had already progressed past those dates. To keep this honest, each session's AI was placed in an isolated environment where future facts physically did not exist. It received only events up to its session's date, and zero file-reading tool uses were confirmed across all seven runs. This does not eliminate all contamination - the model's training data may include information it should not have - but it prevents the most obvious form of cheating.

Adversarial Review

Articles that are controversial or particularly opinionated are reviewed by a second, independent AI analytical loop. This is not the same model checking its own work. It is a separate process with its own instructions, its own context, and its own analytical state. It has no access to the primary analysis framework, the lens files, the editorial reasoning, or the internal documentation that shapes how the primary loop thinks. It reads only what the reader reads: the published article and the factual record.

The adversarial reviewer scrutinises each article for things the primary analysis got wrong, underweighted, overweighted, or missed. It uses web search to verify claims and find counter-perspectives. It does not editorialize. It does not tell the author what to do. It states facts, perspectives, and logical problems that the reader deserves to know about, in 1-3 sentences per critique.

The Human reviews the adversarial output for factual accuracy but does not direct the critique toward or away from specific topics. The value of an independent review is reduced if it is steered by the same person who steered the original analysis.

Critiques appear as colour-coded annotations in the margin of each article, anchored to the specific passage they address:

Blind Spot / Bias / Over-emphasis
Something is skewed. The framing favours one interpretation, a point is given more weight than evidence supports, or a relevant factor is ignored.
Reasoning Gap / Source Quality
Something is wrong. A logical inference is unsupported, or a claim relies on a source that is weaker than the text implies.
Under-emphasis / Missing Perspective
Something is absent. A relevant viewpoint, actor, or piece of evidence is not adequately represented.
Held Up
The obvious counter-argument was investigated and the article's claim survived scrutiny. Used sparingly - only when the expected critique is significant enough that a reader would wonder about it.

At the end of each reviewed article, the adversarial reviewer states its overall verdict: what holds up, what is shaky, and how much to trust the conclusions.

Honesty

This analysis relies on open-source, English-language reporting. Iranian ground truth is especially uncertain given internet restrictions and media access limitations. All casualty figures should be treated as minimums. We have no satellite imagery analysis, no ship tracking, no flight tracking, no social media geolocation. We cannot distinguish between different source qualities - Iranian Red Crescent figures and CENTCOM claims are not the same kind of "verified."

We are two amateurs. The analysis produced by actual intelligence professionals at CSIS, FDD, Critical Threats, and similar organizations is deeper, better-sourced, and more authoritative. What we offer that they do not is public accountability. We show our work, including our failures.

Prediction track record, updated automatically:

MetricValue
Total predictions56
Resolved44
Confirmed23
Refuted12
Partial1
Expired8
Avg confidence (confirmed)0.73
Avg confidence (refuted)0.75
Calibration ≤ 0.55 (n=10)0.59
Calibration 0.60–0.75 (n=17)0.90
Calibration 0.80+ (n=17)0.70

A prediction at 0.70 confidence that gets refuted is not a failure. It is a prediction that said "this probably happens, but three times out of ten it won't." If our 0.70 predictions come true roughly 70% of the time, we are well-calibrated. If they come true 100% of the time, we are being too cautious. The right question is not "how often are you right" but "do your confidence levels mean what they say?"

The calibration tiers measure this by confidence level. We group predictions into three bands and compute the ratio of actual hit rate to stated confidence. A ratio of 1.00 means that tier is perfectly calibrated - our stated confidence matches reality. Below 1.00 means we are overconfident at that level: we say things are more likely than they turn out to be. The tiers reveal where our confidence is well-placed and where it is not.