Method - Armchair Intelligence

What is this?

An experimental exercise in structured war analysis. Every few days, The Human and a language model sit down, gather the latest developments, score previous predictions against what actually happened, and update the analysis. The results are published here — including the mistakes.

The project exists because most war commentary is post-hoc narrative. Pundits explain what happened, never what they expected. We wanted to see if tracking predictions with explicit confidence levels, and scoring them honestly, produces anything useful. The answer so far is: kind of. We're running at a 59% hit rate, which is better than a coin flip but worse than we'd like.

The Eight Lenses

Every event is analyzed through eight doctrinal lenses. The idea is that a military event has economic, political, and escalation implications that single-domain analysis misses.

Military-Operational

Phases of war, air superiority, attrition, centers of gravity. Who is winning on the battlefield and what can each side actually do?

Escalation Dynamics

Kahn ladder position, horizontal and vertical escalation, red lines, off-ramps. How far up are we and what pushes us higher?

Information & Narrative

Competing narratives, propaganda, strategic communication. Who is winning the story and how is perception shaping reality?

Economic & Energy

Oil markets, Strait of Hormuz, sanctions, global energy impact. At what price does domestic pressure force the US to stop?

Political-Domestic

US, Iran, Israel internal politics and decision-making. Whose domestic politics breaks first?

Proxy & Regional

Hezbollah, Houthis, Iraqi PMF, Gulf states. Will proxy fronts remain contained or does one blow open?

Nuclear

Program status, enrichment, IAEA, proliferation risk. Does this war make an Iranian bomb more or less likely?

Endgame Scenarios

How does this end? Ceasefire paths, regime outcomes, protracted conflict. What conditions trigger each scenario?

The Prediction System

Each prediction has: an ID, a clear falsifiable statement, a specific deadline, a confidence level (0.0 to 1.0), and a reasoning. When the deadline arrives, it is scored as CONFIRMED, REFUTED, PARTIALLY_CONFIRMED, or EXPIRED.

The goal is not to be right. It is to be calibrated. If our 70% predictions come true about 70% of the time, the framework is working. If our 70% predictions come true 95% of the time, we are being too cautious. If they come true 40% of the time, we are overconfident.

Prediction failures are the most analytically valuable moments. When we are wrong, we dig into why — the miss tells us something about our mental model that a hit does not.

Isolation Methodology

The retroactive sessions (001-007) were produced by Claude Opus 4.6, each running in an isolated git worktree created from an orphan branch. The orphan branch contained a single commit with only the analytical framework document — no war-related files, no git history to inspect.

Each agent received facts only up to its session's date, provided inline in its prompt. It was instructed not to use file-reading tools, and zero file-tool uses were confirmed across all 7 sessions. The agent physically could not access future facts because they did not exist in its environment.

This does not eliminate all contamination. The model's training data may include information about this conflict. But it prevents the most obvious form of hindsight cheating — reading files that contain what happens next.

The Participants

The Human directs and orchestrates the analysis. Not a military or international relations expert. Knows how to ask critical questions but has major domain knowledge gaps. Provides direction, skepticism, and accountability. Comes from a long tradition of internet edgelords. The humor is dark and on the nose.

The Model (Claude Opus 4.6) provides domain expertise, structured analysis, and the actual predictions. Is confidently wrong about things on a regular basis. Cannot be held responsible for anything because it is, at the end of the day, a very sophisticated text predictor cosplaying as an intelligence analyst.

Calibration (7 Sessions)

Metric	Value
Total predictions	31
Resolved	17
Confirmed	10
Refuted	4
Partial	1
Expired	2
Hit rate (confirmed only)	59%
Hit rate (incl. partial)	65%
Avg confidence (confirmed)	0.70
Avg confidence (refuted)	0.71

Systematic bias identified: We consistently overestimate the speed of diplomatic and political responses. Most misses involve predicting actors would act faster than they did. The diplomatic world moves slower than the military world in a fast-moving conflict.

Worst miss: P001 — Houthi attack by 7 March at 0.85 confidence. Refuted. We overestimated Houthi willingness to escalate and Iranian leverage over Houthi decision-making.

Most embarrassing error: P011 (oil stays below $95) directly contradicted P002 (oil exceeds $100) from the prior session. The S003 agent flagged this as "sloppy analytical discipline." It was.

Limitations

This analysis relies on open-source, English-language reporting. Iranian ground truth is especially uncertain given internet restrictions and media access limitations. All casualty figures should be treated as minimums.

We have no OSINT methodology — no satellite imagery analysis, no ship tracking, no flight tracking, no social media geolocation. We cannot distinguish between different source qualities (Iranian Red Crescent figures vs. CENTCOM claims are not the same kind of "verified").

The model's training data is a source of contamination we cannot fully control. It may "know" things about this conflict that it should not, even in isolated sessions. The orphan-branch approach mitigates file-based contamination but cannot address this.

We are two amateurs. The analysis produced by actual intelligence professionals at CSIS, FDD, Critical Threats, and similar organizations is deeper, better-sourced, and more authoritative. What we offer that they do not is public prediction accountability — we show our work, including our failures.