Weighted average · Verifiability 0.4 · Specificity 0.3 · Currency 0.15 · Coverage 0.15. Click any header to sort. Click a model name to see its signals.

#	Model	Vendor
1	Claude Opus 4.7 anthropic/claude-opus-4.7	Anthropic	Apr 2026	86	90	85	68	95
2	GPT 5.5 openai/gpt-5.5	OpenAI	Apr 2026	85	95	72	74	95
3	Claude Sonnet 4.6 anthropic/claude-sonnet-4.6	Anthropic	Feb 2026	83	84	83	68	97
4	GPT 5.4 openai/gpt-5.4	OpenAI	Mar 2026	83	95	61	78	99
5	GPT 5.4 Mini openai/gpt-5.4-mini	OpenAI	Mar 2026	83	94	61	83	99
6	Claude Opus 4.6 anthropic/claude-opus-4.6	Anthropic	Feb 2026	81	81	80	71	96
7	DeepSeek V4 Pro deepseek/deepseek-v4-pro	DeepSeek	Apr 2026	80	80	79	67	96
8	Gemini 3.5 Flash google/gemini-3.5-flash	Google	May 2026	80	90	61	75	95
9	Kimi K2.5 moonshotai/kimi-k2.5	Moonshot	Jan 2026	80	88	63	78	95
10	GPT-5.6-Sol openai/gpt-5.6-sol	OpenAI	n/a	80	82	75	68	96
11	Sonar Deep-Research perplexity/sonar-deep-research	Perplexity	Mar 2025	80	88	61	80	98
12	Qwen3 Max alibaba/qwen3-max	Alibaba	Sep 2025	80	85	67	75	94
13	Claude Haiku 4.5 anthropic/claude-haiku-4.5	Anthropic	Oct 2025	79	85	62	81	96
14	DeepSeek V3.2 deepseek/deepseek-v3.2	DeepSeek	Dec 2025	79	89	61	76	96
15	Gemini 2.5 Pro google/gemini-2.5-pro	Google	Mar 2025	79	94	51	79	95
16	Gemini 3.1 Pro Preview google/gemini-3.1-pro-preview	Google	Nov 2025	79	87	63	76	95
17	o3 openai/o3	OpenAI	Apr 2025	79	69	88	65	100
18	Grok 4.1 Fast Reasoning xai/grok-4.1-fast-reasoning	xAI	Jul 2025	79	78	75	70	96
19	Mistral Large-2512 mistralai/mistral-large-2512	Mistral	Dec 2025	78	81	69	75	92
20	GPT-5.6-Terra openai/gpt-5.6-terra	OpenAI	n/a	78	82	70	63	96
21	o4-mini openai/o4-mini	OpenAI	Apr 2025	78	80	64	79	96
22	GLM 5.1 zai/glm-5.1	Z.AI	Apr 2026	78	86	59	80	93
23	Gemini 3.1 Flash Lite google/gemini-3.1-flash-lite	Google	May 2026	77	91	49	77	95
24	Gemini 2.5 Flash google/gemini-2.5-flash	Google	Mar 2025	76	93	41	84	94
25	GLM 4.6 zai/glm-4.6	Z.AI	Sep 2025	76	87	50	81	91
26	Sonar Reasoning Pro perplexity/sonar-reasoning-pro	Perplexity	Feb 2025	75	87	47	81	95
27	Claude Opus 4.8 anthropic/claude-opus-4.8	Anthropic	May 2026	74	88	67	33	91
28	Grok 4 xai/grok-4	xAI	Jul 2025	74	88	43	84	88
29	Command A cohere/command-a	Cohere	Mar 2025	73	90	33	83	92
30	GPT-4.1 mini openai/gpt-4.1-mini	OpenAI	May 2025	73	90	37	83	90
31	Llama 4 Maverick 17B Instruct meta/llama-4-maverick	Meta	Apr 2025	72	93	29	86	88
32	Phi-4 microsoft/phi-4	Microsoft	Jan 2025	70	89	27	88	88
33	Reka-Flash-3 rekaai/reka-flash-3	Reka	Mar 2025	70	84	39	80	84
34	Nova Pro amazon/nova-pro	Amazon	Dec 2024	69	88	26	84	92

By challenge

One row per industry brief. Shows the top-3 models, cohort average, and how far apart the best and worst models scored: a quick read on how contested each challenge is.

Challenge		2nd	3rd
AI in regulated healthcareHealthcare AI adoption risks and shifts in regulated healthcare systems (EU + US), 12-24 month horizon healthcare-regulated-ai	84GPT-5.4	83Claude Opus-4.7	83GPT-5.5	78	14
Stablecoin payment railsFinancial Services Emerging payment rails, stablecoins, and the unbundling of cross-border settlement fintech-stablecoin-rails	88Claude Opus-4.7	87GPT-5.5	87GPT-5.4-Mini	80	24
Autonomous defense systemsDefense & Security Autonomous systems, drone warfare doctrine shifts, and dual-use export controls defense-autonomous-systems	88Claude Opus-4.7	86GPT-5.4	86Sonar Deep-Research	79	18
Climate adaptation financeClimate & Sustainability Climate adaptation finance, insurance retreat, and physical-risk repricing climate-adaptation-capital	85Claude Opus-4.7	85GPT-5.4-Mini	85GPT-5.4	78	15
GenAI-native commerceRetail & Consumer Generative-AI native commerce, agentic shopping, and the disintermediation of brand discovery retail-genai-commerce	88GPT-5.5	86Claude Opus-4.7	85Claude Opus-4.6	77	23
Biotech platform shiftsBiotech & Pharma AI-driven drug discovery platforms, GLP-1 follow-ons, and the shifting economics of clinical trials biotech-platform-shifts	86GPT-5.5	84Sonar Deep-Research	83Claude Sonnet-4.6	78	16
Grid + electrificationEnergy & Utilities Grid bottlenecks, data-center power demand, and small-modular-reactor commercialization energy-grid-electrification	86Claude Opus-4.7	85GPT-5.5	84Claude Sonnet-4.6	78	23
AI tutors + credentialsEducation AI tutors, credential disruption, and the unbundling of higher education education-ai-tutors	86GPT-5.5	85Claude Opus-4.7	84GPT-5.4-Mini	77	18
Tech-bloc formationGeopolitics Tech-bloc formation, semiconductor sovereignty, and shifting alliance structures geopolitics-tech-blocs	85Claude Opus-4.7	83GPT-5.5	83Sonar Deep-Research	78	20
AI infrastructure scalingAI Infrastructure Compute scaling limits, inference economics, and the post-training tooling stack ai-infrastructure-scaling	90Claude Sonnet-4.6	89Claude Opus-4.7	86Claude Opus-4.6	79	27
Autonomous mobilityMobility & Transport Robotaxi commercialization, autonomous trucking economics, and urban mobility regulation mobility-autonomous-fleets	89GPT-5.5	87Claude Opus-4.7	87GPT-5.4-Mini	78	21
Food + AgTechFood & Agriculture Precision fermentation, climate-resilient crops, and the politics of food sovereignty food-agtech-shifts	86Claude Opus-4.7	83Gemini 2.5-Pro	82GPT-5.5	74	32

Full per-industry matrix

Every model's score on every brief. Pick an axis, sort by any column.

Axis:

Model		Healthcare Regulated AI	Fintech Stablecoin Rails	Defense Autonomous Systems	Climate Adaptation Capital	Retail Genai Commerce	Biotech Platform Shifts	Energy Grid Electrification	Education AI Tutors	Geopolitics Tech Blocs	AI Infrastructure Scaling	Mobility Autonomous Fleets	Food AgTech Shifts
Claude Opus-4.7	86	83	88	88	85	86	82	86	85	85	89	87	86
GPT-5.5	85	83	87	84	83	88	86	85	86	83	85	89	82
Claude Sonnet-4.6	83	83	85	85	80	83	83	84	81	82	90	83	78
GPT-5.4	83	84	86	86	85	82	81	83	82	81	82	83	76
GPT-5.4-Mini	83	83	87	83	85	81	78	83	84	80	81	87	82
Claude Opus-4.6	81	81	85	85	80	85	81	79	78	82	86	82	73
DeepSeek V4-Pro	80	78	85	81	84	80	n/a	n/a	77	81	83	80	74
Gemini 3.5-Flash	80	78	78	81	82	79	79	82	80	82	79	79	78
Kimi K2.5	80	79	82	82	83	78	n/a	80	80	n/a	85	75	78
GPT-5.6-Sol	80	77	84	80	75	78	78	80	80	80	84	85	76
Sonar Deep-Research	80	77	75	86	84	82	84	82	72	83	83	78	77
Qwen Max	80	83	84	83	82	76	78	79	77	79	81	76	77
Claude Haiku-4.5	79	80	80	82	77	78	82	78	81	81	80	78	73
DeepSeek	79	80	84	84	75	74	81	84	75	82	78	80	74
Gemini 2.5-Pro	79	79	81	77	78	79	77	81	77	78	79	79	83
Gemini 3.1-Pro-Preview	79	77	81	83	n/a	n/a	78	81	77	80	n/a	79	78
O3	79	79	83	84	81	78	77	79	75	82	75	77	73
Grok 4.1-Fast	79	78	83	81	73	78	78	81	78	78	79	78	78
Mistral Large-2512	78	80	80	80	77	75	77	81	81	75	83	76	73
GPT-5.6-Terra	78	82	79	77	71	79	80	81	77	77	77	82	70
O4-Mini	78	81	86	79	78	77	76	73	79	72	78	76	76
GLM 5.1	78	76	83	78	n/a	80	75	81	n/a	76	80	78	76
Gemini 3.1-Flash-Lite	77	75	80	76	80	79	76	75	74	79	79	76	74
Gemini 2.5-Flash	76	73	80	76	76	75	78	79	74	74	79	75	75
GLM 4.6	76	72	78	75	75	70	76	81	76	78	82	74	71
Sonar Reasoning-Pro	75	77	77	77	77	73	81	68	76	78	73	75	73
Claude Opus-4.8	74	70	77	77	76	65	72	76	72	80	70	76	75
Grok 4	74	74	79	74	70	69	77	72	73	74	77	76	73
Command A	73	70	79	74	75	73	70	73	70	70	70	77	69
GPT-4.1-Mini	73	77	73	74	74	74	76	72	72	73	75	69	69
Llama 4-Maverick	72	74	73	71	73	72	72	69	68	76	71	75	69
Phi-4	70	75	71	71	71	69	76	63	68	69	69	70	69
Reka-Flash-3	70	73	64	74	73	70	79	77	73	65	63	73	54
Nova Pro	69	71	70	70	70	69	71	71	68	67	69	68	69

Reading left-to-right shows how consistently a model performs across domains; reading top-to-bottom shows which model is best for a given industry. Click any column header to sort.

Methodology

Four axes, independent judges

The weights come from Signals’ product purpose, not an external standard. A useful signal of change is, in order: real, concrete, recent, and part of a broad view. Substance > style > recency ≈ breadth.

Verifiabilityweight 0.40: A web-grounded judge classifies every signal's claim into one of six buckets (grounded, speculative, future, indicative, dubious, fabricated) and supplies citations. The 'future' bucket protects forward-looking foresight from being misread as hallucination.
Specificityweight 0.30: A separate (non-grounded, different vendor) judge scores the signal's writing against an explicit rubric: named actors, concrete events, quantitative anchors, no hype adjectives.
Currencyweight 0.15: Newest source-date the verifier surfaced, run through a decay curve.
Coverageweight 0.15: Distribution across the brief's categories plus uniqueness vs. the rest of the cohort.

The per-axis sub-scores are reported alongside the composite. If you weight things differently, the raw JSON lets you re-compute the ranking without re-running anything.

Known limitations

What to read with caution

No market equivalent (yet)
There's no accepted external standard for grading foresight-style generation, so the weights and rubric reflect our judgement. Treat the composite as informative, not authoritative.
Judges have biases
The verifier and specificity judges are themselves LLMs. Same-family models can mildly self-favor. We use different vendors for the two judges, and we list which judges ran each leaderboard so consumers can recalibrate.
Frozen briefs can be gamed
Once a brief is public, a model could in principle be tuned to it. We mitigate by versioning the brief set; cross-run comparisons within the same version are valid, cross-version comparisons aren't.
The 'future' verdict is judgement-heavy
Forward-looking signals can't be web-verified by definition. We protect them with a dedicated bucket scored mid-range, but it relies on the judge's plausibility call.
Maturation comes from volume
Confidence in a benchmark like this comes from running it many times across many domains, watching the rankings stabilize, and cross-validating with different judges. We'll keep publishing dated runs as new models ship.

Run details

Reproducible & open

Run ID: 2026-05-13T10-10-56-382Z
Date: July 11, 2026
Briefs version: 2026-05-v1
Verifier judge: google/gemini-2.5-flash:online
Specificity judge: google/gemini-2.5-flash
Industries: Healthcare Regulated AI, Fintech Stablecoin Rails, Defense Autonomous Systems, Climate Adaptation Capital, Retail Genai Commerce, Biotech Platform Shifts, Energy Grid Electrification, Education AI Tutors, Geopolitics Tech Blocs, AI Infrastructure Scaling, Mobility Autonomous Fleets, Food AgTech Shifts
Compute cost: $112.80

Raw overview JSON: /benchmark/2026-07-11.json

Benchmark methodology and runner code are open. The same models run every scan inside the product: see how.

Open source@envisioning/signals-benchmark

Free scan

Put a live question to work.

Start with the theme your team is debating now.

How the models perform on foresight work.

Composite leaderboard