Benchmark

A leaderboard for foresight.

We ran 32 frontier models against 12 frozen industry briefs and scored every output on four orthogonal axes. Web-grounded judges. No vibes.

Models tested
32
Industry briefs
12
Signals evaluated
5,841
Cost / signal
$0.019

Composite leaderboard

Weighted average · Verifiability 0.4 · Specificity 0.3 · Currency 0.15 · Coverage 0.15. Click any header to sort. Click a model name to see its signals.

#ModelVendor
1
Claude Opus 4.7
anthropic/claude-opus-4.7
AnthropicApr 20268690856896
2
GPT 5.5
openai/gpt-5.5
OpenAIApr 20268595727496
3
Claude Sonnet 4.6
anthropic/claude-sonnet-4.6
AnthropicFeb 20268384836897
4
GPT 5.4
openai/gpt-5.4
OpenAIMar 20268395617899
5
GPT 5.4 Mini
openai/gpt-5.4-mini
OpenAIMar 20268394618399
6
Claude Opus 4.6
anthropic/claude-opus-4.6
AnthropicFeb 20268181807196
7
DeepSeek V4 Pro
deepseek/deepseek-v4-pro
DeepSeekApr 20268080796796
8
Gemini 3.5 Flash
google/gemini-3.5-flash
GoogleMay 20268090617595
9
Kimi K2.5
moonshotai/kimi-k2.5
MoonshotJan 20268088637895
10
Sonar Deep-Research
perplexity/sonar-deep-research
PerplexityMar 20258088618098
11
Qwen3 Max
alibaba/qwen3-max
AlibabaSep 20258085677595
12
Claude Haiku 4.5
anthropic/claude-haiku-4.5
AnthropicOct 20257985628196
13
DeepSeek V3.2
deepseek/deepseek-v3.2
DeepSeekDec 20257989617696
14
Gemini 2.5 Pro
google/gemini-2.5-pro
GoogleMar 20257994517995
15
Gemini 3.1 Pro Preview
google/gemini-3.1-pro-preview
GoogleNov 20257987637695
16
o3
openai/o3
OpenAIApr 202579698865100
17
Grok 4.1 Fast Reasoning
xai/grok-4.1-fast-reasoning
xAIJul 20257978757096
18
Mistral Large-2512
mistralai/mistral-large-2512
MistralDec 20257881697592
19
o4-mini
openai/o4-mini
OpenAIApr 20257880647997
20
GLM 5.1
zai/glm-5.1
Z.AIApr 20267886598093
21
Gemini 3.1 Flash Lite
google/gemini-3.1-flash-lite
GoogleMay 20267791497795
22
Gemini 2.5 Flash
google/gemini-2.5-flash
GoogleMar 20257693418494
23
GLM 4.6
zai/glm-4.6
Z.AISep 20257687508191
24
Sonar Reasoning Pro
perplexity/sonar-reasoning-pro
PerplexityFeb 20257587478195
25
Claude Opus 4.8
anthropic/claude-opus-4.8
AnthropicMay 20267488673391
26
Grok 4
xai/grok-4
xAIJul 20257488438488
27
Command A
cohere/command-a
CohereMar 20257390338392
28
GPT-4.1 mini
openai/gpt-4.1-mini
OpenAIMay 20257390378390
29
Llama 4 Maverick 17B Instruct
meta/llama-4-maverick
MetaApr 20257293298688
30
Phi-4
microsoft/phi-4
MicrosoftJan 20257089278888
31
Reka-Flash-3
rekaai/reka-flash-3
RekaMar 20257084398084
32
Nova Pro
amazon/nova-pro
AmazonDec 20246988268492

By challenge

One row per industry brief. Shows the top-3 models, cohort average, and how far apart the best and worst models scored — a quick read on how contested each challenge is.

Challenge2nd3rd
AI in regulated healthcareHealthcare
AI adoption risks and shifts in regulated healthcare systems (EU + US), 12-24 month horizon
healthcare-regulated-ai
84GPT-5.483Claude Opus-4.783GPT-5.5
78
14
Stablecoin payment railsFinancial Services
Emerging payment rails, stablecoins, and the unbundling of cross-border settlement
fintech-stablecoin-rails
88Claude Opus-4.787GPT-5.587GPT-5.4-Mini
80
24
Autonomous defense systemsDefense & Security
Autonomous systems, drone warfare doctrine shifts, and dual-use export controls
defense-autonomous-systems
88Claude Opus-4.786GPT-5.486Sonar Deep-Research
79
18
Climate adaptation financeClimate & Sustainability
Climate adaptation finance, insurance retreat, and physical-risk repricing
climate-adaptation-capital
85Claude Opus-4.785GPT-5.4-Mini85GPT-5.4
78
15
GenAI-native commerceRetail & Consumer
Generative-AI native commerce, agentic shopping, and the disintermediation of brand discovery
retail-genai-commerce
88GPT-5.586Claude Opus-4.785Claude Opus-4.6
77
22
Biotech platform shiftsBiotech & Pharma
AI-driven drug discovery platforms, GLP-1 follow-ons, and the shifting economics of clinical trials
biotech-platform-shifts
86GPT-5.584Sonar Deep-Research83Claude Sonnet-4.6
78
16
Grid + electrificationEnergy & Utilities
Grid bottlenecks, data-center power demand, and small-modular-reactor commercialization
energy-grid-electrification
86Claude Opus-4.785GPT-5.584Claude Sonnet-4.6
78
23
AI tutors + credentialsEducation
AI tutors, credential disruption, and the unbundling of higher education
education-ai-tutors
86GPT-5.585Claude Opus-4.784GPT-5.4-Mini
76
18
Tech-bloc formationGeopolitics
Tech-bloc formation, semiconductor sovereignty, and shifting alliance structures
geopolitics-tech-blocs
85Claude Opus-4.783GPT-5.583Sonar Deep-Research
78
20
AI infrastructure scalingAI Infrastructure
Compute scaling limits, inference economics, and the post-training tooling stack
ai-infrastructure-scaling
90Claude Sonnet-4.689Claude Opus-4.786GPT-5.5
79
27
Autonomous mobilityMobility & Transport
Robotaxi commercialization, autonomous trucking economics, and urban mobility regulation
mobility-autonomous-fleets
89GPT-5.587Claude Opus-4.787GPT-5.4-Mini
78
21
Food + AgTechFood & Agriculture
Precision fermentation, climate-resilient crops, and the politics of food sovereignty
food-agtech-shifts
86Claude Opus-4.783Gemini 2.5-Pro82GPT-5.5
75
32

Full per-industry matrix

Every model's score on every brief. Pick an axis, sort by any column.

Axis:
Model
Claude Opus-4.7
86
83
88
88
85
86
82
86
85
85
89
87
86
GPT-5.5
85
83
87
85
83
88
86
85
86
83
86
89
82
Claude Sonnet-4.6
83
83
85
85
80
83
83
84
81
82
90
83
78
GPT-5.4
83
84
86
86
85
82
81
83
82
81
82
83
76
GPT-5.4-Mini
83
83
87
83
85
81
78
83
84
80
81
87
82
Claude Opus-4.6
81
81
85
85
80
85
81
79
78
82
86
82
73
DeepSeek V4-Pro
80
78
85
81
84
80
77
81
83
80
74
Gemini 3.5-Flash
80
78
78
81
82
79
79
82
80
82
79
79
78
Kimi K2.5
80
79
82
82
83
78
80
80
85
75
78
Sonar Deep-Research
80
77
75
86
84
82
84
82
72
83
83
78
77
Qwen Max
80
83
84
83
82
76
78
79
77
79
81
76
77
Claude Haiku-4.5
79
80
80
82
77
78
82
78
81
81
80
78
73
DeepSeek
79
80
84
84
75
74
81
84
75
82
78
80
74
Gemini 2.5-Pro
79
79
81
77
78
79
77
81
77
78
79
79
83
Gemini 3.1-Pro-Preview
79
77
81
83
78
81
77
80
79
78
O3
79
79
83
84
81
78
77
79
75
82
75
77
73
Grok 4.1-Fast
79
78
83
81
73
78
78
82
78
78
79
78
78
Mistral Large-2512
78
80
80
80
77
75
77
81
81
75
83
76
73
O4-Mini
78
81
86
79
78
77
76
73
79
72
78
76
77
GLM 5.1
78
76
83
78
80
75
81
76
80
78
76
Gemini 3.1-Flash-Lite
77
75
80
76
80
79
76
75
74
79
79
76
74
Gemini 2.5-Flash
76
73
80
76
76
75
78
79
74
74
79
75
75
GLM 4.6
76
72
78
75
75
70
76
81
76
78
82
74
71
Sonar Reasoning-Pro
75
77
77
77
77
73
81
68
76
78
73
75
73
Claude Opus-4.8
74
70
77
77
76
66
72
76
72
80
70
76
75
Grok 4
74
74
79
74
70
69
77
72
73
74
77
76
73
Command A
73
70
79
74
75
73
70
73
70
70
70
77
69
GPT-4.1-Mini
73
77
73
74
74
74
76
72
72
73
75
69
69
Llama 4-Maverick
72
74
73
71
73
72
72
69
68
76
71
75
69
Phi-4
70
75
71
71
71
69
76
63
68
69
69
70
69
Reka-Flash-3
70
73
64
74
73
70
79
77
73
65
63
73
54
Nova Pro
69
71
70
70
70
69
71
71
68
67
69
68
69

Reading left-to-right shows how consistently a model performs across domains; reading top-to-bottom shows which model is best for a given industry. Click any column header to sort.

Pareto chart

Substance vs. style, by model

X = Verifiability · Y = Specificity · dot size = Composite score. Upper-right is the frontier; biggest dot wins overall.

50607080901005060708090100Verifiability →Specificity →Claude Opus-4.7 (anthropic/claude-opus-4.7) Verifiability 90 · Specificity 85 · Coverage 96 · Composite 86GPT-5.5 (openai/gpt-5.5) Verifiability 95 · Specificity 72 · Coverage 96 · Composite 85Claude Sonnet-4.6 (anthropic/claude-sonnet-4.6) Verifiability 84 · Specificity 83 · Coverage 97 · Composite 83GPT-5.4-Mini (openai/gpt-5.4-mini) Verifiability 94 · Specificity 61 · Coverage 99 · Composite 83GPT-5.4 (openai/gpt-5.4) Verifiability 95 · Specificity 61 · Coverage 99 · Composite 83Claude Opus-4.6 (anthropic/claude-opus-4.6) Verifiability 81 · Specificity 80 · Coverage 96 · Composite 81DeepSeek V4-Pro (deepseek/deepseek-v4-pro) Verifiability 80 · Specificity 79 · Coverage 96 · Composite 80Gemini 3.5-Flash (google/gemini-3.5-flash) Verifiability 90 · Specificity 61 · Coverage 95 · Composite 80Kimi K2.5 (moonshotai/kimi-k2.5) Verifiability 88 · Specificity 63 · Coverage 95 · Composite 80Sonar Deep-Research (perplexity/sonar-deep-research) Verifiability 88 · Specificity 61 · Coverage 98 · Composite 80Qwen Max (qwen/qwen3-max) Verifiability 85 · Specificity 67 · Coverage 95 · Composite 80Claude Haiku-4.5 (anthropic/claude-haiku-4.5) Verifiability 85 · Specificity 62 · Coverage 96 · Composite 79DeepSeek (deepseek/deepseek-v3.2) Verifiability 89 · Specificity 61 · Coverage 96 · Composite 79Gemini 2.5-Pro (google/gemini-2.5-pro) Verifiability 94 · Specificity 51 · Coverage 95 · Composite 79Gemini 3.1-Pro-Preview (google/gemini-3.1-pro-preview) Verifiability 87 · Specificity 63 · Coverage 95 · Composite 79O3 (openai/o3) Verifiability 69 · Specificity 88 · Coverage 100 · Composite 79Grok 4.1-Fast (x-ai/grok-4.1-fast) Verifiability 78 · Specificity 75 · Coverage 96 · Composite 79Mistral Large-2512 (mistralai/mistral-large-2512) Verifiability 81 · Specificity 69 · Coverage 92 · Composite 78O4-Mini (openai/o4-mini) Verifiability 80 · Specificity 64 · Coverage 97 · Composite 78GLM 5.1 (z-ai/glm-5.1) Verifiability 86 · Specificity 59 · Coverage 93 · Composite 78Gemini 3.1-Flash-Lite (google/gemini-3.1-flash-lite) Verifiability 91 · Specificity 49 · Coverage 95 · Composite 77Gemini 2.5-Flash (google/gemini-2.5-flash) Verifiability 93 · Specificity 41 · Coverage 94 · Composite 76GLM 4.6 (z-ai/glm-4.6) Verifiability 87 · Specificity 50 · Coverage 91 · Composite 76Sonar Reasoning-Pro (perplexity/sonar-reasoning-pro) Verifiability 87 · Specificity 47 · Coverage 95 · Composite 75Claude Opus-4.8 (anthropic/claude-opus-4.8) Verifiability 88 · Specificity 67 · Coverage 91 · Composite 74Grok 4 (x-ai/grok-4) Verifiability 88 · Specificity 43 · Coverage 88 · Composite 74Command A (cohere/command-a) Verifiability 90 · Specificity 33 · Coverage 92 · Composite 73GPT-4.1-Mini (openai/gpt-4.1-mini) Verifiability 90 · Specificity 37 · Coverage 90 · Composite 73Llama 4-Maverick (meta-llama/llama-4-maverick) Verifiability 93 · Specificity 29 · Coverage 88 · Composite 72Phi-4 (microsoft/phi-4) Verifiability 89 · Specificity 27 · Coverage 88 · Composite 70Reka-Flash-3 (rekaai/reka-flash-3) Verifiability 84 · Specificity 39 · Coverage 84 · Composite 70Nova Pro (amazon/nova-pro-v1) Verifiability 88 · Specificity 26 · Coverage 92 · Composite 69
AlibabaAmazonAnthropicCohereDeepSeekGoogleMetaMicrosoftMistralMoonshotOpenAIPerplexityRekaZ.AIxAIHover a dot for the model · Click to open its detail page
Methodology

Four axes, independent judges

The weights come from Signals’ product purpose, not an external standard. A useful signal of change is, in order: real, concrete, recent, and part of a broad view. Substance > style > recency ≈ breadth.

Verifiabilityweight 0.40
A web-grounded judge classifies every signal's claim into one of six buckets — grounded, speculative, future, indicative, dubious, fabricated — and supplies citations. The 'future' bucket protects forward-looking foresight from being misread as hallucination.
Specificityweight 0.30
A separate (non-grounded, different vendor) judge scores the signal's writing against an explicit rubric: named actors, concrete events, quantitative anchors, no hype adjectives.
Currencyweight 0.15
Newest source-date the verifier surfaced, run through a decay curve.
Coverageweight 0.15
Distribution across the brief's categories plus uniqueness vs. the rest of the cohort.

The per-axis sub-scores are reported alongside the composite. If you weight things differently, the raw JSON lets you re-compute the ranking without re-running anything.

Known limitations

What to read with caution

  • No market equivalent (yet)

    There's no accepted external standard for grading foresight-style generation, so the weights and rubric reflect our judgement. Treat the composite as informative, not authoritative.

  • Judges have biases

    The verifier and specificity judges are themselves LLMs. Same-family models can mildly self-favor. We use different vendors for the two judges, and we list which judges ran each leaderboard so consumers can recalibrate.

  • Frozen briefs can be gamed

    Once a brief is public, a model could in principle be tuned to it. We mitigate by versioning the brief set; cross-run comparisons within the same version are valid, cross-version comparisons aren't.

  • The 'future' verdict is judgement-heavy

    Forward-looking signals can't be web-verified by definition. We protect them with a dedicated bucket scored mid-range, but it relies on the judge's plausibility call.

  • Maturation comes from volume

    Confidence in a benchmark like this comes from running it many times across many domains, watching the rankings stabilize, and cross-validating with different judges. We'll keep publishing dated runs as new models ship.

Run details

Reproducible & open

Run ID
2026-05-13T10-10-56-382Z
Date
May 28, 2026
Briefs version
2026-05-v1
Verifier judge
google/gemini-2.5-flash:online
Specificity judge
google/gemini-2.5-flash
Industries
Healthcare Regulated AI, Fintech Stablecoin Rails, Defense Autonomous Systems, Climate Adaptation Capital, Retail Genai Commerce, Biotech Platform Shifts, Energy Grid Electrification, Education AI Tutors, Geopolitics Tech Blocs, AI Infrastructure Scaling, Mobility Autonomous Fleets, Food AgTech Shifts
Compute cost
$108.97
Raw overview JSON: /benchmark/2026-05-13.json

Benchmark methodology and runner code are open. The same models run every scan inside the product — see how.

Open source@envisioning/signals-benchmark