A leaderboard for foresight.
We ran 32 frontier models against 12 frozen industry briefs and scored every output on four orthogonal axes. Web-grounded judges. No vibes.
Web-grounded judge confirms the signal's claim.
Named actors, concrete events, no hype adjectives.
Recent supporting evidence in cited sources.
Balanced across categories; finds non-obvious signals.
Composite leaderboard
Weighted average · Verifiability 0.4 · Specificity 0.3 · Currency 0.15 · Coverage 0.15. Click any header to sort. Click a model name to see its signals.
By challenge
One row per industry brief. Shows the top-3 models, cohort average, and how far apart the best and worst models scored — a quick read on how contested each challenge is.
Full per-industry matrix
Every model's score on every brief. Pick an axis, sort by any column.
| Model | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
86 | 83 | 88 | 88 | 85 | 86 | 82 | 86 | 85 | 85 | 89 | 87 | 86 | |
85 | 83 | 87 | 85 | 83 | 88 | 86 | 85 | 86 | 83 | 86 | 89 | 82 | |
83 | 83 | 85 | 85 | 80 | 83 | 83 | 84 | 81 | 82 | 90 | 83 | 78 | |
83 | 84 | 86 | 86 | 85 | 82 | 81 | 83 | 82 | 81 | 82 | 83 | 76 | |
83 | 83 | 87 | 83 | 85 | 81 | 78 | 83 | 84 | 80 | 81 | 87 | 82 | |
81 | 81 | 85 | 85 | 80 | 85 | 81 | 79 | 78 | 82 | 86 | 82 | 73 | |
80 | 78 | 85 | 81 | 84 | 80 | — | — | 77 | 81 | 83 | 80 | 74 | |
80 | 78 | 78 | 81 | 82 | 79 | 79 | 82 | 80 | 82 | 79 | 79 | 78 | |
80 | 79 | 82 | 82 | 83 | 78 | — | 80 | 80 | — | 85 | 75 | 78 | |
80 | 77 | 75 | 86 | 84 | 82 | 84 | 82 | 72 | 83 | 83 | 78 | 77 | |
80 | 83 | 84 | 83 | 82 | 76 | 78 | 79 | 77 | 79 | 81 | 76 | 77 | |
79 | 80 | 80 | 82 | 77 | 78 | 82 | 78 | 81 | 81 | 80 | 78 | 73 | |
79 | 80 | 84 | 84 | 75 | 74 | 81 | 84 | 75 | 82 | 78 | 80 | 74 | |
79 | 79 | 81 | 77 | 78 | 79 | 77 | 81 | 77 | 78 | 79 | 79 | 83 | |
79 | 77 | 81 | 83 | — | — | 78 | 81 | 77 | 80 | — | 79 | 78 | |
79 | 79 | 83 | 84 | 81 | 78 | 77 | 79 | 75 | 82 | 75 | 77 | 73 | |
79 | 78 | 83 | 81 | 73 | 78 | 78 | 82 | 78 | 78 | 79 | 78 | 78 | |
78 | 80 | 80 | 80 | 77 | 75 | 77 | 81 | 81 | 75 | 83 | 76 | 73 | |
78 | 81 | 86 | 79 | 78 | 77 | 76 | 73 | 79 | 72 | 78 | 76 | 77 | |
78 | 76 | 83 | 78 | — | 80 | 75 | 81 | — | 76 | 80 | 78 | 76 | |
77 | 75 | 80 | 76 | 80 | 79 | 76 | 75 | 74 | 79 | 79 | 76 | 74 | |
76 | 73 | 80 | 76 | 76 | 75 | 78 | 79 | 74 | 74 | 79 | 75 | 75 | |
76 | 72 | 78 | 75 | 75 | 70 | 76 | 81 | 76 | 78 | 82 | 74 | 71 | |
75 | 77 | 77 | 77 | 77 | 73 | 81 | 68 | 76 | 78 | 73 | 75 | 73 | |
74 | 70 | 77 | 77 | 76 | 66 | 72 | 76 | 72 | 80 | 70 | 76 | 75 | |
74 | 74 | 79 | 74 | 70 | 69 | 77 | 72 | 73 | 74 | 77 | 76 | 73 | |
73 | 70 | 79 | 74 | 75 | 73 | 70 | 73 | 70 | 70 | 70 | 77 | 69 | |
73 | 77 | 73 | 74 | 74 | 74 | 76 | 72 | 72 | 73 | 75 | 69 | 69 | |
72 | 74 | 73 | 71 | 73 | 72 | 72 | 69 | 68 | 76 | 71 | 75 | 69 | |
70 | 75 | 71 | 71 | 71 | 69 | 76 | 63 | 68 | 69 | 69 | 70 | 69 | |
| Reka-Flash-3 | 70 | 73 | 64 | 74 | 73 | 70 | 79 | 77 | 73 | 65 | 63 | 73 | 54 |
69 | 71 | 70 | 70 | 70 | 69 | 71 | 71 | 68 | 67 | 69 | 68 | 69 |
Reading left-to-right shows how consistently a model performs across domains; reading top-to-bottom shows which model is best for a given industry. Click any column header to sort.
Substance vs. style, by model
X = Verifiability · Y = Specificity · dot size = Composite score. Upper-right is the frontier; biggest dot wins overall.
Four axes, independent judges
The weights come from Signals’ product purpose, not an external standard. A useful signal of change is, in order: real, concrete, recent, and part of a broad view. Substance > style > recency ≈ breadth.
- Verifiabilityweight 0.40
- A web-grounded judge classifies every signal's claim into one of six buckets — grounded, speculative, future, indicative, dubious, fabricated — and supplies citations. The 'future' bucket protects forward-looking foresight from being misread as hallucination.
- Specificityweight 0.30
- A separate (non-grounded, different vendor) judge scores the signal's writing against an explicit rubric: named actors, concrete events, quantitative anchors, no hype adjectives.
- Currencyweight 0.15
- Newest source-date the verifier surfaced, run through a decay curve.
- Coverageweight 0.15
- Distribution across the brief's categories plus uniqueness vs. the rest of the cohort.
The per-axis sub-scores are reported alongside the composite. If you weight things differently, the raw JSON lets you re-compute the ranking without re-running anything.
What to read with caution
- No market equivalent (yet)
There's no accepted external standard for grading foresight-style generation, so the weights and rubric reflect our judgement. Treat the composite as informative, not authoritative.
- Judges have biases
The verifier and specificity judges are themselves LLMs. Same-family models can mildly self-favor. We use different vendors for the two judges, and we list which judges ran each leaderboard so consumers can recalibrate.
- Frozen briefs can be gamed
Once a brief is public, a model could in principle be tuned to it. We mitigate by versioning the brief set; cross-run comparisons within the same version are valid, cross-version comparisons aren't.
- The 'future' verdict is judgement-heavy
Forward-looking signals can't be web-verified by definition. We protect them with a dedicated bucket scored mid-range, but it relies on the judge's plausibility call.
- Maturation comes from volume
Confidence in a benchmark like this comes from running it many times across many domains, watching the rankings stabilize, and cross-validating with different judges. We'll keep publishing dated runs as new models ship.
Reproducible & open
- Run ID
- 2026-05-13T10-10-56-382Z
- Date
- May 28, 2026
- Briefs version
- 2026-05-v1
- Verifier judge
- google/gemini-2.5-flash:online
- Specificity judge
- google/gemini-2.5-flash
- Industries
- Healthcare Regulated AI, Fintech Stablecoin Rails, Defense Autonomous Systems, Climate Adaptation Capital, Retail Genai Commerce, Biotech Platform Shifts, Energy Grid Electrification, Education AI Tutors, Geopolitics Tech Blocs, AI Infrastructure Scaling, Mobility Autonomous Fleets, Food AgTech Shifts
- Compute cost
- $108.97
Benchmark methodology and runner code are open. The same models run every scan inside the product — see how.