Benchmark

GPT-5.4

OpenAIopenai/gpt-5.4

Composite

Verifiability

Specificity

Currency

Coverage

Briefs evaluated: 12

Total signals: 192

Run: 2026-05-13

Verifier: google/gemini-2.5-flash:online

Specificity judge: google/gemini-2.5-flash

Per-industry signals

12 industries · expand any to see the model's signals with verdict, judge commentary, and citations.

Clinical
Ambient Scribe Liability Reviews
Grounded
Hospitals document diagnostic and medication errors linked to ambient AI scribes that omit symptoms, allergies, or negations in clinical notes. Signals immediate need for clinician verification standards, audit trails, and specialty-specific deployment limits.
verif 100spec 85cur 70newest src 2025-08-01
Judge · Multiple sources confirm risks of omissions and hallucinations, impacting diagnoses and treatments. Clinician review, logging, and evaluation frameworks are crucial for safety.
Writing · Concrete actor, event, and quantifiable shift included. Future-tense recommendations deduct slightly.
Clinical
Sepsis Model Drift Incidents
Indicative
Health systems report sepsis alert performance changes after EHR upgrades, population shifts, and revised lab workflows alter input patterns. Indicates immediate relevance for continuous validation, recalibration schedules, and oversight of model-dependent care pathways.
verif 60spec 65cur 50newest src 2025-05-08
Judge · Multiple sources highlight AI model variability and the need for localized validation and recalibration due to differing patient populations and clinical contexts, implying drift.
Writing · Concrete actor (health systems), event (EHR upgrades), and temporal anchor (after). Deductions for 'many' implicit, 'alters input patterns'. Uses active voice effectively.
Clinical
AI Imaging Triage Overrides
Grounded
Radiology services track cases where AI triage flags differ from radiologist prioritization, especially in stroke, fracture, and chest imaging queues. Signals immediate pressure to define override authority, escalation rules, and documentation for discrepant findings.
verif 100spec 85cur 100newest src 2026-03-01
Judge · The increasing use of AI for triage in radiology, as highlighted by recent research in mammography, makes the tracking of AI-radiologist discrepancies a present concern.
Writing · Concrete actors (radiology services, AI triage, radiologist), specific events (flags, prioritization), and domain.
Clinical
LLM Discharge Instruction Errors
Grounded
Pilot programs find large language models producing discharge instructions with reading-level mismatches, dosing ambiguities, and unsupported follow-up advice. Indicates immediate relevance for human review, multilingual validation, and standardized patient education controls.
verif 100spec 65cur 50newest src 2024-11-20
Judge · Multiple studies identify hallucinations, medication errors (including dosage omissions), and lack of personalization in AI-generated discharge instructions. Human review is consistently emphasized.
Writing · Good concrete problems identified; lacks specific actor/event for higher score.
Regulatory
EU AI Act Device Mapping
Grounded
Health technology vendors map clinical AI products to EU AI Act risk tiers alongside MDR and IVDR classifications. Signals immediate compliance work for procurement criteria, documentation requests, and post-market monitoring responsibilities.
verif 100spec 65cur 100newest src 2026-04-15
Judge · The EU AI Act classifies AI systems in healthcare by risk, imposing specific compliance and oversight requirements. This necessitates comprehensive mapping and governance shifts.
Writing · Concrete actor (EU AI Act, Europe), event (risk mapping), and shift (procurement to governance).
Regulatory
FDA AI Change Control Plans
Grounded
FDA guidance discussions center on predetermined change control plans for software functions that update through machine learning. Indicates immediate relevance for vendor contracts, validation evidence, and governance of model modifications after deployment.
verif 100spec 65cur 70newest src 2025-08-18
Judge · The FDA has issued guidance on PCCPs for AI-enabled devices, with ongoing research into evaluation methods.
Writing · Concrete actor/event, but 'discussions center on' is passive and 'immediate relevance' is weak.
Regulatory
OCR AI Privacy Enforcement
Grounded
US regulators scrutinize health data flows to analytics and AI tools that transmit identifiers through tracking pixels, prompts, and cloud logs. Signals immediate need for HIPAA risk reviews, vendor restrictions, and logging minimization practices.
verif 100spec 65cur 10newest src 2024-05-06
Judge · OCR actively investigating AI-related complaints and emphasizing HIPAA compliance for AI, especially concerning tracking technologies and vendor agreements.
Writing · Concrete actor (US regulators), event (scrutiny), and anchor (HIPAA) are present. 'Immediate need' is a slight deduction.
Regulatory
Algorithmic Bias Audit Mandates
Grounded
State and EU policymakers advance rules requiring impact assessments, dataset documentation, and bias testing for high-risk automated decisions. Indicates immediate relevance for hospital governance committees, evidence retention, and procurement due diligence.
verif 100spec 65cur 100newest src 2026-05-13
Judge · Both US and EU regulations mandate bias audits and impact assessments for AI, especially in healthcare, with compliance deadlines imminent.
Writing · Concrete actors (State, EU), specific actions (impact assessments, dataset documentation, bias testing). Lacks a specific quantitative or temporal anchor.
Operational
GPU Capacity Allocation Conflicts
Speculative
Hospital IT teams face compute bottlenecks as imaging, documentation, and revenue-cycle AI projects compete for limited GPU and cloud budgets. Signals immediate need for portfolio prioritization, usage metering, and cost-to-value tracking.
verif 80spec 65cur 50newest src 2025-01-01
Judge · While the impact of resource constraints on AI adoption in healthcare is acknowledged, specific evidence linking it directly to GPU capacity allocation conflicts is not explicitly detailed across multiple sources within the provided context.
Writing · Concrete actors (Hospital IT teams, AI projects) and events. Lacks a quantitative/temporal anchor.
Operational
EHR Copilot Workflow Friction
Speculative
Clinicians report extra clicks, inbox clutter, and note-reconciliation work when EHR copilots insert suggestions outside established documentation patterns. Indicates immediate relevance for workflow redesign, usability testing, and specialty-level adoption metrics.
verif 80spec 65cur 85newest src 2025-11-17
Judge · No direct mention of 'extra clicks,' 'inbox clutter,' or 'note-reconciliation work' from EHR copilots, but studies indicate a dynamic impact on workflow and the need for careful review of AI-generated notes, suggesting potential friction points.
Writing · Concrete actor (Clinicians, EHR copilots), specific frictions named, no vague qualifiers or hype.
Operational
Shadow AI Procurement Gaps
Grounded
Departments purchase AI tools through local budgets or free trials, bypassing security review, integration checks, and data-processing assessments. Signals immediate need for centralized intake, inventory controls, and standardized contract language.
verif 100spec 65cur 100newest src 2026-03-16
Judge · Shadow AI is widespread in healthcare, often bypassing IT and compliance for speed and functionality.
Writing · Concrete actors (Departments), clear actions, and identified gaps. Lacks specific temporal/quantitative anchors.
Operational
Vendor Indemnity Clause Disputes
Grounded
Contract negotiations increasingly focus on responsibility for clinical harm, copyright claims, and regulatory violations tied to generative AI outputs. Indicates immediate relevance for legal review, insurance coverage checks, and deployment approval thresholds.
verif 100spec 65cur 100newest src 2026-05-13
Judge · Indemnification clauses are increasingly covering regulatory non-compliance, algorithmic bias, and IP infringement in healthcare AI contracts, driven by evolving regulations and potential for patient harm.
Writing · Concrete areas of dispute are named. Lacks specific actors, products, or quantitative anchors.
Patient Trust
Consent Questions on AI Notes
Grounded
Patients ask whether ambient listening tools record encounters, store audio, or train models using sensitive visit conversations. Signals immediate relevance for disclosure language, consent workflows, and visible safeguards during appointments.
verif 100spec 65cur 85newest src 2025-12-23
Judge · Multiple sources confirm patient concerns regarding ambient AI recording, data use, and the need for clear consent. HHS is seeking feedback on related issues.
Writing · Concrete actor (patients), event (ask questions), and immediate relevance. Lacks a temporal anchor.
Patient Trust
Portal Chatbot Confidence Gaps
Indicative
Patient portal users report confusion when chatbots answer billing, triage, or medication questions without clear sourcing or escalation paths. Indicates immediate relevance for labeling rules, handoff options, and response quality monitoring.
verif 60spec 65cur 100newest src 2026-05-09
Judge · While specific user confusion isn't detailed, sources highlight AI risks in patient portals and underscore the need for transparency, clear governance, and monitoring.
Writing · Concrete actor/event (Portal Chatbot/users) and a clear problem identified. Lacks a temporal or quantitative anchor.
Patient Trust
Bias Concerns in Risk Scores
Grounded
Community groups challenge algorithmic risk scores that use proxies linked to race, disability, language, or prior access patterns. Signals immediate relevance for explainability materials, fairness reviews, and stakeholder engagement in deployment decisions.
verif 100spec 55cur 100newest src 2026-03-25
Judge · Multiple sources confirm concerns about algorithmic bias, particularly in Medicare Advantage, impacting equitable access and patient outcomes. Regulations are emerging to address this.
Writing · No actor/event/quantity; uses active voice; concrete topics.
Patient Trust
Data Use Objections for Training
Grounded
Patients and advocates object when de-identified records support vendor model training without clear notice, opt-out processes, or benefit-sharing terms. Indicates immediate relevance for data governance transparency, contract disclosures, and public communication.
verif 100spec 45cur 85newest src 2025-12-29
Judge · Multiple sources highlight concerns over data privacy, transparency, consent challenges, and regulatory complexity in AI development using patient data, emphasizing the need for clear governance and stakeholder involvement.
Writing · No concrete actor, event, product or quantitative anchor. Uses vague concepts like "clear notice".

GPT-5.4

Per-industry signals

Healthcare Regulated AI

Ambient Scribe Liability Reviews

Sepsis Model Drift Incidents

AI Imaging Triage Overrides

LLM Discharge Instruction Errors

EU AI Act Device Mapping

FDA AI Change Control Plans

OCR AI Privacy Enforcement

Algorithmic Bias Audit Mandates

GPU Capacity Allocation Conflicts

EHR Copilot Workflow Friction

Shadow AI Procurement Gaps

Vendor Indemnity Clause Disputes

Consent Questions on AI Notes

Portal Chatbot Confidence Gaps

Bias Concerns in Risk Scores

Data Use Objections for Training

Fintech Stablecoin Rails

Defense Autonomous Systems

Climate Adaptation Capital

Retail Genai Commerce

Biotech Platform Shifts

Energy Grid Electrification

Education AI Tutors

Geopolitics Tech Blocs

AI Infrastructure Scaling

Mobility Autonomous Fleets

Food AgTech Shifts