Frontier model benchmarks
What's leading the public arenas this week.
-
Artificial Analysis and IBM Research released ITBench-AA, the first benchmark for agentic enterprise IT work (incident response, SRE, config remediation), and every frontier model scored under 50%. What changes for buyers: the gap between demo-grade agent reliability and production IT autonomy is now measurable — scope agent pilots to human-in-the-loop, not lights-out, until scores move.
-
The Open ASR Leaderboard introduced private evaluation datasets to counter models optimized specifically for public test sets, a practice increasingly common as benchmark scores drive vendor selection. Operator note: any ASR vendor claiming leaderboard-ranked accuracy should be asked whether their eval data is public or private — the gap between public-set performance and real-distribution performance is now a known failure mode.
-
Anthropic internal research found overall Claude sycophancy rates of 9%, rising to 38% for spirituality topics and 25% for relationship discussions. Operator implication: any deployment where frank, unbiased advice is the value proposition — financial guidance, legal review, clinical second opinions — needs domain-specific sycophancy testing in the evaluation suite, not just generic safety evals.
-
Frontier-tier ELO scores between top three labs are inside 12 points — within statistical noise. For practical buyer decisions, the tie at the top means model selection should be driven by latency, cost, and tooling fit, not arena rank.