JAMES SCHLAUCH · CONSULTING PRACTICE

Signal.

What I track. Frontier model benchmarks, recent releases, San Diego AI/ML community signals, and the market moves that change my recommendations. Curated weekly. Last updated .

LIVE TRACKER 28 signals tracked
BENCHMARKS

Frontier model benchmarks

What's leading the public arenas this week.

  • Artificial Analysis and IBM Research released ITBench-AA, the first benchmark for agentic enterprise IT work (incident response, SRE, config remediation), and every frontier model scored under 50%. What changes for buyers: the gap between demo-grade agent reliability and production IT autonomy is now measurable — scope agent pilots to human-in-the-loop, not lights-out, until scores move.

    Artificial Analysis × IBM Research Benchmarks
  • The Open ASR Leaderboard introduced private evaluation datasets to counter models optimized specifically for public test sets, a practice increasingly common as benchmark scores drive vendor selection. Operator note: any ASR vendor claiming leaderboard-ranked accuracy should be asked whether their eval data is public or private — the gap between public-set performance and real-distribution performance is now a known failure mode.

    Hugging Face Benchmarks
  • Anthropic internal research found overall Claude sycophancy rates of 9%, rising to 38% for spirituality topics and 25% for relationship discussions. Operator implication: any deployment where frank, unbiased advice is the value proposition — financial guidance, legal review, clinical second opinions — needs domain-specific sycophancy testing in the evaluation suite, not just generic safety evals.

    Anthropic Frontier
  • Frontier-tier ELO scores between top three labs are inside 12 points — within statistical noise. For practical buyer decisions, the tie at the top means model selection should be driven by latency, cost, and tooling fit, not arena rank.

    LMSYS Benchmarks
RELEASES

Model releases

What shipped, and what it means for production.

  • Cisco is deploying OpenAI Codex across its engineering org to scale AI-native development, accelerate its AI Defense work, and automate defect remediation. Practical implication: large-enterprise coding-agent rollouts are moving from pilots to org-wide standardization — procurement and security review of agent tooling is becoming a board-visible line item.

    OpenAI Frontier
  • Anthropic announced expanded Claude usage limits alongside a compute arrangement with SpaceX, easing capacity constraints that have throttled high-volume enterprise usage. Practical implication: rate-limit architectures built around the old ceiling are worth revisiting; the cost of retry logic and fallback chains may drop.

    Anthropic Market
  • Anthropic introduced Claude agent capabilities tailored to financial services workflows, marking the lab's first major vertical-specific product release. Practical implication for regulated-industry buyers: procurement conversations will now reference a named product path rather than generic API integration; compliance and audit trail requirements should be scoped against Anthropic's published enterprise terms.

    Anthropic Frontier
  • OpenAI replaced ChatGPT's default model with GPT-5.5 Instant, citing improved accuracy and reduced hallucination rates alongside user-personalization controls. Operator note: default-tier API behavior may shift for existing integrations — benchmark against your current system prompt before assuming behavior is stable.

    OpenAI Frontier
  • The 1M-token context window for Opus 4.7 leaves beta. Practical implication for active engagements: full-codebase RAG indexes can be replaced with single-prompt context loads on the 200K+ files-per-prompt path. Cache hit rate becomes the cost-determining variable.

    Anthropic Frontier
  • Anthropic's frontier reasoning model gains a 1M-token context window in beta. Practical implication: full-codebase analysis in a single prompt becomes viable for medium-sized monorepos. Pricing premium relative to 200K-window tier; cache hit rate becomes load-bearing for cost.

    Anthropic Frontier
TOOLING

Tooling & infra

Frameworks, runtimes, and platforms moving the operating cost.

  • Willison wrote that the distinction between vibe coding and professional agentic engineering has narrowed as coding agents become more reliable — skipping code review feels uncomfortable but increasingly common, like trusting another team's code without reading it. Practical implication: the differentiation between amateur and professional agentic work is shifting from 'does it work' to 'do you understand what it's doing and why.'

    Simon Willison Frontier
  • MLX 3.0 lands unified-memory model serving for Apple Silicon, collapsing CPU/GPU transfer overhead for on-device inference. For SoCal teams running edge-AI prototypes on M-series workstations, this changes the local-development cost curve and may shift some 'GPU-required' workflows back to laptop-class hardware.

    Apple ML Research Tooling
  • Vercel AI SDK 5 makes streaming tool calls the default pattern. For Astro/Next-based production AI surfaces, this collapses a meaningful chunk of glue code. Practical implication: prototypes ship a week earlier; production review cycles unchanged.

    Vercel Tooling
  • Independent crawl reports llms.txt adoption above 38% among top-1000 ranked sites in technical-content categories — up from <8% at start of Q1. Generative-engine optimization is no longer a trailing-edge bet.

    Hugging Face GEO/AEO
POLICY

Policy & governance

What regulators and standards bodies are doing.

  • Public comment period opened on proposed rules requiring disclosure of AI-generated content in commercial communications. Operative for any consumer-facing AI workflow; regulated-industry buyers should expect compliance-review pickup within Q3.

    Federal Trade Commission Governance
MARKET

Market signals

Funding, M&A, and structural moves in the practice's domain.

  • Willison argues both frontier labs have found product-market fit, citing reports that Anthropic is approaching its first profitable quarter. What changes for buyers: the two leading model vendors are trending toward financial durability, which lowers the multi-year continuity risk of building core workflows on their APIs — though single-vendor lock-in still warrants an abstraction layer.

    Simon Willison Market
  • San Diego-based Kneron is positioning its full-stack hardware-plus-software offering for AI's shift from training to inference workloads. For the local market: a homegrown SoCal player in edge/inference silicon — relevant when evaluating on-prem or edge deployment options beyond the hyperscaler GPU stack.

    San Diego Business Journal Market
  • OpenAI placed in the Leaders quadrant of the inaugural 2026 Gartner Magic Quadrant for Enterprise AI Coding Agents, cited for Codex's enterprise-scale deployment. What changes for buyers: analyst coverage of coding agents now exists as a category — expect it to surface in vendor-selection RFPs, so weigh MQ placement against your own eval harness rather than in place of it.

    OpenAI / Gartner Market
  • SpaceX's IPO filing frames orbital data centers as its bet to out-compute Big Tech on AI, even as xAI's Grok lags rival services. What changes for buyers: space-based compute is still speculative, but it signals that frontier AI capacity planning is now a capital-markets narrative — treat any near-term capacity promises tied to it as roadmap, not availability.

    Ars Technica Market
  • Anthropic announced a joint venture with Blackstone, Hellman & Friedman, and Goldman Sachs to build a dedicated enterprise AI services company. What changes for buyers: the frontier lab is now competing in the professional services tier, not just selling API access — a structural shift that changes vendor selection conversations.

    Anthropic Market
  • OpenAI and PwC announced a collaboration targeting CFO office automation — forecasting, internal controls, and reporting workflows — using AI agents. What changes for buyers: vendor-led AI is moving into finance and audit; board-level expectations around AI governance and explainability in financial reporting will increase, not decrease.

    OpenAI Market
  • Aggregator passes through Q2 frontier-model price drops; cost-per-million-tokens for top-tier reasoning models down 18-24% versus Q1. Implication for engagements with active token spend: re-bid the next 90 days.

    OpenRouter Market
LOCAL

San Diego AI/ML calendar

What's happening in the local community.

  • Workshop on expert label disagreement in medical imaging, fine-tuning foundation models (UNI, MedSAM2, BiomedCLIP) on curated datasets, and using FiftyOne for evaluation, active learning, and regulatory readiness. Relevant for teams navigating FDA AI/ML guidance in production medical imaging pipelines.

    San Diego AI/ML and Computer Vision Meetup Community
  • The San Diego AI/ML & Computer Vision Meetup is hosting a three-day virtual Best of CVPR series July 8–10, featuring researchers presenting accepted papers from the 2026 conference. For the local community: a low-cost way to track frontier computer-vision research without traveling to CVPR.

    San Diego AI/ML & Computer Vision Meetup Community
  • The San Diego AI/ML and CV community meets online May 14 (9–11 AM Pacific) with talks on evaluating AI agents with FiftyOne and MCP, real-world document AI beyond OCR, and energy-intelligent inference infrastructure. Registration via Meetup.

    San Diego AI/ML and Computer Vision Meetup Community
  • May 2026 GenAI/agents meetup at the Google Cloud San Diego venue. Active CFP for practitioner talks; sponsor slots include venue + food sponsorship paths. Highest-density local builder audience in San Diego right now.

    AICamp Community
  • Recurring SD meetup for GenAI/LLM/agent practitioners. Strong venue for practitioner-tier conversations and informal benchmarking among local builders. Open to sponsor and speaker proposals.

    AICamp Community
  • Conference + workshops + bootcamp at the Hyatt Regency La Jolla, June 1–5. Closest analog to a pure-ML conference in San Diego this cycle. Sponsor and speaker calendars worth tracking for 2027.

    MLcon Community

Methodology

How this tracker works.

Signals are curated weekly from the LMSYS Chatbot Arena leaderboard, the OpenRouter model rankings, the Hugging Face trending board, the Stanford AI Index, the AI Now Institute, the Federal Trade Commission's AI rulemaking docket, the local San Diego AI/ML/Computer-Vision Meetup calendar, and selected industry publications. Each entry links to the primary source.

The tracker exists for one reason: I want my buyers — VPs of Engineering, CDOs, and Chief AI Officers — to be able to read one page once a week and know what changed in their domain. If a signal here changes a recommendation I'm giving in an active engagement, that's the right cadence.

Automation roadmap: a content pipeline (see writing ) will surface candidate signals from the RSS feed of San Diego AI news and frontier benchmark releases. New signals are reviewed and pushed weekly.

RSS feed for writing essays Talk through which tier fits