If users have to redo the AI's work to verify it, the issue is not just evaluation; the product has not made sources, definitions, intermediate steps, and unverifiable claims first-class artifacts.
Digest
What I read today
A daily dig through my RSS feeds — the links and ideas worth keeping, each with a short note on what it's about.
ScarfBench brings enterprise Java migration evaluation back to system reality: success means build, deploy, and behavioral validation all pass, while today's strongest agents still stay below a 10% behavioral success rate.
Sierra's agent engineer and FDE model shows that the hard work is encoding customer workflows, APIs, brand voice, release governance, and verification paths into the system.
Loopcraft moves people from writing one-off prompts into designing loops; goals, feedback, routing, validation, budgets, and permission boundaries are the real leverage in agent systems.
The Claude Code prompt-steganography debate is less about whether anti-abuse detection is reasonable and more about how local developer tools need visible, auditable policies when they sit near files, commands, and credentials.
Parse-don't-validate is not about adding checks everywhere; it is about narrowing untrusted input at request, URL, database, and env boundaries so later code can rely on domain types instead of memory.
A vector-add kernel hides a full protocol stack: nvcc, PTX/SASS, the host launch stub, driver ioctls, pushbuffers, GPFIFO, QMD, doorbells, SMs, and warp scheduling.
Cloudflare's truncated-response bug came from hyper's HTTP/1 state machine discarding Poll::Pending: bytes stayed buffered, the connection shut down, and clients saw an early EOF.
Henrico County shows how AI and cloud infrastructure costs do not live only inside cloud invoices; they spill into local grid investment, rate allocation, and public-institution power bills.
Low-dose radiation risk cannot be collapsed into one cumulative-dose fear index; the useful frame separates dose rate, exposure path, control, consent, and statistical uncertainty.
GLM 5.2 scores 39% F1 on IDOR detection, ahead of Claude Code's 32%, but Semgrep's own multimodal harness reaches 53-61%; the useful comparison is the full system of model, context selection, output parsing, and execution loop.
Ignore files reduce noise and express intent, but if the agent process can still read a secret, tool output, search results, and logs can leak it; the real boundary has to come from the OS, containers, VMs, or least-privilege credentials.
Jon Udell argues against reducing people to approval buttons; the better design keeps human plans, queues, review, and history as the main loop, with agents joining through visible, recoverable small steps.
ClickHouse's Rust rewrite of its WAL archiver matters less as a generic speed story than as a resource-predictability story: under WAL-heavy load, virtual memory falls from nearly 2.8GB to under 1GB.
Open-weight releases are no longer a single movement led by a few players; pure model makers, Big Tech, product companies, and sovereign AI efforts all open models for different economic reasons.
Gary Marcus reads China's model catch-up as a no-moat story: more competitors, lower token prices, thinner margins, and a costly paradigm whose capability lead may not become a durable business moat.
On the two loops inside agentic coding — the inner agent loop that ends when the model says "done," and the outer harness loop that decides whether to keep going — and why the second is remarkable on disposable, verifiable work but corrosive on code meant to last.
How the November 2025 agents multiplied code output while human review stayed flat — and how Meta's largest-ever incident traced back to AI-written, AI-reviewed code shipping past a gutted Trust & Safety team.
How Coinbase compressed its delivery cycle from 20 days to 1.8 using Plan Mode and five-to-seven parallel agents, with 75% of pull requests now opened by an agent.
A research finding that models infer who is speaking from a text's style rather than its role tags — and that rewriting an attack to read slightly off-format drops its success rate from 61% to 10%.
On an automated red-teamer that now out-ranks human professionals, the finding that larger models are not automatically safer, and the "Lethal Trifecta" — untrusted input, private data, and an exfiltration path together.
Why GLM-5.2 is the first open-weight model that works as a general agent inside a Claude Code-style harness, narrowing the US–China gap to about 6.8 months at a fraction of the price.
A 3-billion-parameter model that ties 600B–1T flagships on math and competitive programming where answers are machine-checkable, via a two-stage "Spectrum-to-Signal" post-training recipe.
IBM's open harness tops AppWorld and WebArena on an open-weight model by moving planning, state, and reflection into the harness, leaving developers to write only tools and prompts.
Flat Mercurial-style manifests and lazy mounting give an agent seconds-to-first-edit on a multi-GB monorepo without cloning the whole thing — at the cost of leaving the Git ecosystem behind.
An argument that memcached suits caching precisely because it does less — no persistence, no clustering — forcing correct "cache can vanish" semantics and sidestepping the Redis-as-database trap.
How GPT-5 Pro gave an immunologist a new angle on T-cell behavior that explained an experiment he had been unable to account for over three years.
How FromSoftware builds boss behavior without planning algorithms — a pushdown-automaton goal stack, weighted-random action selection, and interrupt callbacks that keep designers in full control.
Why the bottleneck for AI data centers is not power but a first-come interconnection queue that fills with speculative projects — and how auctioning slots and pricing flexibility could clear it.
How land reclamation stalled across the West around 1970 — not by prohibition, but by litigable environmental review that pushed single-project approval times into decades.