Just ran some evals on Claude Sonnet 4.5. It’s better than 4 on some but worse on a lot. LLM progress is so weird. You really gotta test this stuff on what you care about.
Visual journalist/hacker covering AI
Just ran some evals on Claude Sonnet 4.5. It’s better than 4 on some but worse on a lot. LLM progress is so weird. You really gotta test this stuff on what you care about.