LLM evals

← Click on an eval category in the sidebar to see the results.

Every week some company releases another LLM that blows the previous models out of the water, according to the benchmarks. The charts only go up. The benchmarks are useful on some level. But honestly, they are pretty weird.

If you’re doing anything at all interesting with large language models, you need to set up your own evals. Whether you’re trying to extract committee names from political emails, classify campaign expenditures or keep a tracker updated, I promise that your use cases are much more useful than the benchmarks. Only setting up your own evals will tell you what combination of models and prompts work best for you. After all, you will be directly testing how you use them!

Read full blog post