LLM evals

← Click on an eval category in the sidebar to see the results.

View on GitHub

Every week some company releases another LLM that blows the previous models out of the water, according to the benchmarks. The charts only go up. The benchmarks are useful on some level. But honestly, they are pretty weird.

If you’re doing anything at all interesting with large language models, you need to set up your own evals. Whether you’re trying to extract committee names from political emails, classify campaign expenditures or keep a tracker updated, I promise that your use cases are much more useful than the benchmarks. Only setting up your own evals will tell you what combination of models and prompts work best for you. After all, you will be directly testing how you use them!

Unfortunately setting up evals remains a bit painful. There are lots of ways to test LLMs, but they all feel a bit messy. Trying out a bunch helped me figure out the features I’m looking for.

Read full blog post