Kevin Schaul

Visual journalist/hacker

You need your own LLM evals. Here’s how I set mine up.

April 10, 2025

Update 4/17/25: Added screenshots of my dashboard, and noted switch to pydantic-evals

Every week some company releases another LLM that blows the previous models out of the water, according to the benchmarks. The charts only go up. The benchmarks are useful on some level. But honestly, they are pretty weird.

If you’re doing anything at all interesting with large language models, you need to set up your own evals. Whether you’re trying to extract committee names from political emails, classify campaign expenditures or keep a tracker updated, I promise that your use cases are much more useful than the benchmarks. Only setting up your own evals will tell you what combination of models and prompts work best for you. After all, you will be directly testing how you use them!

Unfortunately setting up evals remains a bit painful. There are lots of ways to test LLMs, but they all feel a bit messy. Trying out a bunch helped me figure out the features I’m looking for.

Here are my LLM evals (and code. More coming soon.

And here’s what one looks like:

Screenshot of a website showing how well different LLM models performed on a task about whether an article is describing a new action/policy by the Trump administration. gemini-1.5-flash-latest leads

What I want in an LMM eval framework

My new favorite: Pydantic’s evals

This library is promising. It’s faily base-bones right now, but I was able to pretty quickly write some helpers to handle caching, calculating aggregate statistics that I want and output files in a nice format for analysis.

The format I settled on is for each eval to output two dataframes/csvs:

  1. An aggregate version that calculates summary statistics across model-prompt variations. This lets you sort by your summary statistic to quickly see the best combination.
attributes.model attributes.prompt count is_correct share_correct
gemini-1.5-flas… Is this article … 200 163 0.82
openai/gpt4.1 Is this article … 200 152 0.76
gemini-2.0-flas… Is this article … 200 151 0.76
  1. A full version that lists out the results for each test-model-prompt combination. This lets you browse specific results to hopefully better understand what’s happening.
attributes.model attributes.prompt input.headline expected_output output is_correct
openai/gpt4.1 Is this article … Noem: Guantán… True True
openai/gpt4.1 Is this article … Medical evacu… False False
openai/gpt4.1 Is this article … Cherry blosso… False False

Together these outputs give me everything I want to do, so far at least.

I would love to see some of this become either standard in the library or hidden behind another tool (CLI anyone?). But for now this is working great. I’d love to not have to write this code in the future.

My previous favorite: promptfoo

Here is my setup for promptfoo. It’s basically three steps:

  1. I write my test cases to a json file. (Unfortunately due to a bug, I am currently using jsonl.)
  2. I include my custom promptfoo llm_provider that basically just calls llm for whatever model I want to test
  3. I wire it all up with promptfooconfig.yaml. This file specifies the prompts and models I want to evaluate.

With these in place, I can run promptfoo eval to run the evals, and then promptfoo view to open the results in a browser. They look like this:

Screenshot of promptfoo output, showing gemini-2.0-flash versus llama-3.2-3b-4bit

I don’t know what two of these three charts means, but if you ignore those, the tables are great. Here I can see that gemini-2.0-flash outperformed llama-3.2-3b-4bit on these 10 test cases. I can scroll through and see what cases missed. It’s pretty great output, once you know where to look.

It took me a while to figure out how to get promptfoo to work, but overall it does nearly everything I’m looking for. I wish it did less – I find the documentation pretty confusing because it has so many features – but maybe that’s unfair criticism.

Also interesting: pytest-evals

This library attempts to treat llm evals somewhat like unit tests, letting you run them with pytest. It’s an interesting idea that I generally like. It does require you to roll your own analysis functions, though.

I toyed around with building a similar library that simplified usage, including analysis functions and caching for you. But I don’t think this is the way forward. I’m more inclined to spend time making promptfoo work for me.

How are you running evals?

If there are other good eval frameworks, please let me know. I don’t feel like I’ve fully cracked the nut yet, though I’m in a much better spot than a few months ago.