
Filters
A few notables about the filters:
- I picked tasks that are most commonly required in AI-powered apps. They're not exclusive, meaning there are certainly other tasks you may want to accomplish with your app, but it's a good jumping-off point.
- Most of the leaderboards have quality benchmarks, but as you move down the list the benchmarks become increasingly scarce. To wit, very few leaderboards at the time of publishing include benchmarks for context window.
- I decided to split 'Generate text' and 'Chat' into two different tasks because, while there is some overlap, you can have an app that generates text (e.g., summarize content) that doesn't have a chat component. Conversely, the chat task may have benchmarks that wouldn't apply to more generic text generation. Case in point, Artificial Analysis' Communication benchmark evaluates a model's performance in conversational settings, evaluating communication skills, coherence, and engagement based on user feedback, which wouldn't necessarily be relevant to non-conversation text-generation tasks. There may be similar overlaps with 'Solve complex problems' and 'Solve math problems'.
- Benchmarks may have the same label across leaderboards but be calculated differently (e.g., speed and math). To wit, some leaderboards only report on speed, which combines speed and latency; others track them separately. So make sure you understand how a leaderboard defines a benchmark/metric. You can click on any green node to summon a pop-up with more information and source links for that leaderboard. Or you can hover over a benchmark node (the gray ones) to learn how that metric is defined by the leaderboard creator.
Network Graphs
A few points about the graph:
- I redesigned this app in March 2025 and made it responsive, all the way to phone size. In that redesign, I decided to simplify the graph by removing some of the tooltips that were repetitive and for the tooltips that remained, e.g., the info icons in the headers and the benchmark toolips (the gray nodes). They are now draggable, so if you're on a smaller screen and some of the tooltip is cut off, just grab it by the handle and move it to the center. It should then responsively adjust to your screen size. 🤞 I couldn't test all the functionality in Chrome's emulator tool because of all the JavaScript.
- It's very rare for leaderboard owners to include definitions for their benchmarks, which was one of the greatest challenges in building this tool. Shockingly, many don't even include these definitions in their methodology source (which is usually an Arxiv paper). I would even search their corresponding blogs to try to hunt down these definitions. If I couldn't find these definitions anywhere, I used ChatGPT for the definitions. Many of these benchmarks use labels that are quite opaque—e.g., 'GSM8K', 'Non-live AST Summary', and 'VQAV2 - EM'—so I found it very strange that these definitions were so difficult to rustle up, if they existed at all. 🤨
- The tips were not generated by AI. I spent a significant amount of time with each leaderboard, looking for tips and tricks I could pass along. Sometimes a benchmark provider will bury really useful features in a menu. I point them out wherever possible.
- HELM has a helpful Safety leaderboard, which contains benchmarks that were pertinent to the Chat task, but I just grouped them in with HELM Lite and added a 🚩 next to the safety benchmarks.
- I really hope that one day all benchmark owners will disclose if they allow model owners to self-report. Currently, the only leaderboards I'm aware of that disclose when model providers self-report are MMMU and MathVista.
- To date I haven't kept a changelog, but I will moving forward. When I update the app I'll add the link to it in the footer.