blog

Hugging Face’s Revamped Leaderboard and How to Benchmark LLMs

Written by Greg Chase | Jun 29, 2024 12:19:22 AM

 

We started this week's podcast by commenting on Hugging Face's changes to their Open LLM Leaderboard, which posted more challenging benchmarks and scoring and showed much more exciting results. 

From Hugging Face Blog

 

Hugging Face explained their changes in a blog, starting with a plot of average benchmark scores over the last year showing converging and plateauing performance. This led to little differentiation between the significant models they tested.  They also discovered that some of the benchmarks they used had errors, so the old system needed updating.

From the Hugging Face blog

 

To respond to these problems, Hugging Face released a new version of the leaderboard with more challenging benchmarks.  Their blog shared a plot showing how scores shifted downward for the same models between updated benchmarks.  

From the Hugging Face blog

We looked at the difference between scores on the old and new versions of the leaderboard, including looking up OLMo, the fully open LLM we discussed in the last podcast.

We then moved on to a more general topic of what LLM benchmarks are and what they test.  Yulia pointed out that, similar to intelligence tests for humans, the question of how to test an LLM remains open. We first took a deep dive into the MMLU-Pro benchmark paper. We noted the very recent timestamp on this paper, showing how quickly Hugging Face incorporated it into its new leaderboard.  We then looked at the MuSR benchmark (short for Multistep Soft Reasoning). This benchmark provides tests such as asking an LLM to solve a murder mystery.

Yulia closed our discussion by referencing a survey paper describing many ways to test LLMs for different attributes.

 

From: “Evaluating Large Language Models: A Comprehensive Survey

Join us for our next session on Jul 9, 2024. We’ll be broadcasting from the Mozilla.ai community and talking about our work packaging the OLMo model with Llamafile to improve the model's accessibility for developers.

 

To be part of our online studio audience, please join the AIFoundry.org Discord.

Subscribe to the AIFoundary.org calendar on Luma to stay updated with upcoming community podcasts and events.

Feel free to drop a comment to this blog below.