Papers
arxiv:2404.08008

Sample-Efficient Human Evaluation of Large Language Models via Maximum Discrepancy Competition

Published on Apr 10, 2024
Authors:
,
,
,
,
,
,
,
,
,

Abstract

A sample-efficient human evaluation method based on MAximum Discrepancy Competition ranks large language models across various tasks with limited instructions.

Reliable evaluation of large language models (LLMs) is impeded by two key challenges: objective metrics often fail to reflect human perception of natural language, and exhaustive human labeling is prohibitively expensive. Here, we propose a sample-efficient human evaluation method for LLMs based on the principle of MAximum Discrepancy (MAD) Competition. Our method automatically and adaptively selects a compact set of input instructions that maximize semantic discrepancy between pairs of LLM responses. Human evaluators then perform three-alternative forced choices on these paired responses, which are aggregated into a global ranking using Elo rating. We apply our approach to compare eight widely used LLMs across four tasks: scientific knowledge understanding, mathematical reasoning, creative and functional writing, and code generation and explanation. Experimental results show that our sample-efficient evaluation method recovers "gold-standard" model rankings with a handful of MAD-selected instructions, reveals respective strengths and weaknesses of each LLM, and offers nuanced insights to guide future LLM development. Code is available at https://github.com/weiji-Feng/MAD-Eval .

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2404.08008
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.08008 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.08008 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.08008 in a Space README.md to link it from this page.

Collections including this paper 1