Date
Publisher
arXiv
As large language models (LLMs) grow in capability and autonomy, evaluating
their outputs-especially in open-ended and complex tasks-has become a critical
bottleneck. A new paradigm is emerging: using AI agents as the evaluators
themselves. This "agent-as-a-judge" approach leverages the reasoning and
perspective-taking abilities of LLMs to assess the quality and safety of other
models, promising calable and nuanced alternatives to human evaluation. In this
review, we define the agent-as-a-judge concept, trace its evolution from
single-model judges to dynamic multi-agent debate frameworks, and critically
examine their strengths and shortcomings. We compare these approaches across
reliability, cost, and human alignment, and survey real-world deployments in
domains such as medicine, law, finance, and education. Finally, we highlight
pressing challenges-including bias, robustness, and meta evaluation-and outline
future research directions. By bringing together these strands, our review
demonstrates how agent-based judging can complement (but not replace) human
oversight, marking a step toward trustworthy, scalable evaluation for
next-generation LLMs.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
