What is LLM-as-a-Judge?

TL;DR

Using a strong LLM to score or rank another LLM's outputs — the substitute for human evaluation that became the de-facto evaluation method in 2026, anchoring OpenAI Evals, LangSmith, Ragas, and others.

LLM-as-a-Judge: Definition & Explanation

LLM-as-a-Judge is the practice of having a strong LLM (GPT-5, Claude Opus 4.7, Gemini 3 Ultra) score or rank the outputs of another LLM. Formalized after the 2023 Vicuna evaluation paper and now standard. It typically reaches 50-90% agreement with human judgment at 1/100 the cost and 1000x the throughput, which is why it's used for (1) RAG answer-quality evaluation, (2) regression testing fine-tunes, (3) production output monitoring, (4) automated A/B winner picking, and (5) prompt-optimization loops. Common axes: faithfulness, relevance, coherence, helpfulness, safety, toxicity. Implementation patterns: (a) pairwise comparison (A vs B), (b) pointwise scoring (1-10), (c) reference-based (compare to gold), (d) reference-free, (e) multi-turn (whole-conversation rating). Tooling: LangSmith, LangFuse, Ragas, Phoenix, DeepEval, PromptFoo. Known biases: (1) position bias (favors whichever option appears first), (2) length bias (favors longer outputs), (3) self-enhancement bias (favors the same model family). Mitigations: order swapping, temperature 0, multi-model voting. In 2026, 'can write good evals' is a baseline AI engineering skill.

Related Terms

AI Marketing Tools by Our Team