title: "Getting to know LLM-as-a-Judge" description: "" added: "Nov 2 2024" tags: [AI]
Matt Pocock wrote an article about what evals are. Normal software is deterministic. Let's say you capitalize a single word in an app menu. You can be fairly confident in the outcome of that change. But capitalizing a single word in a prompt can create massive ripple effects. In AI systems, no change is small.
Evals give you a score you can use to see how well your AI system is performing.
scorers: [
// Checks if output is long enough
length,
// Uses an LLM to check if it's accurate
factualAccuracy,
// Uses an LLM to check writing style
writingStyle,
],
LLM-as-a-Judge is a solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice. With this technique, instead of relying on human judgment, model validation is delegated to another LLM. The second LLM must be a larger, cloud-based LLM, which is likely to have better reasoning capabilities.
A solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge. This method was introduced in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
In the article "Compare LLM capability with summarization", it tells how to evaluate the models' capabilities in summarization.
We broke the evaluation process into two steps. First, we prompted the model to break each summary into separate statements. Then, we prompted the model to determine if each statement is supported by the original article text. The model classified each statement's validity as:
Alignment is a metric that measures the frequency with which the statements included in a summary are supported in the original content the summary is based on.
This process resulted into two metrics that can be used to compare the models:
IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.
Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question
Provide your feedback as follows:
Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)
You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.
Now here are the question and answer.
Question: {question}
Answer: {answer}
Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """