仓库源文站点原文


layout: "../layouts/BlogPost.astro" title: "Getting to know LLM-as-a-Judge" slug: getting-to-know-llm-as-a-judge description: "" added: "Nov 2 2024" tags: [AI]

updatedDate: "Nov 4 2024"

LLM-as-a-Judge is a solution that uses LLMs to evaluate LLM responses based on any specific criteria of your choice. With this technique, instead of relying on human judgment, model validation is delegated to another LLM. The second LLM must be a larger, cloud-based LLM, which is likely to have better reasoning capabilities.

A solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge. This method was introduced in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.

In the article "Compare LLM capability with summarization", it tells how to evaluate the models' capabilities in summarization.

We broke the evaluation process into two steps. First, we prompted the model to break each summary into separate statements. Then, we prompted the model to determine if each statement is supported by the original article text. The model classified each statement's validity as:

Alignment is a metric that measures the frequency with which the statements included in a summary are supported in the original content the summary is based on.

This process resulted into two metrics that can be used to compare the models:

Improve the LLM judge

IMPROVED_JUDGE_PROMPT = """
You will be given a user_question and system_answer couple.
Your task is to provide a 'total rating' scoring how well the system_answer answers the user concerns expressed in the user_question.
Give your answer on a scale of 1 to 4, where 1 means that the system_answer is not helpful at all, and 4 means that the system_answer completely and helpfully addresses the user_question.

Here is the scale you should use to build your answer:
1: The system_answer is terrible: completely irrelevant to the question asked, or very partial
2: The system_answer is mostly not helpful: misses some key aspects of the question
3: The system_answer is mostly helpful: provides support, but still could be improved
4: The system_answer is excellent: relevant, direct, detailed, and addresses all the concerns raised in the question

Provide your feedback as follows:

Feedback:::
Evaluation: (your rationale for the rating, as a text)
Total rating: (your rating, as a number between 1 and 4)

You MUST provide values for 'Evaluation:' and 'Total rating:' in your answer.

Now here are the question and answer.

Question: {question}
Answer: {answer}

Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Feedback:::
Evaluation: """