fix: improve LLM prompt to use actual rubric score ranges

Previous prompt had hardcoded 0-3 score examples which misled LLM. Now prompt instructs LLM to read max_score from rubric for each criterion.
2025-12-02 14:21:41 +08:00 · 2025-12-02 14:21:41 +08:00 · c25bcfddd0
commit c25bcfddd0
parent 1afc2eae48
1 changed files with 19 additions and 13 deletions
--- a/.autograde/llm_grade.py
+++ b/.autograde/llm_grade.py
@ -36,19 +36,25 @@ def read_file_or_string(value):

 PROMPT_TEMPLATE = """你是严格且一致的助教，按提供的评分量表为学生的简答题评分。

- 只依据量表，不做主观延伸；允许多样表述。
- 不输出任何解释性文本；只输出 JSON，包含:
-  {{
-    "total": number(0-10, 两位小数),
-    "criteria": [
-      {{"id":"accuracy","score":0-3,"reason":"要点式一句话"}},
-      {{"id":"coverage","score":0-3,"reason":""}},
-      {{"id":"clarity","score":0-3,"reason":""}}
-    ],
-    "flags": [],
-    "confidence": number(0-1)
-  }}
-如果答案与题目无关，total=0，并加 flag "need_review"。
+- 只依据量表中各评分项的 max_score 和 scoring_guide 进行评分
+- 每个评分项的分数范围是 0 到该项的 max_score
+- 不输出任何解释性文本；只输出 JSON
+
+输出格式：
+{{
+  "total": number (各项分数之和，保留两位小数),
+  "criteria": [
+    {{"id": "评分项id", "score": number(0到该项max_score), "reason": "简短评语"}},
+    ...
+  ],
+  "flags": [],
+  "confidence": number(0-1, 评分置信度)
+}}
+
+重要：
+- 每个评分项的 score 必须在 0 到该项 max_score 范围内
+- total 必须等于所有 criteria 的 score 之和
+- 如果答案与题目无关或为空，total=0，并加 flag "need_review"

 【题目】
 <<<{question}>>>