Benchmark Whitepaper
Qwen3.7 Plus Benchmark Whitepaper
A model-specific whitepaper on Qwen3.7 Plus benchmark results from the TuitionGoWhere rerun sample.
This whitepaper reviews how Qwen3.7 Plus performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.
The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.
Score Profile
Content Coverage
| Content type | Reports | Average score |
|---|---|---|
| Exam papers | 132 | 9.1 |
| Quizzes | 98 | 9.0 |
Criterion Scores
| Criterion | Score | Interpretation |
|---|---|---|
| LaTeX and notation handling | 10.0 | Best-in-sample notation score. |
| Clean output | 9.6 | Usually structured and readable. |
| Language suitability | 9.3 | Good average, but weaker language cases were important. |
| Exam paper format | 9.3 | Generally good paper structure. |
| Difficulty fit | 8.2 | Difficulty calibration was the main weakness. |
| Step-by-step answers | 8.6 | Answer working needed more detail in some cases. |
Main Findings
- Qwen3.7 Plus looked technically strong, especially for notation-heavy work.
- The strongest sampled areas included Primary 4 Mathematics, Primary 6 Science, Primary 4 Science, Secondary 4 Additional Mathematics, Primary 3 Science, and Primary 5 Science.
- The low-score rate was higher because the sample included rerun-like outputs and narrower problem areas. Primary Chinese and some humanities or mathematics cases needed closer review.
Subject and Level Fit
- Strong fit: Primary Science, Primary Mathematics, Additional Mathematics, and notation-heavy pages.
- Use with review: Primary Chinese, O-Level History, A-Level Maths H2, and Social Studies, where language, difficulty, or answer depth can drift.
- Use with review: outputs that are meant to replace weaker earlier resources, because the rerun should be better than the previous version, not just acceptable.
Recurring Comment Themes
The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.
- Difficulty mismatch appeared often in lower-scoring cases.
- Language control was a serious issue where Chinese materials were not mainly written in Chinese.
- Answer depth needed more explicit working and explanation.
- Template drift appeared when papers were close to the intended format but not tight enough for exam-style practice.
- LaTeX and notation were a clear strength, but rendered output still needs a website check.
Prompting and Workflow Implications
TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.
- Use stronger level calibration and require the model to state the intended difficulty before writing the final paper.
- Use strict target-language validation for Chinese outputs.
- Require step-by-step solutions for calculation and reasoning questions.
- Use Qwen3.7 Plus selectively for notation-heavy reruns where its strengths matter most.