Benchmark Whitepaper

Qwen3.7 Plus Benchmark Whitepaper

A model-specific whitepaper on Qwen3.7 Plus benchmark results from the TuitionGoWhere rerun sample.

Singapore June 2026

This whitepaper reviews how Qwen3.7 Plus performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 230

Quiz and paper reports 230

Overall score 9.1

Quiz and paper average 9.1

Below 8.0 18

Below-8 rate 7.8%

Content Coverage

Content type	Reports	Average score
Exam papers	132	9.1
Quizzes	98	9.0

Criterion Scores

Criterion	Score	Interpretation
LaTeX and notation handling	10.0	Best-in-sample notation score.
Clean output	9.6	Usually structured and readable.
Language suitability	9.3	Good average, but weaker language cases were important.
Exam paper format	9.3	Generally good paper structure.
Difficulty fit	8.2	Difficulty calibration was the main weakness.
Step-by-step answers	8.6	Answer working needed more detail in some cases.

Main Findings

Qwen3.7 Plus looked technically strong, especially for notation-heavy work.
The strongest sampled areas included Primary 4 Mathematics, Primary 6 Science, Primary 4 Science, Secondary 4 Additional Mathematics, Primary 3 Science, and Primary 5 Science.
The low-score rate was higher because the sample included rerun-like outputs and narrower problem areas. Primary Chinese and some humanities or mathematics cases needed closer review.

Subject and Level Fit

Strong fit: Primary Science, Primary Mathematics, Additional Mathematics, and notation-heavy pages.
Use with review: Primary Chinese, O-Level History, A-Level Maths H2, and Social Studies, where language, difficulty, or answer depth can drift.
Use with review: outputs that are meant to replace weaker earlier resources, because the rerun should be better than the previous version, not just acceptable.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

Difficulty mismatch appeared often in lower-scoring cases.
Language control was a serious issue where Chinese materials were not mainly written in Chinese.
Answer depth needed more explicit working and explanation.
Template drift appeared when papers were close to the intended format but not tight enough for exam-style practice.
LaTeX and notation were a clear strength, but rendered output still needs a website check.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

Use stronger level calibration and require the model to state the intended difficulty before writing the final paper.
Use strict target-language validation for Chinese outputs.
Require step-by-step solutions for calculation and reasoning questions.
Use Qwen3.7 Plus selectively for notation-heavy reruns where its strengths matter most.

References

Back to News View Benchmark