Benchmark Whitepaper

Qwen3.6 Plus Benchmark Whitepaper

A model-specific whitepaper on Qwen3.6 Plus benchmark results for TuitionGoWhere quizzes and exam papers.

Singapore

This whitepaper reviews how Qwen3.6 Plus performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,045
Quiz and paper reports 1,045
Overall score 9.4
Quiz and paper average 9.4
Below 8.0 18
Below-8 rate 1.7%

Content Coverage

Content type Reports Average score
Exam papers 510 9.3
Quizzes 535 9.4

Criterion Scores

Criterion Score Interpretation
LaTeX and notation handling 9.9 Very strong handling of formulas, symbols, and structured notation.
Language suitability 9.8 Generally level-appropriate language across the tested sample.
Syllabus adherence 9.7 Usually stayed close to the expected school topic and scope.
Clean output 9.7 Low rate of obvious tags, stray symbols, or broken structure.
Answer explanation quality 8.4 The main quality gap was answer depth, especially where working should be shown.
Difficulty fit 8.8 Some papers needed better calibration to the intended level.

Main Findings

  • The model was strongest in technical and structured subjects. The benchmark showed high averages for Secondary 3 Chemistry and Physics, Secondary 4 Principles of Accounts, O-Level Biology, and A-Level Chemistry H1.
  • The best use case appears to be large-scale generation of exam-style quizzes and papers where notation, marking structure, and syllabus boundaries matter.
  • The weakest cases were concentrated in language subjects, especially Secondary 4 Chinese and Tamil, where the benchmark caught wrong-language or mixed-language risks.

Subject and Level Fit

  • Strong fit: Secondary sciences, O-Level Biology, Principles of Accounts, A-Level Chemistry, and notation-heavy subjects.
  • Use with review: Secondary 4 Chinese, Tamil, Malay, and English writing tasks, where language control and marking expectations are stricter.
  • Operational note: missing image flags were common, but many reflect the image-generation pipeline rather than a text-generation failure.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

  • Answer keys sometimes needed more method-level explanation, not only final answers.
  • Difficulty calibration appeared in lower-score cases, especially where the paper was too easy or too ambitious for the intended level.
  • Template and format drift was usually minor, but the benchmark still found cases where marks, sections, or exam framing needed tightening.
  • Language control was the most serious issue when it appeared, because a Mother Tongue paper written mostly in English does not meet the learning goal.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

  • Add a stronger answer-key requirement for worked solutions and short marking notes.
  • Use stricter language locks for Chinese, Malay, and Tamil outputs.
  • Add a final checklist for exam-paper sections, marks, timing, and formatting before accepting the output.
  • Route low-scoring language-subject outputs to targeted reruns rather than rerunning the full model set.

References