Benchmark Whitepaper
Qwen3.6 Plus Benchmark Whitepaper
A model-specific whitepaper on Qwen3.6 Plus benchmark results for TuitionGoWhere quizzes and exam papers.
This whitepaper reviews how Qwen3.6 Plus performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.
The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.
Score Profile
Content Coverage
| Content type | Reports | Average score |
|---|---|---|
| Exam papers | 510 | 9.3 |
| Quizzes | 535 | 9.4 |
Criterion Scores
| Criterion | Score | Interpretation |
|---|---|---|
| LaTeX and notation handling | 9.9 | Very strong handling of formulas, symbols, and structured notation. |
| Language suitability | 9.8 | Generally level-appropriate language across the tested sample. |
| Syllabus adherence | 9.7 | Usually stayed close to the expected school topic and scope. |
| Clean output | 9.7 | Low rate of obvious tags, stray symbols, or broken structure. |
| Answer explanation quality | 8.4 | The main quality gap was answer depth, especially where working should be shown. |
| Difficulty fit | 8.8 | Some papers needed better calibration to the intended level. |
Main Findings
- The model was strongest in technical and structured subjects. The benchmark showed high averages for Secondary 3 Chemistry and Physics, Secondary 4 Principles of Accounts, O-Level Biology, and A-Level Chemistry H1.
- The best use case appears to be large-scale generation of exam-style quizzes and papers where notation, marking structure, and syllabus boundaries matter.
- The weakest cases were concentrated in language subjects, especially Secondary 4 Chinese and Tamil, where the benchmark caught wrong-language or mixed-language risks.
Subject and Level Fit
- Strong fit: Secondary sciences, O-Level Biology, Principles of Accounts, A-Level Chemistry, and notation-heavy subjects.
- Use with review: Secondary 4 Chinese, Tamil, Malay, and English writing tasks, where language control and marking expectations are stricter.
- Operational note: missing image flags were common, but many reflect the image-generation pipeline rather than a text-generation failure.
Recurring Comment Themes
The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.
- Answer keys sometimes needed more method-level explanation, not only final answers.
- Difficulty calibration appeared in lower-score cases, especially where the paper was too easy or too ambitious for the intended level.
- Template and format drift was usually minor, but the benchmark still found cases where marks, sections, or exam framing needed tightening.
- Language control was the most serious issue when it appeared, because a Mother Tongue paper written mostly in English does not meet the learning goal.
Prompting and Workflow Implications
TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.
- Add a stronger answer-key requirement for worked solutions and short marking notes.
- Use stricter language locks for Chinese, Malay, and Tamil outputs.
- Add a final checklist for exam-paper sections, marks, timing, and formatting before accepting the output.
- Route low-scoring language-subject outputs to targeted reruns rather than rerunning the full model set.