Benchmark Whitepaper
Claude Sonnet 4 Benchmark Whitepaper
A model-specific whitepaper on Claude Sonnet 4 benchmark results across legacy and generated TuitionGoWhere materials.
This whitepaper reviews how Claude Sonnet 4 performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.
The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.
Score Profile
Content Coverage
| Content type | Reports | Average score |
|---|---|---|
| Audio lesson scripts | 594 | 9.2 |
| Cheatsheets | 145 | 8.9 |
| Exam papers | 483 | 9.1 |
| Parent guides | 145 | 9.2 |
| Quizzes | 391 | 9.0 |
Criterion Scores
| Criterion | Score | Interpretation |
|---|---|---|
| Clean output | 9.9 | Very strong at readable, polished structure. |
| Parent guide syllabus fit | 9.8 | Strong alignment in parent-facing material. |
| Syllabus adherence | 9.5 | Generally stayed within expected topic scope. |
| Audio learning value | 9.5 | Audio scripts were usually useful and understandable. |
| Answer explanation quality | 8.0 | Answer depth was weaker than prose quality. |
| Template adherence | 8.4 | Exam-paper structure needed stronger constraints. |
Main Findings
- Claude Sonnet 4 looked strongest on explanatory and guide-style content, including parent guides and audio lesson scripts.
- The model also performed well on Primary 1 English, A-Level Biology H2, O-Level Principles of Accounts, O-Level Biology, Secondary 4 Pure Biology, and Primary 5 Higher Chinese in the tested sample.
- Its main weakness was stricter paper generation. The model can write useful educational content, but it needs stronger format control to match exam-paper templates.
Subject and Level Fit
- Strong fit: parent guides, audio scripts, cheatsheets, Biology, accounts, and prose-heavy support materials.
- Use with review: Primary 2 Chinese, Primary 3 Chinese, Tamil, and Higher Tamil, where language-level and language-control issues appeared.
- Use with review: formal exam papers, especially when marks, timing, and section structure must match past-year formats closely.
Recurring Comment Themes
The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.
- The benchmark comments often separated good prose from exam fit. A page could read well but still not match the expected paper template.
- Answer explanations needed more step-by-step detail in some quiz and paper outputs.
- Cheatsheets sometimes needed sharper three-point summaries rather than repeated general advice.
- Audio scripts were usually strong, but some needed simpler spoken phrasing for younger students.
- Language-subject issues were more visible in Primary Chinese and Tamil cases than in English-medium subjects.
Prompting and Workflow Implications
TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.
- Use Claude mainly where explanation, guide tone, or spoken lesson structure matters.
- For exam papers, add a strict template scaffold before generation begins.
- Add worked-answer requirements so the answer key teaches the method, not only the result.
- Use a stronger target-language rule for Chinese, Malay, Tamil, and Higher Mother Tongue resources.