Benchmark Whitepaper

Claude Sonnet 4 Benchmark Whitepaper

A model-specific whitepaper on Claude Sonnet 4 benchmark results across legacy and generated TuitionGoWhere materials.

Singapore

This whitepaper reviews how Claude Sonnet 4 performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,758
Quiz and paper reports 874
Overall score 9.1
Quiz and paper average 9.1
Below 8.0 15 quiz or paper reports
Below-8 rate 1.7% for quizzes and papers

Content Coverage

Content type Reports Average score
Audio lesson scripts 594 9.2
Cheatsheets 145 8.9
Exam papers 483 9.1
Parent guides 145 9.2
Quizzes 391 9.0

Criterion Scores

Criterion Score Interpretation
Clean output 9.9 Very strong at readable, polished structure.
Parent guide syllabus fit 9.8 Strong alignment in parent-facing material.
Syllabus adherence 9.5 Generally stayed within expected topic scope.
Audio learning value 9.5 Audio scripts were usually useful and understandable.
Answer explanation quality 8.0 Answer depth was weaker than prose quality.
Template adherence 8.4 Exam-paper structure needed stronger constraints.

Main Findings

  • Claude Sonnet 4 looked strongest on explanatory and guide-style content, including parent guides and audio lesson scripts.
  • The model also performed well on Primary 1 English, A-Level Biology H2, O-Level Principles of Accounts, O-Level Biology, Secondary 4 Pure Biology, and Primary 5 Higher Chinese in the tested sample.
  • Its main weakness was stricter paper generation. The model can write useful educational content, but it needs stronger format control to match exam-paper templates.

Subject and Level Fit

  • Strong fit: parent guides, audio scripts, cheatsheets, Biology, accounts, and prose-heavy support materials.
  • Use with review: Primary 2 Chinese, Primary 3 Chinese, Tamil, and Higher Tamil, where language-level and language-control issues appeared.
  • Use with review: formal exam papers, especially when marks, timing, and section structure must match past-year formats closely.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

  • The benchmark comments often separated good prose from exam fit. A page could read well but still not match the expected paper template.
  • Answer explanations needed more step-by-step detail in some quiz and paper outputs.
  • Cheatsheets sometimes needed sharper three-point summaries rather than repeated general advice.
  • Audio scripts were usually strong, but some needed simpler spoken phrasing for younger students.
  • Language-subject issues were more visible in Primary Chinese and Tamil cases than in English-medium subjects.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

  • Use Claude mainly where explanation, guide tone, or spoken lesson structure matters.
  • For exam papers, add a strict template scaffold before generation begins.
  • Add worked-answer requirements so the answer key teaches the method, not only the result.
  • Use a stronger target-language rule for Chinese, Malay, Tamil, and Higher Mother Tongue resources.

References