Benchmark Whitepaper

Claude Sonnet 4 Benchmark Whitepaper

A model-specific whitepaper on Claude Sonnet 4 benchmark results across legacy and generated TuitionGoWhere materials.

Singapore June 2026

This whitepaper reviews how Claude Sonnet 4 performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,758

Quiz and paper reports 874

Overall score 9.1

Quiz and paper average 9.1

Below 8.0 15 quiz or paper reports

Below-8 rate 1.7% for quizzes and papers

Content Coverage

Content type	Reports	Average score
Audio lesson scripts	594	9.2
Cheatsheets	145	8.9
Exam papers	483	9.1
Parent guides	145	9.2
Quizzes	391	9.0

Criterion Scores

Criterion	Score	Interpretation
Clean output	9.9	Very strong at readable, polished structure.
Parent guide syllabus fit	9.8	Strong alignment in parent-facing material.
Syllabus adherence	9.5	Generally stayed within expected topic scope.
Audio learning value	9.5	Audio scripts were usually useful and understandable.
Answer explanation quality	8.0	Answer depth was weaker than prose quality.
Template adherence	8.4	Exam-paper structure needed stronger constraints.

Main Findings

Claude Sonnet 4 looked strongest on explanatory and guide-style content, including parent guides and audio lesson scripts.
The model also performed well on Primary 1 English, A-Level Biology H2, O-Level Principles of Accounts, O-Level Biology, Secondary 4 Pure Biology, and Primary 5 Higher Chinese in the tested sample.
Its main weakness was stricter paper generation. The model can write useful educational content, but it needs stronger format control to match exam-paper templates.

Subject and Level Fit

Strong fit: parent guides, audio scripts, cheatsheets, Biology, accounts, and prose-heavy support materials.
Use with review: Primary 2 Chinese, Primary 3 Chinese, Tamil, and Higher Tamil, where language-level and language-control issues appeared.
Use with review: formal exam papers, especially when marks, timing, and section structure must match past-year formats closely.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

The benchmark comments often separated good prose from exam fit. A page could read well but still not match the expected paper template.
Answer explanations needed more step-by-step detail in some quiz and paper outputs.
Cheatsheets sometimes needed sharper three-point summaries rather than repeated general advice.
Audio scripts were usually strong, but some needed simpler spoken phrasing for younger students.
Language-subject issues were more visible in Primary Chinese and Tamil cases than in English-medium subjects.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

Use Claude mainly where explanation, guide tone, or spoken lesson structure matters.
For exam papers, add a strict template scaffold before generation begins.
Add worked-answer requirements so the answer key teaches the method, not only the result.
Use a stronger target-language rule for Chinese, Malay, Tamil, and Higher Mother Tongue resources.

References

Back to News View Benchmark