Benchmark Whitepaper

Gemma 4 31B Benchmark Whitepaper

A model-specific whitepaper on Gemma 4 31B benchmark results and recurring evaluator comments.

Singapore

This whitepaper reviews how Gemma 4 31B performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,038
Quiz and paper reports 1,038
Overall score 9.3
Quiz and paper average 9.3
Below 8.0 16
Below-8 rate 1.5%

Content Coverage

Content type Reports Average score
Exam papers 515 9.3
Quizzes 523 9.2

Criterion Scores

Criterion Score Interpretation
Language suitability 9.9 Very strong language fit across most tested materials.
Clean output 9.9 Very few obvious formatting artefacts.
LaTeX and notation handling 9.9 Excellent notation handling in the benchmark sample.
Syllabus adherence 9.7 Usually stayed close to the intended syllabus topic.
Answer explanation quality 6.2 The clearest weakness was thin or incomplete answer explanation.
Step-by-step answers 8.7 Working was present in many cases, but not always easy to follow.

Main Findings

  • Gemma 4 31B looked strong on surface quality: clean structure, good notation, and clear language.
  • It performed well in Secondary 3 Biology, Chemistry, and Physics, Secondary 4 Combined Science Chemistry, Principles of Accounts, A-Level Chemistry H1, A-Level Physics H1, and O-Level Biology.
  • The main issue was that good questions were sometimes paired with answer keys that were too brief. This is important because students using free practice papers often need the method, not just the final answer.

Subject and Level Fit

  • Strong fit: sciences, accounts, and notation-heavy papers where structure and formula handling matter.
  • Use with review: Secondary 4 Higher Chinese, Tamil, and A-Level Tamil H2, where the benchmark found lower averages and language-control concerns.
  • Use with review: any paper where the answer key is meant to teach the solution process.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

  • The strongest repeated theme was answer explanation quality. The generated answer often needed clearer working, reasoning, or marking guidance.
  • Some lower-scoring cases involved difficulty mismatch, where a paper was not well tuned to the expected student level.
  • Template drift appeared when question style, section structure, or marking expectations were not close enough to exam-paper conventions.
  • Language mixing was less frequent than answer-depth comments, but it remained a serious failure mode in non-English subjects.
  • Notation handling was mostly strong, but this should still be checked because a small LaTeX error can make a good mathematics or science question hard to use.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

  • Require step-by-step answer keys where the task involves calculation, source use, grammar choice, or reasoning.
  • Separate question generation from answer-key generation so the answer section receives enough attention.
  • Add a strict language check before accepting Chinese, Malay, or Tamil outputs.
  • Keep the notation instructions, but add a final renderability check for equations and symbols.

References