Benchmark Whitepaper

DeepSeek V4 Pro Benchmark Whitepaper

A model-specific whitepaper on DeepSeek V4 Pro benchmark results for TuitionGoWhere generated practice resources.

Singapore

This whitepaper reviews how DeepSeek V4 Pro performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,034
Quiz and paper reports 1,034
Overall score 9.3
Quiz and paper average 9.3
Below 8.0 4
Below-8 rate 0.4%

Content Coverage

Content type Reports Average score
Exam papers 516 9.4
Quizzes 518 9.3

Criterion Scores

Criterion Score Interpretation
Language suitability 9.9 Very strong level-appropriate wording in the tested sample.
Syllabus adherence 9.8 Strong subject and syllabus alignment.
Clean output 9.6 Generally clean structure with few artefact issues.
LaTeX and notation handling 9.6 Reliable handling of mathematical and scientific notation.
Difficulty fit 8.9 Still needs checks for level fit and exam-time realism.
Template adherence 9.0 Good but not perfect exam-template control.

Main Findings

  • DeepSeek V4 Pro had the lowest below-8 rate among the large-sample quiz and paper models in this benchmark.
  • It performed especially well in O-Level Principles of Accounts, Secondary 3 Biology, Secondary 3 Chemistry, Secondary 3 Physics, A-Level Economics H2, and A-Level Chemistry H1.
  • Its main risk was not basic readability. The review issues were more often about difficulty, exam style, and whether the generated paper fully matched the intended paper type.

Subject and Level Fit

  • Strong fit: Secondary sciences, O-Level science subjects, Principles of Accounts, A-Level Economics, and A-Level sciences.
  • Use with review: Secondary 4 Higher Chinese, Chinese, Tamil, and Malay, where non-English language control needs closer checking.
  • Use with review: long exam papers that require exact timing, section structure, and mark allocation.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

  • Answer depth was usually acceptable, but some comments still asked for clearer working or marking guidance.
  • Difficulty and timeframe checks remain useful because a technically correct question can still be too hard or too long.
  • Missing visuals were common in graph, diagram, and science-context questions, but this should be treated as a pipeline follow-up when the text is otherwise sound.
  • Language issues were rare compared with other models, but still serious in the few cases where they appeared.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

  • Add stronger checks for timing and total marks when generating full papers.
  • Ask the model to label each section and mark allocation before writing the questions.
  • Keep the existing notation instructions, but add a rendering check for equations and units.
  • Use a separate review pass for non-English subjects instead of relying only on the overall score.

References