Benchmark Whitepaper

DeepSeek V4 Pro Benchmark Whitepaper

A model-specific whitepaper on DeepSeek V4 Pro benchmark results for TuitionGoWhere generated practice resources.

Singapore June 2026

This whitepaper reviews how DeepSeek V4 Pro performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 1,034

Quiz and paper reports 1,034

Overall score 9.3

Quiz and paper average 9.3

Below 8.0 4

Below-8 rate 0.4%

Content Coverage

Content type	Reports	Average score
Exam papers	516	9.4
Quizzes	518	9.3

Criterion Scores

Criterion	Score	Interpretation
Language suitability	9.9	Very strong level-appropriate wording in the tested sample.
Syllabus adherence	9.8	Strong subject and syllabus alignment.
Clean output	9.6	Generally clean structure with few artefact issues.
LaTeX and notation handling	9.6	Reliable handling of mathematical and scientific notation.
Difficulty fit	8.9	Still needs checks for level fit and exam-time realism.
Template adherence	9.0	Good but not perfect exam-template control.

Main Findings

DeepSeek V4 Pro had the lowest below-8 rate among the large-sample quiz and paper models in this benchmark.
It performed especially well in O-Level Principles of Accounts, Secondary 3 Biology, Secondary 3 Chemistry, Secondary 3 Physics, A-Level Economics H2, and A-Level Chemistry H1.
Its main risk was not basic readability. The review issues were more often about difficulty, exam style, and whether the generated paper fully matched the intended paper type.

Subject and Level Fit

Strong fit: Secondary sciences, O-Level science subjects, Principles of Accounts, A-Level Economics, and A-Level sciences.
Use with review: Secondary 4 Higher Chinese, Chinese, Tamil, and Malay, where non-English language control needs closer checking.
Use with review: long exam papers that require exact timing, section structure, and mark allocation.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

Answer depth was usually acceptable, but some comments still asked for clearer working or marking guidance.
Difficulty and timeframe checks remain useful because a technically correct question can still be too hard or too long.
Missing visuals were common in graph, diagram, and science-context questions, but this should be treated as a pipeline follow-up when the text is otherwise sound.
Language issues were rare compared with other models, but still serious in the few cases where they appeared.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

Add stronger checks for timing and total marks when generating full papers.
Ask the model to label each section and mark allocation before writing the questions.
Keep the existing notation instructions, but add a rendering check for equations and units.
Use a separate review pass for non-English subjects instead of relying only on the overall score.

References

Back to News View Benchmark