Stage 9-0 Benchmark

Generated Content Benchmark

Scores for quizzes, exam papers, cheatsheets, and parent guides generated by different models. Each artifact is evaluated against Singapore syllabus, exam-format, answer-key, notation, timing, and content-quality criteria. Open a level view to compare LLM scores by subject.

4507 reports Updated 2026-06-02 15:35:52 UTC 0.0-10.0 scoring Refreshes every 5 min
Evaluator Gemma 4 26B A4B google/gemma-4-26b-a4b-it

Model Summary

Average scores grouped by the model that generated the content.

Generation Model Artifacts Overall Missing Images Language Syllabus Answers Notation Timing
Claude Sonnet 4 245 9.1 122 9.5 9.3 9.0 9.4 9.0
DeepSeek V4 Pro 1050 9.3 665 9.9 9.7 9.2 9.6 9.0
Gemma 4 31B 1064 9.2 668 9.8 9.6 8.7 9.9 8.9
Legacy generator 1009 8.3 558 9.0 8.5 7.7 8.5 8.1
Owl Alpha 77 9.2 44 9.7 9.4 9.5 10.0 9.7
Qwen3.6 Plus 1062 9.3 672 9.8 9.7 9.2 9.9 9.0

Level Views

Open a per-level benchmark page to compare scores by LLM and subject.

Level Artifacts Overall Missing Images Language Syllabus Answers Notation Timing View
A-Level 1024 9.4 688 9.9 9.8 9.1 9.8 8.9 Open
O-Level 680 9.3 457 9.9 9.7 9.1 9.7 9.1 Open
Primary 1 71 9.3 41 9.7 9.5 9.4 10.0 9.8 Open
Primary 2 55 9.3 19 9.7 9.4 9.3 9.8 9.7 Open
Primary 3 152 7.9 92 8.3 7.9 7.6 8.2 8.0 Open
Primary 4 137 8.8 74 9.5 8.9 8.7 9.2 9.2 Open
Primary 5 168 6.8 82 7.4 7.2 5.8 6.4 6.6 Open
Primary 6 PSLE 178 7.5 92 8.9 8.0 6.2 8.3 7.0 Open
Secondary 1 145 9.1 78 9.6 9.2 9.0 9.6 8.8 Open
Secondary 2 137 9.0 91 9.5 9.2 8.8 9.5 8.6 Open
Secondary 3 671 9.3 404 9.8 9.7 9.0 9.8 9.1 Open
Secondary 4 1089 9.2 611 9.8 9.6 8.9 9.8 9.0 Open

Content Type Summary

Average scores grouped by quizzes, papers, cheatsheets, and parent guides.

Cheatsheet 9.3 85 artifacts, 46 with missing-image flags
Paper 9.0 2206 artifacts, 1521 with missing-image flags
Parents Guide 9.8 85 artifacts, 0 with missing-image flags
Quiz 9.1 2131 artifacts, 1162 with missing-image flags

Criteria

Scores use 10.0 as best fit. Missing images are tracked as a yes/no flag.

1

Language suitability

2

Syllabus adherence

3

Past-paper template adherence

4

No weird artefacts/symbols

5

Step-by-step answers

6

Latex/notation format

7

Exam paper format

8

Difficulty appropriateness

9

Doable within timeframe

10

Cheatsheet 3-point summaries

11

Parent guide syllabus fit