Stage 9-0 Benchmark
Generated Content Benchmark
Scores for quizzes, exam papers, cheatsheets, and parent guides generated by different models. Each artifact is evaluated against Singapore syllabus, exam-format, answer-key, notation, timing, and content-quality criteria. Open a level view to compare LLM scores by subject.
Model Summary
Average scores grouped by the model that generated the content.
| Generation Model | Artifacts | Overall | Missing Images | Language | Syllabus | Answers | Notation | Timing |
|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 245 | 9.1 | 122 | 9.5 | 9.3 | 9.0 | 9.4 | 9.0 |
| DeepSeek V4 Pro | 1050 | 9.3 | 665 | 9.9 | 9.7 | 9.2 | 9.6 | 9.0 |
| Gemma 4 31B | 1064 | 9.2 | 668 | 9.8 | 9.6 | 8.7 | 9.9 | 8.9 |
| Legacy generator | 1009 | 8.3 | 558 | 9.0 | 8.5 | 7.7 | 8.5 | 8.1 |
| Owl Alpha | 77 | 9.2 | 44 | 9.7 | 9.4 | 9.5 | 10.0 | 9.7 |
| Qwen3.6 Plus | 1062 | 9.3 | 672 | 9.8 | 9.7 | 9.2 | 9.9 | 9.0 |
Level Views
Open a per-level benchmark page to compare scores by LLM and subject.
| Level | Artifacts | Overall | Missing Images | Language | Syllabus | Answers | Notation | Timing | View |
|---|---|---|---|---|---|---|---|---|---|
| A-Level | 1024 | 9.4 | 688 | 9.9 | 9.8 | 9.1 | 9.8 | 8.9 | Open |
| O-Level | 680 | 9.3 | 457 | 9.9 | 9.7 | 9.1 | 9.7 | 9.1 | Open |
| Primary 1 | 71 | 9.3 | 41 | 9.7 | 9.5 | 9.4 | 10.0 | 9.8 | Open |
| Primary 2 | 55 | 9.3 | 19 | 9.7 | 9.4 | 9.3 | 9.8 | 9.7 | Open |
| Primary 3 | 152 | 7.9 | 92 | 8.3 | 7.9 | 7.6 | 8.2 | 8.0 | Open |
| Primary 4 | 137 | 8.8 | 74 | 9.5 | 8.9 | 8.7 | 9.2 | 9.2 | Open |
| Primary 5 | 168 | 6.8 | 82 | 7.4 | 7.2 | 5.8 | 6.4 | 6.6 | Open |
| Primary 6 PSLE | 178 | 7.5 | 92 | 8.9 | 8.0 | 6.2 | 8.3 | 7.0 | Open |
| Secondary 1 | 145 | 9.1 | 78 | 9.6 | 9.2 | 9.0 | 9.6 | 8.8 | Open |
| Secondary 2 | 137 | 9.0 | 91 | 9.5 | 9.2 | 8.8 | 9.5 | 8.6 | Open |
| Secondary 3 | 671 | 9.3 | 404 | 9.8 | 9.7 | 9.0 | 9.8 | 9.1 | Open |
| Secondary 4 | 1089 | 9.2 | 611 | 9.8 | 9.6 | 8.9 | 9.8 | 9.0 | Open |
Content Type Summary
Average scores grouped by quizzes, papers, cheatsheets, and parent guides.
Criteria
Scores use 10.0 as best fit. Missing images are tracked as a yes/no flag.
Language suitability
Syllabus adherence
Past-paper template adherence
No weird artefacts/symbols
Step-by-step answers
Latex/notation format
Exam paper format
Difficulty appropriateness
Doable within timeframe
Cheatsheet 3-point summaries
Parent guide syllabus fit