Stage 9-1 Level View
Primary 2 Benchmark Scores
Per-level benchmark view grouped by generation model and subject. Scores are derived from the Stage 9-0 evaluator reports without changing the underlying scoring algorithm.
Showing all Primary 2 subjects. Pick a subject to recalculate the LLM scores, low-score review list, and detailed rows for that subview.
LLM Summary
Average scores grouped by the model that generated the Primary 2 content.
| Generation Model | Artifacts | Overall | Missing Images | Language | Syllabus | Answers | Notation | Timing |
|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | 22 | 9.2 | 4 | 9.6 | 9.4 | 8.9 | 9.6 | 9.8 |
| Owl Alpha | 15 | 9.4 | 7 | 9.8 | 9.7 | 9.6 | 9.9 | 9.6 |
Subject Summary
Average scores grouped by subject inside Primary 2.
| Subject | Artifacts | Overall | Missing Images | Language | Syllabus | Answers | Notation | Timing |
|---|---|---|---|---|---|---|---|---|
| Chinese | 9 | 8.7 | 1 | 8.9 | 8.3 | 9.2 | 10.0 | 9.5 |
| English | 10 | 9.4 | 1 | 9.8 | 9.9 | 9.2 | 10.0 | 9.8 |
| Mathematics | 17 | 9.5 | 8 | 10.0 | 10.0 | 9.3 | 9.6 | 9.9 |
| Tamil | 1 | 8.7 | 1 | 9.0 | 8.0 | 9.0 | 10.0 | 9.0 |
LLM by Content Type
Model scores split by quizzes, papers, cheatsheets, and parent guides.
| Generation Model | Type | Artifacts | Overall | Missing Images | Language | Syllabus | Answers | Notation | Timing |
|---|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | Cheatsheet | 3 | 8.2 | 1 | 7.8 | 7.3 | - | 10.0 | - |
| Claude Sonnet 4 | Parents Guide | 3 | 9.8 | 0 | 9.8 | 9.7 | - | - | - |
| Claude Sonnet 4 | Quiz | 16 | 9.3 | 3 | 9.8 | 9.7 | 8.9 | 9.6 | 9.8 |
| Owl Alpha | Quiz | 15 | 9.4 | 7 | 9.8 | 9.7 | 9.6 | 9.9 | 9.6 |
Needs Review: Scores Below 8.0
Artifacts with overall benchmark scores below 8.0 for the current level view.
| Overall | Model | Subject | Type | Stage | Topic / Paper | Language | Syllabus | Template | Answers | Notation | Timing | Comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5.4 | Claude Sonnet 4 | Chinese | Cheatsheet | 2-7 | cheatsheet | 4.0 | 2.0 | - | - | - | - | The cheatsheet is generic and fails to align with the actual P2 Chinese syllabus (Huanle Huoban 2.0). It focuses on abstract linguistic concepts (synonyms, grammar structures) rather than the specific vocabulary and character recognition milestones required for P2. The content is too advanced/abstract for a 7-8 year old, reading more like a middle school grammar guide. |
Content Type Summary
Average scores grouped by content type.
Detailed Benchmark Rows
Topics, quiz variants, paper versions, cheatsheets, and parent guides listed individually.
| Model | Type | Stage | Subject | Topic / Paper | Overall | Missing Images | Language | Syllabus | Template | Clean | Step Answers | Notation | Paper Format | Difficulty | Time Fit | 3-Point Summary | Parent Guide | Difficulty | Comments |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Sonnet 4 | Cheatsheet | 2-7 | Chinese | cheatsheet | 5.4 | No | 4.0 | 2.0 | - | 10.0 | - | - | - | 3.0 | - | 8.0 | - | too easy | The cheatsheet is generic and fails to align with the actual P2 Chinese syllabus (Huanle Huoban 2.0). It focuses on abstract linguistic concepts (synonyms, grammar structures) rather than the specific vocabulary and character recognition milestones required for P2. The content is too advanced/abstract for a 7-8 year old, reading more like a middle school grammar guide. |
| Claude Sonnet 4 | Cheatsheet | 2-7 | English | cheatsheet | 9.5 | No | 9.5 | 10.0 | - | 10.0 | - | - | - | 9.0 | - | 9.0 | - | appropriate | Excellent syllabus alignment with the MOE P2 English framework. The cheatsheet uses effective thematic grouping and provides high-quality, actionable summaries (e.g., breaking down listening into 'First, Then, Finally'). Language is perfectly pitched for 7-8 year olds. No major issues found. |
| Claude Sonnet 4 | Cheatsheet | 2-7 | Mathematics | cheatsheet | 9.8 | Yes | 10.0 | 10.0 | - | 10.0 | - | 10.0 | - | 10.0 | - | 9.0 | - | appropriate | Excellent syllabus coverage for P2. Topic sections use effective three-point summaries and clear examples. Missing diagrams for 3D shapes, fractions, and money which are essential for this level. |
| Claude Sonnet 4 | Parents Guide | 2-9 | Chinese | parents-guide | 9.6 | No | 9.5 | 9.0 | - | 10.0 | - | - | - | 10.0 | - | - | 9.5 | appropriate | High quality guide. Accurately reflects the P2 qualitative assessment model (no grades). Term-by-term progression is well-structured and aligns with the MOE syllabus focus on listening, speaking, and character foundation. |
| Claude Sonnet 4 | Parents Guide | 2-9 | English | parents-guide | 9.8 | No | 10.0 | 10.0 | - | 9.0 | - | - | - | 10.0 | - | - | 10.0 | appropriate | Excellent parent guide. Highly aligned with the MOE Singapore P2 syllabus, specifically noting the qualitative assessment approach. Content is practical, encouraging, and age-appropriate. Minor truncation at the very end of the document. |
| Claude Sonnet 4 | Parents Guide | 2-9 | Mathematics | parents-guide | 10.0 | No | 10.0 | 10.0 | - | 10.0 | - | - | - | 10.0 | - | - | 10.0 | appropriate | Excellent parent guide. It accurately reflects the MOE Singapore P2 syllabus, including the shift to qualitative assessment. The term-by-term breakdown and practical home activities are highly relevant and age-appropriate. |
| Claude Sonnet 4 | Quiz | 5-1 | Chinese | hanyu-pinyin | 8.9 | No | 9.5 | 9.0 | 8.0 | 10.0 | 8.5 | 10.0 | 8.5 | 8.0 | 9.0 | - | - | appropriate | The quiz is well-structured and aligns with P2 Hanyu Pinyin requirements. The answer key contains a self-correction for Question 16, which is helpful but indicates a generation error in the original question's logic. The difficulty is appropriate for the level. |
| Claude Sonnet 4 | Quiz | 5-1 | Chinese | reading-comprehension | 9.1 | No | 9.5 | 9.0 | 8.5 | 10.0 | 9.0 | - | 8.5 | 9.0 | 9.5 | - | - | appropriate | Language is highly suitable for P2. Content aligns well with the syllabus. Question types (MCQ and open-ended) are standard. The answer key is excellent, providing both correct answers and marking rubrics for open-ended questions. Minor note: No images were required for this specific text-based quiz, so no images are missing. |
| Claude Sonnet 4 | Quiz | 5-1 | Chinese | vocabulary | 9.0 | No | 9.5 | 8.5 | 7.0 | 10.0 | 9.0 | 10.0 | 8.0 | 9.0 | 10.0 | - | - | appropriate | The quiz is well-structured for P2 level. Vocabulary and antonyms align well with the syllabus. Question 15 (matching) is slightly unconventional for standard MOE formats but acceptable for a quiz. The answer key is excellent, providing a useful summary of classifiers and antonyms at the end. |
| Claude Sonnet 4 | Quiz | 5-1 | English | grammar | 9.5 | No | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | - | 9.0 | 10.0 | 10.0 | - | - | appropriate | High quality quiz. Language and grammar topics (SVA, plurals, tenses, articles) align perfectly with P2 syllabus. Answer key includes helpful explanations. Format is clean, though Section B marks could be more explicitly distributed per sub-question in the question paper itself. |
| Claude Sonnet 4 | Quiz | 5-1 | English | phonics | 8.7 | No | 9.5 | 9.0 | 7.0 | 10.0 | 8.5 | 10.0 | 7.5 | 8.0 | 9.0 | - | - | appropriate | Language is well-suited for P2. The quiz follows a standard format but lacks the specific layout of Singapore MOE English papers (e.g., specific section headers like 'Grammar' or 'Vocabulary'). A significant error was found in the answer key for question 15a (hospital), though the key self-corrected. The marking scheme for Section B is slightly inconsistent with the question numbering/marks assigned. |
| Claude Sonnet 4 | Quiz | 5-1 | English | reading-comprehension | 9.5 | No | 10.0 | 10.0 | 9.0 | 10.0 | 9.0 | - | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Language is perfectly leveled for P2. Question types (MCQ and short answer) align well with Singapore primary standards. Answer key provides excellent evidence-based explanations. Minor note: Section B Q18 is more of a personal response than a comprehension check, but acceptable for this level. |
| Claude Sonnet 4 | Quiz | 5-1 | English | vocabulary | 9.2 | No | 9.5 | 10.0 | 8.5 | 10.0 | 7.0 | 10.0 | 9.0 | 8.5 | 10.0 | - | - | appropriate | Language is well-suited for P2. Syllabus coverage of antonyms, jobs, and adjectives is accurate. Question 9 is slightly ambiguous for this level as 'all of the above' is a complex logic for P2. Answer key provides explanations but lacks true step-by-step reasoning for short answers. Format is clean and professional. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | addition-subtraction | 9.4 | No | 10.0 | 10.0 | 9.0 | 10.0 | 8.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Content aligns well with P2 syllabus (3-digit addition/subtraction with regrouping). Section B answers provide clear vertical alignment for working. Marks and time are realistic. Minor note: Section B question 13 uses vertical alignment in the answer key which is good, but the question itself could benefit from explicit instruction to 'show working' in a specific format. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | fractions | 8.8 | Yes | 10.0 | 10.0 | 8.0 | 9.0 | 9.0 | 5.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | The quiz content is highly aligned with the P2 syllabus. However, it relies heavily on emoji/text-based representations (e.g., squares and circles) which are poor substitutes for formal mathematical diagrams required in a real exam. Notation for fractions uses plain text slashes instead of proper LaTeX, which is not standard for high-quality math papers. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | length-mass-volume | 9.6 | Yes | 10.0 | 10.0 | 8.0 | 10.0 | 10.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Content aligns perfectly with P2 syllabus for length, mass, and volume. Missing diagrams for measurement questions (e.g., scales or rulers) which are standard in Singapore math papers. Answer key is excellent with clear working steps. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | money | 9.7 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Adheres well to P2 Money syllabus including decimal notation and conversions. Section B includes multi-step problems and multiplication which is appropriate for the level. Answer key provides excellent step-by-step working and marking schemes. Minor note: Section A Q3 only has two options instead of four, but it is clear. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | multiplication-division | 9.4 | No | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | 10.0 | 8.5 | 9.0 | 10.0 | - | - | appropriate | The quiz aligns well with the P2 syllabus, covering the 2, 3, 4, 5, and 10 times tables. The difficulty is appropriate for the level. The format is slightly simplified compared to actual MOE papers (which usually use Section A for MCQs and Section B for structured questions with more visual aids), but it is highly functional. Answer key provides good explanations and marking schemes. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | numbers | 9.7 | No | 10.0 | 10.0 | 9.0 | 10.0 | 9.0 | 10.0 | 9.0 | 10.0 | 10.0 | - | - | appropriate | High quality quiz. Content aligns perfectly with P2 Numbers up to 1000 syllabus. Answer key provides clear explanations and marking schemes. Format is professional, though Section B could benefit from more complex word problems to match higher-tier P2 papers. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | picture-graphs | 9.2 | No | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | 10.0 | 9.0 | 7.0 | 10.0 | - | - | uneven | The quiz is well-structured and follows the syllabus perfectly. However, Question 17 is logically flawed as no two distinct fruits in the provided data sum to the number of apples (60). The answer key even notes this error. Most questions are appropriate for P2, but the logic error in the final question makes the difficulty uneven. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | shapes | 9.4 | No | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Content aligns perfectly with P2 geometry syllabus. Uses emojis effectively for patterns instead of missing images. Answer key is excellent with clear explanations and a summary table. |
| Claude Sonnet 4 | Quiz | 5-1 | Mathematics | time | 9.4 | Yes | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | The quiz aligns well with P2 syllabus (telling time to the minute, conversions). Missing clock face diagrams for Q1 and Q14. Format is professional with clear marks and time. Answer key provides good explanations. |
| Owl Alpha | Quiz | 5-1 | Chinese | character-recognition | 9.1 | Yes | 9.5 | 10.0 | 8.0 | 10.0 | 9.0 | - | 7.5 | 9.0 | 10.0 | - | - | appropriate | High syllabus alignment with P2 character lists. Section B relies heavily on image descriptions (e.g., 'Picture shows...') which indicates missing actual images/placeholders. Exam format is generally good but lacks specific marks per question in the question paper itself (though present in answer key). Difficulty is well-calibrated for P2. |
| Owl Alpha | Quiz | 5-1 | Chinese | hanyu-pinyin | 9.5 | No | 10.0 | 9.5 | 8.0 | 10.0 | 10.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Content aligns well with P2 Hanyu Pinyin requirements (tones, initials, finals, nasal sounds, and light tones). Answer key is exceptionally detailed with pedagogical tips. Format is professional, though P2 papers in Singapore usually feature more visual aids/pictures for context. |
| Owl Alpha | Quiz | 5-1 | Chinese | reading-comprehension | 9.1 | No | 9.5 | 9.0 | 8.5 | 10.0 | 9.5 | - | 9.0 | 8.5 | 9.0 | - | - | appropriate | Language is well-suited for P2. Content aligns with syllabus themes. The quiz lacks multiple-choice questions (MCQs) which are standard in Singapore primary Chinese papers, relying entirely on open-ended questions. Answer key is excellent with clear marking schemes. |
| Owl Alpha | Quiz | 5-1 | Chinese | vocabulary | 8.6 | No | 9.0 | 8.5 | 7.0 | 10.0 | 9.5 | - | 8.0 | 7.5 | 9.0 | - | - | uneven | Language is appropriate for P2. Syllabus adherence is good, though Section C includes some abstract phonetics (nasal sounds) that might be slightly advanced for standard P2 vocabulary focus. The difficulty is uneven: Section A/B are standard, but Section C contains complex multiple-choice questions on tones and polyphones. Exam format is mostly correct but lacks specific marks per question in the header. Answer key is excellent with detailed explanations. |
| Owl Alpha | Quiz | 5-1 | English | grammar | 9.7 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.0 | 9.5 | 10.0 | - | - | appropriate | High quality quiz. Content aligns perfectly with P2 grammar syllabus (nouns, pronouns, tenses, punctuation). Answer key provides excellent pedagogical explanations. Format is professional, though P2 papers usually use more visual aids/pictures for context. |
| Owl Alpha | Quiz | 5-1 | English | phonics | 9.3 | Yes | 10.0 | 10.0 | 7.0 | 10.0 | 10.0 | 10.0 | 8.0 | 9.0 | 10.0 | - | - | appropriate | High quality phonics quiz. Question 5 refers to a picture clue that is missing. Exam format is good but lacks specific marks per question in the main body (though present in answer key). Content is perfectly aligned with P2 phonics requirements. |
| Owl Alpha | Quiz | 5-1 | English | reading-comprehension | 9.5 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 8.5 | 9.0 | 9.0 | - | - | appropriate | High quality quiz. Language and difficulty are well-calibrated for P2. Question types (literal, inference, sequencing) align with syllabus. Marks and instructions are clear, though 40 marks for 20 questions in 40 minutes is slightly heavy for P2, but manageable. Answer key is excellent with clear marking notes. |
| Owl Alpha | Quiz | 5-1 | English | vocabulary | 9.8 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.5 | 10.0 | 10.0 | - | - | appropriate | High quality quiz. Vocabulary, compound words, and context clues are perfectly aligned with P2 syllabus. Answer key is excellent, providing clear explanations and common pitfalls. Format is professional and follows standard exam structures. |
| Owl Alpha | Quiz | 5-1 | Mathematics | addition-subtraction | 9.6 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.0 | 9.5 | 9.0 | - | - | appropriate | High quality quiz. Content aligns perfectly with P2 syllabus (3-digit addition/subtraction with regrouping). Section D provides excellent scaffolding for higher-order thinking. Answer key is exceptionally detailed with marking notes. Minor note: 40 minutes for 20 questions might be tight for some P2 students, but generally appropriate. |
| Owl Alpha | Quiz | 5-1 | Mathematics | length | 9.7 | Yes | 10.0 | 10.0 | 9.0 | 10.0 | 9.0 | 10.0 | 10.0 | 9.0 | 10.0 | - | - | appropriate | High quality quiz. Syllabus alignment for P2 length (m and cm) is perfect. Note: Questions 1 and 2 use ASCII art rulers which are functional but real diagrams are preferred for this level. Question 6 has two correct answers (A and C), which should be fixed to avoid student confusion. |
| Owl Alpha | Quiz | 5-1 | Mathematics | money | 9.6 | Yes | 10.0 | 10.0 | 8.0 | 10.0 | 10.0 | 9.0 | 9.0 | 10.0 | 10.0 | - | - | appropriate | The quiz is well-structured and aligns perfectly with the P2 Money syllabus. It uses emojis as placeholders for coins/notes, but in a real exam, these must be replaced with actual high-quality images. The answer key is excellent, providing clear methods and common mistakes. Notation is clean, though standard LaTeX could be used for more complex math. |
| Owl Alpha | Quiz | 5-1 | Mathematics | numbers | 9.8 | No | 10.0 | 10.0 | 9.0 | 10.0 | 10.0 | 10.0 | 9.5 | 10.0 | 10.0 | - | - | appropriate | High quality quiz. Excellent alignment with P2 syllabus (place value, patterns, odd/even). Answer key is exceptionally detailed with clear marking notes. Format is professional, though Section B/C could benefit from more varied question types (e.g., word problems) to match higher-order thinking in actual exams. |
| Owl Alpha | Quiz | 5-1 | Mathematics | shapes | 9.3 | Yes | 9.5 | 10.0 | 8.0 | 10.0 | 9.5 | 10.0 | 9.0 | 8.5 | 9.0 | - | - | appropriate | The quiz is well-structured and syllabus-aligned. However, it has significant missing image issues: Q12, Q17, and Q20 rely on visual cues or drawings that are only described in text placeholders, making them impossible for a student to solve as presented. The difficulty is appropriate for P2, though some perimeter questions (Q16, Q19) might be slightly advanced depending on the specific school's progression. |
| Owl Alpha | Quiz | 5-1 | Mathematics | time | 9.4 | Yes | 10.0 | 10.0 | 8.0 | 10.0 | 9.0 | 10.0 | 9.0 | 9.0 | 10.0 | - | - | appropriate | The quiz is well-structured and syllabus-aligned. However, Section A and Section B are fundamentally broken because they rely entirely on visual clock faces/diagrams which are missing. While the text descriptions in Section A attempt to compensate, Section B (drawing hands) is impossible to complete without a provided clock face template. |
| Owl Alpha | Quiz | 5-1 | Tamil | vocabulary | 8.7 | Yes | 9.0 | 8.0 | 7.0 | 10.0 | 9.0 | 10.0 | 8.0 | 8.0 | 9.0 | - | - | appropriate | Vocabulary level is appropriate for P2. Section A relies on image placeholders which are not actual images. Section B sentences are slightly simplistic but functional. Answer key provides good reasoning and marking schemes. |
Criteria
Scores use 10.0 as best fit. Missing images are tracked as a yes/no flag.
Language suitability
Syllabus adherence
Past-paper template adherence
No weird artefacts/symbols
Step-by-step answers
Latex/notation format
Exam paper format
Difficulty appropriateness
Doable within timeframe
Cheatsheet 3-point summaries
Parent guide syllabus fit