Benchmark Whitepaper
Owl Alpha Benchmark Whitepaper
A model-specific whitepaper on Owl Alpha benchmark results across primary, secondary, O-Level, and A-Level materials.
This whitepaper reviews how Owl Alpha performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.
The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.
Score Profile
Content Coverage
| Content type | Reports | Average score |
|---|---|---|
| Exam papers | 891 | 9.2 |
| Quizzes | 817 | 9.2 |
Criterion Scores
| Criterion | Score | Interpretation |
|---|---|---|
| Language suitability | 9.8 | Strong general language fit, especially in English-medium subjects. |
| LaTeX and notation handling | 9.8 | Strong formula and notation handling overall. |
| Clean output | 9.6 | Usually produced readable, structured output. |
| Difficulty fit | 8.5 | The main weakness was difficulty calibration. |
| Step-by-step answers | 8.7 | Answer working sometimes needed more detail. |
| Template adherence | 8.9 | Some papers needed tighter exam-style format control. |
Main Findings
- Owl Alpha was useful for broad coverage. It generated a much larger evaluated sample than most models and still averaged above 9.0.
- The model performed well in Primary Mathematics, Secondary 3 Biology, Chemistry, Secondary 4 Pure Biology, A-Level Physics H1, and other science-heavy areas.
- The larger sample also exposed more weak cases. Low-score reports were concentrated around difficulty fit, template drift, and some Tamil or Malay outputs.
Subject and Level Fit
- Strong fit: Primary 1 to Primary 3 Mathematics, Secondary sciences, and selected A-Level science topics.
- Use with review: Primary Tamil, Malay, Higher Tamil, and some Additional Mathematics papers, where low-score cases appeared more often.
- Use with review: high-stakes exam papers where the exact paper template and timing matter.
Recurring Comment Themes
The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.
- Difficulty calibration was a repeated issue. Some papers were usable but not well matched to the intended level or time limit.
- Template drift appeared when questions were educationally reasonable but did not fully follow the expected exam-paper style.
- Answer explanations sometimes needed clearer steps, especially in Mathematics and Science.
- Language-subject issues appeared in lower-scoring Tamil and Malay cases. These need stricter language validation before publication.
- Missing visuals were frequent because many generated questions depend on diagrams, graphs, or images that may need a later image pass.
Prompting and Workflow Implications
TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.
- Add stricter difficulty bands by level and subject before generation starts.
- Use a paper-template checklist with section names, marks, timing, and allowed question types.
- Add a non-English language validation step for Mother Tongue and Higher Mother Tongue subjects.
- For visual questions, require either a precise image brief or a rewritten text-only version.