Benchmark Whitepaper

Kimi K2.6 Free Benchmark Whitepaper

A model-specific whitepaper on Kimi K2.6 Free benchmark results for generated TuitionGoWhere papers and quizzes.

Singapore

This whitepaper reviews how Kimi K2.6 Free performed in the TuitionGoWhere benchmark. The benchmark used a separate evaluator model to score generated learning materials against language fit, syllabus fit, answer quality, exam format, notation, difficulty, missing-image risk, and other practical checks.

The results should be read as an internal quality signal, not a formal education certification. A score below 8.0 is treated as a review warning and is hidden from the main subject pages under the current quality policy.

Score Profile

Total reports 854
Quiz and paper reports 854
Overall score 9.2
Quiz and paper average 9.2
Below 8.0 32
Below-8 rate 3.7%

Content Coverage

Content type Reports Average score
Exam papers 449 9.2
Quizzes 405 9.2

Criterion Scores

Criterion Score Interpretation
LaTeX and notation handling 9.9 Very strong handling of notation and formulas.
Language suitability 9.6 Generally suitable wording for the intended level.
Syllabus adherence 9.5 Good alignment with expected topics.
Step-by-step answers 8.5 Some answer keys needed clearer working.
Difficulty fit 8.8 Several cases needed better level calibration.
Template adherence 8.9 Format was good, but not always exam-tight.

Main Findings

  • Kimi K2.6 Free performed well enough to be useful for a broad generation run, especially given that it was used as a free model.
  • Strong areas included Primary 1 Mathematics, Secondary 1 Geography, Primary 1 Malay, Secondary 3 Chemistry, Secondary 1 Science, and Primary 5 or Primary 6 Higher Chinese in the tested sample.
  • Weak cases appeared in Primary 6 Mathematics, Higher Tamil, Tamil, and some deep-answer Mathematics or Additional Mathematics tasks.

Subject and Level Fit

  • Strong fit: Primary Mathematics, Secondary 1 Geography and Science, Secondary 3 Chemistry and Physics, and selected Higher Chinese materials.
  • Use with review: Primary 6 Mathematics, Tamil, Higher Tamil, and answer-heavy mathematics tasks.
  • Use with review: resources where a missing diagram changes the meaning of the question.

Recurring Comment Themes

The benchmark comments were reviewed for repeated patterns. These themes overlap, so they should not be added together as independent failure counts.

  • Answer depth was the largest theme. The benchmark often asked for clearer steps, better reasoning, or a more useful answer key.
  • Difficulty calibration was another recurring issue, especially for PSLE-level Mathematics and higher-demand topics.
  • Some artefact and formatting issues appeared, so generated files should continue to be checked for tags, odd symbols, or incomplete blocks.
  • Language quality was usually good, but non-English subjects still need a dedicated language check.
  • LaTeX and notation were a strength, but notation-heavy outputs should still be rendered on the site before indexing.

Prompting and Workflow Implications

TuitionGoWhere does not publish its original internal prompts. The points below are high-level improvement directions based on the benchmark findings.

  • Add stronger worked-solution instructions for Mathematics and Science.
  • Use PSLE and exam-level calibration checks before accepting Primary 6 outputs.
  • Reject or rerun outputs with empty sections, odd tags, or incomplete answer blocks.
  • Use a separate language check for Tamil and Higher Tamil.

References