<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Educational Question Generation | Tong Lu&#39;s Homepage</title>
    <link>https://lutong.space/tags/educational-question-generation/</link>
      <atom:link href="https://lutong.space/tags/educational-question-generation/index.xml" rel="self" type="application/rss+xml" />
    <description>Educational Question Generation</description>
    <generator>HugoBlox Kit (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Fri, 10 Apr 2026 00:00:00 +0000</lastBuildDate>
    <image>
      <url>https://lutong.space/media/icon_hu_702a800cd775dbac.png</url>
      <title>Educational Question Generation</title>
      <link>https://lutong.space/tags/educational-question-generation/</link>
    </image>
    
    <item>
      <title>CogBench Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering</title>
      <link>https://lutong.space/post/acl2026-cogbench/</link>
      <pubDate>Fri, 10 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://lutong.space/post/acl2026-cogbench/</guid>
      <description>&lt;h2 id=&#34;-introduction&#34;&gt;🚀 Introduction&lt;/h2&gt;
&lt;p align=&#34;center&#34;&gt;
  &lt;strong&gt;CogBench&lt;/strong&gt; is a benchmark  to assess the cognitive alignment capabilities of Large Language models in educational question answering
&lt;/p&gt;
## 🔥 Highlights
&lt;ul&gt;
&lt;li&gt;Benchmark: 2,100 K–12 mathematics questions, each with multiple valid, cognition-differentiated solutions&lt;/li&gt;
&lt;li&gt;Average 2.16 solutions per question; 3.2 curriculum knowledge components per question&lt;/li&gt;
&lt;li&gt;Grade coverage: Primary 40%, Middle 35%, High 25%&lt;/li&gt;
&lt;li&gt;3 cognition-aware QA tasks; 3 complementary metrics (CA, KC, KD)&lt;/li&gt;
&lt;li&gt;Curriculum-Aware Knowledge Graph (CAKG) aligned to grade levels and solution strategies&lt;/li&gt;
&lt;li&gt;Evaluated 11 LLMs (open-source and proprietary) via APIs (Sept–Dec 2025)&lt;/li&gt;
&lt;li&gt;Key findings:
&lt;ul&gt;
&lt;li&gt;Large gap between standard accuracy (up to 0.942) and cognitive alignment under unconstrained QA (best CA 0.534, KC 0.604)&lt;/li&gt;
&lt;li&gt;Grade-constrained prompting improves alignment (best CA 0.560, KC 0.753; KD up to 0.790)&lt;/li&gt;
&lt;li&gt;Knowledge-constrained prompting often reduces alignment due to activation of higher-level parametric patterns&lt;/li&gt;
&lt;li&gt;Fine-tuning (SFT + DPO) improves CA (0.47→0.63) and KC (0.54→0.68) with slight drops in ACC (0.88→0.83) and KD (0.72→0.61)&lt;/li&gt;
&lt;li&gt;Automatic metrics correlate well with expert human judgments on consistency and diversity&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;-framework&#34;&gt;🔥 Framework&lt;/h2&gt;
&lt;p align=&#34;center&#34;&gt;
  &lt;img src=&#34;cogbenchframework.png&#34; alt=&#34;Cogbench Framework&#34; width=&#34;800&#34; /&gt;
&lt;/p&gt;
&lt;h2 id=&#34;-dataset--annotations&#34;&gt;📊 Dataset &amp;amp; Annotations&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Sources: 1.2K Olympiad problems (public website) + 0.9K CMMath problems&lt;/li&gt;
&lt;li&gt;Coverage: Primary (Grades 1–6), Middle (7–9), High (10–12)&lt;/li&gt;
&lt;li&gt;Per-question: at least two solutions at different cognitive levels&lt;/li&gt;
&lt;li&gt;Generation base model for multi-solution sampling: Qwen3-30B-A3B&lt;/li&gt;
&lt;li&gt;Expert alignment: education experts verify solution–knowledge–grade mapping&lt;/li&gt;
&lt;li&gt;Reliability: high-quality, cognition-aware labels after expert review&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&#34;-usage&#34;&gt;📦 Usage&lt;/h2&gt;
&lt;p&gt;The evaluation program is in the evaluation folder, and the metrics it uses are in the metric folder.&lt;/p&gt;
&lt;h3 id=&#34;three-prompting-modes&#34;&gt;Three prompting modes:&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Unconstrained: &lt;code&gt;response1_title_only&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Grade-constrained: &lt;code&gt;response2_title_grade&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Knowledge-Constrained: &lt;code&gt;response3_title_knowledge&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 id=&#34;run-the-evaluation-scripts&#34;&gt;Run the evaluation scripts:&lt;/h3&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-shell&#34; data-lang=&#34;shell&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m evaluation.response --model_name gpt-5-nano-2025-08-07
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m evaluation.evaluate_response --model_name gpt-5-nano-2025-08-07
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m evaluation.find_knowledge_used --model_name gpt-5-nano-2025-08-07
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;python -m evaluation.calculate_metrics --model_name gpt-5-nano-2025-08-07
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;h2 id=&#34;-citation&#34;&gt;📄 Citation&lt;/h2&gt;
&lt;p&gt;If you use Cogbench or our construction framework, please cite:&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bibtex&#34; data-lang=&#34;bibtex&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;nc&#34;&gt;@article&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;{&lt;/span&gt;&lt;span class=&#34;nl&#34;&gt;CogBench2026&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;na&#34;&gt;title&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;{CogBench: Benchmarking Cognitive Alignment of Large Language Models in Educational Question Answering}&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;na&#34;&gt;author&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;{Tong Lu, Zhichun Wang, Yuanhao Sun, Yaoyu Zhou, Mingrui Li,Yiming Guan, Zhiyong Bai}&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;na&#34;&gt;year&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;{2026}&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;,&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;  &lt;span class=&#34;na&#34;&gt;journal&lt;/span&gt;&lt;span class=&#34;p&#34;&gt;=&lt;/span&gt;&lt;span class=&#34;s&#34;&gt;{Findings of ACL}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;&lt;span class=&#34;p&#34;&gt;}&lt;/span&gt;
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;</description>
    </item>
    
  </channel>
</rss>
