Date
Publisher
arXiv
Large language models (LLMs) can transform education, but their optimization
for direct question-answering often undermines effective pedagogy which
requires strategically withholding answers. To mitigate this, we propose an
online reinforcement learning (RL)-based alignment framework that can quickly
adapt LLMs into effective tutors using simulated student-tutor interactions by
emphasizing pedagogical quality and guided problem-solving over simply giving
away answers. We use our method to train a 7B parameter tutor model without
human annotations which reaches similar performance to larger proprietary
models like LearnLM. We introduce a controllable reward weighting to balance
pedagogical support and student solving accuracy, allowing us to trace the
Pareto frontier between these two objectives. Our models better preserve
reasoning capabilities than single-turn SFT baselines and can optionally
enhance interpretability through thinking tags that expose the model's
instructional planning.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
