Date
Publisher
arXiv
Generative AI and large language models hold great promise in enhancing
computing education by powering next-generation educational technologies for
introductory programming. Recent works have studied these models for different
scenarios relevant to programming education; however, these works are limited
for several reasons, as they typically consider already outdated models or only
specific scenario(s). Consequently, there is a lack of a systematic study that
benchmarks state-of-the-art models for a comprehensive set of programming
education scenarios. In our work, we systematically evaluate two models,
ChatGPT (based on GPT-3.5) and GPT-4, and compare their performance with human
tutors for a variety of scenarios. We evaluate using five introductory Python
programming problems and real-world buggy programs from an online platform, and
assess performance using expert-based annotations. Our results show that GPT-4
drastically outperforms ChatGPT (based on GPT-3.5) and comes close to human
tutors' performance for several scenarios. These results also highlight
settings where GPT-4 still struggles, providing exciting future directions on
developing techniques to improve the performance of these models.
What is the application?
Who age?
Why use AI?
Study design
