Date
Publisher
arXiv
The rapid advancement of large language models (LLMs) has enabled the
generation of coherent essays, making AI-assisted writing increasingly common
in educational and professional settings. Using large-scale empirical data, we
examine and benchmark the characteristics and quality of essays generated by
popular LLMs and discuss their implications for two key components of writing
assessments: automated scoring and academic integrity. Our findings highlight
limitations in existing automated scoring systems, such as e-rater, when
applied to essays generated or heavily influenced by AI, and identify areas for
improvement, including the development of new features to capture deeper
thinking and recalibrating feature weights. Despite growing concerns that the
increasing variety of LLMs may undermine the feasibility of detecting
AI-generated essays, our results show that detectors trained on essays
generated from one model can often identify texts from others with high
accuracy, suggesting that effective detection could remain manageable in
practice.
What is the application?
Who is the user?
Who age?
Why use AI?
Study design
