Date
Publisher
arXiv
The integration of Large Language Models (LLMs) in K--12 education offers
both transformative opportunities and emerging risks. This study explores how
students may Trojanize prompts to elicit unsafe or unintended outputs from
LLMs, bypassing established content moderation systems with safety guardrils.
Through a systematic experiment involving simulated K--12 queries and
multi-turn dialogues, we expose key vulnerabilities in GPT-3.5 and GPT-4. This
paper presents our experimental design, detailed findings, and a prototype
tool, TrojanPromptGuard (TPG), to automatically detect and mitigate Trojanized
educational prompts. These insights aim to inform both AI safety researchers
and educational technologists on the safe deployment of LLMs for educators.
What is the application?
Who is the user?
Why use AI?
