Date
Publisher
arXiv
Large Language Models (LLMs) have recently demonstrated strong emergent
abilities in complex reasoning and zero-shot generalization, showing
unprecedented potential for LLM-as-a-judge applications in education, peer
review, and data quality evaluation. However, their robustness under prompt
injection attacks, where malicious instructions are embedded into the content
to manipulate outputs, remains a significant concern. In this work, we explore
a frustratingly simple yet effective attack setting to test whether LLMs can be
easily misled. Specifically, we evaluate LLMs on basic arithmetic questions
(e.g., "What is 3 + 2?") presented as either multiple-choice or true-false
judgment problems within PDF files, where hidden prompts are injected into the
file. Our results reveal that LLMs are indeed vulnerable to such hidden prompt
injection attacks, even in these trivial scenarios, highlighting serious
robustness risks for LLM-as-a-judge applications.
What is the application?
Who is the user?
Why use AI?
Study design
