Date
Publisher
arXiv
Language models are increasingly used in Brazil, but most evaluation remains
English-centric. This paper presents Alvorada-Bench, a 4,515-question,
text-only benchmark drawn from five Brazilian university entrance examinations.
Evaluating twenty models under zero-shot, role-playing, and chain-of-thought
prompting, producing 270,900 responses with structured self-reports of
confidence, perceived difficulty, and Bloom level. The top models exceed 94%
accuracy overall, but accuracy declines on Mathematics and on the engineering
oriented IME and ITA exams, indicating persistent weaknesses in multi-step
reasoning. Confidence is well calibrated and correlates with perceived
difficulty, revealing that models can accurately assess their own certainty
capabilities. A cost accuracy analysis shows that high accuracy is achievable
at under $2 per 1K tokens. On ENEM 2024 the top model (O3) achieved perfect
scores in Languages subject questions while even the weakest system (GPT-4.1
Nano) only underperforms humans in Mathematics. Through exams that distill
decades of Brazilian educational priorities and assess millions of students
yearly, Alvorada-Bench establishes whether language models can navigate the
intersection of language, culture, and reasoning that defines academic
readiness in Brazil.
Who age?
Study design
