Date
Publisher
arXiv
The recent explosion in popularity of large language models (LLMs) has
inspired learning engineers to incorporate them into adaptive educational tools
that automatically score summary writing. Understanding and evaluating LLMs is
vital before deploying them in critical learning environments, yet their
unprecedented size and expanding number of parameters inhibits transparency and
impedes trust when they underperform. Through a collaborative user-centered
design process with several learning engineers building and deploying summary
scoring LLMs, we characterized fundamental design challenges and goals around
interpreting their models, including aggregating large text inputs, tracking
score provenance, and scaling LLM interpretability methods. To address their
concerns, we developed iScore, an interactive visual analytics tool for
learning engineers to upload, score, and compare multiple summaries
simultaneously. Tightly integrated views allow users to iteratively revise the
language in summaries, track changes in the resulting LLM scores, and visualize
model weights at multiple levels of abstraction. To validate our approach, we
deployed iScore with three learning engineers over the course of a month. We
present a case study where interacting with iScore led a learning engineer to
improve their LLM's score accuracy by three percentage points. Finally, we
conducted qualitative interviews with the learning engineers that revealed how
iScore enabled them to understand, evaluate, and build trust in their LLMs
during deployment.
What is the application?
Who is the user?
Who age?
Why use AI?