Deciphering Stereotypes in Pre-Trained Language Models

Authors

Weicheng Ma, Henry Scheible, Brian Wang, Goutham Veeramachaneni, Pratim Chowdhary, Alan Sun, Andrew Koulogeorge, Lili Wang, Diyi Yang, Soroush Vosoughi

Date

12/2023

Publisher

ACL Anthology

Link

https://aclanthology.org/2023.emnlp-main.697/

Warning: This paper contains content that is stereotypical and may be upsetting. This paper addresses the issue of demographic stereotypes present in Transformer-based pre-trained language models (PLMs) and aims to deepen our understanding of how these biases are encoded in these models. To accomplish this, we introduce an easy-to-use framework for examining the stereotype-encoding behavior of PLMs through a combination of model probing and textual analyses. Our findings reveal that a small subset of attention heads within PLMs are primarily responsible for encoding stereotypes and that stereotypes toward specific minority groups can be identified using attention maps on these attention heads. Leveraging these insights, we propose an attention-head pruning method as a viable approach for debiasing PLMs, without compromising their language modeling capabilities or adversely affecting their performance on downstream tasks.

What is the application?

Who is the user?

Who age?

Why use AI?

Outcomes – Other Academic

Study design

Impact – Quasi–experimental

Search and Filter

Submit a research study

Deciphering Stereotypes in Pre-Trained Language Models