"

Bias

Human Feedback Leading to Erasure of Marginalized Groups

Ironically, the very process created to remove toxic language and mitigate bias in LLMs has led to diminished representation and erasure of marginalized groups. Whereas the concept of the human feedback loop is a good one, OpenAI recognizes that “…aligning model outputs to the values of specific humans introduces difficult choices with societal implications, and ultimately, we must establish responsible, inclusive processes for making these decisions” (Aligning Language Models to Follow Instructions, n.d.). Moreover, not all data was reviewed by more than one individual; indeed, OpenAI admits that most of their data was reviewed just once and that their interrater reliability was only about 73%. This ends up giving an inordinate amount of power to these 40, non-representative people (non-representative insofar as they were all English-speaking employees of OpenAI, which excludes an enormous swath of human experience/characteristics).

But there is even more to the RLHF process employed by the 40 non-representative humans that leads to erasure of certain groups: these contractors were asked to remove “toxic language” and were trained on what to look for. So, even if these humans who are judging language do not have their own biases (which of course they do), they were tasked with flagging certain words and phrases as inappropriate or toxic.

When the contractors flagged passages as being offensive, they trained the model to not produce this type of passage again. This has presumably led to a dearth of the tool’s “knowledge” about topics that use particular terms that had been labelled as offensive, including terms reclaimed by LGBT groups, different ethnicities, marginalized communities etc. that may have formerly been deemed offensive). Dodge et al. (2021) found that the common practice of removing text containing “gay” or “lesbian” from the training set meant that the models were less able to work with passages written about those groups of people. Dodge recommends against using block lists for filtering text scraped from the web and notes that text about sexual orientation is the most likely to be filtered out, more so than racial or ethnic identities. Much of the text with “gay” or “lesbian” in it that is automatically filtered is non-offensive or non-sexual (Dodge et al., 2021).

Most of the banned words on these block lists are sexual in nature, presumably so that pornography is filtered out. However, the lists contain some words that mean more than one thing, so removing the “bad” word also removes its innocuous version (e.g., in French, baiser is “to kiss,” but also a vulgar word for sexual intercourse). The lists also contain some legitimate words for body parts (primarily genitals) as well as rape and date rape, so any text about those topics is removed (e.g., support for survivors of sexual violence, laws or policies around sexual assault, etc.). As Rettberg points out, “Removing sex words also means that non-offensive material about queer culture, including legal documents about same-sex marriage, have been filtered out” (Rettberg, 2022).

There is a long history of AI content moderation screening out LGBT and minority material, from social media platforms and dating apps that flag content as inappropriate to search engines which exclude certain content. YouTube has faced backlash due to recommending anti-LGBT content via its algorithms; Mozilla’s crowdsourced study of the recommendation algorithm found that 70% of the “regret reports” (videos that users wish they hadn’t seen) refer not to material that viewers themselves had chosen, but videos that had been recommended by YouTube’s algorithm (McCrosky & Geurkink, 2021).

Custom GPTs and open source LLMs can potentially play a role in bias mitigation done by the public, without having to rely on private companies.