ChatGPT Replicates Gender Bias in Recommendation Letters

A new study has found that the use of AI tools such as ChatGPT in the workplace entrenches biased language based on gender

By Chris Stokel-Walker

artist's concept of artificial intelligence represented by an illustration of a robot communicating via e-mail with human workers who are comparatively diminutive in scale — ssnjaytuturkhi/Getty Images

Generative artificial intelligence has been touted as a valuable tool in the workplace. Estimates suggest it could increase productivity growth by 1.5 percent in the coming decade and boost global gross domestic product by 7 percent during the same period. But a new study advises that it should only be used with careful scrutiny—because its output discriminates against women.

The researchers asked two large language model (LLM) chatbots—ChatGPT and Alpaca, a model developed by Stanford University—to produce recommendation letters for hypothetical employees. In a paper shared on the preprint server arXiv.org, the authors analyzed how the LLMs used very different language to describe imaginary male and female workers.

“We observed significant gender biases in the recommendation letters,” says paper co-author Yixin Wan, a computer scientist at the University of California, Los Angeles. While ChatGPT deployed nouns such as “expert” and “integrity” for men, it was more likely to call women a “beauty” or “delight.” Alpaca had similar problems: men were “listeners” and “thinkers,” while women had “grace” and “beauty.” Adjectives proved similarly polarized. Men were “respectful,” “reputable” and “authentic,” according to ChatGPT, while women were “stunning,” “warm” and “emotional.” Neither OpenAI nor Stanford immediately responded to requests for comment from Scientific American.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

The issues encountered when artificial intelligence is used in a professional context echo similar situations with previous generations of AI. In 2018 Reuters reported that Amazon had disbanded a team that had worked since 2014 to try and develop an AI-powered résumé review tool. The company scrapped this project after realizing that any mention of “women” in a document would cause the AI program to penalize that applicant. The discrimination arose because the system was trained on data from the company, which had, historically, employed mostly men.

The new study results are “not super surprising to me,” says Alex Hanna, director of research at the Distributed AI Research Institute, an independent research group analyzing the harms of AI. The training data used to develop LLMs are often biased because they’re based on humanity’s past written records—many of which have historically depicted men as active workers and women as passive objects. The situation is compounded by LLMs being trained on data from the Internet, where more men than women spend time: globally, 69 percent of men use the Internet, compared with 63 percent of women, according to the United Nations’ International Telecommunication Union.

Fixing the problem isn’t simple. “I don’t think it’s likely that you can really debias the data set,” Hanna says. “You need to acknowledge what these biases are and then have some kind of mechanism to capture that.” One option, Hanna suggests, is to train the model to de-emphasize biased outputs through an intervention called reinforcement learning. OpenAI has worked to rein in the biased tendencies of ChatGPT, Hanna says, but “one needs to know that these are going to be perennial problems.”

This all matters because women have already long faced inherent biases in business and the workplace. For instance, women often have to tiptoe around workplace communication because their words are judged more harshly than those of their male colleagues, according to a 2022 study. And of course, women earn 83 cents for every dollar a man makes. Generative AI platforms are “propagating those biases,” Wan says. So as this technology becomes more ubiquitous throughout the working world, there’s a chance that the problem will become even more firmly entrenched.

“I welcome research like this that is exploring how these systems operate and their risks and fallacies,” says Gem Dale, a lecturer in human resources at Liverpool John Moores University in England. “It is through this understanding we will learn the issues and then can start to tackle them.”

Dale says anyone thinking of using generative AI chatbots in the workplace should be wary of such problems. “If people use these systems without rigor—as in letters of recommendation in this research—we are just sending the issue back out into the world and perpetuating it,” she says. “It is an issue I would like to see the tech firms address in the LLMs. Whether they will or not will be interesting to find out.”