Exclusion of marginalized groups during text filtering #4

jenuk · 2022-09-21T11:36:44Z

You write that you remove texts if they contain any of the "profane" words from the following lists:

However, these lists contain words that are not actual profanity and they should not be discarded. The first list contains the words "lesbian", "gay" and "queer", and the second list helpfully adds the plural form of these words (the third one seemed fine in that regard). These words are commonly used in the LGBTQ+ community to self refer and are not slurs, excluding them removes these marginalized groups from your dataset, decreasing the diversity of the dataset.

The field of image synthesis that is dependent on large scale datasets such as this one already struggles to portray a diverse socitety (see https://github.com/openai/dalle-2-preview/blob/main/system-card.md#bias-and-representation). Therefore, I believe that this filtering step is a step in a very wrong direction.

That is to me the most concering filtering of the profanity step, but I have not read the whole lists and might have overlooked other things. However, the 2. and 3. list also include the word "god" as profane, which seems to me like a weird choice, that might also exclude culture from polytheistic faiths such as hinduism.

Also there are words such as "menstruation", "menstruate", "pms", and "tampon" which don't seem related to profanity at all and a lot of neutral words describing body parts (though I'm not sure what SFW images with these words would be, maybe scientific diagrams).

Futher, some non profane words are filtered to preempt misspelling of banned words, such as "pawn" for "porn" or "seaman", "seamen" for "semen", which makes some sense in a filtered chat application (for which I expect the list were made) and less sense if the poster is not aware of any upcoming filtering.

To address these concerns I would suggest thoroughly scaning the lists of words you've previously banned (maybe check in with experts or people directly affected by words such as "dyke"), creating a white-list of non profane words, and then readding previously discarded text-image pairs to the dataet if they are not removed by any of your other filtering steps. That should then be released as an updated version 1.1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclusion of marginalized groups during text filtering #4

Exclusion of marginalized groups during text filtering #4

jenuk commented Sep 21, 2022

Exclusion of marginalized groups during text filtering #4

Exclusion of marginalized groups during text filtering #4

Comments

jenuk commented Sep 21, 2022