Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclusion of marginalized groups during text filtering #4

Open
jenuk opened this issue Sep 21, 2022 · 0 comments
Open

Exclusion of marginalized groups during text filtering #4

jenuk opened this issue Sep 21, 2022 · 0 comments

Comments

@jenuk
Copy link

jenuk commented Sep 21, 2022

You write that you remove texts if they contain any of the "profane" words from the following lists:

  1. https://github.com/rominf/profanity-filter/blob/master/profanity_filter/data/en_profane_words.txt
  2. https://github.com/snguyenthanh/better_profanity/blob/master/better_profanity/profanity_wordlist.txt
  3. https://gist.github.com/ryanlewis/a37739d710ccdb4b406d

However, these lists contain words that are not actual profanity and they should not be discarded. The first list contains the words "lesbian", "gay" and "queer", and the second list helpfully adds the plural form of these words (the third one seemed fine in that regard). These words are commonly used in the LGBTQ+ community to self refer and are not slurs, excluding them removes these marginalized groups from your dataset, decreasing the diversity of the dataset.

The field of image synthesis that is dependent on large scale datasets such as this one already struggles to portray a diverse socitety (see https://github.com/openai/dalle-2-preview/blob/main/system-card.md#bias-and-representation). Therefore, I believe that this filtering step is a step in a very wrong direction.

That is to me the most concering filtering of the profanity step, but I have not read the whole lists and might have overlooked other things. However, the 2. and 3. list also include the word "god" as profane, which seems to me like a weird choice, that might also exclude culture from polytheistic faiths such as hinduism.

Also there are words such as "menstruation", "menstruate", "pms", and "tampon" which don't seem related to profanity at all and a lot of neutral words describing body parts (though I'm not sure what SFW images with these words would be, maybe scientific diagrams).

Futher, some non profane words are filtered to preempt misspelling of banned words, such as "pawn" for "porn" or "seaman", "seamen" for "semen", which makes some sense in a filtered chat application (for which I expect the list were made) and less sense if the poster is not aware of any upcoming filtering.

To address these concerns I would suggest thoroughly scaning the lists of words you've previously banned (maybe check in with experts or people directly affected by words such as "dyke"), creating a white-list of non profane words, and then readding previously discarded text-image pairs to the dataet if they are not removed by any of your other filtering steps. That should then be released as an updated version 1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant