You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
However, these lists contain words that are not actual profanity and they should not be discarded. The first list contains the words "lesbian", "gay" and "queer", and the second list helpfully adds the plural form of these words (the third one seemed fine in that regard). These words are commonly used in the LGBTQ+ community to self refer and are not slurs, excluding them removes these marginalized groups from your dataset, decreasing the diversity of the dataset.
That is to me the most concering filtering of the profanity step, but I have not read the whole lists and might have overlooked other things. However, the 2. and 3. list also include the word "god" as profane, which seems to me like a weird choice, that might also exclude culture from polytheistic faiths such as hinduism.
Also there are words such as "menstruation", "menstruate", "pms", and "tampon" which don't seem related to profanity at all and a lot of neutral words describing body parts (though I'm not sure what SFW images with these words would be, maybe scientific diagrams).
Futher, some non profane words are filtered to preempt misspelling of banned words, such as "pawn" for "porn" or "seaman", "seamen" for "semen", which makes some sense in a filtered chat application (for which I expect the list were made) and less sense if the poster is not aware of any upcoming filtering.
To address these concerns I would suggest thoroughly scaning the lists of words you've previously banned (maybe check in with experts or people directly affected by words such as "dyke"), creating a white-list of non profane words, and then readding previously discarded text-image pairs to the dataet if they are not removed by any of your other filtering steps. That should then be released as an updated version 1.1.
The text was updated successfully, but these errors were encountered:
You write that you remove texts if they contain any of the "profane" words from the following lists:
However, these lists contain words that are not actual profanity and they should not be discarded. The first list contains the words "lesbian", "gay" and "queer", and the second list helpfully adds the plural form of these words (the third one seemed fine in that regard). These words are commonly used in the LGBTQ+ community to self refer and are not slurs, excluding them removes these marginalized groups from your dataset, decreasing the diversity of the dataset.
The field of image synthesis that is dependent on large scale datasets such as this one already struggles to portray a diverse socitety (see https://github.com/openai/dalle-2-preview/blob/main/system-card.md#bias-and-representation). Therefore, I believe that this filtering step is a step in a very wrong direction.
That is to me the most concering filtering of the profanity step, but I have not read the whole lists and might have overlooked other things. However, the 2. and 3. list also include the word "god" as profane, which seems to me like a weird choice, that might also exclude culture from polytheistic faiths such as hinduism.
Also there are words such as "menstruation", "menstruate", "pms", and "tampon" which don't seem related to profanity at all and a lot of neutral words describing body parts (though I'm not sure what SFW images with these words would be, maybe scientific diagrams).
Futher, some non profane words are filtered to preempt misspelling of banned words, such as "pawn" for "porn" or "seaman", "seamen" for "semen", which makes some sense in a filtered chat application (for which I expect the list were made) and less sense if the poster is not aware of any upcoming filtering.
To address these concerns I would suggest thoroughly scaning the lists of words you've previously banned (maybe check in with experts or people directly affected by words such as "dyke"), creating a white-list of non profane words, and then readding previously discarded text-image pairs to the dataet if they are not removed by any of your other filtering steps. That should then be released as an updated version 1.1.
The text was updated successfully, but these errors were encountered: