Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

emojis attack, surrounding chars, and homophones chars attack #12

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22,764 changes: 22,764 additions & 0 deletions Adversary/assets/emoji.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Adversary/assets/homophones.json

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions Adversary/assets/spam_data.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
100%,#1,$$$,100% free,100% Satisfied,4U,50% off,Accept credit cards,Acceptance,Access,Accordingly,Act Now,Action,Ad,Additional income,Addresses on CD,Affordable,All natural,All new,Amazed,Amazing,Amazing stuff,Apply now,Apply Online,As seen on,Auto email removal,Avoid,Avoid bankruptcy,Bargain,Be amazed,Be your own boss,Being a member,Beneficiary,Best price,Beverage,Big bucks,Bill 1618,Billing,Billing address,Billion,Billion dollars,Bonus,Boss,Brand new pager,Bulk email,Buy,Buy direct,Buying judgments,Cable converter,Call,Call free,Call now,Calling creditors,Can’t live without,Cancel,Cancel at any time,Cannot be combined with any other offer,Cards accepted,Cash,Cash bonus,Cashcashcash,Casino,Celebrity,Cell phone cancer scam,Cents on the dollar,Certified,Chance,Cheap,Check,Check or money order,Claims,Claims not to be selling anything,Claims to be in accordance with some spam law,Claims to be legal,Clearance,Click,Click below,Click here,Click to remove,Collect,Collect child support,Compare,Compare rates,Compete for your business,Confidentially on all orders,Congratulations,Consolidate debt and credit,Consolidate your debt,Copy accurately,Copy DVDs,Costs,Credit,Credit bureaus,Credit card offers,Cures,Cures baldness,Deal,Dear [email/friend/somebody],Debt,Diagnostics,Dig up dirt on friends,Direct email,Direct marketing,Discount,Do it today,Don’t delete,Don’t hesitate,Dormant,Double your,Double your cash,Double your income,Drastically reduced,Earn,Earn $,Earn extra cash,Earn per week,Easy terms,Eliminate bad credit,Eliminate debt,Email harvest,Email marketing,Exclusive deal,Expect to earn,Expire,Explode your business,Extra,Extra cash,Extra income,F r e e,Fantastic,Fantastic deal,Fast cash,Fast Viagra delivery,Financial freedom,Financially independent,For free,For instant access,For just $ (some amount),For just $xxx,For Only,For you,Form,Free,Free access,Free cell phone,Free consultation,Free DVD,Free gift,Free grant money,Free hosting,Free info,Free installation,Free Instant,Free investment,Free leads,Free membership,Free money,Free offer,Free preview,Free priority mail,Free quote,Free sample,Free trial,Free website,Freedom,Friend,Full refund,Get,Get it now,Get out of debt,Get paid,Get started now,Gift certificate,Give it away,Giving away,Great,Great offer,Guarantee,Guaranteed,Have you been turned down?,Hello,Here,Hidden,Hidden assets,Hidden charges,Home,Home based,Home employment,Home based business,Human growth hormone,If only it were that easy,Important information regarding,In accordance with laws,Income,Income from home,Increase sales,Increase traffic,Increase your sales,Incredible deal,Info you requested,Information you requested,Insurance,Internet market,Internet marketing,Investment,Investment decision,It’s effective,Join millions,Join millions of Americans,Junk,Laser printer,Leave,Legal,Life,Life Insurance,Lifetime,Limited,limited time,Limited time offer,Limited time only,Loan,Long distance phone offer,Lose,Lose weight,Lose weight spam,Lower interest rates,Lower monthly payment,Lower your mortgage rate,Lowest insurance rates,Lowest Price,Luxury,Luxury car,Mail in order form,Maintained,Make $,Make money,Marketing,Marketing solutions,Mass email,Medicine,Medium,Meet singles,Member,Member stuff,Message contains,Message contains disclaimer,Million,Million dollars,Miracle,MLM,Money,Money back,Money making,Month trial offer,More Internet Traffic,Mortgage,Mortgage rates,Multi-level marketing,Name brand,Never,New customers only,New domain extensions,Nigerian,No age restrictions,No catch,No claim forms,No cost,No credit check,No disappointment,No experience,No fees,No gimmick,No hidden,No hidden Costs,No interests,No inventory,No investment,No medical exams,No middleman,No obligation,No purchase necessary,No questions asked,No selling,No strings attached,No-obligation,Not intended,Not junk,Not spam,Now only,Obligation,Offshore,Offer,Offer expires,Once in lifetime,One hundred percent free,One hundred percent guaranteed,One time,One time mailing,Online biz opportunity,Online degree,Online marketing,Online pharmacy,Only $,Open,Opportunity,Opt in,Order,Order now,Order shipped by,Order status,Order today,Outstanding values,Passwords,Pennies a day,Per day,Per week,Performance,Phone,Please read,Potential earnings,Pre-approved,Presently,Print form signature,Print out and fax,Priority mail,Prize,Problem,Produced and sent out,Profits,Promise,Promise you,Purchase,Pure Profits,Quote,Rates,Real thing,Refinance,Refinance home,Refund,Removal,Removal instructions,Remove,Removes wrinkles,Request,Requires initial investment,Reserves the right,Reverses,Reverses aging,Risk free,Rolex,Round the world,S 1618,Safeguard notice,Sale,Sample,Satisfaction,Satisfaction guaranteed,Save $,Save big money,Save up to,Score,Score with babes,Search engine listings,Search engines,Section 301,See for yourself,Sent in compliance,Serious,Serious cash,Serious only,Shopper,Shopping spree,Sign up free today,Social security number,Solution,Spam,Special promotion,Stainless steel,Stock alert,Stock disclaimer statement,Stock pick,Stop,Stop snoring,Strong buy,Stuff on sale,Subject to cash,Subject to credit,Subscribe,Success,Supplies,Supplies are limited,Take action,Take action now,Talks about hidden charges,Talks about prizes,Teen,Tells you it’s an ad,Terms,Terms and conditions,The best rates,The following form,They keep your money — no refund!,They’re just giving it away,This isn’t a scam,This isn’t junk,This isn’t spam,This won’t last,Thousands,Time limited,Trial,Undisclosed recipient,University diplomas,Unlimited,Unsecured credit,Unsecured debt,Unsolicited,Unsubscribe,Urgent,US dollars,Vacation,Vacation offers,Valium,Vicodin,Visit our website,Wants credit card,Warranty,We hate spam,We honor all,Web traffic,Weekend getaway,Weight,Weight loss,What are you waiting for?,What’s keeping you?,While supplies last,While you sleep,Who really wins?,Why pay more?,Wife,Will not believe your eyes,Win,Winner,Winning,Won,Work from home,Xanax,You are a winner!,You have been selected,Your income,Orders shipped by,shopper,Additional Income,Homebased business,Work at home,For just $XXX,Loans,Lowest price,Pure profit,They keep your money -- no refund!,Accept Credit Cards,Multi level marketing,Notspam,Sales,This isn't junk,This isn't spam,Off shore,Prizes,unlimited,You’re a Winner!,Act Now!,Can't live without,Don't delete,Don't hesitate
107 changes: 105 additions & 2 deletions Adversary/attacks.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,114 @@
import json
import os
from random import choice, randint, sample, randrange
from string import punctuation

from Adversary.constants import *
import spacy
from regex import search
from spacy import displacy

from constants import *

'''These act on a single text'''


def emojis_attack(text):
text = text.lower()
words = text.split()

# get the emojis path
script_dir = os.path.dirname(__file__)
rel_path = "assets/emoji.json"
abs_file_path = os.path.join(script_dir, rel_path)
with open(abs_file_path, 'r') as emojis:
# load emojis as lists
json_data = json.load(emojis)
for emoji in json_data:
# extract each emoji tags
tags_list = emoji['tags'] + emoji['aliases'] + [emoji['description']]
# check if there are any possible keywords in the text
emojis_in_text = set(words) & set(tags_list)
if len(emojis_in_text) > 0:
# source https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, emojis_in_text)))
text = big_regex.sub(emoji['emoji'], str(text))

return text


def advanced_emojis_attack(text):
nlp = spacy.load("en_core_web_sm")
doc = nlp(text.lower())

# get the emojis path
script_dir = os.path.dirname(__file__)
rel_path = "assets/emoji.json"
abs_file_path = os.path.join(script_dir, rel_path)
with open(abs_file_path, 'r') as emojis:
# load emojis as lists
json_data = json.load(emojis)
for emoji in json_data:
# extract each emoji tags
tags_list = emoji['tags'] + emoji['aliases'] + [emoji['description']]
for token in doc:
if (token.pos_ == "PROPN" or token.pos_ == "VERB"
or token.pos_ == "SYM" or token.pos_ == "NOUN") \
and token.text in tags_list:
# source https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, [token.text])))
text = big_regex.sub(emoji['emoji'], str(text.lower()))

# uncomment to launch analysis in browser
# displacy.serve(doc, style="dep")

return text


def surrounding_chars(text):
text = text.lower()

# get the spam texts file path
script_dir = os.path.dirname(__file__)
rel_path = "assets/spam_data.txt"
abs_file_path = os.path.join(script_dir, rel_path)
with open(abs_file_path, 'r') as spam:
spam_list = spam.read().split(",")
spam_list_lowered = list(map(str.lower, spam_list))

for spam_word in spam_list_lowered:
# start surrounding to text if it is in spam
if search(spam_word, text):
# source https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, [spam_word])))
surround_char = ' '.join('("' + item + '")' for item in spam_word if item != " ")
text = big_regex.sub(surround_char, str(text.lower()))

return text


def homophones_chars(text):
nlp = spacy.load("en_core_web_sm")
doc = nlp(text.lower())

# get the spam texts file path
script_dir = os.path.dirname(__file__)
rel_path = "assets/homophones.json"
abs_file_path = os.path.join(script_dir, rel_path)
with open(abs_file_path, 'r') as homophones:
# load homophones as list
json_data = json.load(homophones)

for token in doc:
if (token.pos_ == "PROPN" or token.pos_ == "VERB"
or token.pos_ == "SYM" or token.pos_ == "NOUN") \
and token.text in json_data:
# source https://stackoverflow.com/questions/15658187/replace-all-words-from-word-list-with-another-string-in-python
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, [token.text])))
text = big_regex.sub(json_data[token.text][0], str(text.lower()))

return text


def good_word_attack(text):
if randint(1, 2) == 1:
return text + ' ' + ' '.join(sample(NEUTRAL_WORDS, randint(5, 15)))
Expand Down Expand Up @@ -93,10 +196,10 @@ def num_to_word(word):

'''Keeps track of all attacks and their types'''


ATTACK_MAP = {
'text': {
'good_word_attack': good_word_attack,
'emoji_attack': emojis_attack,
'swap_words': swap_words,
'remove_spacing': remove_spacing,
},
Expand Down
45 changes: 45 additions & 0 deletions Adversary/constants.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import re

NUM_TO_WORD = {
'1': 'one',
'2': 'two',
Expand All @@ -9,6 +11,49 @@
'8': 'eight',
'9': 'nine'
}
# arabic language have some diacritics that can be invisible in some english texts
arabic_diacritics = r"""
ّ
َ
ً
ُ
ٌ
ِ ٍ ْ"""

## convert English to non-latin chars
cyrillic_translit = {'\u0410': 'A', '\u0430': 'a',
'\u0411': 'B', '\u0431': 'b',
'\u0412': 'V', '\u0432': 'v',
'\u0413': 'G', '\u0433': 'g',
'\u0414': 'D', '\u0434': 'd',
'\u0415': 'E', '\u0435': 'e',
'\u0416': 'Zh', '\u0436': 'zh',
'\u0417': 'Z', '\u0437': 'z',
'\u0418': 'I', '\u0438': 'i',
'\u0419': 'I', '\u0439': 'i',
'\u041a': 'K', '\u043a': 'k',
'\u041b': 'L', '\u043b': 'l',
'\u041c': 'M', '\u043c': 'm',
'\u041d': 'N', '\u043d': 'n',
'\u041e': 'O', '\u043e': 'o',
'\u041f': 'P', '\u043f': 'p',
'\u0420': 'R', '\u0440': 'r',
'\u0421': 'S', '\u0441': 's',
'\u0422': 'T', '\u0442': 't',
'\u0423': 'U', '\u0443': 'u',
'\u0424': 'F', '\u0444': 'f',
'\u0425': 'Kh', '\u0445': 'kh',
'\u0426': 'Ts', '\u0446': 'ts',
'\u0427': 'Ch', '\u0447': 'ch',
'\u0428': 'Sh', '\u0448': 'sh',
'\u0429': 'Shch', '\u0449': 'shch',
'\u042a': '"', '\u044a': '"',
'\u042b': 'Y', '\u044b': 'y',
'\u042c': "'", '\u044c': "'",
'\u042d': 'E', '\u044d': 'e',
'\u042e': 'Iu', '\u044e': 'iu',
'\u042f': 'Ia',
'\u044f': 'ia'}

# Word keys from https://www.ef.edu/english-resources/english-vocabulary/top-3000-words/
# Synonyms generated from https://github.com/explosion/spaCy/issues/276
Expand Down
19 changes: 19 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Adversary~=1.1.1
setuptools~=57.0.0
pip~=21.1.2
wheel~=0.36.2
nltk~=3.6.3
numpy~=1.21.2
click~=8.0.1
tqdm~=4.62.3
requests~=2.26.0
regex~=2021.9.30
pyparsing~=2.4.7
pytz~=2021.3
joblib~=1.0.1
pandas~=1.3.3
MarkupSafe~=2.0.1
python-dateutil~=2.8.2
six~=1.16.0
textblob~=0.15.3
spacy~=3.1.3
19 changes: 12 additions & 7 deletions tests/test_adversary.py
Original file line number Diff line number Diff line change
@@ -1,32 +1,37 @@
from Adversary.adversary import Adversary
from adversary import Adversary


def test_generate_single_iter():
m = Adversary(verbose=True)
og_texts = [u'happy happy happy happy dog dog dog dog dog',
u'okay okay yeah here', 'tell me awful things']
u'okay okay yeah here', 'tell me awful things']
g = m.generate(og_texts)
assert(len(g) == 3)
assert (len(g) == 3)


def test_generate_many_iter():
m = Adversary(verbose=True)
og_texts = [u'happy happy happy happy dog dog dog dog dog',
u'okay okay yeah here', 'tell me awful things']
u'okay okay yeah here', 'tell me awful things']
g = m.generate(og_texts, text_sample_rate=5)
assert(len(g) == 15)
assert (len(g) == 15)


def test_large():
m = Adversary(verbose=True)
og_texts = ['tell me awful things'] * 1000
g = m.generate(og_texts, text_sample_rate=5)
assert (len(g) == 5000)


def test_attack():
m = Adversary(verbose=True)
og_texts = [u'happy happy happy happy dog dog dog dog dog',
u'okay okay yeah here', 'tell me awful things']
u'okay okay yeah here', 'tell me awful things']
g = m.generate(og_texts)
df_s, df_m = m.attack(og_texts, g, lambda x: 1 if x in og_texts else 0)
assert(df_s is not None and df_m is not None)
assert (df_s is not None and df_m is not None)


def test_attack_large():
m = Adversary(verbose=True)
Expand Down
9 changes: 6 additions & 3 deletions tests/test_attacks.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
from Adversary.attacks import *
from attacks import *


def test_num_to_word():
assert(num_to_word('1') == 'one')
assert(num_to_word('dog') == 'dog')
assert (num_to_word('1') == 'one')
assert (num_to_word('dog') == 'dog')