Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features from best SemEval-2013 participant, NRC-Canada #45

Open
4 of 11 tasks
bwbaugh opened this issue Apr 18, 2013 · 0 comments
Open
4 of 11 tasks

Features from best SemEval-2013 participant, NRC-Canada #45

bwbaugh opened this issue Apr 18, 2013 · 0 comments

Comments

@bwbaugh
Copy link
Owner

bwbaugh commented Apr 18, 2013

See: NRC-Canada: Building the State-of-the-Art in Sentiment Analysis of Tweets

  • all-caps: the number of words with all characters in upper case;
  • clusters: presence/absence of tokens from each of the 1000 clusters (provided by Carnegie Mellon University's Twitter NLP tool);
  • elongated words: the number of words with one character repeated more than 2 times, e.g. 'soooo';
  • emoticons:
    • presence/absence of positive and negative emoticons at any position in the tweet;
    • whether the last token is a positive or negative emoticon;
  • hashtags: the number of hashtags;
  • negation: the number of negated contexts. A negated context also affects the ngram and lexicon features: each word and associated with it polarity in a negated context become negated (e.g., 'not perfect' becomes 'not perfect_NEG', 'POLARITY_positive' becomes 'POLARITY_positive_NEG');
  • POS: the number of occurrences for each part-of-speech tag;
  • punctuation:
    • the number of contiguous sequences of exclamation marks, question marks, and both exclamation and question marks;
    • whether the last token contains exclamation or question mark;
  • sentiment lexicons: automatically created lexicons (NRC Hashtag Sentiment Lexicon, Sentiment140 Lexicon), manually created sentiment lexicons (NRC Emotion Lexicon, MPQA, Bing Liu Lexicon). For each lexicon and each polarity we calculated:
    • total count of tokens in the tweet with score greater than 0;
    • the sum of the scores for all tokens in the tweet;
    • the maximal score;
    • the non-zero score of the last token in the tweet;
      The lexicon features were created for all tokens in the tweet, for each part-of-speech tag, for hashtags, and for all-caps tokens.
  • word ngrams
  • character ngrams.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant