Sentiment Analysis Project

This project demonstrates the basics of text processing by building text classifiers and learning to represent words effectively. It includes techniques such as rule-based classification, Bag-of-Words (BoW), and Word2Vec embeddings.

Dataset

The dataset consists of sentences labeled with two sentiments: positive and negative. Examples include:

Positive: "I really like your new haircut!"
Negative: "Your new haircut is awful!"

The dataset is divided into three parts:

Training Set: train_data.csv (provided)
Validation Set: val_data.csv (provided)
Test Set: test_data.csv (not provided for blind evaluation)

Downloading the Dataset

To download the dataset, execute the following commands:

# Training data
wget -O train_data.csv "https://docs.google.com/spreadsheets/d/176-KrOP8nhLpoW91UnrOY9oq_-I0XYNKS1zmqIErFsA/gviz/tq?tqx=out:csv&sheet=train_data.csv"

# Validation data
wget -O val_data.csv "https://docs.google.com/spreadsheets/d/1YxjoAbatow3F5lbPEODToa8-YWvJoTY0aABS9zaXk-c/gviz/tq?tqx=out:csv&sheet=val_data.csv"

# Test data
wget -O test_data.csv "https://docs.google.com/spreadsheets/d/1YxjoAbatow3F5lbPEODToa8-YWvJoTY0aABS9zaXk-c/gviz/tq?tqx=out:csv&sheet=test_data.csv"

Python Libraries Required

Ensure the following Python libraries are installed:

General libraries: numpy, pandas, re
Machine Learning: sklearn
Visualization: matplotlib, seaborn
Word2Vec and NLP: gensim, torch, torchtext
Utility: wget, tqdm

Install all required libraries via:

pip install numpy pandas matplotlib seaborn scikit-learn gensim torch torchtext tqdm wget

you will have to deal with the dependency issues - ALL THE BEST!!!

Methodology

Part I: Rule-Based Sentiment Classification

Feature Extraction: Handcrafted rules identify keywords indicative of positive or negative sentiment.
Prediction: A linear scoring model based on the extracted features predicts sentiment.
Evaluation: The rule-based model achieves accuracy ~60%, showcasing the importance of keyword-based analysis.

Part II: Bag-of-Words (BoW)

Vectorization: The text is converted into numerical vectors using word frequency.
Learning Weights: Logistic regression is used to learn feature weights for classification.
Evaluation: Accuracy improves with automated weight learning compared to manual assignment.

Part III: Word Embeddings using Word2Vec

Word2Vec Training: Skip-gram model implemented from scratch using PyTorch to learn word embeddings.
Visualization: PCA reduces embeddings to 2D for visualization of semantic similarity.
Analogy Tests: Evaluate embedding quality on analogy questions like "man:king :: woman:?".

Evaluation Metrics

Classification Accuracy: Measures correctness of predictions.
Word Similarity: Assesses semantic proximity between words using cosine similarity.
Analogy Precision: Tests embedding performance on analogy tasks, comparing the top 5 predictions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Sentiment Analysis Project

Dataset

Downloading the Dataset

Python Libraries Required

Methodology

Part I: Rule-Based Sentiment Classification

Part II: Bag-of-Words (BoW)

Part III: Word Embeddings using Word2Vec

Evaluation Metrics

Files

README.md

Latest commit

History

README.md

File metadata and controls

Sentiment Analysis Project

Dataset

Downloading the Dataset

Python Libraries Required

Methodology

Part I: Rule-Based Sentiment Classification

Part II: Bag-of-Words (BoW)

Part III: Word Embeddings using Word2Vec

Evaluation Metrics