This project demonstrates the basics of text processing by building text classifiers and learning to represent words effectively. It includes techniques such as rule-based classification, Bag-of-Words (BoW), and Word2Vec embeddings.
The dataset consists of sentences labeled with two sentiments: positive
and negative
. Examples include:
- Positive: "I really like your new haircut!"
- Negative: "Your new haircut is awful!"
The dataset is divided into three parts:
- Training Set:
train_data.csv
(provided) - Validation Set:
val_data.csv
(provided) - Test Set:
test_data.csv
(not provided for blind evaluation)
To download the dataset, execute the following commands:
# Training data
wget -O train_data.csv "https://docs.google.com/spreadsheets/d/176-KrOP8nhLpoW91UnrOY9oq_-I0XYNKS1zmqIErFsA/gviz/tq?tqx=out:csv&sheet=train_data.csv"
# Validation data
wget -O val_data.csv "https://docs.google.com/spreadsheets/d/1YxjoAbatow3F5lbPEODToa8-YWvJoTY0aABS9zaXk-c/gviz/tq?tqx=out:csv&sheet=val_data.csv"
# Test data
wget -O test_data.csv "https://docs.google.com/spreadsheets/d/1YxjoAbatow3F5lbPEODToa8-YWvJoTY0aABS9zaXk-c/gviz/tq?tqx=out:csv&sheet=test_data.csv"
Ensure the following Python libraries are installed:
- General libraries:
numpy
,pandas
,re
- Machine Learning:
sklearn
- Visualization:
matplotlib
,seaborn
- Word2Vec and NLP:
gensim
,torch
,torchtext
- Utility:
wget
,tqdm
Install all required libraries via:
pip install numpy pandas matplotlib seaborn scikit-learn gensim torch torchtext tqdm wget
- you will have to deal with the dependency issues - ALL THE BEST!!!
- Feature Extraction: Handcrafted rules identify keywords indicative of positive or negative sentiment.
- Prediction: A linear scoring model based on the extracted features predicts sentiment.
- Evaluation: The rule-based model achieves accuracy ~60%, showcasing the importance of keyword-based analysis.
- Vectorization: The text is converted into numerical vectors using word frequency.
- Learning Weights: Logistic regression is used to learn feature weights for classification.
- Evaluation: Accuracy improves with automated weight learning compared to manual assignment.
- Word2Vec Training: Skip-gram model implemented from scratch using PyTorch to learn word embeddings.
- Visualization: PCA reduces embeddings to 2D for visualization of semantic similarity.
- Analogy Tests: Evaluate embedding quality on analogy questions like "man:king :: woman:?".
- Classification Accuracy: Measures correctness of predictions.
- Word Similarity: Assesses semantic proximity between words using cosine similarity.
- Analogy Precision: Tests embedding performance on analogy tasks, comparing the top 5 predictions.