SC1015 Natural Language Processing (NLP) Spam Detection Project

An NLP project that uses Classification Trees and LSTM to predict whether a message is spam or ham.

Welcome!

About

This is our Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on differentiating ham and spam messages from a UCI Machine Learning Repository Dataset (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

To fully understand our project, here are some details pertaining to the items in our repository:

GloVe Word Embeddings will be needed for this project and can be downloaded from this website: https://nlp.stanford.edu/projects/glove/ Please download and unzip the contents of the version glove.6B.zip and place it in data>pre-trained>glove.

Files of Interest:

SpamHamClassification.ipynb: holds all code, documentation and analysis
fileMaster.py: holds all the filepaths in variables for easy access (and change, if needed)
LSTMModel.py: contains the condensed classes required for training: PreProcess() which preprocesses the data, LSTMModel (Which is the model class) and TrainValidate (which contains the class to compile and fit (or load from saved folder) the LSTM model
word_idx.json: contains the word index generated from the words tokenized by our custom tokenization function]
requirements.txt: list of versions required for the libraries we used in our code

Folders of Interest:

data: contains all data directories, including pre-trained embeddings, processed, raw, results and train_test (this train_test folder is used for notebook only)
pre-trained: contains glove pre-trained embeddings
processed: contains the cleaned txt file and the train test folder (this train test is used solely by the TrainValidate class)
raw: contains all the raw data
results: contains saved models and model histories from both the trainings conducted within the notebook, and TrainValidate class

Contributors

@lemousehunter - Primary Coder, Long Short-Term Memory Model, Machine Learning Engineer
@raydent30 - Secondary Coder, Exploratory Data Analysis, Data Analytics, Documentation

Problem Definition

How do we differentiate ham and spam messages using machine learning?
Which model would be the best to predict it?

Models Used

Binary Classification Tree
Long Short-Term Memory (LSTM)

Conclusion

Words and characters are much better predictors than sentences to classify ham or spam messages
Classification trees can predict the type of message with relatively high accuracies, but with a low F1 score.
LSTM performs better than classification trees in predicting the type of message.
Vectorization plays a key role when dealing with textual data.
Yes, it is possible to differentiate ham and spam messages using both Classification Trees and LSTM, however, there is still room for improvement.
A transformer theoretically will be able to produce much better F1 scores than both Classification Trees and LSTM.

What did we learn from this project?

Usage of Different Scoring Metrics
Neural Networks, Keras and Tensorflow
Concept of Transformers
Collaborating using GitHub

References

Dataset:

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

Exploratory Data Analysis:

RNN & Long Short-Term Memory (LSTM):

Scoring Metrics:

Transformers:

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
images		images
.gitattributes		.gitattributes
.gitignore		.gitignore
DSAI Mini Project_NLP_Spam_Ham_Classification Final.pptx		DSAI Mini Project_NLP_Spam_Ham_Classification Final.pptx
LSTMModel.py		LSTMModel.py
README.md		README.md
SpamHamClassification.ipynb		SpamHamClassification.ipynb
fileMaster.py		fileMaster.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015 Natural Language Processing (NLP) Spam Detection Project

Welcome!

About

Contributors

Problem Definition

Models Used

Conclusion

What did we learn from this project?

References

About

Releases

Packages

Contributors 2

Languages

lemousehunter/SC1015-NLP-Spam-Detection-Project

Folders and files

Latest commit

History

Repository files navigation

SC1015 Natural Language Processing (NLP) Spam Detection Project

Welcome!

About

Contributors

Problem Definition

Models Used

Conclusion

What did we learn from this project?

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages