An NLP project that uses Classification Trees and LSTM to predict whether a message is spam or ham.
This is our Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on differentiating ham and spam messages from a UCI Machine Learning Repository Dataset (
To fully understand our project, here are some details pertaining to the items in our repository:
GloVe Word Embeddings will be needed for this project and can be downloaded from this website: Please download and unzip the contents of the version and place it in data>pre-trained>glove.
Files of Interest:
- SpamHamClassification.ipynb: holds all code, documentation and analysis
- holds all the filepaths in variables for easy access (and change, if needed)
- contains the condensed classes required for training: PreProcess() which preprocesses the data, LSTMModel (Which is the model class) and TrainValidate (which contains the class to compile and fit (or load from saved folder) the LSTM model
- word_idx.json: contains the word index generated from the words tokenized by our custom tokenization function]
- requirements.txt: list of versions required for the libraries we used in our code
Folders of Interest:
- data: contains all data directories, including pre-trained embeddings, processed, raw, results and train_test (this train_test folder is used for notebook only)
- pre-trained: contains glove pre-trained embeddings
- processed: contains the cleaned txt file and the train test folder (this train test is used solely by the TrainValidate class)
- raw: contains all the raw data
- results: contains saved models and model histories from both the trainings conducted within the notebook, and TrainValidate class
- @lemousehunter - Primary Coder, Long Short-Term Memory Model, Machine Learning Engineer
- @raydent30 - Secondary Coder, Exploratory Data Analysis, Data Analytics, Documentation
- How do we differentiate ham and spam messages using machine learning?
- Which model would be the best to predict it?
- Binary Classification Tree
- Long Short-Term Memory (LSTM)
- Words and characters are much better predictors than sentences to classify ham or spam messages
- Classification trees can predict the type of message with relatively high accuracies, but with a low F1 score.
- LSTM performs better than classification trees in predicting the type of message.
- Vectorization plays a key role when dealing with textual data.
- Yes, it is possible to differentiate ham and spam messages using both Classification Trees and LSTM, however, there is still room for improvement.
- A transformer theoretically will be able to produce much better F1 scores than both Classification Trees and LSTM.
- Usage of Different Scoring Metrics
- Neural Networks, Keras and Tensorflow
- Concept of Transformers
- Collaborating using GitHub
Exploratory Data Analysis:
RNN & Long Short-Term Memory (LSTM):
Scoring Metrics: