Skip to content
This repository has been archived by the owner on Apr 24, 2022. It is now read-only.

A NLP project that uses RNN to predict whether a message is spam or ham

Notifications You must be signed in to change notification settings

lemousehunter/SC1015-NLP-Spam-Detection-Project

Repository files navigation

SC1015 Natural Language Processing (NLP) Spam Detection Project

An NLP project that uses Classification Trees and LSTM to predict whether a message is spam or ham.

Welcome!

About

This is our Mini-Project for SC1015 (Introduction to Data Science and Artificial Intelligence) which focuses on differentiating ham and spam messages from a UCI Machine Learning Repository Dataset (https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection).

To fully understand our project, here are some details pertaining to the items in our repository:

GloVe Word Embeddings will be needed for this project and can be downloaded from this website: https://nlp.stanford.edu/projects/glove/ Please download and unzip the contents of the version glove.6B.zip and place it in data>pre-trained>glove.

Files of Interest:

  • SpamHamClassification.ipynb: holds all code, documentation and analysis
  • fileMaster.py: holds all the filepaths in variables for easy access (and change, if needed)
  • LSTMModel.py: contains the condensed classes required for training: PreProcess() which preprocesses the data, LSTMModel (Which is the model class) and TrainValidate (which contains the class to compile and fit (or load from saved folder) the LSTM model
  • word_idx.json: contains the word index generated from the words tokenized by our custom tokenization function]
  • requirements.txt: list of versions required for the libraries we used in our code

Folders of Interest:

  • data: contains all data directories, including pre-trained embeddings, processed, raw, results and train_test (this train_test folder is used for notebook only)
  • pre-trained: contains glove pre-trained embeddings
  • processed: contains the cleaned txt file and the train test folder (this train test is used solely by the TrainValidate class)
  • raw: contains all the raw data
  • results: contains saved models and model histories from both the trainings conducted within the notebook, and TrainValidate class

Contributors

  • @lemousehunter - Primary Coder, Long Short-Term Memory Model, Machine Learning Engineer
  • @raydent30 - Secondary Coder, Exploratory Data Analysis, Data Analytics, Documentation

Problem Definition

  • How do we differentiate ham and spam messages using machine learning?
  • Which model would be the best to predict it?

Models Used

  1. Binary Classification Tree
  2. Long Short-Term Memory (LSTM)

Conclusion

  • Words and characters are much better predictors than sentences to classify ham or spam messages
  • Classification trees can predict the type of message with relatively high accuracies, but with a low F1 score.
  • LSTM performs better than classification trees in predicting the type of message.
  • Vectorization plays a key role when dealing with textual data.
  • Yes, it is possible to differentiate ham and spam messages using both Classification Trees and LSTM, however, there is still room for improvement.
  • A transformer theoretically will be able to produce much better F1 scores than both Classification Trees and LSTM.

What did we learn from this project?

  • Usage of Different Scoring Metrics
  • Neural Networks, Keras and Tensorflow
  • Concept of Transformers
  • Collaborating using GitHub

References

Dataset:

Exploratory Data Analysis:

RNN & Long Short-Term Memory (LSTM):

Scoring Metrics:

Transformers:

About

A NLP project that uses RNN to predict whether a message is spam or ham

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published