Capstone Project - Azure Machine Learning Engineer
-
In this project, I utilise my learning from the Udacity Nanodegree
-
I create two models:
- one using AutoML - one customized model with hyperparameters tuned using HyperDrive
-
I then compare the performance of both the models and deploy the best performing model
-
This project demonstrates my ability to use an external dataset in my workspace, train a model using the different tools available in the AzureML framework as well as my ability to deploy the model as a web service
- First create a workspace in AzureML
- Then create a compute instance depending on budget
- Upload the dataset (csv) file into Datastore and register it with the name "capstone-spam-dataset"
- Upload all files in the project into the Notebooks tab
- Run all cells in "automl.ipynb" and "hyperparameter_tuning.ipynb"
We are using a Spam classification dataset obtained from kaggle "spam-ham-dataset.csv" (Find it here)
- This task is a multi-class text classification problem. Specifically binary text classification since we have only 2 classes - spam and ham
- For AutoML model, features are extracted automatically
- For Hyperdrive model, we are using TF-IDF as features
- The data can be accessed from the Datastore using
Dataset.get_by_name(ws, name)
- Automated machine learning, also referred to as automated ML or AutoML, is the process of automating the time-consuming, iterative tasks of machine learning model development.
- It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality.
- It enable organizations to quickly, accurately and consistently develop and apply machine learning models at scale across their operations to address real-world issues rather than relying solely on data scientists to hand-code models.
n_cross_validations: 4
We want to perform 4-fold cross validation, i.e. training data is split into 4 folds/groups, 1 group is used as validation set, while the remaining 3 groups are used for training. This is repeated 4 times, with a different group used as a validation set each time
task = "classification"
We are performing multi-class classification, specifically binary classification
label_column_name = "Category"
We want the ML model to predict this column value
experiment_timeout_hours: 0.25
We want the AutoML experiment to run for maximum 15 min
primary_metric: "accuracy"
We want to use accuracy as the metric during model training for optimization.
enable_early_stopping: True
We want to enable early termination during the AutoML experiment if the score is not improving in the short term
featurization: 'auto'
As part of preprocessing, we want data guardrails and featurization steps to be done automatically
max_concurrent_iterations: 5
Number of concurrent child runs allowed for the AutoML experiment
compute_target
The compute target on which we want this AutoML experiment to run is specified
training_data = training_dataset
training_dataset is passed because it contains the data to be used for training the model
What are the results you got with your automated ML model? What were the parameters of the model?
- Best AutoML model was "Stack Ensemble", with accuracy of 98.79%
- Parameters of the best model are shown below:
How to improve results?
- Dataset was imbalanced, hence balancing that could have given better results
- If we allow Deep Learning models within AutoML, it could lead to better results
- Try experimenting with other configurations and settings to see how they affect model performance and accuracy
What is the use of Rundetails widget?
- Represents a Jupyter notebook widget used to view the progress of model training
- It is asynchronous and provides updates until training finishes
What kind of model did you choose for this experiment and why?
- I chose a Logistic Regression model because I wanted to build a simple baseline model for the Hyperdrive component of this project
Give an overview of the types of parameters and their ranges used for the hyperparameter search
-
Its parameters are C and max_iter
C: Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization max_iter: Maximum number of iterations taken for the solvers to converge
-
Ranges:
"C": choice(0.01, 0.001, 0.1, 1) "max_iter": choice(50,100)
What are the results you got with your model? What were the parameters of the model?
- The best model had accuracy of 96.91%
- Parameters were C=1 and max_iter=50
How could you have improved it?
- Balancing the imbalanced dataset
- Deep Learning models such as BERT, LSTM, etc. could have given better results
- We could have used better feature engineering techniques such as word2vec, doc2vec, transformer-embeddings, etc.
- We can try out other classical ML classification algorithms such as SVM, XGBoost, Random Forest, etc.
Give an overview of the deployed model and instructions on how to query the endpoint with a sample input.
- The deployed model was a Stack Ensemble model created in the AutoML experiment, since it gave the best accuracy from all experiments
- First obtain and deploy the best model by running all cells (except last 2 cells - they delete resources) in the "automl.ipynb" notebook
- The "automl.ipynb" notebook contains the code logic to query the deployed endpoint. To test your own input, enter your text in the "Message" field in the data, and run the cell to get back the prediction.
Provide a link to a screen recording of the project in action. Remember that the screencast should demonstrate:
- A working model
- Demo of the deployed model
- Demo of a sample request sent to the endpoint and its response