Churn_prediction_Udacity_Capstone

Solved churn prediction (classification) problem on an event data set and exemplified big data handling skills with Spark on AWS

Project Discription

Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In this project, an event data set from a fictionary music streaming company named Sparkify was analyzed. A tiny subset (128MB) of the full dataset (12GB) was first analyzed locally in Jupyter Notebook with a scalable script in Spark and the whole data set was analyzed on the AWS EMR cluster.

Skills examplified

How to manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn.
how to use Spark MLlib to build machine learning models with large datasets

File Descriptions

Sparkify_visualization: Codes for cleaning small data set and exploratory visualization
Sparkify_modeling: Codes for cleaning small data set and modeling
Sparkify_AWS: Codes for analyzing big data set on AWS EMR cluster (Because of limited time and budget, I only run this version of code once on AWS cluster. This is not the exactly scaled codes from the other two parts. However, the other two notebooks should be completely scalable on big data set)

AWS EMR Setting Instruction

On AWS, the full 12GB dataset was hosted on a public S3 bucket, follow the instructions below to launch a EMR cluster and notebook. It was expect to use about $30 dollars or more to run this cluster for a week with the following setting.

Full Sparkify Dataset: s3n://udacity-dsnd/sparkify/sparkify_event_data.json
Mini Sparkify Dataset: s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json

**Launch EMR Cluster and Notebook

Open a regular AWS account (if you don't already have one) following the instructions via the Amazon Web Service Help Center
Go to the Amazon EMR Console
Select "Clusters" in the menu on the left, and click the "Create cluster" button.

Step 1: Configure your cluster with the following settings:

Release: emr-5.20.0 or later
Applications: Spark: Spark 2.4.0 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.0
Instance type: m3.xlarge
Number of instance: 3
EC2 key pair: Proceed without an EC2 key pair or feel free to use one if you'd like
You can keep the remaining default setting and click "Create cluster" on the bottom right.

Step 2: Wait for Cluster "Waiting" Status

Once you create the cluster, you'll see a status next to your cluster name that says Starting. Wait a short time for this status to change to Waiting before moving on to the next step.

Step 3: Create Notebook

Now that you launched your cluster successfully, let's create a notebook to run Spark on that cluster.

Select "Notebooks" in the menu on the left, and click the "Create notebook" button.

Step 4: Configure your notebook

Enter a name for your notebook
Select "Choose an existing cluster" and choose the cluster you just created
Use the default setting for "AWS service role" - this should be "EMR_Notebooks_DefaultRole" or "Create default role" if you haven't done this before.

You can keep the remaining default settings and click "Create notebook" on the bottom right.

Step 5: Wait for Notebook "Ready" Status, Then Open

Once you create an EMR notebook, you'll need to wait a short time before the notebook status changes from Starting or Pending to Ready. Once your notebook status is Ready, click the "Open" button to open the notebook.

Start Coding!

Now you can run Spark code for your project in this notebook, which EMR will run on your cluster.

For more information on EMR notebooks, click here.

For those who want to challenge yourselves, you could start with the starter code instead of referring to my code

When you run the last cell, you'll see a box appear that says "Spark Job Progress." Click on the arrow in that box to view your cluster's progress as it reads the full 12GB dataset!

Analysis of Mini Data Set

Code Structure

Import libraries
Instantiate a Spark session
Load and Clean Dataset
Exploratory Data Analysis (separately done in Sparkify_visualization)
Feature Engineering
Modeling
Conclusion
Discussion

Libraries

Pyspark
Pyspark.sql
Pyspark.ml
Numpy
Pandas
Seaborn
Matplotlib

Data Cleaning

Missing values or empty strings handling
Simplifying categorical variables location and userAgent
Transformation of timestamps to epoch time

Exploration of Dataset

Churn and downgrade definition
Visualization of behavior for users who stayed VS users who churned

Feature Engineering

Assembling features
Transforming categorical variables
Scaling features
Transforming to vector

Modeling

Models: Logistic regression, random forest, gradient-boosted tree
Evaluator: Binary, Multiclass
Metrics: F1 and AUC
Hyperparameter tuning
Cross-validation
Check feature importances

Conclusion

In this project, churn prediction was performed based on an event data set for a music streaming provider. This was basically a binary classification problem. After loading and cleaning data, I performed some exploratory analysis and provided insights on the next step of feature engineering. All together 13 explanatory features were selected and logistic regression, random forest, and gradient-boosted tree models were fitted respectively to a training data set.

The model performance was the best for the logistic regression on small data set, with an F1 score of 73.10 on the test set. The other two models were both suffered from overfitting. Hyperparameter tuning and cross-validation was not very helpful in solving overfitting, probably because of a small number of sample size. Due to time and budget limitations, the final models were not tested on the big data set. However, the completely scalable process shed a light on solving the churn prediction problem on big data with Spark on Cloud.

Future work

Finer feature engineering

Feature engineering was one of the most important steps on this project, due to the trade-off between model performance and calculation capacity, it was impossible and unnecessary to select as many features as I wanted. Fewer features were tested on the full data set, as a result, the model performance was not satisfying, that's why I created more features and tested on the small data set. Although there was an improvement in the model performance, it was probably not the ideal result. There are some techniques that might help in the feature selection process. For example, the ChiSqSelector provided by Spark ML for selecting significant features.

Build balanced training set

Imbalanced samples (with more "0" labeled rows than "1") was another factor that holding back our model performance. Introducing weight balance was only possible for the logistic regression algorithm. In future work, we could randomly select the same number of "0" rows to "1" rows to create balanced training data and fit models, which might improve the model performances.

Acknowledgment

This is my capstone project on Udacity Data Scientist Nanodegree program. Thanks to Udacity for providing such a wonderful program, making it possible for many with different backgrounds to step into the field of data science.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
images in Github readme		images in Github readme
README.md		README.md
Sparkify_AWS_3 (2).ipynb		Sparkify_AWS_3 (2).ipynb
sparkify_modelling.ipynb		sparkify_modelling.ipynb
sparkify_visualization.ipynb		sparkify_visualization.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Churn_prediction_Udacity_Capstone

Project Discription

Skills examplified

File Descriptions

AWS EMR Setting Instruction

Analysis of Mini Data Set

Code Structure

Libraries

Data Cleaning

Exploration of Dataset

Feature Engineering

Modeling

Conclusion

Future work

Acknowledgment

About

Releases

Packages

Languages

Tselmeg-C/Churn_prediction_Udacity_Capstone

Folders and files

Latest commit

History

Repository files navigation

Churn_prediction_Udacity_Capstone

Project Discription

Skills examplified

File Descriptions

AWS EMR Setting Instruction

Analysis of Mini Data Set

Code Structure

Libraries

Data Cleaning

Exploration of Dataset

Feature Engineering

Modeling

Conclusion

Future work

Acknowledgment

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages