Solved churn prediction (classification) problem on an event data set and exemplified big data handling skills with Spark on AWS
Churn prediction, namely predicting clients who might want to turn down the service, is one of the most common business applications of machine learning. It is especially important for those companies providing streaming services. In this project, an event data set from a fictionary music streaming company named Sparkify was analyzed. A tiny subset (128MB) of the full dataset (12GB) was first analyzed locally in Jupyter Notebook with a scalable script in Spark and the whole data set was analyzed on the AWS EMR cluster.
- How to manipulate large and realistic datasets with Spark to engineer relevant features for predicting churn.
- how to use Spark MLlib to build machine learning models with large datasets
- Sparkify_visualization: Codes for cleaning small data set and exploratory visualization
- Sparkify_modeling: Codes for cleaning small data set and modeling
- Sparkify_AWS: Codes for analyzing big data set on AWS EMR cluster (Because of limited time and budget, I only run this version of code once on AWS cluster. This is not the exactly scaled codes from the other two parts. However, the other two notebooks should be completely scalable on big data set)
On AWS, the full 12GB dataset was hosted on a public S3 bucket, follow the instructions below to launch a EMR cluster and notebook. It was expect to use about $30 dollars or more to run this cluster for a week with the following setting.
- Full Sparkify Dataset: s3n://udacity-dsnd/sparkify/sparkify_event_data.json
- Mini Sparkify Dataset: s3n://udacity-dsnd/sparkify/mini_sparkify_event_data.json
**Launch EMR Cluster and Notebook
- Open a regular AWS account (if you don't already have one) following the instructions via the Amazon Web Service Help Center
- Go to the Amazon EMR Console
- Select "Clusters" in the menu on the left, and click the "Create cluster" button.
Step 1: Configure your cluster with the following settings:
- Release: emr-5.20.0 or later
- Applications: Spark: Spark 2.4.0 on Hadoop 2.8.5 YARN with Ganglia 3.7.2 and Zeppelin 0.8.0
- Instance type: m3.xlarge
- Number of instance: 3
- EC2 key pair: Proceed without an EC2 key pair or feel free to use one if you'd like
- You can keep the remaining default setting and click "Create cluster" on the bottom right.
Step 2: Wait for Cluster "Waiting" Status
Once you create the cluster, you'll see a status next to your cluster name that says Starting. Wait a short time for this status to change to Waiting before moving on to the next step.
Step 3: Create Notebook
Now that you launched your cluster successfully, let's create a notebook to run Spark on that cluster.
Select "Notebooks" in the menu on the left, and click the "Create notebook" button.
Step 4: Configure your notebook
- Enter a name for your notebook
- Select "Choose an existing cluster" and choose the cluster you just created
- Use the default setting for "AWS service role" - this should be "EMR_Notebooks_DefaultRole" or "Create default role" if you haven't done this before.
You can keep the remaining default settings and click "Create notebook" on the bottom right.
Step 5: Wait for Notebook "Ready" Status, Then Open
Once you create an EMR notebook, you'll need to wait a short time before the notebook status changes from Starting or Pending to Ready. Once your notebook status is Ready, click the "Open" button to open the notebook.
Start Coding!
Now you can run Spark code for your project in this notebook, which EMR will run on your cluster.
For more information on EMR notebooks, click here.
For those who want to challenge yourselves, you could start with the starter code instead of referring to my code
When you run the last cell, you'll see a box appear that says "Spark Job Progress." Click on the arrow in that box to view your cluster's progress as it reads the full 12GB dataset!
- Import libraries
- Instantiate a Spark session
- Load and Clean Dataset
- Exploratory Data Analysis (separately done in Sparkify_visualization)
- Feature Engineering
- Modeling
- Conclusion
- Discussion
- Pyspark
- Pyspark.sql
- Pyspark.ml
- Numpy
- Pandas
- Seaborn
- Matplotlib
- Missing values or empty strings handling
- Simplifying categorical variables location and userAgent
- Transformation of timestamps to epoch time
- Churn and downgrade definition
- Visualization of behavior for users who stayed VS users who churned
- Assembling features
- Transforming categorical variables
- Scaling features
- Transforming to vector
- Models: Logistic regression, random forest, gradient-boosted tree
- Evaluator: Binary, Multiclass
- Metrics: F1 and AUC
- Hyperparameter tuning
- Cross-validation
- Check feature importances
In this project, churn prediction was performed based on an event data set for a music streaming provider. This was basically a binary classification problem. After loading and cleaning data, I performed some exploratory analysis and provided insights on the next step of feature engineering. All together 13 explanatory features were selected and logistic regression, random forest, and gradient-boosted tree models were fitted respectively to a training data set.
The model performance was the best for the logistic regression on small data set, with an F1 score of 73.10 on the test set. The other two models were both suffered from overfitting. Hyperparameter tuning and cross-validation was not very helpful in solving overfitting, probably because of a small number of sample size. Due to time and budget limitations, the final models were not tested on the big data set. However, the completely scalable process shed a light on solving the churn prediction problem on big data with Spark on Cloud.
Finer feature engineering
Feature engineering was one of the most important steps on this project, due to the trade-off between model performance and calculation capacity, it was impossible and unnecessary to select as many features as I wanted. Fewer features were tested on the full data set, as a result, the model performance was not satisfying, that's why I created more features and tested on the small data set. Although there was an improvement in the model performance, it was probably not the ideal result. There are some techniques that might help in the feature selection process. For example, the ChiSqSelector provided by Spark ML for selecting significant features.
Build balanced training set
Imbalanced samples (with more "0" labeled rows than "1") was another factor that holding back our model performance. Introducing weight balance was only possible for the logistic regression algorithm. In future work, we could randomly select the same number of "0" rows to "1" rows to create balanced training data and fit models, which might improve the model performances.
This is my capstone project on Udacity Data Scientist Nanodegree program. Thanks to Udacity for providing such a wonderful program, making it possible for many with different backgrounds to step into the field of data science.