Keywords: NLP, Tfidf, Pipeline, GridSearch, Multi-label Classification, Flask, Plotly
In this project, data engineering skills were applied to analyze disaster data from Figure Eight to build a model for an API that classifies disaster messages.
In the project repo, messages.csv contains real messages that were sent during disaster events, while the categories.csv contains category information of those messages. One piece of the message can be labeled in more categories. A machine learning pipeline should be created to categorize these events so that they could be sent to an appropriate disaster relief agency, which is the real-world implication of this project.
A web app were created, where a new message can be classified into several categories. The web app also displays three visualizations of the data.
There are three components to this project.
-
ETL Pipeline In the preparation phase, data were processed in Jupyter Notebook, refer
ETL Pipeline Preparation.ipynb
for details. A data cleaning pipelineprocess_data.py
(in Workspace/Data folder) was created in a Python script including the following functions:- Loads the messages and categories datasets
- Merges the two datasets
- Cleans the data
- Stores it in a SQLite database
-
Machine Learning Pipeline In the preparation phase, data were processed in Jupyter Notebook, refer
ML Pipeline Preparation.ipynb
for details. A machine learning pipelinetrain_classifier.py
(in Workspace/Model folder) was created in a Python script including the following functions:- Loads data from the SQLite database
- Splits the dataset into training and test sets
- Builds a text processing and machine learning pipeline
- Trains and tunes a model using GridSearchCV
- Outputs results on the test set
- Exports the final model as a pickle file
-
Flask Web App Here's the file structure of the project:
app
- template
-- master.html # main page of web app
-- go.html # classification result page of web app
- run.py # Flask file that runs the app
data
- disaster_categories.csv # data to process
- disaster_messages.csv # data to process
- process_data.py # data processing module
- InsertDatabaseName.db # database to save clean data to
models
- train_classifier.py # model traning module
- classifier.pkl # saved model
README.md