Datarevenue Code Challenge - Wine Rating Predictor

In this project, we built a proof of concept Wine Rating Predictor.

To train our machine learning model, we used both some of the features that originally came with the dataset, but also numerical features that were extracted from originally textual features.

Using a basic linear regression model as our baseline, we achieved a 5.0 MSE in our regression problem. We were able to improve this performance by using a XGBoost model with default hyperparameters, which achieved a 3.55 MSE.

Our pipeline runs in with Docker and Luigi and can be run on any machine with docker and docker compose. Five tasks are run in cascade:

Download Data
Clean Data
Make Dataset
Train Model
Evaluate Model

Command to build images: ./build-task-images.sh 0.1 Command to run the pipeline: docker-compose up orchestrator

Download Data - Downloads the Wine Rating Dataset

Clean Data - Cleans dataset, drops irrelevant features and deals with null values. The output of the task is a 'Clean.csv' dataset in the '/data_root/interim' folder.

Make Dataset - Splits the clean dataset into 80% for the train set and 20% for test set. The output of the task ia a 'Test.csv' and a 'Train.csv' dataset in the 'data_root/partition' folder.

Train Model - Trains a XGBoost model on the train dataset. The output of the task is a 'trained_model.sav' file in the 'data_root/model' folder. This .sav file is the serialized model.

Evaluate Model - Evaluates the model, considering the MSE metric. The output of the task is a 'report.pdf file with the model metrics and most important graphs.

We believe that this result can be further improved by tuning both the hyperparameter and the VADER sentiment analysis tool to our specific case. More experiments with different types of encoders can also help us achieve higher result. We recommend implementing a full production solution, not only for the promising predictor metrics, but also for its potential to provide important insight to our costumer about their products and comercial partners.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
build_base		build_base
clean_dataset		clean_dataset
data_root		data_root
download_data		download_data
evaluate_model		evaluate_model
make_dataset		make_dataset
notebooks		notebooks
orchestrator		orchestrator
train_model		train_model
.gitignore		.gitignore
README.md		README.md
build-task-images.sh		build-task-images.sh
docker-clean.sh		docker-clean.sh
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datarevenue Code Challenge - Wine Rating Predictor

About

Releases

Packages

Languages

camilaodsouza/Wine-Rating-Project

Folders and files

Latest commit

History

Repository files navigation

Datarevenue Code Challenge - Wine Rating Predictor

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages