dataPreparation

Data preparation accounts for about 80% of the work during a data science project. Let's take that number down. dataPreparation will allow you to do most of the painful data preparation for a data science project with a minimum amount of code.

This package is

fast (use data.table and exponential search)
RAM efficient (perform operations by reference and column-wise to avoid copying data)
stable (most exceptions are handled)
verbose (log a lot)

Main preparation steps

Before using any machine learning (ML) algorithm, one need to prepare its data. Preparing a data set for a data science project can be long and tricky. The main steps are the followings:

Read: load the data set (this package don't treat this point: for csv we recommend data.table::fread)
Correct: most of the times, there are some mistake after reading, wrong format... one have to correct them
Transform: creating new features from date, categorical, character... in order to have information usable for a ML algorithm (aka: numeric or categorical)
Filter: get rid of useless information in order to speed up computation
Pre model transformation: Specific manipulation for the chosen model (handling NA, discretization, one hot encoding, scaling...)
Shape: put your data set in a nice shape usable by a ML algorithm

Here are the functions available in this package to tackle those issues:

Correct	Transform	Filter	Pre model manipulation	Shape
un_factor	generate_date_diffs	fast_filter_variables	fast_handle_na	shape_set
find_and_transform_dates	generate_factor_from_date	which_are_constant	fast_discretization	same_shape
find_and_transform_numerics	aggregate_by_key	which_are_in_double	fast_scale	set_as_numeric_matrix
set_col_as_character	generate_from_factor	which_are_bijection		one_hot_encoder
set_col_as_numeric	generate_from_character	remove_sd_outlier
set_col_as_date	fast_round	remove_rare_categorical
set_col_as_factor	target_encode	remove_percentile_outlier

All of those functions are integrated in the full pipeline function prepare_set.

For more details on how it work go check our tutorial.

Getting started: 30 seconds to dataPreparation

Installation

Install the package from CRAN:

install.packages("dataPreparation")

To have the latest features, install the package from github:

library(devtools)
install_github("ELToulemonde/dataPreparation")

Test it

Load a toy data set

library(dataPreparation)
data(messy_adult)
head(messy_adult)

Perform full pipeline function

clean_adult <- prepare_set(messy_adult)
head(clean_adult)

That's it. For all functions, you can check out documentation and/or tutorial vignette.

How to Contribute

dataPreparation has been developed and used by many active community members. Your help is very valuable to make it better for everyone.

Check out call for contributions to see what can be improved, or open an issue if you want something.
Contribute to add new usesfull features.
Contribute to the tests to make it more reliable.
Contribute to the documents to make it clearer for everyone.
Contribute to the examples to share your experience with other users.
Open issue if you met problems during development.

For more details, please refer to CONTRIBUTING.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataPreparation

Main preparation steps

Getting started: 30 seconds to dataPreparation

Installation

Test it

How to Contribute

Files

README.md

Latest commit

History

README.md

File metadata and controls

dataPreparation

Main preparation steps

Getting started: 30 seconds to dataPreparation

Installation

Test it

How to Contribute