-
Notifications
You must be signed in to change notification settings - Fork 10
/
NEWS
318 lines (246 loc) · 12.3 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
V 1.1.1 (June 2023)
==================
- FEAT:
- Speed up examples by providing and using a `tiny_messy_adult` data set.
- FIX:
- Fix typos
- TECH:
- Speed up CI for MACOS
V 1.1.0
=======
- FEAT:
- Stop supporting R strictly before 3.6, and support R 4.2 and 4.3
- BUGFIX:
- FIX documentation
- TECH:
- Upgrade package install in CI
V 1.0.5 (July 2022)
==================
FEAT:
- New functions *compute_probability_ratio* and *compute_weight_of_evidence* to be used for target encoding
- New function *get_most_frequent_element* to identify most frequent element in a list
V 1.0.4
=======
BUGFIX: Fix *generate_from_character*, when there were some NAs in the column it would drop the line. It is not the case
anymore.
V 1.0.3
=======
BUGFIX: Fix bud on *fast_is_bijection* when column has multiple class
FEAT: Harmonize logging levels between functions
V 1.0.2
=======
Remove useless dependencies.
Make sure library works on windows, macos, ubuntu, and R versions from 3.3 to 4.1.
V 1.0.1
=======
Based on CRAN feedbacks removed problematic vignettes.
V 1.0.0
=======
For this version 1.0.0 there are a lot of changes, and version is not compatible with previous version of the package.
Also there might be some rework to do on code using previous version of this package (and we are sorry about it), we
strongly believe that this version will be easier to use, faster, and more maintanable in time.
In this version:
- All function names and variables are snake_case (there used to be a mix of camel case and snake case)
- We remove a lost of useless code that was slowing done the package (particularly garbage collection)
- We made the code more readable so that it is easier to contribute to this package
- Logging is more explicit and cleaner.
- We took into account linting.
- A few more functions are availables.
We hope that you will like even more this new version of the package. Please don't hesitate to provide feedback, warn us
about bug, suggest improvements or even better developp some improvements on this package. To do so please go to
github (https://github.com/ELToulemonde/dataPreparation/).
V 0.4.3
=======
- Fix :
- In *same_shape*: there was a future bug due to change in class "matrix". Fixed it by implementing 2 functions to
check class
V 0.4.2
=======
- Fix test:
- Case in *build_encoding*: min_frequency allows to drop rare values" was not built correctly.
V 0.4.1
=======
- New features:
- New functions:
- Functions *target_encode* and *build_target_encoding* have been implemented to provide target encoding which
is the process of replacing a categorical value with the aggregation of the target variable.
- Function *remove_sd_outlier* helps to remove rows that have numerical values to extreme.
- Function *remove_percentile_outlier* helps to remove rows that have numerical values to extreme (based on
percentile analysis).
- Function *remove_rare_categorical* helps to remove rows that have categorical values to rare.
- New features in existing functions:
- Function *prepare_set* integrate *target_encode* function. It is called by providing *target_col* and
*target_encoding_functions*.
V 0.4.0
=======
- New features:
- New features in existing functions:
- To avoid issues based on column names, we will check and rename columns that have same names.
- In *aggregate_by_key* generated column names are changed to be more explicit.
- In *aggregate_by_key* generated from character column with more than \code{thresh} values is now count of
unique instead of count.
- Added missing *auto* default values on cols
- Bug fixes:
- *which_are_bijection* and *which_are_in_double* are using *bi_col_test* which was not working with 2 column data
set. It is fixed.
- *prepare_set* optional argument *factor_date_type* was not working. It is fixed.
- Other changes:
- Changed *which_are_included* example since it was to slow for CRAN. Also it might be a little bit more explicit
now.
- Changed *aggregate_by_key* example since it was to slow for CRAN.
- Integration:
- Rewrite all tests to make them more readable
- Code coverage is improved, dependencies on *messy_adult* set is lowered
WARNING:
- In *aggregate_by_key* generated column names are changed.
- In *aggregate_by_key* generated column for character is different.
V 0.3.9
=======
- Integration:
- Matching new devtools requirements
- Starting to rewrite unittest to make it more readable
V 0.3.8
=======
- New features:
- New features in existing functions:
- Identification of bijection through internal function *fast_is_bijection* is way faster (up to 40 times faster in
case of bijection). So *whichArebijection* and *fastFiltervariables* are also improved.
- Remove remaining *gc* to save time.
- In *one_hot_encoder* added parameter *type* to choose between logical or numerical results.
V 0.3.7
=======
- New features:
- New functions:
- Function *as.POSIXct_fast* is now available. It helps to transform to POSIXct way faster (if the same date
value is present multiple times in the column).
- New features in existing functions:
- In dates identifications, we make it faster by computing search of format only on unique values.
- In date transformation, we made it faster by using *as.POSIXct_fast* when it is necessary.
- Functions *findAndTransFormDates*, *find_and_transform_numerics* and *un_factor* now accept argument *cols* to
limit search.
- Bug fixes:
- Control that over-allocate option is activated on every data.table to avoid issues with set. Package should be
more robust.
- In bijection search (internal function *fast_is_bijection*) there was a bug on some rare cases. Fixed but slower.
-Code quality:
- Improving code quality using lintr
- Suppressing some useless code
- Meeting new covr standard
- Improve log of setColAsXXX
V 0.3.6
=======
- Bug fixes:
- *identify_dates* had a weird bug. Solved
- Integration:
- Making dataPreparation compatible with testthat 2.0.0
V 0.3.5
=======
- New features:
- New features in existing functions:
- *findAndTransFormDates* now as an *ambiguities* parameter, IGNORE to work as before, WARN to check for
ambiguities and print them, SOLVE to try to solve ambiguities on more lines.
- *one_hot_encoder* now uses a *build_encoding* functions to be able to build same encoding on train and on
test.
- *aggregate_by_key* is now way faster on numerics. But it changed the way it gets input functions.
- *fast_scale* now as a *way* parameter which allow you to either scale or unscale. Unscaling numeric values can
be very useful for most post-model analysis.
- *set_col_as_date* now accept multiple formats in a single call.
- New functions:
- *build_encoding* build a list of encoding to be used by *one_hot_encoder*, it also has a parameter
*min_frequency* to control that rare values doesn't result in new columns.
- Previously private function *identify_dates* is now exported. To be able to perform same transformation on
train and on test.
- Adding *dataPreparationNews* function to open NEWS file (inspired from rfNews() of randomForest package)
- Bug fixes:
- *findAndTransFormDates*: bug fixed: user formats weren't used.
- *identify_dates*: some formats where tested but would never work. They have been removed.
- Refactoring:
- Unit test partly reviewed to be more readable and more efficient. Unit test time as been divided by 3.
- Improving input control for more robust functions
WARNING:
- *one_hot_encoder* now requires you to run *build_encoding* first.
- *aggregate_by_key* now require functions to be passed by character name
This version is making (as much as possible) transformation reproducible on train and test set. This is to prepare
future pipeline feature.
V 0.3.4
========
- Improvement of function
- *which_are_bijection*: It is 2 to 15 time faster than previous version.
- *which_are_included*: It is a bit faster.
- Bug fixes:
- *generate_factor_from_date*: default value was missing. Fixed.
- New features:
- New features in existing functions:
- *fast_filter_variables* has a new parameter (level) to choose which types of filtering to perform
WARNING:
- *which_are_included*: in case of bijection (col1 is a bijection of col2), they are both included in the other, but the
choice of the one to drop might have changed in this version.
V 0.3.3
========
- New features:
- New features in existing functions:
- *findAndTransFormDates* now recognize date character even if there are multiple separator in date (ex: "2016,
Jan-26").
- *findAndTransFormDates* now recognize date character even if there are leading and tailing white spaces.
WARNING:
- *date3* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if
there are leading and/or trailing white spaces.
- *date4* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even if
there are multiple separator.
V 0.3.2
========
- Change URLs to meet CRAN requirement
v 0.3.1
=======
- Fix bug in Latex documentation
v 0.3
=====
- New features:
- New features in existing functions:
- *findAndTransFormDates* now recognize date character even if "0" are not present in month or day part and
month as lower strings.
- *findAndTransFormDates* and *set_col_as_date* now work with *factors*.
- New functions:
- *fast_discretization*: to perform equal freq or equal width discretization on a data set using *data.table*
power.
- *fast_scale*: to perform scaling on a data set using *data.table* power.
- *one_hot_encoder*: to perform one_hot encoding on a data set using *data.table* power.
- New documentation:
- A new vignette to illustrate how to build a correct *train* and *test* set using data preparation
- Minor changes in log (in particular regarding progress bars and typos)
- Due to dependencies issues with *tcltk*, we stop using it and start using *progress*
- Refactoring:
- Private function *real_cols* take more importance to control that columns have the correct types and handling "
auto" value.
- Making code faster: some functions are up to **30% faster**
- Review unit testing to be faster
- Unit test evolution to be more readable
WARNING:
- *date1* column in *messy_adult* data set has changed in order to illustrate the recognition of date character even
if "0" are not present in month or day part.
v 0.2
=====
- Improving unit testing and code coverage
- Improving documentation
- Solving minor bug in date conversion and in which functions
- New features:
- New functions:
- *un_factor* to un-factor columns, when reading wasn't performed in expected way.
- *same_shape* to make ure that train and test set have exactly the same shape.
- generate new columns from existing columns (generate functions)
- generate factor from dates: *generate_factor_from_date*
- diffDates becomes *generate_date_diffs* (for better name understanding).
- generate numerics and booleans from character of factors (using *generate_from_factor* and
*generate_from_character*)
- *set_col_as_factor* a function to make multiple columns as factor and controlling number of unique elements
- New features in existing functions:
- which functions: add *keep_cols* argument to make sure that they are not dropped
- fast_filter_variables: *verbose* can be T/F or 0, 1, 2 in order to control level of verbosity
- *findAndTransFormDates* and *set_col_as_dates* now recognize and accept timestamp.
WARNING:
- If you were using *diffDates*, it is now called *generate_date_diffs*
- *date2* column in *messy_adult* data set have changed in order to illustrate new timestamp features
- *set_col_as_factorOrLogical* doesn't exist anymore: it as been split between *set_col_as_factor* and *generateFromCat*
- Considering all those changes: *shape_set* and *prepare_set* don't give the same result anymore.
v 0.1: release on CRAN July 2017
================================