In the course of partaking an online class regarding data analysis, in particular the class Getting and Cleaning Data, one assignment was a project with the purpose to demonstrate the student's ability to collect, work with, and clean a data set. The objective is to prepare tidy data that can be used for later analysis.
One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data linked to from the course website represent data collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained from the Machine Learning Repository at UCI's.
The data for the project can be obtained from here.
Create a R script called run_analysis.R
that carries out the following steps:
- Merges the training and the test sets to create one data set;
- Extracts only the measurements on the mean and standard deviation for each measurement;
- Uses descriptive activity names to name the activities in the data set;
- Appropriately labels the data set with descriptive variable names;
- From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject;
- Export the tidy data set as a txt file created with write.table() using row.name=FALSE.
The project comprises several directories and files, all of which are outlined below.
In the project directory you will find several subdirectories.
<project directory>/
+- CodeBook.md
+- data/
| +- getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
| +- UCI HAR Dataset/
| | +- activity_labels.txt
| | +- features.txt
| | +- features_info.txt
| | +- README.txt
| | +- test/
| | | +- Inertial Signals/
| | | | +- body_acc_x_test.txt
| | | | +- body_acc_y_test.txt
| | | | +- body_acc_z_test.txt
| | | | +- body_gyro_x_test.txt
| | | | +- body_gyro_y_test.txt
| | | | +- body_gyro_z_test.txt
| | | | +- total_acc_x_test.txt
| | | | +- total_acc_y_test.txt
| | | | +- total_acc_z_test.txt
| | | +- subject_test.txt
| | | +- X_test.txt
| | | +- y_test.txt
| | +- train/
| | +- Inertial Signals/
| | | +- body_acc_x_train.txt
| | | +- body_acc_y_train.txt
| | | +- body_acc_z_train.txt
| | | +- body_gyro_x_train.txt
| | | +- body_gyro_y_train.txt
| | | +- body_gyro_z_train.txt
| | | +- total_acc_x_train.txt
| | | +- total_acc_y_train.txt
| | | +- total_acc_z_train.txt
| | +- subject_train.txt
| | +- X_train.txt
| | +- y_train.txt
| +- UCI_HAR_Dataset_Messy.txt
| +- UCI_HAR_Dataset_Tidy.txt
+- doc/
+- Getting and Cleaning Data Course Project.Rproj
+- lib/
| +- chkdir.R
| +- dldat.R
| +- extvars.R
| +- mergetesttrain.R
| +- messydat.R
| +- rdactlbl.R
| +- rdfeatlbl.R
| +- rdtest.R
| +- rdtrain.R
| +- repidwithlbl.R
| +- setvarnames.R
| +- tidydat.R
| +- wrdat.R
+- README.md
+- run_analysis.R
The data
directory contains the downloaded data file, both as a compressed archive (zip), as well as the uncompress individual data files.
Inside the doc
directory you can find the R Markdown Cheatsheet and R Markdown Reference Guide.
Any additional, supporting scripts or libraries are stored inside the lib
directory.
For reasons of readability, testing, profiling, benchmarking and re-usability, instead of creating one single script file, an approach of splitting the script into several functions, each put into a corresponding script file of its own, placed into the lib
directory, has been taken.
The main entry point, the main script, Upon which execution all required libraries and supporting scripts are getting loaded, as well as global variables and constants are defined.
Steps carried out by this script:
- check for existence of the
doc
anddata
directories, and create these if required; - download and uncompress data file (raw data) if not already available;
- create the messy data (merge train and test data sets), and store these in data.table
dtMessy
; - create tidy data from messy data, and store these in data.table
dtTidy
; - write the messy and tidy data sets to corresponding text files located in the
data
directory.
Check whether a specific directory is already existing, and if not create that directory.
Parameter | Description |
---|---|
dname | vector of directory names to check/create |
mkdir | create directories if non-existing |
Table: Input parameter
Result | Description |
---|---|
TRUE | success; directories are existing/have been created successfully |
FALSE | failure; directories are not existing/cannot be created |
Table: Output value
Download data into indicated directory, expand if requested, and rename as specified.
Parameter | Description |
---|---|
dlname | vector of files to downloads |
dldir | vector of directories files to download to |
fname | local filenames of downloaded files |
exp | expand downloaded files? |
redl | re-download files? |
Table: Input parameter
Result | Description |
---|---|
TRUE | success; directory is existing/has been created successfully |
FALSE | failure; directory does not exist/cannot be created |
Table: Output value
Note
Case #1) dlname
, dldir
, fname
have to be of identical length, -or-
Case #2) dlname
and fname
have to be of identical length, and dldir
has to be of length 1.
Extract specific variable from data set.
Parameter | Description |
---|---|
dtSrc | data.table containing the source data |
vars | vector of variables to extract |
pattern | pattern of variables to extract |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the extracted data only |
NULL | failure |
Table: Output value
Note
If both vars
and pattern
are non-NULL, only vars
will be taken into account.
Merge test and train data.
Parameter | Description |
---|---|
dtTest | data.table containing the test data |
dtTrain | data.table containing the train data |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the merged data |
NULL | failure |
Table: Output value
Note
Case #1) dtTest
and dtTrain
are omitted -> both corresponding reading functions are executed;
Case #2) dtTest
or dtTrain
are omitted -> corresponding reading function for the omitted data set is executed;
Case #3) dtTest
and dtTrain
are provided -> only a merge is getting carried out.
Create (merge) messy data from raw data.
Parameter | Description |
---|---|
NONE |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the messy data |
NULL | failure |
Table: Output value
Read activity label data.
Parameter | Description |
---|---|
basedir | base directory to read files from |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the activity label data |
NULL | failure |
Table: Output value
Read feature label data.
Parameter | Description |
---|---|
basedir | base directory to read files from |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the feature label data |
NULL | failure |
Table: Output value
Read test data.
Parameter | Description |
---|---|
basedir | base directory to read files from |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the test data |
NULL | failure |
Table: Output value
Read train data.
Parameter | Description |
---|---|
basedir | base directory to read files from |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the train data |
NULL | failure |
Table: Output value
Replace id with label.
Parameter | Description |
---|---|
dtSrc | data.table containing the source data |
dtLbl | data.table containing the label data |
cnSrcId | variable/column name(s) holding the id to replace |
cnLblId | variable/column name(s) holding the id for matching |
cnLblLbl | variable/column name(s) holding the label to replace id with |
rmSrcId | remove id variable/column from resulting data set? |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing data with replaced id(s) |
NULL | failure |
Table: Output value
Note
- all of
dtSrc
,dtLbl
,cnSrcId
,cnLblId
,cnLblLbl
have to be specified, and a non-NULL value has to be provided for each of these; - if the id variable/column gets removed, the new key will be the old key but with the new variable/column instead; otherwise the old key will remain.
Set variable/column names.
Parameter | Description |
---|---|
dtSrc | data.table containing the source data |
cnVars | vector of new variable/column names |
pattern | pattern of variables to extract |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing data with new variable/column names assigned |
NULL | failure |
Table: Output value
Note
dtSrc
andcnVars
have to be provided, and assigned a non-NULL value respectively;- if
cnVars
contains less elements thandtSrc
variables/columns, the remining, missing elements will be taken from the current variables/columns ofdtSrc
; - if
cnVars
contains more elements thandtSrc
variables/columns, only the first elements up to the number of elements of the current variables/columns ofdtSrc
will be taken; - if pattern is also provided (and non-NULL), all variable/column names matching the pattern will be amended by replacing the matched pattern with the content of
cnVars
; - any exsting sorting/matching key will be amended accordingly.
Tidy up messy data.
Parameter | Description |
---|---|
dtMessy | data.table containing the test data |
Table: Input parameter
Result | Description |
---|---|
data table | success; data table containing the tidy data |
NULL | failure |
Table: Output value
Export data set to file.
Parameter | Description |
---|---|
dtSrc | data.table containing the source data |
basedir | base directory to write files to |
fname | filename of output file |
fext | file extension of output file |
arch | archive/compress (zip) written file? |
archtype | archive/compress type (bz, gz, zip [default]) |
Table: Input parameter
Result | Description |
---|---|
TRUE | success; data set has been successfully written to file |
FALSE | failure; data set couldn't be written to file |
Table: Output value
Note
- at least dtSrc and fname have to be provided, and containing a non-NULL value;
- if outfile is already existing it will get overwritten.