Purpose

In the course of partaking an online class regarding data analysis, in particular the class Getting and Cleaning Data, one assignment was a project with the purpose to demonstrate the student's ability to collect, work with, and clean a data set. The objective is to prepare tidy data that can be used for later analysis.

Setting

One of the most exciting areas in all of data science right now is wearable computing - see for example this article . Companies like Fitbit, Nike, and Jawbone Up are racing to develop the most advanced algorithms to attract new users. The data linked to from the course website represent data collected from the accelerometers from the Samsung Galaxy S smartphone. A full description is available at the site where the data was obtained from the Machine Learning Repository at UCI's.

The data for the project can be obtained from here.

Assignment

Create a R script called run_analysis.R that carries out the following steps:

Merges the training and the test sets to create one data set;
Extracts only the measurements on the mean and standard deviation for each measurement;
Uses descriptive activity names to name the activities in the data set;
Appropriately labels the data set with descriptive variable names;
From the data set in step 4, creates a second, independent tidy data set with the average of each variable for each activity and each subject;
Export the tidy data set as a txt file created with write.table() using row.name=FALSE.

Project

The project comprises several directories and files, all of which are outlined below.

Directory structure

In the project directory you will find several subdirectories.

<project directory>/
  +- CodeBook.md
  +- data/
  |    +- getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip
  |    +- UCI HAR Dataset/
  |    |    +- activity_labels.txt
  |    |    +- features.txt
  |    |    +- features_info.txt
  |    |    +- README.txt
  |    |    +- test/
  |    |    |    +- Inertial Signals/
  |    |    |    |    +- body_acc_x_test.txt
  |    |    |    |    +- body_acc_y_test.txt
  |    |    |    |    +- body_acc_z_test.txt
  |    |    |    |    +- body_gyro_x_test.txt
  |    |    |    |    +- body_gyro_y_test.txt
  |    |    |    |    +- body_gyro_z_test.txt
  |    |    |    |    +- total_acc_x_test.txt
  |    |    |    |    +- total_acc_y_test.txt
  |    |    |    |    +- total_acc_z_test.txt
  |    |    |    +- subject_test.txt
  |    |    |    +- X_test.txt
  |    |    |    +- y_test.txt
  |    |    +- train/
  |    |         +- Inertial Signals/
  |    |         |    +- body_acc_x_train.txt
  |    |         |    +- body_acc_y_train.txt
  |    |         |    +- body_acc_z_train.txt
  |    |         |    +- body_gyro_x_train.txt
  |    |         |    +- body_gyro_y_train.txt
  |    |         |    +- body_gyro_z_train.txt
  |    |         |    +- total_acc_x_train.txt
  |    |         |    +- total_acc_y_train.txt
  |    |         |    +- total_acc_z_train.txt
  |    |         +- subject_train.txt
  |    |         +- X_train.txt
  |    |         +- y_train.txt
  |    +- UCI_HAR_Dataset_Messy.txt
  |    +- UCI_HAR_Dataset_Tidy.txt
  +- doc/
  +- Getting and Cleaning Data Course Project.Rproj
  +- lib/
  |    +- chkdir.R
  |    +- dldat.R
  |    +- extvars.R
  |    +- mergetesttrain.R
  |    +- messydat.R
  |    +- rdactlbl.R
  |    +- rdfeatlbl.R
  |    +- rdtest.R
  |    +- rdtrain.R
  |    +- repidwithlbl.R
  |    +- setvarnames.R
  |    +- tidydat.R
  |    +- wrdat.R
  +- README.md
  +- run_analysis.R

The data directory contains the downloaded data file, both as a compressed archive (zip), as well as the uncompress individual data files.

Inside the doc directory you can find the R Markdown Cheatsheet and R Markdown Reference Guide.

Any additional, supporting scripts or libraries are stored inside the lib directory.

Files

For reasons of readability, testing, profiling, benchmarking and re-usability, instead of creating one single script file, an approach of splitting the script into several functions, each put into a corresponding script file of its own, placed into the lib directory, has been taken.

run_analysis.R

The main entry point, the main script, Upon which execution all required libraries and supporting scripts are getting loaded, as well as global variables and constants are defined.

Steps carried out by this script:

check for existence of the doc and data directories, and create these if required;
download and uncompress data file (raw data) if not already available;
create the messy data (merge train and test data sets), and store these in data.table dtMessy;
create tidy data from messy data, and store these in data.table dtTidy;
write the messy and tidy data sets to corresponding text files located in the data directory.

lib/chkdir.R

Check whether a specific directory is already existing, and if not create that directory.

Parameter	Description
dname	vector of directory names to check/create
mkdir	create directories if non-existing

Table: Input parameter

Result	Description
TRUE	success; directories are existing/have been created successfully
FALSE	failure; directories are not existing/cannot be created

Table: Output value

lib/dldat.R

Download data into indicated directory, expand if requested, and rename as specified.

Parameter	Description
dlname	vector of files to downloads
dldir	vector of directories files to download to
fname	local filenames of downloaded files
exp	expand downloaded files?
redl	re-download files?

Table: Input parameter

Result	Description
TRUE	success; directory is existing/has been created successfully
FALSE	failure; directory does not exist/cannot be created

Table: Output value

Note
Case #1) dlname, dldir, fname have to be of identical length, -or-
Case #2) dlname and fname have to be of identical length, and dldir has to be of length 1.

lib/extvars.R

Extract specific variable from data set.

Parameter	Description
dtSrc	data.table containing the source data
vars	vector of variables to extract
pattern	pattern of variables to extract

Table: Input parameter

Result	Description
data table	success; data table containing the extracted data only
NULL	failure

Table: Output value

Note
If both vars and pattern are non-NULL, only vars will be taken into account.

lib/mergetesttrain.R

Merge test and train data.

Parameter	Description
dtTest	data.table containing the test data
dtTrain	data.table containing the train data

Table: Input parameter

Result	Description
data table	success; data table containing the merged data
NULL	failure

Table: Output value

Note
Case #1) dtTest and dtTrain are omitted -> both corresponding reading functions are executed;
Case #2) dtTest or dtTrain are omitted -> corresponding reading function for the omitted data set is executed;
Case #3) dtTest and dtTrain are provided -> only a merge is getting carried out.

lib/messydat.R

Create (merge) messy data from raw data.

Parameter	Description
NONE

Table: Input parameter

Result	Description
data table	success; data table containing the messy data
NULL	failure

Table: Output value

lib/rdactlbl.R

Read activity label data.

Parameter	Description
basedir	base directory to read files from

Table: Input parameter

Result	Description
data table	success; data table containing the activity label data
NULL	failure

Table: Output value

lib/rdfeatlbl.R

Read feature label data.

Parameter	Description
basedir	base directory to read files from

Table: Input parameter

Result	Description
data table	success; data table containing the feature label data
NULL	failure

Table: Output value

lib/rdtest.R

Read test data.

Parameter	Description
basedir	base directory to read files from

Table: Input parameter

Result	Description
data table	success; data table containing the test data
NULL	failure

Table: Output value

lib/rdtrain.R

Read train data.

Parameter	Description
basedir	base directory to read files from

Table: Input parameter

Result	Description
data table	success; data table containing the train data
NULL	failure

Table: Output value

lib/repidwithlbl.R

Replace id with label.

Parameter	Description
dtSrc	data.table containing the source data
dtLbl	data.table containing the label data
cnSrcId	variable/column name(s) holding the id to replace
cnLblId	variable/column name(s) holding the id for matching
cnLblLbl	variable/column name(s) holding the label to replace id with
rmSrcId	remove id variable/column from resulting data set?

Table: Input parameter

Result	Description
data table	success; data table containing data with replaced id(s)
NULL	failure

Table: Output value

Note

all of dtSrc, dtLbl, cnSrcId, cnLblId, cnLblLbl have to be specified, and a non-NULL value has to be provided for each of these;
if the id variable/column gets removed, the new key will be the old key but with the new variable/column instead; otherwise the old key will remain.

lib/setvarnames.R

Set variable/column names.

Parameter	Description
dtSrc	data.table containing the source data
cnVars	vector of new variable/column names
pattern	pattern of variables to extract

Table: Input parameter

Result	Description
data table	success; data table containing data with new variable/column names assigned
NULL	failure

Table: Output value

Note

dtSrc and cnVars have to be provided, and assigned a non-NULL value respectively;
if cnVars contains less elements than dtSrc variables/columns, the remining, missing elements will be taken from the current variables/columns of dtSrc;
if cnVars contains more elements than dtSrc variables/columns, only the first elements up to the number of elements of the current variables/columns of dtSrc will be taken;
if pattern is also provided (and non-NULL), all variable/column names matching the pattern will be amended by replacing the matched pattern with the content of cnVars;
any exsting sorting/matching key will be amended accordingly.

lib/tidydat.R

Tidy up messy data.

Parameter	Description
dtMessy	data.table containing the test data

Table: Input parameter

Result	Description
data table	success; data table containing the tidy data
NULL	failure

Table: Output value

lib/wrdat.R

Export data set to file.

Parameter	Description
dtSrc	data.table containing the source data
basedir	base directory to write files to
fname	filename of output file
fext	file extension of output file
arch	archive/compress (zip) written file?
archtype	archive/compress type (bz, gz, zip [default])

Table: Input parameter

Result	Description
TRUE	success; data set has been successfully written to file
FALSE	failure; data set couldn't be written to file

Table: Output value

Note

at least dtSrc and fname have to be provided, and containing a non-NULL value;
if outfile is already existing it will get overwritten.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

Setting

Assignment

Project

Directory structure

Files

run_analysis.R

lib/chkdir.R

lib/dldat.R

lib/extvars.R

lib/mergetesttrain.R

lib/messydat.R

lib/rdactlbl.R

lib/rdfeatlbl.R

lib/rdtest.R

lib/rdtrain.R

lib/repidwithlbl.R

lib/setvarnames.R

lib/tidydat.R

lib/wrdat.R

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
doc		doc
lib		lib
CodeBook.md		CodeBook.md
Data Science, Wearable Computing and the Battle for the Throne as World's Top Sports Brand -.webloc		Data Science, Wearable Computing and the Battle for the Throne as World's Top Sports Brand -.webloc
Getting and Cleaning Data Course Project.Rproj		Getting and Cleaning Data Course Project.Rproj
Getting and Cleaning Data \| Coursera.pdf		Getting and Cleaning Data \| Coursera.pdf
README.md		README.md
run_analysis.R		run_analysis.R

Sil68/GettingandCleaningDataCourseProjectCoursera

Folders and files

Latest commit

History

Repository files navigation

Purpose

Setting

Assignment

Project

Directory structure

Files

run_analysis.R

lib/chkdir.R

lib/dldat.R

lib/extvars.R

lib/mergetesttrain.R

lib/messydat.R

lib/rdactlbl.R

lib/rdfeatlbl.R

lib/rdtest.R

lib/rdtrain.R

lib/repidwithlbl.R

lib/setvarnames.R

lib/tidydat.R

lib/wrdat.R

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages