This repository serves as an overview of the practical part of my Bachelor's Thesis for Software & Information Engineering at Vienna University of Technology.
Scientific experiments commonly involve several computational steps that consume certain input data and are responsible for the generation of results. Such workflows are often encoded or implemented in the form of scripts. While scripts can be adapted to every requirement, they lack valuable metadata such as experiment parameters, process flows and file accesses - in short, provenance. There have been research efforts addressing this issue which resulted in tools that collect the provenance of scripts. However, they rarely utilize a common provenance format, which makes processing and exchanging their output difficult. In this thesis, two of these tools - YesWorkflow and noWorkflow - are extended such that they produce results that are compliant to the World Wide Web Consortium's PROV standard. The main goal is to increase the utility of their output by facilitating interoperability and machine-aided processing. This work outlines how the proposed modifications to YesWorkflow and noWorkflow were implemented and how they can be leveraged. Moreover, capitalizing on the ontological representation of their provenance, possibilities to gain deeper insight into the scripts' structure and execution details are highlighted. To this end, RDF serializations of the PROV ontology are used to infer otherwise only implicitly available information. This is achieved with the help of specifically constructed SPARQL queries.
- case_study: script files related to and turtle data resulting from the case study conducted in section 6.2.2 where noWorkflow was used to localize a programming error introduced during refactoring
- instances: sample scripts that may be used as input files for YesWorkflow and noWorkflow; explanations of the scripts are provided in the thesis
- noworkflow_outputs: exemplary outputs generated by the modified version of noWorkflow; they are available in different file formats
- queries: example queries that are used by the SPARQL Playground for retrieving information from exported provenance data
- sample_data: turtle files that may be used as input for the SPARQL Playground
- validate: bash script that validates PROV-O output in sample_data against PROV-CONSTRAINTS using prov-check
- yesworkflow_outputs: exemplary outputs generated by the modified version of YesWorkflow; they are available in different file formats
- thesis.pdf: the thesis itself as PDF document
- thesis_print.pdf: the thesis itself as PDF document, but without colored hyperlinks (e.g. for printing)
- YesWorkflow: modified source code of YesWorkflow with PROV-O exporter module
- noWorkflow: modified source code of noWorkflow with PROV-O exporter module
- SPARQL Playground: modified source code of SPARQL Playground, pre-equipped with test data (created with extended versions of YesWorkflow and noWorkflow) and provenance queries conceived for the thesis
Note: The recommended Python and Java versions are Python 3.5 and Java 8 (SPARQL Playground) or 11 (YesWorkflow) since the scripts and software artifacts involved in this thesis have been developed and tested using those.