Skip to content

Pipeline for setting up deidloop directories and file dependencies

Kathleen Muenzen edited this page Mar 9, 2020 · 2 revisions

The following is a suggested pipeline for setting up the directory structure before running deidloop.py, and for generating the meta and knownphi files that are used during full pipeline de-identification. All script can be found in the extras/ directory in de-id_stable1. The 2.5k training set will be used as a sample test set for this pipeline:

  1. Create the directory structure for the deidloop batch input: ./extras/get_batch_input_files.sh /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/ucsf_notes/ /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/structured/ 0
  • Input 1: Path to the directory with ALL input notes
  • Input 2: Path to the parent directory of the output batch directories
  • Input 3: The depth of subfolders inside the base input folder. If the note files are found directly inside the base folder then you pass ‘0’ as shown in the example. For our notes in the shredded_notes_20190712 this parameter will be ‘3’ as we have 3 levels of subfolders inside each folder in the shredded_notes_20190712 folder.
  1. Create a list of the batch directories that meta and knownphi files should be created for: find /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/structured/ -mindepth 2 -maxdepth 2 -name "*" > /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/dir_list.txt
  • Input 1: Path to the batch directory where all batch directories for this dataset are located
  • -mindepth: The minimum depth of the folder structure of your Input 1 folder
  • -maxdepth: The maximum depth of the folder structure of your Input 1 folder
  • -name: The search pattern that should be used to find the batch directories. "*" indicates that al directories should be searched
  • Output 1: Path to the output directory list file
  1. Create meta files for each batch directory: python3 ./extras/extract_surrogate_meta_using_hash_v1.py /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/dir_list.txt /data/radhakrishnanl/Meta_files/meta_file_20190712.txt
  • Input 1: Path to the batch directory list file (created in Step 2)
  • Input 2: Path to the parent metadata file that contains meta info for all batch directories
  1. Create knownphi files for each batch directory: python3 ./extras/probes_extract_using_hash_v3_original_probes.py /data/muenzenk/batch_data/combined_102_110_r1_r2_r3/dir_list.txt /data/muenzenk/probe_tests/cleaned_name_phone_probes.txt /data/radhakrishnanl/Meta_files/meta_file_20190712.txt
  • Input 1: Path to the batch directory list file (created in Step 2)
  • Input 2: Path to the parent probes file that contains knownphi information for all batch directories
  • Input 3: Path to the parent metadata file that contains meta info for all batch directories