Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Dorado] New Dorado Basecalling Workflow Terra #659

Open
wants to merge 130 commits into
base: main
Choose a base branch
from

Conversation

fraser-combe
Copy link
Contributor

@fraser-combe fraser-combe commented Oct 24, 2024

🗑️ This dev branch should be deleted after merging to main.

🧠 Summary

A new Dorado Basecalling Workflow, a GPU-accelerated pipeline for basecalling Oxford Nanopore POD5 files. The workflow includes optional automatic model selection, SAM-to-BAM conversion, and demultiplexing into unique barcode fastq files, with outputs uploaded to a new user defined Terra table for further downstream analysis.

⚡ Impacted Workflows/Tasks

This is a new workflow that does not impact any other workflows

This PR may lead to different results in pre-existing outputs: No

This PR uses an element that could cause duplicate runs to have different results: No

🛠️ Changes

This PR introduces the following changes:

  • New Workflow: Dorado Basecalling Workflow version 1.0.
  • Optional Inputs: Added use_auto_model flag for automatic model selection.
  • Manual and Auto Model Options: Supports both predefined models and automatic selection (sup, hac, fast).
  • SAM-to-BAM Conversion: Integrated SAMTools task for efficient data handling.
  • Demultiplexing: Added demux step to create barcode-specific FASTQ files.
  • Terra Integration: Outputs transferred to Terra, with a table generated for downstream workflows.

⚙️ Algorithm

  • New Tasks:
    1. Dorado Basecall: Converts POD5 files to SAM using GPU acceleration. Uses a new Dorado Staph-B Docker image v0.80
      https://github.com/StaPH-B/docker-builds/tree/master/dorado/0.8.0
    2. SAMTools Convert: Converts SAM files to BAM.
    3. Dorado Demultiplexing: Creates barcode-specific FASTQ files.
    4. File Transfer: Uploads FASTQ files to Terra.
    5. Terra Table Creation: Generates Terra table from the uploaded FASTQ files.

➡️ Inputs

  • New Inputs:
    • use_auto_model (Boolean): Enables automatic model selection.
    • model_accuracy (String): Specifies model accuracy if using auto-selection (sup, hac, fast).
    • fastq_file_name (String): Prefix for output FASTQ files.
    • fastq_upload_path (String): Path to Terra for uploading FASTQ files.
    • kit_name (String): Specifies sequencing kit for adapter/barcode trimming.

⬅️ Outputs

  • New Outputs:
    • basecalled_fastqs: Array of FASTQ files generated from basecalling.
    • demuxed_fastqs: Array of FASTQ files generated from demultiplexing.
    • logs: Logs generated during the demux step.
    • terra_table_tsv: TSV file for uploading to Terra.

🧪 Testing

Test 2. 24 pod5 files from 2 barcodes (manual model)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/9bef28ea-82ba-4406-8545-f32de7e07e02

image

test 3. 24 files from 2 barcodes (auto mode)
https://app.terra.bio/#workspaces/theiagen-training-workspaces/Theiagen_FCombe_sandbox/job_history/cead789e-c737-4541-a6ed-d9b907493ee1

output terra table example
image

  • Edge Case Handling: Verified workflow behavior with missing inputs and unsupported models.
  • Terra Integration: Confirmed successful transfer and Terra table generation with sample data.

Suggested Scenarios for Reviewer to Test

  1. Basecalling with Auto Model Selection: Run with the use_auto_model flag enabled.
  2. Manual Model Input: Test with a specific dorado_model path and confirm outputs.
  3. Demultiplexing: Verify barcode-specific FASTQ outputs.
  4. Edge Case: Provide incomplete inputs (e.g., missing kit_name) to confirm error handling.
  5. Terra Table Generation: Confirm Terra table creation and FASTQ uploads with valid inputs.

🔬 Final Developer Checklist

  • The workflow/task has been tested and results, including file contents, are as anticipated
  • The CI/CD has been adjusted and tests are passing (Theiagen developers)
  • Code changes follow the style guide
  • Documentation and/or workflow diagrams have been updated if applicable (Theiagen developers only)

🎯 Reviewer Checklist

  • All changed results have been confirmed
  • You have tested the PR appropriately (see the testing guide for more information)
  • All code adheres to the style guide
  • MD5 sums have been updated
  • The PR author has addressed all comments
  • The documentation has been updated

@fraser-combe
Copy link
Contributor Author

fraser-combe commented Nov 15, 2024

fraser-combe and others added 2 commits November 18, 2024 10:32
…e used at runtime; improved logging of dorado STDERR to a file; parsed explict model name from STDERR file or accept user input string; added dorado_log task output file
@kapsakcj
Copy link
Contributor

kapsakcj commented Nov 18, 2024

I will test 3 different workflows and report back:

  1. using fast as dorado_model input string
  1. using [email protected]
  1. using sup (as this will be the recommended input param for our users)

EDIT: all of these wfs were run AFTER making the below commit 82a7962 bug fix

@kapsakcj
Copy link
Contributor

TheiaProk_ONT ran successfully on the FASTQs produced by my test above with SUP dorado model 👍 https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/238d0f1f-fe13-4823-8846-b0774fb75e0c

More confirmation the FASTQs produced by this wf are valid for downstream processing

…ections. Added in-line code and re-orderd sections within Model Type Selection. Also added helpful info for determining Terra workspace bucket GSURI
@sage-wright sage-wright marked this pull request as draft December 13, 2024 17:27
@fraser-combe fraser-combe marked this pull request as ready for review December 19, 2024 20:14
@fraser-combe
Copy link
Contributor Author

Added new task that allows user to upload pod5 files to the Data Uploader in Terra and provide the link to the Google bucket as an import to the workflow.

The new task will place combine the file paths into an array and pass to basecalling task.

Documentation has been updated with visuals for user to use the data uploaded and copying bucket link to workflow input

Tested with small number of pod5 files
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/fbb56bc8-7d91-4f82-a246-40babd601303

Tested with large number of Pod5 files
https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/205e0035-a506-452a-841a-902eaba6e9c8

@kapsakcj
Copy link
Contributor

I appreciate your fortitude in making even further changes to this workflow. These changes will seriously simplify the setup process for the end user and save everyone lots of time.

The doc updates look great, the screenshots and section on uploading POD5 files and getting the input GSURI for where the files were uploaded look great. Straightforward and easy to understand & follow (in my opinion).

I'm launching a test here in Terra, but assuming it won't finish before I go on PTO for the holidays. Given your recent tests & my previous tests, I'm pretty confident it will run successfully. Please check the logs & outputs in my absence https://app.terra.bio/#workspaces/cdph-terrabio-taborda-manual/CDPH_Bioinformatics_Development/job_history/1c0618d9-560c-4741-a816-2c8bd7b89f15

  • The new task for listing POD5 files ran successfully and the array was successfully passed into the dorado basecalling scatter block of the wf 👍

Don't wait for me if the you/dev team wants to merge this PR before I'm back in office.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants