Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cell type annotation: scGPT workflow #832

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

dorien-er
Copy link
Contributor

@dorien-er dorien-er commented Jul 9, 2024

Changelog

  • Workflow for scGPT transformer-based cell type annotation
  • Minor fix to scGPT annotation module to allow for multi-processing

Issue ticket number and link

Closes #xxxx (Replace xxxx with the GitHub issue number)

Checklist before requesting a review

  • I have performed a self-review of my code

  • Conforms to the Contributor's guide

  • Check the correct box. Does this PR contain:

    • Breaking changes
    • New functionality
    • Major changes
    • Minor changes
    • Documentation
    • Bug fixes
  • Proposed changes are described in the CHANGELOG.md

  • CI tests succeed!

@dorien-er dorien-er marked this pull request as ready for review September 6, 2024 12:56
@dorien-er dorien-er changed the title scgpt cell type annotation workflow Cell type annotation: scGPT transformer annotation Sep 10, 2024
@dorien-er dorien-er changed the title Cell type annotation: scGPT transformer annotation Cell type annotation: scGPT workflow Sep 10, 2024
dorien-er and others added 6 commits September 10, 2024 10:23
* update description concat component

* Update src/dataflow/concatenate_h5mu/config.vsh.yaml

Co-authored-by: Dries Schaumont <[email protected]>

---------

Co-authored-by: Dries Schaumont <[email protected]>
Co-authored-by: Vladimir Shitov <[email protected]>
dorien-er and others added 24 commits November 18, 2024 15:18
* Remove uses of auto: [publish: true]

* Undo removal of publish component

* Fix integration test
* add component to subset obsp

* update changelog

* update descriptions

* fix tests

* add comment
* update knn component

* update changelog

* update changelog

* address pr comments

* fix tests

* fix tests
* update scanvi

* typo

* update changelog
#894)

* Fix ingestion components not working when optional arguments are unset

* update changelog

* fix test
Copy link
Contributor

@rcannood rcannood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me source-wise!

I think we might need to merge main into this branch again, right?

After the merge is done, I'll run the component manually once to verify the behaviour. Let me know when this PR is ready for me to do the manual run!

CHANGELOG.md Outdated Show resolved Hide resolved
model_dict = {}
model_dict["model_state_dict"] = f_model_dict
model_dict["id_to_class"] = {k: str(k) for k in range(15)}
torch.save(model_dict, ft_model_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the test resources need to be updated for this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately yes, to be able to run the integration tests.
The resources only contain the foundation model, but a finetuned model has a different file architecture and scGPT annotation strictly requires a finetuned model.

"obsm_gene_tokens": "gene_id_tokens",
"obsm_tokenized_values": "values_tokenized"
],
toState: {id, output, state -> ["output": output.output]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the final output of this workflow will be an h5mu file with only the hvg files and the predicted celltypes & probabilities, correct?

Would it make sense to revert the h5mu back to the original input, but then copy the new outputs structures (predicted celltypes and probabilities) to the original input data.

Interested to hear your thoughts on this -- I can be convinced to not include this step in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should't output the file with hvg features only. In the bigger annotation workflow (where combining multiple methods is possible), this output file could be the input of another annotation workflow, where no hvg subsetting is desired.

I'm tempted to include the HVG subsetting logic inside the annotation component (it's also the case for e.g. scANVI), rather than coing the subsetting in a separate component. Then we can also copy the annotations back to the original input, wdyt?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants