Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR: 'All_islands.fasta' does not exist, or is unreadable #9

Open
yl030 opened this issue Jul 11, 2024 · 2 comments
Open

ERROR: 'All_islands.fasta' does not exist, or is unreadable #9

yl030 opened this issue Jul 11, 2024 · 2 comments

Comments

@yl030
Copy link

yl030 commented Jul 11, 2024

Hello gifrop team,
I installed gifrop through manual and when I run % gifrop --get_islands

This is gifrop 0.0.9
command issued:
/gss1/App_os7/miniconda3/envs/gifrop/bin/gifrop --get_islands
===== Dependencies check =====
parallel .... good
abricate .... good
Rscript .... good
find .... good
[1] "All required R packages were detected"
/gss2/home_new/xuefeng01/gff/gene_presence_absence.csv exist
found 3299 .gff files
WRANGLING SEQUENCE DATA...
making shortened gffs...
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

found 3299 .gff files
extracting fastas from prokka gffs...
Academic tradition requires you to cite works you base your article on.
If you use programs that use GNU Parallel to process data for an article in a
scientific publication, please cite:

Tange, O. (2024, May 22). GNU Parallel 20240522 ('Tbilisi').
Zenodo. https://doi.org/10.5281/zenodo.11247979

This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

More about funding GNU Parallel and the citation notice:
https://www.gnu.org/software/parallel/parallel_design.html#citation-notice

To silence this citation notice: run 'parallel --citation' once.

DONE WRANGLING SEQUENCE DATA
EXECUTING Rscript 'gifrop_id.R'
[1] "loading packages"
Warning message:
package ‘dplyr’ was built under R version 4.2.3
Warning message:
package ‘tidyr’ was built under R version 4.2.3
Warning message:
package ‘readr’ was built under R version 4.2.3
Warning message:
package ‘purrr’ was built under R version 4.2.3
[1] "done loading packages"
Warning message:
One or more parsing issues, call problems() on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
[1] "reading in gffs..."
Joining with by = join_by(seqid, locus_tag)
Error in left_join():
! This join would result in more rows than dplyr can handle.
5723840911 rows would be returned. 2147483647 rows is the maximum number
allowed.
Double check your join keys. This error commonly occurs due to a missing join
key, or an improperly specified join condition.
Backtrace:

  1. ├─... %>% select(genome, seqid, seqid_loc_tags)
  2. ├─dplyr::select(., genome, seqid, seqid_loc_tags)
  3. ├─dplyr::ungroup(.)
  4. ├─tidyr::nest(., seqid_loc_tags = c(locus_tag, loc_tag_order))
  5. ├─dplyr::select(., genome, seqid, locus_tag, loc_tag_order)
  6. ├─dplyr::left_join(., loc_tag_orders)
  7. ├─dplyr:::left_join.data.frame(., loc_tag_orders)
  8. │ └─dplyr:::join_mutate(...)
  9. │ └─dplyr:::join_rows(...)
  10. │ └─dplyr:::dplyr_locate_matches(...)
  11. │ ├─base::withCallingHandlers(...)
  12. │ └─vctrs::vec_locate_matches(...)
  13. ├─vctrs:::stop_matches_overflow(size = 5723840911, call = <env>)
  14. │ └─vctrs:::stop_matches(...)
  15. │ └─vctrs:::stop_vctrs(...)
  16. │ └─rlang::abort(message, class = c(class, "vctrs_error"), ..., call = call)
  17. │ └─rlang:::signal_abort(cnd, .file)
  18. │ └─base::signalCondition(cnd)
  19. └─dplyr (local) <fn>(<vctrs___>)
  20. └─dplyr:::rethrow_error_join_matches_overflow(cnd, error_call)
  21. └─dplyr:::stop_join(...)
    
  22.   └─dplyr:::stop_dplyr(...)
    
  23.     └─rlang::abort(...)
    

Execution halted
DONE EXECUTING 'gifrop_id.R'
RUNNING ABRICATE ON THE ISLANDS
Using nucl database ncbi: 5386 sequences - 2023-Nov-4
Processing: All_islands.fasta
ERROR: 'All_islands.fasta' does not exist, or is unreadable

@Jtrachsel
Copy link
Owner

Hello!

It looks like this is a pretty large pangenome you are working with. Unfortunately gifrop isn't designed for use on very large pangenomes.

This portion of the error message is the real issue:

Error in left_join():
! This join would result in more rows than dplyr can handle.
5723840911 rows would be returned. 2147483647 rows is the maximum number
allowed.

My recommendation is to reduce the size of the pangenome you are working with, maybe focus on a subset of genomes you are interested in. Otherwise you may need to consider using a different tool that has been designed for very large datasets. I've had good luck with ppanggolin though you will need to do some of the classification steps that gifrop performs manually.

@yl030
Copy link
Author

yl030 commented Jul 14, 2024

Thanks, Jtrachsel!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants