Comment acknowledgement of Anvar Seyed et al. Addition of probability…

… descriptions as command_11 in appmap.
MatthewRalston · Aug 2, 2024 · afeb931 · afeb931
1 parent 668c5e0
commit afeb931
Show file tree

Hide file tree

Showing 8 changed files with 367 additions and 207 deletions.
diff --git a/README.md b/README.md
@@ -26,19 +26,31 @@ NOTE: Beta-stage `.bgzf` and `zlib` compatible k-mer count vectors and DeBruijn
 
 ## Summary 
 
-- [ x ] Homepage: 
-- [ x ] Quixart
-- [ x ] Readme headr
-- [ x ] OR
-- [ x ] usage / off
-- [ 
-
-
-k-mer counts from .fa(.gz)/.fq(.gz) sequence data can be stored in `.kdb` file format, a bgzf file similar to `.bam`. For those familiar with `.bam`, a `view` and `header` functions are provided. This file is compatible with `zlib`.
+- [ x ] [Homepage:](https://matthewralston.github.io/kmerdb)
+- [ x ] [Quick Start guide](https://matthewralston.github.io/kmerdb/quickstart)
+- [ x ] `kmerdb usage subcommand_name`
+  - `profile` - Make k-mer count vectors/profiles, calculate unique k-mer counts, total k-mer counts, nullomer counts. Import to read/write NumPy arrays from profile object attributes.
+  - `graph` - Make a weighted edge list of kmer-to-kmer relationships, akin to a De Bruijn graph.
+  - `usage` - Display verbose input file/parameter and algorithm details of subcommands.
+  - `help` - Display verbose input file/parameter and algorithm details of subcommands.
+  - `view` - View .tsv count/frequency vectors with/without preamble.
+  - `header` - View YAML formatted header and aggregate counts
+  - `matrix` - Collate multiple profiles into a count matrix for dimensionality reduction, etc.
+  - `kmeans` - k-means clustering on a distance matrix via Scikit-learn or BioPython with kcluster distances
+  - `hierarchical` - hierarchical clustering on a distance matrix via BioPython with linkage choices
+  - `distance` - Distance matrices (from kmer count matrices) including SciPy distances, a Pearson correlation coefficient implemented in Cython, and Spearman rank correlation included as additional distances.
+  - `index` - Create an index file for the kmer profile (Delayed:)
+  - `shuf` - Shuffle a k-mer count vector/profile (Delayed:)
+  - `version` - Display kmerdb version number
+  - `citation` - Silence citation suggestion
+- [ x ] `kmerdb subcommand -h|--help`
+
+
+k-mer counts from .fa(.gz)/.fq(.gz) sequence data can be computed and stored for access to metadata and count aggregation faculties. For those familiar with `.bam`, a `view` and `header` functions are provided. This file is compatible with `zlib`.
 
 Install with `pip install kmerdb`
 
-`kmerdb` is a Python CLI designed for k-mer counting and k-mer graph edge-lists. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts in a columnar format (input checksums, total and unique k-mer counts, nullomers, mononucleotide counts) with a YAML formatted metadata header in the first block of a `bgzf` formatted file. 
+
 
 Please see the [Quickstart guide](https://matthewralston.github.io/kmerdb/quickstart) for more information about the format, the library, and the project.
 
@@ -48,191 +60,40 @@ Please see the [Quickstart guide](https://matthewralston.github.io/kmerdb/quicks
 ```bash
 # Usage    --help option    --debug mode
 kmerdb --help # [+ --debug mode]
-kmerdb usage graph
-
-**** 
- o-O      |||
-o---O     |||             [|[          kmerdb           ]|]
-O---o     |||
- O-o      |||        version :     v0.8.2
-  O       |||
- o-O      |||        GitHub  : https://github.com/MatthewRalston/kmerdb/issues
-o---O     |||         PyPI   : https://pypi.org/project/kmerdb/
-O---o     |||      Website   : https://matthewralston.github.io/kmerdb
- O-o      |||
-                                                                       lang :         python
-                                                                          v :      >= v3.7.4
-
-                      package manger : pip
-                        version      : >= 24.0
-        package root : /home/user/.local/share/virtualenvs/kdb-venv/lib/python3.12/site-packages/kmerdb
-        exe file     : /home/user/.local/share/virtualenvs/kdb-venv/lib/python3.12/site-packages/kmerdb/__init__.py
-
-                      required packages : 9
-                   development packages : 9
-
-           ARGV : ['/home/user/.local/share/virtualenvs/kdb-venv/bin/kmerdb', 'usage', 'graph']
-
-O---o
- O-o
-  O
- o-O
-o---O
-O---o
- O-o
-  O
- o-O
-o---O
-O---o
- O-o
-  O
- o-O
-o---O
-
-
-
-
-Beginning program...
-
-
-
-
-                          [ name ] :         graph
-
-                   description : create a edge list in (block) .gz format from .fasta|.fa or .fastq format.
-
-
-
-
-   :     4 column output : [ row idx | k-mer id node #1 | k-mer id node #2 | edge weight (adjacency count) ]
-
-   :  make a deBruijn graph, count the number of k-mer adjacencies,  printing the edge list to STDOUT
-
-
-
-
-                  +=============+====================+====================+=================================+
-                  <    row idx  |  k-mer id node #1  |  k-mer id node #2  |  edge weight (adjacency count)  >
-                  |             |                    |                    |                                 |
-                  |             +
-                  |
-                  |
-                  |
-                  |
-                  |
-
-
-
-
-
---------------------------
-
-
-                    kmerdb graph -k 12 input_1.fa [example_2.fastq] output.12.kdbg
-
-                    [-]    inputs : 
-
-                           Input file can .fastq (or .fa).   - gzip.  Output is a weighted edge list in .kdb format (gzipped .csv with YAML header)
-
-                    [-]    parameters : 
-
-                           uses < -k > for k-mer size, --quiet to reduce runtime, -v, -vv to control logging. --
-
-
-
-                    [-]    [ usage ]  :  kmerdb graph -k $K --quiet <input_1.fa.gz> [input_2.fq.gz] <output_edge_list_file.12.kdbg>
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-name: arguments
-type: array
-items:
-- name: k
-  type: int
-  value: choice of k-mer size
-- name: quiet
-  type: flag
-  value: Write additional debug level information to stderr?
-
-
-
-
-name: inputs
-type: array
-items:
-- name: <.fasta|.fastq>
-  type: array
-  value: gzipped or uncompressed input .fasta or .fastq file(s)
-- name: .kdbg
-  type: file
-  value: Output edge-list filepath.
-
-
-
-
-name: features
-type: array
-items:
-- name: k-mer count arrays, linear, produced as file is read through sliding window.
-    (Un)compressed support for .fa/.fq.
-  shortname: parallel faux-OP sliding window k-mer shredding
-  description: Sequential k-mers from the input .fq|.fa files are added to the De
-    Bruijn graph. In the case of secondary+ sequences in the .fa or considering NGS
-    (.fq) data, non-adjacent k-mers are pruned with a warning. Summary statistics
-    for the entire file are given for each file read, + a transparent data structure.
-- name: k-mer neighbors assessed and tallied, creates a unsorted edge list, with weights
-  shortname: weighted undirected graph
-  description: an edge list of a De Bruijn graph is generated from all k-mers in the
-    forward direction of .fa/.fq sequences/reads. i.e. only truly neighboring k-mers
-    in the sequence data are added to the tally of the k-mer nodes of the de Bruijn
-    graph and the edges provided by the data.
-
-...
-
+kmerdb usage profile
 
 
 #   +
 
-# [ 3 main features: ]     k-mer counts (kmerdb profile -k 12 <input.fa|.fq> [<input.fa|.fq>])    'De Bruijn' graph (kmerdb graph)         [matrices, distances, and clustering!]
+# [ 3 main features: ]     [ 1.   -    k-mer counts  ]
 
 # Create a [composite] profile of k-mer counts from sequence files. (.fasta|.fastq|.fa.gz|.fq.gz)
-kmerdb profile -k 8 --output-name sample_1 sample_1_rep1.fq.gz [sample_1_rep2.fq.gz]
-# Creates sample_1.8.kdb. --minK and --maxK options can be specified to create multiple k-mer profiles at once.
-# Alternatively, can also take a plain-text samplesheet.txt with one filepath on each line.
+kmerdb profile -vv -k 8 --output-name sample_1 sample_1_rep1.fq.gz [sample_1_rep2.fq.gz]
+# Creates k-mer count vector/profile in sample_1.8.kdb. This is the input to other steps, including count matrix aggregation. --minK and --maxK options can be specified to create multiple k-mer profiles at once.
+<!-- # Alternatively, can also take a plain-text samplesheet.txt with one filepath on each line. -->
 
+#          De Bruijn graphs (not a main feature yet, delayed)
 # Build a weighted edge list (+ node ids/counts = De Bruijn graph)
-kmerdb graph -k 12 example_1.fq.gz example_2.fq.gz edges_1.kdbg
+kmerdb graph -vv -k 12 example_1.fq.gz example_2.fq.gz edges_1.kdbg
 
 # View k-mer count vector
-kmerdb view profile_1.8.kdb # -H for full header
+kmerdb view -vv profile_1.8.kdb # -H for full header
 
 # Note: zlib compatibility
 #zcat profile_1.8.kdb
 
 # View header (config.py[kdb_metadata_schema#L84])
-kmerdb header profile_1.8.kdb
+kmerdb header -vv profile_1.8.kdb
 
-## Optional normalization, dim reduction, and distance matrix features:
+## [ 3 main features: ]   [ 2. Optional normalization, PCA/tSNE, and distance metrics ]
 
 # K-mer count matrix - Cython Pearson coefficient of correlation [ ssxy/sqrt(ssxx*ssyy) ]
-kmerdb matrix pass *.8.kdb | kmerdb distance pearson STDIN
+kmerdb matrix -vv from *.8.kdb | kmerdb distance pearson STDIN
 # 
-# kmerdb matrix DESeq2 *.8.kdb
-# kmerdb matrix PCA *.8.kdb
-# kmerdb matrix tSNE *.8.kdb
-#   # <pass> just makes a k-mer count matrix from k-mer count vectors.
+# kmerdb matrix -vv DESeq2 *.8.kdb
+# kmerdb matrix -vv PCA *.8.kdb
+# kmerdb matrix -vv tSNE *.8.kdb
+#   # <from> just makes a k-mer count matrix from k-mer count vectors.
 # 
 
 # Distances on count matrices [ SciPy ]  pdists + [ Cython ] Pearson correlation, scipy Spearman and scipy correlation pdist calculations are available ]
@@ -242,18 +103,18 @@ kmerdb distance -h
 # usage: kmerdb distance [-h] [-v] [--debug] [-l LOG_FILE] [--output-delimiter OUTPUT_DELIMITER] [-p PARALLEL] [--column-names COLUMN_NAMES] [--delimiter DELIMITER] [-k K]
 #                       {braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,matching,minkowski,pearson,rogerstanimotorusselrao,seuclidean,sokalmichener,sokalsneath,spearman,sqeuclidean,yule} [<kdbfile1 kdbfile2 ...|input.tsv|STDIN> ...]
 
-# +
+# [ 3 main features: ]      [ 3. Clustering: k-means and hierarchical with matplotlib ]
 
 #    Kmeans (sklearn, BioPython)
-kmerdb kmeans -k 4 -i dist.tsv
+kmerdb kmeans -vv -k 4 -i dist.tsv
 #    BioPython Phylip tree + upgma
-kmerdb hierarchical -i dist.tsv
+kmerdb hierarchical -vv -i dist.tsv
 
 
 ```
 
 
-
+`kmerdb` is a Python CLI designed for k-mer counting and k-mer graph edge-lists. It addresses the ['k-mer' problem](https://en.wikipedia.org/wiki/K-mer) (substrings of length k) in a simple and performant manner. It stores the k-mer counts in a columnar format (input checksums, total and unique k-mer counts, nullomers, mononucleotide counts) with a YAML formatted metadata header in the first block of a `bgzf` formatted file. 
 
 ## Usage example
 
@@ -370,7 +231,7 @@ Thanks to Rachel for the good memories and friendship. And Sophie too. veggies n
 Thanks to Yasmeen for the usual banter.
 Thanks to A for the newer banter.
 Thanks to Max, Robin, and Robert for the good memories in St. Louis. What's new?
-Thanks to Fred for the good memories.
+Thanks to Fred for the good memories. Hope you're on soon.
 Thanks to Nichole for the cookies and good memories. And your cute furballs too! Hope you're well
 Thanks to S for the lessons, convos, and even embarassing moments. You're kind of awesome to me.
 Thanks to a few friends I met in 2023 that reminded me I have a lot to learn about friendship, dating, and street smarts.

diff --git a/TODO.org b/TODO.org
@@ -6,20 +6,61 @@
 # .kdb files should be debrujin graph databases
 # The final prototype would be .bgzf format from biopython
 
-* 7/12/24 [Roadmap] - D2 
 
-* TODO 7/11/24 - [LIT REVIEW]
-** D2 metrics, markov sequence prob review
+* 8/1/24 Written Lit review, System Reconfigurations
+
+** Currently reconfiguring my system and redundancies
+
+** Making copies of my installation and configuration/install routines. Trying ubuntu 24.04 LTS version rather than Arch. Better build/configure/make predictability.
+
+** Current [TODO]
+
+*** NEXT Create kmerdb logo using GIMP
+:LOGBOOK:
+- State "IN-PROGRESS" from "NEXT"       [2024-08-01 Thu 19:04]
+:END:
+
+*** TODO Finish logo export
+
+*** Add logo to README
+
+*** Add logo to website
+
+*** 
+
+* 7/28/24 [multiplication rule for Markov probability]
+* needs to be written in documentation
+** currently writen into appmap as command 11, but not fleshed out.
+** 
+
+* [TRIAGE] : vsearch align with kmerdb
+** Use k-mer frequencies to rank similarity to sequences in db.
+** Proceed from seed match/mismatch to full dynamic programmin smith waterman w/ affine gap penalty
+** 
+
+
+* 7/16/24 NEW metadata feature for graph subcommand
+** graph subcommand needs node count explicitly, (k^n, where n is proportional to fastq size in number of reads)
+*** graph in m = 4^k symbols*
+** [new] metadata fields: unique_kmers, total_kmers, total_nodes, total_edges, possible_edges
+*** AND also printed in final stats
+
+* IN PROGRESS 7/11/24 - [LIT REVIEW]
+** IN PROGRESS D2 metrics, markov sequence prob review
 *** D2 = \sum(I(A, B))
 **** 
-*** D2s = \sum{ \frac{ (X - Xbar)(Y - Ybar) }{ \sqrt{ (X - Xbar) + (Y - Ybar) } } (the squareroot of the sum of the standardized X's is the denominator, numerator is the product of the standardized X and Y counts, then the ratio is summed)
+*** D2s = \sum{ \frac{ (X - \bar{X})(Y - Ybar) }{ \sqrt{ (X - Xbar) + (Y - Ybar) } } (the squareroot of the sum of the standardized X's is the denominator, numerator is the product of the standardized X and Y counts, then the ratio is summed)
 **** 
 *** D2* = \sum{ \frac{ (X - Xbar)(Y - Ybar) }{ mhat*nhat*pwX*pwY } } (w=word, hat = "adjusted"/translated = m - k, X and Y are counts from )
 **** 
 *** D2z = ( D2(A,B) - E[D2] ) / \sqrt( var(D2) )
 **** 
-*** DELEGATED D2shepp = \sum{ \frac{ cwXi - (n-k+1)pwx * cwYi - (n-k+1)pwy }{ \sqrt{ (cwXi - (n-k+1)pwx)^{2} + (cwYi - (n-k+1)pwy)^{2}} }
-    CLOSED: [2024-07-12 Fri 21:49]
+*** WAITING D2shepp = \sum{ \frac{ cwXi - (n-k+1)pwx * cwYi - (n-k+1)pwy }{ \sqrt{ (cwXi - (n-k+1)pwx)^{2} + (cwYi - (n-k+1)pwy)^{2}} }
+    :LOGBOOK:
+    - State "WAITING"    from "DONE"       [2024-08-01 Thu 18:49]
+    - State "DONE"       from "CANCELED"   [2024-08-01 Thu 18:49]
+    - State "CANCELED"   from "DELEGATED"  [2024-08-01 Thu 18:49]
+    :END:
 **** Reinert G. et al. "Alignment-free sequence comparison (1): statistics and power" J. Comput. Biol. 2003 v16 (p1615-1634)
 **** Bibtex format below:
 @article{reinert2009alignment,
@@ -37,13 +78,15 @@
 ** TODO core species choices
 *** chicken farm estuary system changes (algination, asphyxia, microbiological changes
 *** anti-human leaky gut syndrome changes.
+**** i.e. looking at the human leaky gut syndrome, but in reverse. What are bioprotective species and niches that provide resilience to leaky-gut syndrome
 **** TODO chemophore SMILES and gastrotoxic footprints
 *** pathology of lupus or auto-immune skin condition microbiome/metagenomic changes.
 *** vaginal microbiome changes
 *** 
 ** Perspective 1 from reivew on distance metrics
 ** 
-* 7/10/24 - okay so path 1 [ 2 reviews + cython D2 metrics ] path 2 [ 2 reviews + graph algorithm ]
+* IN PROGRESS 7/10/24 - [IMPORTANT] Needs a choice [cython d2 x graph algorithm features ]:
+** [Key choice needed]: 1 [ 2 reviews + cython D2 metrics ] path 2 [ 2 reviews + graph algorithm ]
 
 ** cython d2 metrics including the delta distance : |pab(A)-pab(B)| (Karlin et al, tetra,tri,di- nucleotide frequencies)
 ** (describe Karlin delta, algorithm to calculate)