Jutils (version 1.5) is a visualization toolkit for alternative splicing events. Jutils directly supports visualizing results generated by the differential splicing (DS) detection tools MntJULiP, LeafCutter, MAJIQ and rMATS, and can be easily adapted to use with other DS software.
Described in:
- Yang G, Cope L, He Z, and Florea L (2021). Jutils: A visualization toolkit for differential alternative splicing events, Bioinformatics 37(22):4272–4274. [scripts, data]
- Lui WW, Yang G, He Z, and Florea L (2024). MntJULiP and Jutils: Differential splicing analysis of RNA-seq data with covariates. Submitted. [Preprint]
This branch contains the working version of Jutils, version 1.5, which additionally implements PCA visualizations. For the original (stable) version of Jutils, please refer to the master branch.
Copyright (C) 2020-2024, and GNU GPL v3.0, by †Wui Wang Lui, †Guangyu Yang, Liliana Florea († equal contributors)
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
Jutils is a visualization toolkit for alternative splicing events. It uses an intermediate tab separated file (TSV) to represent alternative splicing data such as that generated by a differential splicing prediction program. Jutils supports the vizualization of results generated by popular differential splicing (DS) prediction tools MntJULiP, LeafCutter, MAJIQ and rMATS. Additionally, users can write their own routines to convert the output of any other DS program into the unified TSV data format, which can then be visualized with Jutils.
Jutils provides four types of visualization: heatmaps, PCA plots, sashimi plots, and Venn diagrams. Users can display all records in the file, or can apply filters to select specific subsets of the data, for instance entries satisfying user defined criteria of statistical significance.
Jutils is written in Python. To download the source code, clone the current GitHub repository:
git clone https://github.com/splicebox/Jutils.git
- Linux or Darwin
- Python 3.7 or later
- R 4.0 or later
Required Python packages: pandas
, numpy
, seaborn
, matplotlib
, scikit-learn
. The Python packages can be installed with the command:
pip install --user pandas numpy seaborn matplotlib scikit-learn
Required R packages: ggplot2
, gridExtra
, data.table
. The R packages can be installed with the command in R console:
install.packages(c("ggplot2", "gridExtra", "data.table"))
Jutils works in two steps. Step 1 generates the TSV file representation of the user data. Step 2 uses the TSV file, along with other information optionally provided by the user, to generate visualizations. The basic commands are listed below, and examples as applied to the specific visualizations are given in the subsequent sections.
TSV file conversion
python3 jutils.py convert-results [ --mntjulip-dir | --leafcutter-dir | --majiq-dir | --rmats-dir ] <dsprogram_result_dir>
options:
--out-dir specify the output directory
The command takes as input the directory containing the output from the specified DS tool and generates a TSV file in the './out' directory or the directory specified with --out-dir
. The format of the TSV file is described below, and examples are provided in data/results.tsv.zip.
Heatmap visualization
python3 jutils.py heatmap --tsv-file <tsv_file> --meta-file <meta_file> [options]
options:
--dpsi cutoff for delta PSI (Percent Splice In) (default 0.05)
--p-value cutoff for differential test p-value (default 0.05)
--q-value cutoff for differential test q-value (default 1.0)
--aggregate show results at group level (one entry per group)
--avg cutoff for estimated read counts of DSA results (default 0.0)
--fold-change cutoff for log2(fold-change) of DSA results (default 0.0)
--unsupervised display the top most variable features
--top number of top most variable features to display (default 300)
--method clustering method (default 'weighted')
--metric distance metric for clustering ('cityblock')
--prefix add prefix to the output file name
--out-dir specify the output directory
--pdf generate figures in PDF format (default PNG)
--gene-list-file list of target genes (one gene per line without space) for heatmap
The command generates several heatmaps, displaying clustering of features, samples, both and none. The meta-file lists the condition for each sample, for representation in the heatmap, for instance when the data has been generated from a differential analysis. The p-value, q-value and dPSI values are those generated by the DS tool and stored in the TSV file. The format of the TSV file and that of the metadata file are shown below. Some programs including LeafCutter and MntJULiP represent introns as part of a group, and may report multiple DS introns per group. The --aggregate
option selects one unbiased entry per group to include in the displays.
Alternatively, Jutils can select for display the most variable features across the samples, agnostic of the conditions and program predictions, with the option --unsupervised
. To determine the most variable features, the mean absolute difference, MAD = E[| X-X̅ |], is used for PSI values, and the variance to mean ratio, D=σ2/μ, is used for abundance levels.
Note that the option --top
only applies to the unsupervised mode, whereas the --p-value
, --q-value
and --dpsi
only apply to the results from the (supervised) differential splicing analysis. When the --unsupervised
option is invoked, it overrides the default differential splicing options.
The --avg
and --fold-change
options apply to the Differential Splicing Abundance (DSA) type of data, as implemented in MntJULiP (DSA), which is described below.
Lastly, users can create high resolution PDF images with the option --pdf
; by default, Jutils generates PNG images.
PCA visualization
python3 jutils.py pca --tsv-file <tsv_file> --meta-file <meta_file> [options]
options:
--dpsi cutoff for delta PSI (Percent Splice In) (default 0.05)
--p-value cutoff for differential test p-value (default 0.05)
--q-value cutoff for differential test q-value (default 1.0)
--aggregate show results at group level (one entry per group)
--avg cutoff for estimated read counts of DSA results (default 0.0)
--fold-change cutoff for log2(fold-change) of DSA results (default 0.0)
--unsupervised display the top most variable features
--top number of top most variable features to display, this option only works in the unsupervised mode (default 100)
--prefix add prefix to the output file name
--out-dir specify the output directory
--pdf generate figures in PDF format (default PNG)
--gene-list-file list of target gene(s) (one gene per line without space) for heatmap
--color-shape-col select 2 column indices in the meta file to colour and shape points respectively. The sample name column starts with index 1 (default 2,3)
--label-point label points with sample names from the meta file
--highlight-idlist-file list of highlighted sample(s) (one sample name per line without space) for pca
The command generates PCA plots of PC1-2, PC1-3, and PC2-3. The option --color-shape-col
can change the colours and shapes of points by specific columns. By default (2,3), column 2, namely the condition, specifies the colours and column 3, the (optional) covariate column specifies the shapes. The option --label-point
annotates sample names from the meta file to the points. For details about the filter options, please refer to the section on heatmaps above.
Test run pca with test data:
cd data/mntjulip.jutils_testdata/
python ../../jutils.py pca --meta-file splice.4F1M.cov.meta --tsv-file mntjulip_DSR_results.tsv
Sashimi visualization
python3 jutils.py sashimi --tsv-file <tsv_file> --meta-file <meta_file> --gtf <gtf_file> [ --group-id <group_id> | --coordinate <coords> ] [options]
where:
--group_id select group to visualize
--coordinate select genomic range for visualization
--gtf GTF file of gene annotations to extract exons not in the TSV file
options:
--shrink shrink long introns and exons
--min-coverage minimum intron coverage (default 0)
--prefix add prefix to the output file name
--out-dir specify the output directory
--pdf generate figures in PDF format (default PNG)
The streamlined command above will use solely the event information contained in the TSV and metadata files for visualization. Introns will be displayed along with their values (e.g., read counts, PSI value) in multiple samples, separately by the conditions specified in the metadata file. Exons are extracted from the provided GTF file of gene annotations, for instance the human or mouse GENCODE reference gene set. Please ensure that the annotation file is compatible with the version of the genome used. Introns with a minimum coverage level (default 0 reads, and can be changed with the option --min-coverage
) are shown, and long introns and exons can be shortened with the option --shrink
.
Alternatively, the command below extracts the information on flanking exon coverage from the BAM files, when provided by the user, and generates traditional sashimi visualizations:
python3 jutils.py sashimi --bam-list <bam_list.tsv> --coordinate <coords> --gtf <gtf-file> [options]
where:
--coordinate select genomic range for visualization
--gtf GTF file with gene annotations to add to the display
options:
--shrink shrink long introns and exons
--min-coverage minimum intron coverage (default 0)
--prefix add prefix to the output file name
--out-dir specify the output directory
--pdf generate figures in PDF format (default PNG)
The format of the BAM list file is shown below.
Venn diagram visualization
python3 jutils.py venn-diagram --tsv-file-list <tsv_file_list> [options]
options:
--dpsi cutoff for delta PSI (Percent Splice In) (default 0.0)
--p-value cutoff for differential test p-value (default 1.0)
--q-value cutoff for differential test q-value (default 1.0)
--prefix add prefix to the output file name
--out-dir specify the output directory
--pdf generate figures in PDF format (default PNG)
The command above creates Venn diagram representations of gene sets from multiple TSV files, corresponding to multiple tools or comparisons. Of note, a gene set includes each differentially spliced gene once, even if multiple events are reported for the gene. Users can filter the values by delta PSI, p-value and q-value with the options --dpsi
, --p-value
, --q-value
. Additionally, the TSV file can include filters to apply to individual files. The format of the TSV file list is shown below.
LeafCutter represents alternative splicing events at the intron level, with introns having common endpoints further organized into a group. The program uses the differential splicing ratio of an intron within its group to identify DS events. The Percent Splice In (PSI) value of each intron in its group, as calculated from the raw read counts, are then used by Jutils to generate heatmaps. The user can specify criteria for inclusion, such as p-value or q-value cutoffs. Jutils produces both raw and Z-score normalized heatmaps. At this time, Jutils requires output from LeafCutter in the default naming and format.
The following is an example of a script for processing results from LeafCutter:
# convert LeafCutter output to TSV format
result_dir='path/to/LeafCutter_results_dir'
python3 jutils.py convert-results --leafcutter-dir ${result_dir}
# heatmap: visualize DSR events
python3 jutils.py heatmap --tsv-file 'leafcutter_DSR_results.tsv' \
--meta-file 'meta_file.tsv' \
--dpsi 0.2 --p-value 0.05 --q-value 0.05
MntJULiP detects differentially splicing events at the level of introns, using two types of tests. The first is a differential splicing ratio (DSR) test of an intron within its group. The second is a differential test on the intron abundance level (DSA), to determine instances of isoform specific regulation. We first describe the use of Jutils for MntJULiP(DSR), and then describe changes corresponding to the MntJULiP(DSA) test.
MntJULiP(DSR) evaluates each intron within its group ('bunch'). A 'bunch' collects all introns that share the same splice junction. PSI values for representation in the heatmap are calculated from the raw read counts in each sample. MntJULiP may report more than one DS intron per group; the --aggregate
option can be used to display a representative value per group.
MntJULiP(DSA) evaluates each intron independently. The heatmaps represent abundance values calculated from the read counts in each sample. Instead of the dPSI value the program reports, and the TSV file stores, the change in intron abundance between the conditions compared. The user can filter records by criteria such as p-value and q-value, or the minimum fold change. The TSV file format for DSA data is described below.
Lastly, unlike all other programs, MntJULiP can perform both traditional pairwise comparisons, as well as comparisons of multiple conditions. The latter, for instance, allows one to simultaneously compare all timepoints in a series, and all (or subsets of) conditions in a complex experiment.
Examples of scripts for the above scenarios follow.
Pairwise comparison:
# generate TSV file from MntJULiP output (raw results refers to the original MntJULiP version)
result_dir='path/to/MntJULiP_results_dir'
python3 jutils.py convert-results --mntjulip-dir ${result_dir}
# heatmap: DSR events aggregated by groups/clusters
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
--meta-file 'meta_file.tsv' \
--dpsi 0.2 --p-value 0.05 --q-value 0.05 --aggregate
# heatmap: DSR events without aggregation
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
--meta-file 'meta_file.tsv' \
--dpsi 0.2 --p-value 0.05 --q-value 0.05
# heatmap: top most variable DSR events
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
--meta-file 'meta_file.tsv' \
--unsupervised --top 150
# heatmap: DSA events
python3 jutils.py heatmap --tsv-file 'mntjulip_DSA_results.tsv' \
--meta-file 'meta_file.tsv' \
--p-value 0.05 --q-value 0.05 --avg 200 --fold-change 3
Multi-way comparison:
# generate TSV file from MntJULiP output (raw results refers to the original MntJULiP version)
result_dir='path/to/MntJULiP_results_dir'
python3 jutils.py convert-results --mntjulip-dir ${result_dir}
# heatmap: DSR without aggregation
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
--meta-file 'meta_file_multi.tsv' \
--dpsi 0.2 --p-value 0.05 --q-value 0.05
# heatmap: DSA
python3 jutils.py heatmap --tsv-file 'mntjulip_DSA_results.tsv' \
--meta-file 'meta_file_multi.tsv' \
--p-value 0.05 --q-value 0.05 --avg 200
Sashimi visualizations use the raw read counts per sample, as before:
python3 jutils.py convert-results --mntjulip-dir ${result_dir}
# sashimi plot with bams
python3 jutils.py sashimi --bam-list 'bam_list.tsv' \
--coordinate 'chr19:35744400-35745150'
# sashimi plot without bams
python3 jutils.py sashimi --meta-file 'meta_file.tsv' \
--tsv-file 'mntjulip_DSR_results.tsv' \
--group-id 'g001499' \
--gtf 'gencode.v27.transcripts.gtf'
Unlike LeafCutter, MntJULiP and MAJIQ, which are intron-oriented, rMATS reports differential splicing of canonical alternative splicing events (ASEs), including exon skipping (SE), mutually exclusive exons (MXE), alternative 5' and 3' splice sites (A5SS and A3SS) and intron retention (RI). The heatmaps represent per sample PSI values of the events calculated from the read counts. As with the other programs, intron read counts are shown on the sashimi plots.
# convert rMATS output to TSV format
result_dir='path/to/rMATS_results_dir'
python3 jutils.py convert-results --rmats-dir ${result_dir}
# heatmap: visualize DSR events
python3 jutils.py heatmap --tsv-file 'rmats_DSR_results.tsv' \
--meta-file 'meta_file.tsv' \
--p-value 0.05 --q-value 0.05
Jutils provides visualizations of gene sets generated by different comparisons or methods, and represented in TSV files, as a Venn diagram.
####### Venn diagram
python3 jutils.py convert-results --leafcutter-dir '/path/to/LeafCutter_results/' \
--mntjulip-dir '~/path/to/MntJULiP_results/' \
--rmats-dir '~/path/to/rMATS_results/'
python3 jutils.py venn-diagram --tsv-file-list tsv_file_list.txt
The format of the TSV file list is described below.
The TSV file contains 14 columns, describing the following attributes (for DSR data):
GeneName GroupID FeatureID FeatureType FeatureLabel strand p-value q-value dPSI ReadCount1 ReadCount2 PSI psi(c1) psi(c2)
#
# LeafCutter
Mrpl15 clu_1091_NA i001 intron chr1:4774516-4777525 . 0.788252 0.927378 -0.00266183 4,1,3,3,0,2,0,1,0,0,2,0,0,0,0,1,6,0,1,3,11,2 . 0.0444444,0.0238095,0.0285714,0.0309278,0,0.0134228,0,0.00819672,0,0,0.0194175,0,0,0,0,0.0232558,0.0451128,0,0.04,0.0191083,0.0552764,0.0135135 0.0186777433863672 0.0160159139771536
#
# rMATS
Camk2b J00015 44234 SE chr11:5979721,5981069-5981114,5982500 - 4.84279e-13 1.71877e-09 -0.053 781,3171,1429,1317,2327,1586,2342,999,1562,1184,2082,1476,2014,2328,1568,2006,859,168,2971,2694,2703,213 75,299,213,114,489,174,270,115,183,112,180,109,157,177,104,161,65,11,243,203,199,11 0.867,0.869,0.808,0.879,0.749,0.851,0.845,0.845,0.843,0.869,0.879,0.895,0.889,0.892,0.904,0.887,0.892,0.905,0.885,0.893,0.895,0.924 . .
The following slightly modified format is used for DSA type data:
GeneName GroupID FeatureID FeatureType FeatureLabel strand p-value q-value log2FoldChange ReadCount1 ReadCount2 PSI avg_read_counts(c1) avg_read_counts(c2)
Xkr4 J000001 J000001 intron chr1:3207317-3213439 - 0.383576 1 -0.22461 76,26,66,51,45,62,22,8,96,60,105,8,26,61,44,34,92,51,55,24,51,25 . . 55.35 47.37
The meta-file is a TAB ('\t') separated file that lists the sample names and conditions, for example:
sample1 ctrl
sample2 ctrl
sample3 case
sample4 case
The bam list file is a TAB ('\t') separated file containing the full path to the BAM alignment file, along with the sample name and condition, as used by the (traditional) sashimi visualization:
sample1 /path/to/BAM_1 ctrl
sample2 /path/to/BAM_2 case
...
The TSV file list for Venn diagram visualizations contains the full paths of the TSV files generated for the various comparisons, and optionally file-specific parameters:
/path/to/TSV_for_method_1 label1
/path/to/TSV_for_method_2 label2
/path/to/TSV_for_method_3 label3
or
/path/to/TSV_for_method_1 label1 '--p-val 0.05 --q-val 1 --dpsi 0.05'
/path/to/TSV_for_method_1 label2 '--p-val 0.05 --q-val 1 --dpsi 0.1'
/path/to/TSV_for_method_2 label3
/path/to/TSV_for_method_2 label4 '--q-val 0.05'
Please submit a GitHub Issue.
See the file LICENSE for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.