Skip to content

Visualization toolkit for differential splicing events

License

Notifications You must be signed in to change notification settings

splicebox/Jutils

Repository files navigation

Jutils

Jutils (version 1.5) is a visualization toolkit for alternative splicing events. Jutils directly supports visualizing results generated by the differential splicing (DS) detection tools MntJULiP, LeafCutter, MAJIQ and rMATS, and can be easily adapted to use with other DS software.

Described in:

  • Yang G, Cope L, He Z, and Florea L (2021). Jutils: A visualization toolkit for differential alternative splicing events, Bioinformatics 37(22):4272–4274. [scripts, data]
  • Lui WW, Yang G, He Z, and Florea L (2024). MntJULiP and Jutils: Differential splicing analysis of RNA-seq data with covariates. Submitted. [Preprint]

This branch contains the working version of Jutils, version 1.5, which additionally implements PCA visualizations. For the original (stable) version of Jutils, please refer to the master branch.

Copyright (C) 2020-2024, and GNU GPL v3.0, by †Wui Wang Lui, †Guangyu Yang, Liliana Florea  († equal contributors)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Table of contents

Jutils is a visualization toolkit for alternative splicing events. It uses an intermediate tab separated file (TSV) to represent alternative splicing data such as that generated by a differential splicing prediction program. Jutils supports the vizualization of results generated by popular differential splicing (DS) prediction tools MntJULiP, LeafCutter, MAJIQ and rMATS. Additionally, users can write their own routines to convert the output of any other DS program into the unified TSV data format, which can then be visualized with Jutils.

Jutils provides four types of visualization: heatmaps, PCA plots, sashimi plots, and Venn diagrams. Users can display all records in the file, or can apply filters to select specific subsets of the data, for instance entries satisfying user defined criteria of statistical significance.

Installation

Jutils is written in Python. To download the source code, clone the current GitHub repository:

git clone https://github.com/splicebox/Jutils.git

System requirements

  • Linux or Darwin
  • Python 3.7 or later
  • R 4.0 or later

Prerequisites

Required Python packages: pandas, numpy, seaborn, matplotlib, scikit-learn. The Python packages can be installed with the command:

pip install --user pandas numpy seaborn matplotlib scikit-learn

Required R packages: ggplot2, gridExtra, data.table. The R packages can be installed with the command in R console:

install.packages(c("ggplot2", "gridExtra", "data.table"))

Jutils works in two steps. Step 1 generates the TSV file representation of the user data. Step 2 uses the TSV file, along with other information optionally provided by the user, to generate visualizations. The basic commands are listed below, and examples as applied to the specific visualizations are given in the subsequent sections.

Basic usage

TSV file conversion

python3 jutils.py convert-results [ --mntjulip-dir | --leafcutter-dir | --majiq-dir | --rmats-dir ] <dsprogram_result_dir>
        options:
        --out-dir      specify the output directory

The command takes as input the directory containing the output from the specified DS tool and generates a TSV file in the './out' directory or the directory specified with --out-dir. The format of the TSV file is described below, and examples are provided in data/results.tsv.zip.

Heatmap visualization

python3 jutils.py heatmap --tsv-file <tsv_file> --meta-file <meta_file> [options]
        options:
        --dpsi         cutoff for delta PSI (Percent Splice In) (default 0.05)
        --p-value      cutoff for differential test p-value (default 0.05)
        --q-value      cutoff for differential test q-value (default 1.0)
        --aggregate    show results at group level (one entry per group)
        --avg          cutoff for estimated read counts of DSA results (default 0.0)
        --fold-change  cutoff for log2(fold-change) of DSA results (default 0.0)
        --unsupervised display the top most variable features
        --top          number of top most variable features to display (default 300)
        --method       clustering method (default 'weighted')
        --metric       distance metric for clustering ('cityblock')
        --prefix       add prefix to the output file name
        --out-dir      specify the output directory
        --pdf          generate figures in PDF format (default PNG)
        --gene-list-file list of target genes (one gene per line without space) for heatmap

The command generates several heatmaps, displaying clustering of features, samples, both and none. The meta-file lists the condition for each sample, for representation in the heatmap, for instance when the data has been generated from a differential analysis. The p-value, q-value and dPSI values are those generated by the DS tool and stored in the TSV file. The format of the TSV file and that of the metadata file are shown below. Some programs including LeafCutter and MntJULiP represent introns as part of a group, and may report multiple DS introns per group. The --aggregate option selects one unbiased entry per group to include in the displays.

Alternatively, Jutils can select for display the most variable features across the samples, agnostic of the conditions and program predictions, with the option --unsupervised. To determine the most variable features, the mean absolute difference, MAD = E[| X-X̅ |], is used for PSI values, and the variance to mean ratio, D=σ2/μ, is used for abundance levels.

Note that the option --top only applies to the unsupervised mode, whereas the --p-value, --q-value and --dpsi only apply to the results from the (supervised) differential splicing analysis. When the --unsupervised option is invoked, it overrides the default differential splicing options.

The --avg and --fold-change options apply to the Differential Splicing Abundance (DSA) type of data, as implemented in MntJULiP (DSA), which is described below.

Lastly, users can create high resolution PDF images with the option --pdf; by default, Jutils generates PNG images.

PCA visualization

python3 jutils.py pca --tsv-file <tsv_file> --meta-file <meta_file> [options]
        options:
        --dpsi         cutoff for delta PSI (Percent Splice In) (default 0.05)
        --p-value      cutoff for differential test p-value (default 0.05)
        --q-value      cutoff for differential test q-value (default 1.0)
        --aggregate    show results at group level (one entry per group)
        --avg          cutoff for estimated read counts of DSA results (default 0.0)
        --fold-change  cutoff for log2(fold-change) of DSA results (default 0.0)
        --unsupervised display the top most variable features
        --top          number of top most variable features to display, this option only works in the unsupervised mode (default 100)
        --prefix       add prefix to the output file name
        --out-dir      specify the output directory
        --pdf          generate figures in PDF format (default PNG)
        --gene-list-file list of target gene(s) (one gene per line without space) for heatmap
        --color-shape-col select 2 column indices in the meta file to colour and shape points respectively. The sample name column starts with index 1 (default 2,3)
        --label-point  label points with sample names from the meta file
        --highlight-idlist-file  list of highlighted sample(s) (one sample name per line without space) for pca

The command generates PCA plots of PC1-2, PC1-3, and PC2-3. The option --color-shape-col can change the colours and shapes of points by specific columns. By default (2,3), column 2, namely the condition, specifies the colours and column 3, the (optional) covariate column specifies the shapes. The option --label-point annotates sample names from the meta file to the points. For details about the filter options, please refer to the section on heatmaps above.

Test run pca with test data:

cd data/mntjulip.jutils_testdata/
python ../../jutils.py pca --meta-file splice.4F1M.cov.meta --tsv-file mntjulip_DSR_results.tsv

Sashimi visualization

python3 jutils.py sashimi --tsv-file <tsv_file> --meta-file <meta_file> --gtf <gtf_file> [ --group-id <group_id> | --coordinate <coords> ] [options]
        where:
        --group_id      select group to visualize
        --coordinate    select genomic range for visualization
        --gtf           GTF file of gene annotations to extract exons not in the TSV file
        options:
        --shrink        shrink long introns and exons
        --min-coverage  minimum intron coverage (default 0)
        --prefix        add prefix to the output file name
        --out-dir       specify the output directory
        --pdf           generate figures in PDF format (default PNG)

The streamlined command above will use solely the event information contained in the TSV and metadata files for visualization. Introns will be displayed along with their values (e.g., read counts, PSI value) in multiple samples, separately by the conditions specified in the metadata file. Exons are extracted from the provided GTF file of gene annotations, for instance the human or mouse GENCODE reference gene set. Please ensure that the annotation file is compatible with the version of the genome used. Introns with a minimum coverage level (default 0 reads, and can be changed with the option --min-coverage) are shown, and long introns and exons can be shortened with the option --shrink.

Alternatively, the command below extracts the information on flanking exon coverage from the BAM files, when provided by the user, and generates traditional sashimi visualizations:

python3 jutils.py sashimi --bam-list <bam_list.tsv> --coordinate <coords> --gtf <gtf-file> [options]
        where:
        --coordinate    select genomic range for visualization
        --gtf           GTF file with gene annotations to add to the display
        options:
        --shrink        shrink long introns and exons
        --min-coverage  minimum intron coverage (default 0)
        --prefix        add prefix to the output file name
        --out-dir       specify the output directory
        --pdf           generate figures in PDF format (default PNG)

The format of the BAM list file is shown below.

Venn diagram visualization

python3 jutils.py venn-diagram --tsv-file-list <tsv_file_list> [options]
        options:
        --dpsi         cutoff for delta PSI (Percent Splice In) (default 0.0)
        --p-value      cutoff for differential test p-value (default 1.0)
        --q-value      cutoff for differential test q-value (default 1.0)
        --prefix        add prefix to the output file name
        --out-dir       specify the output directory
        --pdf           generate figures in PDF format (default PNG)

The command above creates Venn diagram representations of gene sets from multiple TSV files, corresponding to multiple tools or comparisons. Of note, a gene set includes each differentially spliced gene once, even if multiple events are reported for the gene. Users can filter the values by delta PSI, p-value and q-value with the options --dpsi, --p-value, --q-value. Additionally, the TSV file can include filters to apply to individual files. The format of the TSV file list is shown below.

Usage customized by program

LeafCutter

LeafCutter represents alternative splicing events at the intron level, with introns having common endpoints further organized into a group. The program uses the differential splicing ratio of an intron within its group to identify DS events. The Percent Splice In (PSI) value of each intron in its group, as calculated from the raw read counts, are then used by Jutils to generate heatmaps. The user can specify criteria for inclusion, such as p-value or q-value cutoffs. Jutils produces both raw and Z-score normalized heatmaps. At this time, Jutils requires output from LeafCutter in the default naming and format.

The following is an example of a script for processing results from LeafCutter:

# convert LeafCutter output to TSV format
result_dir='path/to/LeafCutter_results_dir'
python3 jutils.py convert-results --leafcutter-dir ${result_dir}

# heatmap: visualize DSR events
python3 jutils.py heatmap --tsv-file 'leafcutter_DSR_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --dpsi 0.2 --p-value 0.05 --q-value 0.05                       

MntJULiP detects differentially splicing events at the level of introns, using two types of tests. The first is a differential splicing ratio (DSR) test of an intron within its group. The second is a differential test on the intron abundance level (DSA), to determine instances of isoform specific regulation. We first describe the use of Jutils for MntJULiP(DSR), and then describe changes corresponding to the MntJULiP(DSA) test.

MntJULiP(DSR) evaluates each intron within its group ('bunch'). A 'bunch' collects all introns that share the same splice junction. PSI values for representation in the heatmap are calculated from the raw read counts in each sample. MntJULiP may report more than one DS intron per group; the --aggregate option can be used to display a representative value per group.

MntJULiP(DSA) evaluates each intron independently. The heatmaps represent abundance values calculated from the read counts in each sample. Instead of the dPSI value the program reports, and the TSV file stores, the change in intron abundance between the conditions compared. The user can filter records by criteria such as p-value and q-value, or the minimum fold change. The TSV file format for DSA data is described below.

Lastly, unlike all other programs, MntJULiP can perform both traditional pairwise comparisons, as well as comparisons of multiple conditions. The latter, for instance, allows one to simultaneously compare all timepoints in a series, and all (or subsets of) conditions in a complex experiment.

Examples of scripts for the above scenarios follow.

Pairwise comparison:

# generate TSV file from MntJULiP output (raw results refers to the original MntJULiP version)
result_dir='path/to/MntJULiP_results_dir'
python3 jutils.py convert-results --mntjulip-dir ${result_dir}

# heatmap: DSR events aggregated by groups/clusters
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --dpsi 0.2 --p-value 0.05 --q-value 0.05 --aggregate

# heatmap: DSR events without aggregation
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --dpsi 0.2 --p-value 0.05 --q-value 0.05
                          
# heatmap: top most variable DSR events
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --unsupervised --top 150

# heatmap: DSA events
python3 jutils.py heatmap --tsv-file 'mntjulip_DSA_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --p-value 0.05 --q-value 0.05 --avg 200 --fold-change 3

Multi-way comparison:

# generate TSV file from MntJULiP output (raw results refers to the original MntJULiP version)
result_dir='path/to/MntJULiP_results_dir'
python3 jutils.py convert-results --mntjulip-dir ${result_dir}

# heatmap: DSR without aggregation
python3 jutils.py heatmap --tsv-file 'mntjulip_DSR_results.tsv' \
                          --meta-file 'meta_file_multi.tsv' \
                          --dpsi 0.2 --p-value 0.05 --q-value 0.05

# heatmap: DSA
python3 jutils.py heatmap --tsv-file 'mntjulip_DSA_results.tsv' \
                          --meta-file 'meta_file_multi.tsv' \
                          --p-value 0.05 --q-value 0.05 --avg 200

Sashimi visualizations use the raw read counts per sample, as before:

python3 jutils.py convert-results --mntjulip-dir ${result_dir}

# sashimi plot with bams
python3 jutils.py sashimi --bam-list 'bam_list.tsv' \
                          --coordinate 'chr19:35744400-35745150'

# sashimi plot without bams
python3 jutils.py sashimi --meta-file 'meta_file.tsv' \
                          --tsv-file 'mntjulip_DSR_results.tsv' \
                          --group-id 'g001499' \
                          --gtf 'gencode.v27.transcripts.gtf'
rMATS

Unlike LeafCutter, MntJULiP and MAJIQ, which are intron-oriented, rMATS reports differential splicing of canonical alternative splicing events (ASEs), including exon skipping (SE), mutually exclusive exons (MXE), alternative 5' and 3' splice sites (A5SS and A3SS) and intron retention (RI). The heatmaps represent per sample PSI values of the events calculated from the read counts. As with the other programs, intron read counts are shown on the sashimi plots.

# convert rMATS output to TSV format
result_dir='path/to/rMATS_results_dir'
python3 jutils.py convert-results --rmats-dir ${result_dir}

# heatmap: visualize DSR events
python3 jutils.py heatmap --tsv-file 'rmats_DSR_results.tsv' \
                          --meta-file 'meta_file.tsv' \
                          --p-value 0.05 --q-value 0.05                       
Gene set comparison

Jutils provides visualizations of gene sets generated by different comparisons or methods, and represented in TSV files, as a Venn diagram.

####### Venn diagram
python3 jutils.py convert-results --leafcutter-dir '/path/to/LeafCutter_results/' \
                                  --mntjulip-dir '~/path/to/MntJULiP_results/' \
                                  --rmats-dir '~/path/to/rMATS_results/'

python3 jutils.py venn-diagram --tsv-file-list tsv_file_list.txt

The format of the TSV file list is described below.

The unified TSV file

The TSV file contains 14 columns, describing the following attributes (for DSR data):

GeneName GroupID  FeatureID   FeatureType FeatureLabel   strand   p-value  q-value  dPSI  ReadCount1  ReadCount2  PSI   psi(c1)   psi(c2)
#
# LeafCutter
Mrpl15   clu_1091_NA i001  intron   chr1:4774516-4777525  .  0.788252 0.927378 -0.00266183 4,1,3,3,0,2,0,1,0,0,2,0,0,0,0,1,6,0,1,3,11,2 .  0.0444444,0.0238095,0.0285714,0.0309278,0,0.0134228,0,0.00819672,0,0,0.0194175,0,0,0,0,0.0232558,0.0451128,0,0.04,0.0191083,0.0552764,0.0135135 0.0186777433863672   0.0160159139771536
# 
# rMATS
Camk2b  J00015  44234   SE   chr11:5979721,5981069-5981114,5982500   -   4.84279e-13   1.71877e-09  -0.053  781,3171,1429,1317,2327,1586,2342,999,1562,1184,2082,1476,2014,2328,1568,2006,859,168,2971,2694,2703,213        75,299,213,114,489,174,270,115,183,112,180,109,157,177,104,161,65,11,243,203,199,11     0.867,0.869,0.808,0.879,0.749,0.851,0.845,0.845,0.843,0.869,0.879,0.895,0.889,0.892,0.904,0.887,0.892,0.905,0.885,0.893,0.895,0.924   .   .

The following slightly modified format is used for DSA type data:

GeneName GroupID  FeatureID  FeatureType  FeatureLabel   strand   p-value  q-value  log2FoldChange ReadCount1  ReadCount2  PSI   avg_read_counts(c1)        avg_read_counts(c2)
Xkr4   J000001   J000001  intron  chr1:3207317-3213439  -   0.383576   1   -0.22461    76,26,66,51,45,62,22,8,96,60,105,8,26,61,44,34,92,51,55,24,51,25   .    .   55.35  47.37

The meta-file

The meta-file is a TAB ('\t') separated file that lists the sample names and conditions, for example:

sample1  ctrl
sample2  ctrl
sample3  case
sample4  case

The bam_list file

The bam list file is a TAB ('\t') separated file containing the full path to the BAM alignment file, along with the sample name and condition, as used by the (traditional) sashimi visualization:

sample1   /path/to/BAM_1   ctrl
sample2   /path/to/BAM_2   case
...

The tsv_file_list

The TSV file list for Venn diagram visualizations contains the full paths of the TSV files generated for the various comparisons, and optionally file-specific parameters:

/path/to/TSV_for_method_1   label1  
/path/to/TSV_for_method_2   label2
/path/to/TSV_for_method_3   label3

or

/path/to/TSV_for_method_1   label1   '--p-val 0.05 --q-val 1 --dpsi 0.05'
/path/to/TSV_for_method_1   label2   '--p-val 0.05 --q-val 1 --dpsi 0.1'
/path/to/TSV_for_method_2   label3
/path/to/TSV_for_method_2   label4   '--q-val 0.05'

Please submit a GitHub Issue.

License information

See the file LICENSE for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.

About

Visualization toolkit for differential splicing events

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •