Skip to content

Phasing and scaffolding polyploid genomes based on Pore-C or Hi-C.

Notifications You must be signed in to change notification settings

wangyibin/CPhasing

Repository files navigation

C-Phasing logo

C-Phasing

Phasing and scaffolding polyploid genomes based on Pore-C, Ultra-long, or Hi-C data

.

Documentation | 中文文档

Introduction

One of the major problems with Hi-C scaffolding of polyploid genomes is a large proportion of ambiguous short-read mapping, leading to a high-level of switched or chimeric assemblies. Now, the long-read-based chromosome conformation capture technology, e.g., Pore-C, provides an effective way to overcome this problem. Here, we developed a new pipeline, namely C-Phasing, which is specifically tailored for polyploid phasing by leveraging the advantage of Pore-C data. It also works on Hi-C data and diploid genome assembly.

The advantages of C-Phasing:

  • High speed.
  • High anchor rate of genome.
  • High accuracy of polyploid phasing.

Installation

Via activate_cphasing (Recommended)

## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git

## activate environment (For the first configuration, run it when the network is accessible.)
./CPhasing/bin/activate_cphasing

Via Anaconda

## Download C-Phasing and install all dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
conda env create -f environment.yml
conda activate cphasing

## Add these command into .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH
export PYTHONPATH=/path/to/CPhasing:$PYTHONPATH
## The hic pipeline require GLIBCXX_3.4.29, or you can add this command to your environment (.bash_profile)
export LD_LIBRARY_PATH=/path/to/anaconda3/envs/cphasing/lib:$LD_LIBRARY_PATH

Custom install the dependencies

Install C-Phasing

## Download C-Phasing and install python dependencies
git clone https://github.com/wangyibin/CPhasing.git
cd CPhasing
pip install .

## Add following to the .bash_profile or .bashrc
export PATH=/path/to/CPhasing/bin:$PATH

Dependencies

  1. For core function
  2. For Pore-C pipeline

One command pipeline of C-Phasing

The -n 8:4 parameter of the following commands means assembling a tetraploid (4) with 8 chromosome basic numbers. If you set -n 0:0 means partition in both rounds automatically, also support it set to -n 8:0 or -n 0:4.

  • Start from a pore-c data
cphasing pipeline -f draft.asm.fasta -pcd sample.fastq.gz -t 10 -n 8:4
  • Start from multiple pore-c data: specify multiple -pcd parameters.
cphasing pipeline -f draft.asm.fasta -pcd sample1.fastq.gz -pcd sample2.fastq.gz -t 10 -n 8:4

Note: If you want to run on cluster system and submit them to multiple nodes, you can use cphasing mapper and cphasing-rs porec-merge to generate the merged porec.gz file and input it by -pct parameter.

  • Start from a pore-c table (porec.gz), which is generated by cphasing mapper.
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -n 8:4
  • Start from a paired-end Hi-C data
cphasing pipeline -f draft.asm.fasta -hic1 Lib_R1.fastq.gz -hic2 Lib_R2.fastq.gz -t 10 -n 8:4

Note1: If you want to run multiple samples, you can use cphasing hic mapper and cphasing-rs pairs-merge to generate the merged pairs.gz file, and input it by -prs parameter.
Note2: If the total length of your input genome is larger than 8 Gb, the -hic-mapper-k 27 -hic-mapper-w 14 should be specified, to avoid the error of chromap.

  • Start from a 4DN pairs file,
cphasing pipeline -f draft.asm.fasta -prs sample.pairs.gz -t 10 -n 8:4
  • Skip some steps
## skip steps 1.alleles and 2.prepare steps 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -ss 1,2
  • Perform only specified steps
## run 3.hyperpartition 
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -s 3
  • Add the -hcr parameter to remove the greedy contacts (several regions contact with the whole genome) to improve the phasing quality.
cphasing pipeline -f draft.asm.fasta -pct sample.porec.gz -t 10 -hcr

Curation by Juicebox

  • generate .assembly and .hic, depend on 3d-dna
cphasing pairs2mnd sample.pairs.gz -o sample.mnd.txt
cphasing utils agp2assembly groups.agp > groups.assembly
bash ~/software/3d-dna/visualize/run-assembly-visualizer.sh sample.assembly sample.mnd.txt

Note: if chimeric corrected, please use groups.corrected.agp and generate a new corrected.pairs.gz by cphasing-rs pairs-break

  • After curation
## convert assembly to agp
cphasing utils assembly2agp groups.review.assembly -n 8:4 
## or haploid or a homologous group
cphasing utils assembly2agp groups.review.assembly -n 8
## extract contigs from agp 
cphasing agp2fasta groups.review.agp draft.asm.fasta --contigs > contigs.fasta
## extract chromosome-level fasta from agp
cphasing agp2fasta groups.review.agp draft.asm.fasta > groups.review.asm.fasta

Rename

Rename and orient chromosome according a monoploid reference (or genome of closely related species).

cphasing rename -r mono.fasta -f draft.asm.fasta -a groups.review.agp -t 20

Note: To reduce the time consumed, we only align the first haplotype (g1) to the monoploid, which the orientation among different haplotypes has already been set to the same in the scaffolding step. If not, you can set —unphased to align all haplotypes to the monoploid to adjust the orientation.


Pipeline of Ultra-long data [Optional]

C-Phasing enable to use ultra-long to correct chimeric and identify the high confidence regions (HCRs) to help assembly.

Step by step pipeline of Pore-C data

  1. mapping

    Mapping pore-c data into contig-level assembly and output the pore-c table and 4DN pairs.

    ## results are `sample.porec.gz` and `sample.pairs.gz`  
    cphasing mapper draft.asm.fasta sample.fastq.gz -t 10
    

    For Hi-C data please use cphasing hic mapper.

    Note: If you are mapping multiple pore-c data, the multiple pairs.gz files should be merged by following steps:

    cphasing-rs pairs-merge sample1.pairs.gz sample2.pairs.gz -o sample.merge.pairs.gz
  2. alleles (Optional for phasing mode)

    Identify the inter-allelic contig pairs.

    ## result is `draft.asm.allele.table`
    cphasing alleles -f draft.asm.fasta

    Note: the memory usage of this step is depend on the genome size of input fasta, when input the length larger 10 Gb, the peak memory will reach 130 Gb, or more. If you want to reduce the memory usage, you can increase the k and w, for example, when we benchmark a 20 Gb hexaploid, we set the k and w to 27 and 24, respectively.

  3. prepare

    Prepare some data for subsequence analysis

    ## results are `sample.counts_AAGCTT.txt`, `sample.clm.gz`, `sample.split.contacts`, `sample.contacts`
    cphasing prepare draft.asm.fasta sample.pairs.gz 
  4. hyperpartition

    ## result is `output.clusters.txt`
    ## for haploid scaffolding 
    cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt
    ## for polyploid or diploid phasing must add prune information and use the incremental partition mode
    ### chromsome number aware, 8:4 indicate that this is a tetraploid with 8 chromosome in each haplotype
    cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -n 8:4 -t 4
    ### auto generate groups
    cphasing hyperpartition sample.porec.gz draft.asm.contigsizes output.clusters.txt -pt prune.contig.table -inc -t 4

    Note: If you are not satisfied with the number of groups, you can adjust the resolution parameters (-r1 or -r2) to make difference groups. -r1 controls the first cluster while -r2 controls the second cluster in phasing mode. Increasing the resolution can generate more groups while decreasing it will get fewer groups.
    If you want to group some contigs, you can input a contig list with the -wl parameter.

  5. scaffolding

    ## result is `groups.agp`
    cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm.gz -sc sample.split.contacts -f draft.asm.fasta -t 4
    
    ## for polyploid can specified allele table to adjust the orientation of different haplotypes
    cphasing scaffolding output.clusters.txt sample.counts_AAGCTT.txt sample.clm -at draft.asm.allele.table -sc sample.split.contacts -f draft.asm.fasta 
  6. plot

    Adjust the contig-level contacts matrix into chromosome-level and plot a heatmap.

    ## result is `sample.10000.cool` and `groups.wg.png`
    cphasing pairs2cool sample.pairs.gz draft.asm.contigsizes sample.10000.cool
    cphasing plot -a groups.agp -m sample.10000.cool -o groups.wg.png

    plot also enable only plot the heatmap from a matrix file (.cool)

    ## directly plot the heatmap in the input resolution, 
    cphasing plot -m sample.500k.chrom.cool -o sample.500k.chrom.png
    
    ## or first coarsen small binsize to larger binsize, e.g. 10k to 500k, specify the `-bs 500k` parameter.
    
    cphasing plot -m sample.10k.chrom.cool -o sample.500k.chrom.png -bs 500k

    Note: If the number of bins in matrix is too large (large genome with small binsize), the memory may overflow. To fix it, users can improve the binsize or specified several chromosomes with -c parameter.

    cphasing plot -m sample.100k.chrom -c Chr01g1,Chr01g2,Chr01g3,Chr01g4 -o Chr01.100k.png 

FAQ

The results of the first round partition are unsatisfactory.

In our two-round partition algorithm, the first round partition depends on the h-trans errors between homologous chromosomes; if you input a contig assembly with low level switch errors or input a high accuracy pore-c data, the h-trans will be not enough to cluster all contigs to correct homologous groups, resulting in unsatisfactory results. You can set the -q1 0 for hyperpartition to increase the rate of h-trans errors. However, this parameter may raise error of out of memory when you input huge pore-c data in porec table or hic contacts in pairs file.

How to set the -n parameter when assembling an aneuploid genome.

The aneuploid genome, such as modern cultivated sugarcane, contains unequal homologous chromosomes. The -n parameter can be set to zero (-n 10:0) to automatically partition contigs into different chromosomes within a homologous group.
However, we also allow the user to input a file with two columns: the first column is the index(1-base) of the first round partition, and the second column is the chromosome number of each homologous. And then specified the -n 10:second.number.tsv in cphasing pipeline or cphasing hyperpartition.

  • second.number.tsv
1    13
2    12
3    12
4    11
5    10
6    12
7    12
8    10
9    12
10    12