chromap

NAME

chromap - fast alignment and preprocessing of chromatin profiles

SYNOPSIS

* Indexing the reference genome:

chromap -i [-k kmer] [-w miniWinSize] -r ref.fa -o ref.index

* Mapping (sc)ATAC-seq reads:

chromap --preset atac -r ref.fa -x ref.index -1 read1.fq -2 read2.fq -o aln.bed [-b barcode.fq.gz] [--barcode-whitelist whitelist.txt]

* Mapping ChIP-seq reads:

chromap --preset chip -r ref.fa -x ref.index -1 read1.fq -2 read2.fq -o aln.bed

* Mapping Hi-C reads:

chromap --preset hic -r ref.fa -x ref.index -1 read1.fq -2 read2.fq -o aln.pairs
chromap --preset hic -r ref.fa -x ref.index -1 read1.fq -2 read2.fq --SAM -o aln.sam

DESCRIPTION

Chromap is an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Typical use cases include: (1) trimming sequencing adapters, mapping bulk ATAC-seq or ChIP-seq genomic reads to the human genome and removing duplicates; (2) trimming sequencing adapters, mapping single cell ATAC-seq genomic reads to the human genome, correcting barcodes, removing duplicates and performing Tn5 shift; (3) split alignment of Hi-C reads against a reference genome. In all these three cases, Chromap is 10-20 times faster while being accurate.

OPTIONS

Indexing options

	-k INT		Minimizer k-mer length [17].
	-w INT		Minimizer window size [7]. A minimizer is the smallest k-mer in a window of w consecutive k-mers.

--min-frag-length

Min fragment length for choosing k and w automatically [30]. Users can increase this value when the min length of the fragments of interest is long, which can increase the mapping speed. Note that the default value 30 is the min fragment length that chromap can map.

Mapping options
--split-alignment

Allow split alignments. This option should be set only when mapping Hi-C reads.

	-e INT		Max edit distance allowed to map a read [8].
	-s INT		Min number of minimizers required to map a read [2].

-f INT1[,INT2]

Ignore minimizers occuring more than INT1 [500] times. INT2 [1000] is the threshold for a second round of seeding.

	-l INT		Max insert size, only for paired-end read mapping [1000].
	-q INT		Min MAPQ in range [0, 60] for mappings to be output [30].

--min-read-length INT

Skip mapping the reads of length less than INT [30]. Note that this is different from the index option --min-frag-length , which set -k and -w for indexing the genome.

--trim-adapters

Try to trim adapters on 3’. This only works for paired-end reads. When the fragment length indicated by the read pair is less than the length of the reads, the two mates are overlapped with each other. Then the regions outside the overlap are regarded as adapters and trimmed.

--remove-pcr-duplicates

Remove PCR duplicates.

--remove-pcr-duplicates-at-bulk-level

Remove PCR duplicates at bulk level for single cell data.

--remove-pcr-duplicates-at-cell-level

Remove PCR duplicates at cell level for single cell data.

--Tn5-shift

Perform Tn5 shift. When this option is turned on, the forward mapping start positions are increased by 4bp and the reverse mapping end positions are decreased by 5bp. Note that this works only when --SAM is NOT set.

--low-mem

Use low memory mode. When this option is set, multiple temporary intermediate mapping files might be generated on disk and they are merged at the end of processing to reduce memory usage. When this is NOT set, all the mapping results are kept in the memory before they are saved on disk, which works more efficiently for datasets that are not too large.

--bc-error-threshold INT

Max Hamming distance allowed to correct a barcode [1]. Note that the max supported threshold is 2.

--bc-probability-threshold FLT

Min probability to correct a barcode [0.9]. When there are multiple whitelisted barcodes with the same Hamming distance to the barcode to correct, chromap will process the base quality of the mismatched bases, and compute a probability that the correction is right.

-t INT

The number of threads for mapping [1].

Input options

	-r FILE		Reference file.
	-x FILE		Index file.
	-1 FILE		Single-end read files or paired-end read files 1. Chromap supports mulitple input files concatenate by ",". For example, setting this option to "read11.fq,read12.fq,read13.fq" will make all three files as input and map them in this order. Similarly, -2 and -b also support multiple input files. And the ordering of the input files for all the three options should match.
	-2 FILE		Paired-end read files 2.
	-b FILE		Cell barcode files.

--barcode-whitelist FILE

Cell barcode whitelist file. This is supposed to be a txt file where each line is a whitelisted barcode.

--read-format STR

Format for read files and barcode files ["r1:0:-1,bc:0:-1"] as 10x Genomics single-end format.

Output options

-o FILE

Output file.

--output-mappings-not-in-whitelist

Output mappings with barcode not in the whitelist.

--chr-order FILE

Customized chromsome order.

--BED

Output mappings in BED/BEDPE format. Note that only one of the formats should be set.

--TagAlign

Output mappings in TagAlign/PairedTagAlign format.

	--SAM		Output mappings in SAM format.
	--pairs		Output mappings in pairs format (defined by 4DN for HiC data).

--pairs-natural-chr-order FILE

Natural chromosome order for pairs flipping.

-v

Print version number to stdout.

Preset options
--preset STR

Preset []. This option applies multiple options at the same time. It should be applied before other options because options applied later will overwrite the values set by --preset. Available STR are:

	chip		Mapping ChIP-seq reads (-l 2000 --remove-pcr-duplicates --low-mem --BED).
	atac		Mapping ATAC-seq/scATAC-seq reads (-l 2000 --remove-pcr-duplicates --low-mem --trim-adapters --Tn5-shift --remove-pcr-duplicates-at-cell-level --BED).
	hic		Mapping Hi-C reads (-e 4 -q 1 --low-mem --split-alignment --pairs).