chromap - fast alignment and preprocessing of chromatin profiles
* Indexing the reference genome:
chromap -i [-k kmer] [-w miniWinSize] -r ref.fa -o ref.index
* Mapping (sc)ATAC-seq reads:
chromap --preset atac -r ref.fa -x ref.index -1 read1.fq -2 read2.fq -o aln.bed [-b barcode.fq.gz] [--barcode-whitelist whitelist.txt]
* Mapping ChIP-seq reads:
chromap --preset chip -r ref.fa -x ref.index -1 read1.fq -2 read2.fq -o aln.bed
* Mapping Hi-C reads:
chromap --preset
hic -r ref.fa -x
ref.index -1 read1.fq -2
read2.fq -o aln.pairs
chromap --preset hic -r ref.fa
-x ref.index -1 read1.fq
-2 read2.fq --SAM -o aln.sam
Chromap is an ultrafast method for aligning and preprocessing high throughput chromatin profiles. Typical use cases include: (1) trimming sequencing adapters, mapping bulk ATAC-seq or ChIP-seq genomic reads to the human genome and removing duplicates; (2) trimming sequencing adapters, mapping single cell ATAC-seq genomic reads to the human genome, correcting barcodes, removing duplicates and performing Tn5 shift; (3) split alignment of Hi-C reads against a reference genome. In all these three cases, Chromap is 10-20 times faster while being accurate.
Indexing options
-k INT |
Minimizer k-mer length [17]. | ||
-w INT |
Minimizer window size [7]. A minimizer is the smallest k-mer in a window of w consecutive k-mers. |
--min-frag-length
Min fragment length for choosing k and w automatically [30]. Users can increase this value when the min length of the fragments of interest is long, which can increase the mapping speed. Note that the default value 30 is the min fragment length that chromap can map.
Mapping
options
--split-alignment
Allow split alignments. This option should be set only when mapping Hi-C reads.
-e INT |
Max edit distance allowed to map a read [8]. | ||
-s INT |
Min number of minimizers required to map a read [2]. |
-f INT1[,INT2]
Ignore minimizers occuring more than INT1 [500] times. INT2 [1000] is the threshold for a second round of seeding.
-l INT |
Max insert size, only for paired-end read mapping [1000]. | ||
-q INT |
Min MAPQ in range [0, 60] for mappings to be output [30]. |
--min-read-length INT
Skip mapping the reads of length less than INT [30]. Note that this is different from the index option --min-frag-length , which set -k and -w for indexing the genome.
--trim-adapters
Try to trim adapters on 3’. This only works for paired-end reads. When the fragment length indicated by the read pair is less than the length of the reads, the two mates are overlapped with each other. Then the regions outside the overlap are regarded as adapters and trimmed.
--remove-pcr-duplicates
Remove PCR duplicates.
--remove-pcr-duplicates-at-bulk-level
Remove PCR duplicates at bulk level for single cell data.
--remove-pcr-duplicates-at-cell-level
Remove PCR duplicates at cell level for single cell data.
--Tn5-shift
Perform Tn5 shift. When this option is turned on, the forward mapping start positions are increased by 4bp and the reverse mapping end positions are decreased by 5bp. Note that this works only when --SAM is NOT set.
--low-mem |
Use low memory mode. When this option is set, multiple temporary intermediate mapping files might be generated on disk and they are merged at the end of processing to reduce memory usage. When this is NOT set, all the mapping results are kept in the memory before they are saved on disk, which works more efficiently for datasets that are not too large. |
--bc-error-threshold INT
Max Hamming distance allowed to correct a barcode [1]. Note that the max supported threshold is 2.
--bc-probability-threshold FLT
Min probability to correct a barcode [0.9]. When there are multiple whitelisted barcodes with the same Hamming distance to the barcode to correct, chromap will process the base quality of the mismatched bases, and compute a probability that the correction is right.
-t INT |
The number of threads for mapping [1]. |
Input options
-r FILE |
Reference file. | ||
-x FILE |
Index file. | ||
-1 FILE |
Single-end read files or paired-end read files 1. Chromap supports mulitple input files concatenate by ",". For example, setting this option to "read11.fq,read12.fq,read13.fq" will make all three files as input and map them in this order. Similarly, -2 and -b also support multiple input files. And the ordering of the input files for all the three options should match. | ||
-2 FILE |
Paired-end read files 2. | ||
-b FILE |
Cell barcode files. |
--barcode-whitelist FILE
Cell barcode whitelist file. This is supposed to be a txt file where each line is a whitelisted barcode.
--read-format STR
Format for read files and barcode files ["r1:0:-1,bc:0:-1"] as 10x Genomics single-end format.
Output options
-o FILE |
Output file. |
--output-mappings-not-in-whitelist
Output mappings with barcode not in the whitelist.
--chr-order FILE
Customized chromsome order.
--BED |
Output mappings in BED/BEDPE format. Note that only one of the formats should be set. |
--TagAlign
Output mappings in TagAlign/PairedTagAlign format.
--SAM |
Output mappings in SAM format. | ||
--pairs |
Output mappings in pairs format (defined by 4DN for HiC data). |
--pairs-natural-chr-order FILE
Natural chromosome order for pairs flipping.
-v |
Print version number to stdout. |
Preset
options
--preset STR
Preset []. This option applies multiple options at the same time. It should be applied before other options because options applied later will overwrite the values set by --preset. Available STR are:
chip |
Mapping ChIP-seq reads (-l 2000 --remove-pcr-duplicates --low-mem --BED). | ||
atac |
Mapping ATAC-seq/scATAC-seq reads (-l 2000 --remove-pcr-duplicates --low-mem --trim-adapters --Tn5-shift --remove-pcr-duplicates-at-cell-level --BED). | ||
hic |
Mapping Hi-C reads (-e 4 -q 1 --low-mem --split-alignment --pairs). |