Quick Start

Pull the image and point it at your reads. This is the full workflow for most users.

docker pull feeka94/mcaat:1.0.0

docker run --rm -v $(pwd):/data feeka94/mcaat:1.0.0 \
  --input-files /data/reads_R1.fastq.gz /data/reads_R2.fastq.gz \
  --output-folder /data/results

When it finishes, open results/CRISPR_Arrays_1.txt. Each block in that file is one detected CRISPR array with its spacers. See Reading the Output for how to interpret it.

Installation

Docker (recommended)

No dependencies to install. Works on any machine without manual setup.
hub.docker.com/r/feeka94/mcaat

Build from source

Use this on HPC clusters without Docker, or if you need to modify the binary. Requires CMake ≥ 3.12, C++17, zlib, OpenMP, BZip2.

Build from source

git clone --recurse-submodules https://github.com/feeka94/mcaat.git
cd mcaat
chmod +x ./install.sh
./install.sh              # binary → build/mcaat
./install.sh --install    # also copies to system path
Submodules are required. If you cloned without --recurse-submodules, run git submodule update --init --recursive before building.

Your Data

MCAAT accepts FASTQ files, plain or gzip-compressed. Pass one file for single-end reads, two files for paired-end.

SituationCommand
Paired-end (typical Illumina run) --input-files R1.fastq.gz R2.fastq.gz
Single-end or merged reads --input-files reads.fastq.gz

Paired-end mode is preferred. Providing both files gives the graph builder more k-mer coverage across read pairs, which improves detection in lower-depth samples.

Tuning Parameters

The defaults work for most metagenomic datasets. You only need to change parameters if you are seeing unexpected results — too many hits, too few, or arrays that look truncated.

--cycle-min-length --cycle-max-length default: 27 / 77
These define the size of cycles the graph searches for — roughly the combined length of one repeat and one spacer. CRISPR repeats are typically 23–47 bp and spacers 26–72 bp. The defaults cover the vast majority of known CRISPR systems. Only tighten these if you know your target organism's CRISPR type and want to reduce noise.
--threshold-multiplicity default: 20
Controls how many times a k-mer must appear in the reads before it is considered as a cycle start point. Think of it as a depth filter: low-coverage datasets will miss real arrays if this is too high. Lower it (e.g. 5–10) if your sequencing depth is <10×, or if the CRISPR host is a rare community member and you are getting zero hits. Raise it (e.g. 30–50) on deep datasets to cut false positives.
--low-abundance default: true
Enables a detection mode for metagenomes where the CRISPR-carrying organism is a minor fraction of the community. Leave it on for any environmental or gut metagenome. Set it to false only if you are running on a pure culture or a highly enriched sample where the target organism dominates.
--ram default: 95% of system RAM
MCAAT builds the entire de Bruijn graph in memory. On a shared HPC node, set this explicitly to stay within your allocated memory quota. Use suffixed values: 8G, 64G, 128G. The graph typically needs roughly 4–8 bytes per unique k-mer in your dataset.
--threads default: CPU cores − 2
On a laptop, the default is fine. On an HPC node, set this to the number of cores you have allocated, otherwise MCAAT will try to use all physical cores on the node and compete with other jobs.
--autoclean default: true
After a run, MCAAT removes the intermediate graph/ and cycles/ directories. Set to false to keep them — useful if you plan to inspect the graph or re-run detection with different parameters on the same dataset without rebuilding the graph.

Settings File

For repeated runs, store your parameters in a key=value file instead of typing long command lines. Pass it with --settings. Any flag you also pass on the command line overrides the file value.

input-files=/data/R1.fastq.gz /data/R2.fastq.gz
output-folder=results/run_1
ram=128G
threads=26
threshold-multiplicity=15
low-abundance=true
autoclean=true
mcaat --settings run.conf

input-files accepts one or two paths separated by spaces, commas, or semicolons.

Reading the Output

results/
├── CRISPR_Arrays_1.txt  ← your results are here
├── graph/            ← succinct de Bruijn graph files
└── cycles/          ← raw cycle data, not needed for typical use

If many arrays are detected, output is split into numbered files: CRISPR_Arrays_1.txt, CRISPR_Arrays_2.txt, etc. The header of each file tells you the total count.

Structure of an array block

>Array_1  spacers=8
ATCGATCGATCGATCGATCGATCG
        --------------------    AACCCGGTTAATCGATCGTTTCGAGC
        --------------------    TTGGCCAATCGATCGATCAAAACGGG
        ATCGATCGATCGATCTATCG    GGAATTCCAATCGATCGAATACCCAC
ElementWhat it means
>Array_1 spacers=8 Array identifier. The spacer count is a rough proxy for how long this organism has been exposed to phage — more spacers means more recorded encounters.
First sequence line The consensus repeat sequence for this array. All spacers in one array share the same repeat, which is how MCAAT groups them.
Left column (dashes or sequence) The repeat variant for this spacer row. Dashes mean it matches the consensus exactly. A sequence here means this particular repeat copy differs slightly — this is normal and expected at the leader end of an array, where newly acquired spacers sit.
Right column The spacer sequence itself. Each spacer represents one acquired target — typically a phage or mobile element sequence.

Next Steps

GoalHow
Identify what the spacers target BLAST each spacer against viral databases (e.g. NCBI nr, IMG/VR, PHASTER). A hit tells you what phage or MGE this organism has been exposed to.
Find the PAM site Map spacers back to your reads or a reference genome to locate the protospacer. The flanking sequence at the 3′ end is the PAM.
Classify the CRISPR system type Use the repeat sequence as input to CRISPRdb or CasFinder to identify the Cas subtype (I, II, III, etc.).
Re-run with adjusted sensitivity See Reusing the Graph — rebuild only the cycle detection, not the graph.

Citation