MCAAT — User Manual
MCAAT finds CRISPR arrays directly in raw metagenomic reads — no assembly required. It builds a de Bruijn graph from your reads and detects the cyclic repeat-spacer pattern that defines a CRISPR array.
Quick Start
Pull the image and point it at your reads. This is the full workflow for most users.
docker pull feeka94/mcaat:1.0.0
docker run --rm -v $(pwd):/data feeka94/mcaat:1.0.0 \
--input-files /data/reads_R1.fastq.gz /data/reads_R2.fastq.gz \
--output-folder /data/results
When it finishes, open results/CRISPR_Arrays_1.txt.
Each block in that file is one detected CRISPR array with its spacers.
See Reading the Output for how to interpret it.
Installation
Docker (recommended)
No dependencies to install. Works on any machine without manual setup.
hub.docker.com/r/feeka94/mcaat
Build from source
Use this on HPC clusters without Docker, or if you need to modify the binary. Requires CMake ≥ 3.12, C++17, zlib, OpenMP, BZip2.
Build from source
git clone --recurse-submodules https://github.com/feeka94/mcaat.git
cd mcaat
chmod +x ./install.sh
./install.sh # binary → build/mcaat
./install.sh --install # also copies to system path
--recurse-submodules, run git submodule update --init --recursive before building.
Your Data
MCAAT accepts FASTQ files, plain or gzip-compressed. Pass one file for single-end reads, two files for paired-end.
| Situation | Command |
|---|---|
| Paired-end (typical Illumina run) | --input-files R1.fastq.gz R2.fastq.gz |
| Single-end or merged reads | --input-files reads.fastq.gz |
Paired-end mode is preferred. Providing both files gives the graph builder more k-mer coverage across read pairs, which improves detection in lower-depth samples.
Tuning Parameters
The defaults work for most metagenomic datasets. You only need to change parameters if you are seeing unexpected results — too many hits, too few, or arrays that look truncated.
--cycle-min-length
--cycle-max-length
default: 27 / 77
--threshold-multiplicity
default: 20
--low-abundance
default: true
false only if you are running on a pure culture or a highly enriched sample where the target organism dominates.--ram
default: 95% of system RAM
8G, 64G, 128G. The graph typically needs roughly 4–8 bytes per unique k-mer in your dataset.--threads
default: CPU cores − 2
--autoclean
default: true
graph/ and cycles/ directories. Set to false to keep them — useful if you plan to inspect the graph or re-run detection with different parameters on the same dataset without rebuilding the graph.Settings File
For repeated runs, store your parameters in a key=value file
instead of typing long command lines. Pass it with --settings.
Any flag you also pass on the command line overrides the file value.
input-files=/data/R1.fastq.gz /data/R2.fastq.gz
output-folder=results/run_1
ram=128G
threads=26
threshold-multiplicity=15
low-abundance=true
autoclean=true
mcaat --settings run.conf
input-files accepts one or two paths separated by spaces, commas, or semicolons.
Reading the Output
├── CRISPR_Arrays_1.txt ← your results are here
├── graph/ ← succinct de Bruijn graph files
└── cycles/ ← raw cycle data, not needed for typical use
If many arrays are detected, output is split into numbered files:
CRISPR_Arrays_1.txt, CRISPR_Arrays_2.txt, etc.
The header of each file tells you the total count.
Structure of an array block
>Array_1 spacers=8
ATCGATCGATCGATCGATCGATCG
-------------------- AACCCGGTTAATCGATCGTTTCGAGC
-------------------- TTGGCCAATCGATCGATCAAAACGGG
ATCGATCGATCGATCTATCG GGAATTCCAATCGATCGAATACCCAC
| Element | What it means |
|---|---|
>Array_1 spacers=8 |
Array identifier. The spacer count is a rough proxy for how long this organism has been exposed to phage — more spacers means more recorded encounters. |
| First sequence line | The consensus repeat sequence for this array. All spacers in one array share the same repeat, which is how MCAAT groups them. |
| Left column (dashes or sequence) | The repeat variant for this spacer row. Dashes mean it matches the consensus exactly. A sequence here means this particular repeat copy differs slightly — this is normal and expected at the leader end of an array, where newly acquired spacers sit. |
| Right column | The spacer sequence itself. Each spacer represents one acquired target — typically a phage or mobile element sequence. |
Next Steps
| Goal | How |
|---|---|
| Identify what the spacers target | BLAST each spacer against viral databases (e.g. NCBI nr, IMG/VR, PHASTER). A hit tells you what phage or MGE this organism has been exposed to. |
| Find the PAM site | Map spacers back to your reads or a reference genome to locate the protospacer. The flanking sequence at the 3′ end is the PAM. |
| Classify the CRISPR system type | Use the repeat sequence as input to CRISPRdb or CasFinder to identify the Cas subtype (I, II, III, etc.). |
| Re-run with adjusted sensitivity | See Reusing the Graph — rebuild only the cycle detection, not the graph. |