# unitig-caller
[![Dev build Status](https://dev.azure.com/jlees/unitig-caller/_apis/build/status/johnlees.unitig-caller?branchName=master)](https://dev.azure.com/jlees/unitig-caller/_build/latest?definitionId=1&branchName=master)
[![Anaconda-Server Badge](https://anaconda.org/bioconda/unitig-caller/badges/version.svg)](https://anaconda.org/bioconda/unitig-caller)

Determines presence/absence of sequence elements in bacterial sequence
data using Bifrost Build and Query functions. Uses assemblies and/or reads as inputs.

The implementation of unitig-caller is a wrapper around [Bifrost](https://github.com/pmelsted/bifrost)
which formats files for use with pyseer, as well as an implementation which calls sequences
using an FM-index.

Build mode creates a compact de Bruijn graph using Bifrost. Query mode converts the .gfa
file produced by Build mode to a .fasta, using an associated colours file to query
the presence of unitigs in the source genomes used to build the original de Bruijn graph.

Simple mode finds presence of unitigs in a new population using an FM-index.

## Install

Use `unitig-caller` if installed through pip/conda, or
`python unitig_caller-runner.py` if using a clone of the code.

### With conda (recommended)
Get it from [bioconda](http://bioconda.github.io/):
```
conda install unitig-caller
```

If you haven't set this up, first install
[miniconda](https://docs.conda.io/en/latest/miniconda.html). Then
add the correct channels:
```
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
```

### With pip
Get it from PyPI:
```
pip install unitig-caller
```

Requires [bifrost](https://github.com/pmelsted/bifrost) version 1.0.3 installed, and accessible
via PATH (see steps for installation at Bifrost github page).

### From source
Requires `cmake`, `pthreads`, `pybind11` and a C++17 compiler (e.g. gcc >=7.3), in addition
to the pip requirements.
```
git clone https://github.com/johnlees/unitig-caller --recursive
python setup.py install
```

## Usage

There are three ways to use this package:
1. Build a population graph to extract unitigs for GWAS with pyseer like [unitig-counter](https://github.com/johnlees/unitig-counter) (`--build`).
2. Find these unitigs in a new population using a graph (`--build` and `--query`).
3. Find these unitigs in a new population using an index (`--simple`).

For 1), run `--build` mode followed by `--query` mode.

Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.

For 2), first run `--build` mode to make a graph for the
new population. Then run `--query` mode with this graph, but the `--unitigs` from *the original population*.

For 3), run `--simple` mode giving the new genomes as `--refs` and the `--unitigs` from *the original population*.

These modes are detailed below

### Running Build mode
This uses Bifrost Build to generate a compact de Bruijn graph. By default this a
coloured compact de Bruijn graph.
```
unitig-caller --build --refs refs.txt --reads reads.txt --output out_prefix
```

`--refs` is a required .txt file listing paths of input assemblies or read files
(.fasta or .fastq), each on a new line. Must be specified as either 'refs.txt' for assemblies
or 'reads.txt' for read files. No header row.

`--reads` is an optional .txt file listing paths to additional sequence files of different type
to those specified in --input1 (e.g. if 'refs.txt' is given in `--refs`, then 'reads.txt' will
be given in `--reads` and vice versa), each on new line. No header row.

`--output` is the prefix for output files.

By default de Bruijn graphs are coloured, with an accompanying .bfg_colors being
generated alongside the .gfa file. To turn this off, use `--no_colour`. Note, Query mode
cannot be run without a .bfg_colors file.

To generate a clean de Bruijn graph (clip tips and delete isolated contigs shorter
than k k-mers in length), specify `--clean`.

Build mode automatically generates a .fasta file containing unitigs found within the graph.

### Running Query mode
Before running Query mode, generate a coloured compact de Bruijn graph using Build mode.
Then run the Query command as below.
```
unitig-caller --query --graph-prefix in_prefix --unitigs query_unitigs.fasta --output out_prefix
```

`--graph-prefix` is the required prefix for the .gfa, .bfg_colors and unitigs .fasta files generated from
`--build` mode applied to the new population.

`--unitigs` is an optional .fasta file, specifying a separate unitigs .fasta file that was
generated by `--build` mode on another graph. If not specified, unitigs from the graph will be used,
generating calls for this population.

`--output` is the prefix for output files.

The sensitivity of querying can be altered by passing a float argument to `--ratiok`
(between 0 and 1, default 1.0), which determines the threshold proportion of k-mers of a
specific colour present in a unitig for colour classification. Specifying `--inexact` will
search the graph for both exact and inexact k-mers (1 substitution or indel) from queries.
Lowering `--ratiok` and/or specifying `--inexact` will result in more colour hits per unitig,
but will increase probability of false positives and run-time.

### Running simple mode
This uses suffix arrays (FM-index) provided by [SeqAn3](https://www.seqan.de/) to perform
string matches:
```
unitig-caller --simple --refs strain_list.txt --unitigs queries.txt --output calls
```

`--refs` is a required file listing input assemblies, name followed by location
of fasta file (tab separated), each on a new line. No header row.

`--unitigs` is a required list of the unitig sequences to call. The unitigs need
to be in the first column (tab separated). A header row is assumed, so
output from [pyseer](https://github.com/mgalardini/pyseer) etc can be directly used.

`calls_pyseer.txt` will contain unitig calls in seer/pyseer k-mer format.

By default FM-indexes are saved in the same location as the assembly files so that they can
be quickly loaded by subsequent runs. To turn this off use `--no-save-idx`.

### Option reference
```
usage: unitig-caller [-h] (--build | --query | --simple) [--refs REFS]
                     [--reads READS] [--graph-prefix GRAPH_PREFIX]
                     [--unitigs UNITIGS] [--output OUTPUT] [--no_colour]
                     [--clean] [--ratiok RATIOK] [--inexact]
                     [--kmer_size KMER_SIZE] [--minimizer_size MINIMIZER_SIZE]
                     [--no-save-idx] [--threads THREADS] [--bifrost BIFROST]
                     [--version]

Call unitigs in a population dataset

optional arguments:
  -h, --help            show this help message and exit

Mode of operation:
  --build               Build coloured/uncoloured de Bruijn graph using
                        Bifrost
  --query               Query unitig presence/absence across input genomes
  --simple              Use FM-index to make calls

Unitig-caller input/output:
  --refs REFS           Ref file to use to --build bifrost graph (or with
                        --simple)
  --reads READS         Read file to use to --build bifrost graph
  --graph-prefix GRAPH_PREFIX
                        Prefix of bifrost graph to --query
  --unitigs UNITIGS     fasta file of unitigs to query (--query or --simple)
  --output OUTPUT       Prefix for output [default = 'unitig_caller']

Build Input/output:
  --no_colour           Specify for uncoloured de Bruijn Graph [default =
                        False]
  --clean               Clean DBG (clip tips and delete isolated contigs
                        shorter than k k-mers in length) [default = False]

Query Input/output:
  --ratiok RATIOK       ratio of k-mers from queries that must occur in the
                        graph to be considered as belonging to colour [default
                        = 1.0]
  --inexact             Graph is searched with exact and inexact k-mers (1
                        substitution or indel) from queries [default = False]

Bifrost options:
  --kmer_size KMER_SIZE
                        K-mer size for graph building/querying [default = 31]
  --minimizer_size MINIMIZER_SIZE
                        Minimizer size to be used for k-mer hashing [default =
                        23]

Simple mode options:
  --no-save-idx         Do not save FM-indexes for reuse

Other:
  --threads THREADS     Number of threads to use [default = 1]
  --bifrost BIFROST     Location of bifrost executable [default = Bifrost]
  --version             show program's version number and exit
```

## Citation

If you use this, please cite the Bifrost paper:

Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs.
bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338
