Metadata-Version: 2.1
Name: unitig-caller
Version: 1.1.0
Summary: unitig-caller: wrapper around mantis to detect presence of sequence elements
Home-page: https://github.com/johnlees/unitig-caller
Author: John Lees
Author-email: john@johnlees.me
License: Apache Software License
Description: # unitig-caller
        [![Dev build Status](https://dev.azure.com/jlees/unitig-caller/_apis/build/status/johnlees.unitig-caller?branchName=master)](https://dev.azure.com/jlees/unitig-caller/_build/latest?definitionId=1&branchName=master)
        [![Anaconda-Server Badge](https://anaconda.org/bioconda/unitig-caller/badges/version.svg)](https://anaconda.org/bioconda/unitig-caller)
        
        Determines presence/absence of sequence elements in bacterial sequence
        data using Bifrost Build and Query functions. Uses assemblies and/or reads as inputs.
        
        The implementation of unitig-caller is a wrapper around [Bifrost](https://github.com/pmelsted/bifrost)
        which formats files for use with pyseer, as well as an implementation which calls sequences
        using an FM-index.
        
        Build mode creates a compact de Bruijn graph using Bifrost. Query mode converts the .gfa
        file produced by Build mode to a .fasta, using an associated colours file to query
        the presence of unitigs in the source genomes used to build the original de Bruijn graph.
        
        Simple mode finds presence of unitigs in a new population using an FM-index.
        
        ## Install
        
        Use `unitig-caller` if installed through pip/conda, or
        `python unitig_caller-runner.py` if using a clone of the code.
        
        ### With conda (recommended)
        Get it from [bioconda](http://bioconda.github.io/):
        ```
        conda install unitig-caller
        ```
        
        If you haven't set this up, first install
        [miniconda](https://docs.conda.io/en/latest/miniconda.html). Then
        add the correct channels:
        ```
        conda config --add channels defaults
        conda config --add channels bioconda
        conda config --add channels conda-forge
        ```
        
        ### With pip
        Get it from PyPI:
        ```
        pip install unitig-caller
        ```
        
        Requires [bifrost](https://github.com/pmelsted/bifrost) version 1.0.3 installed, and accessible
        via PATH (see steps for installation at Bifrost github page).
        
        ### From source
        Requires `cmake`, `pthreads`, `pybind11` and a C++17 compiler (e.g. gcc >=7.3), in addition
        to the pip requirements.
        ```
        git clone https://github.com/johnlees/unitig-caller --recursive
        python setup.py install
        ```
        
        ## Usage
        
        There are three ways to use this package:
        1. Build a population graph to extract unitigs for GWAS with pyseer like [unitig-counter](https://github.com/johnlees/unitig-counter) (`--build`).
        2. Find these unitigs in a new population using a graph (`--build` and `--query`).
        3. Find these unitigs in a new population using an index (`--simple`).
        
        For 1), run `--build` mode followed by `--query` mode.
        
        Both 2) and 3) give the same results with different index tools, both finding unitigs so pyseer models can be applied to a new population.
        
        For 2), first run `--build` mode to make a graph for the
        new population. Then run `--query` mode with this graph, but the `--unitigs` from *the original population*.
        
        For 3), run `--simple` mode giving the new genomes as `--refs` and the `--unitigs` from *the original population*.
        
        These modes are detailed below
        
        ### Running Build mode
        This uses Bifrost Build to generate a compact de Bruijn graph. By default this a
        coloured compact de Bruijn graph.
        ```
        unitig-caller --build --refs refs.txt --reads reads.txt --output out_prefix
        ```
        
        `--refs` is a required .txt file listing paths of input assemblies or read files
        (.fasta or .fastq), each on a new line. Must be specified as either 'refs.txt' for assemblies
        or 'reads.txt' for read files. No header row.
        
        `--reads` is an optional .txt file listing paths to additional sequence files of different type
        to those specified in --input1 (e.g. if 'refs.txt' is given in `--refs`, then 'reads.txt' will
        be given in `--reads` and vice versa), each on new line. No header row.
        
        `--output` is the prefix for output files.
        
        By default de Bruijn graphs are coloured, with an accompanying .bfg_colors being
        generated alongside the .gfa file. To turn this off, use `--no_colour`. Note, Query mode
        cannot be run without a .bfg_colors file.
        
        To generate a clean de Bruijn graph (clip tips and delete isolated contigs shorter
        than k k-mers in length), specify `--clean`.
        
        Build mode automatically generates a .fasta file containing unitigs found within the graph.
        
        ### Running Query mode
        Before running Query mode, generate a coloured compact de Bruijn graph using Build mode.
        Then run the Query command as below.
        ```
        unitig-caller --query --graph-prefix in_prefix --unitigs query_unitigs.fasta --output out_prefix
        ```
        
        `--graph-prefix` is the required prefix for the .gfa, .bfg_colors and unitigs .fasta files generated from
        `--build` mode applied to the new population.
        
        `--unitigs` is an optional .fasta file, specifying a separate unitigs .fasta file that was
        generated by `--build` mode on another graph. If not specified, unitigs from the graph will be used,
        generating calls for this population.
        
        `--output` is the prefix for output files.
        
        The sensitivity of querying can be altered by passing a float argument to `--ratiok`
        (between 0 and 1, default 1.0), which determines the threshold proportion of k-mers of a
        specific colour present in a unitig for colour classification. Specifying `--inexact` will
        search the graph for both exact and inexact k-mers (1 substitution or indel) from queries.
        Lowering `--ratiok` and/or specifying `--inexact` will result in more colour hits per unitig,
        but will increase probability of false positives and run-time.
        
        ### Running simple mode
        This uses suffix arrays (FM-index) provided by [SeqAn3](https://www.seqan.de/) to perform
        string matches:
        ```
        unitig-caller --simple --refs strain_list.txt --unitigs queries.txt --output calls
        ```
        
        `--refs` is a required file listing input assemblies, name followed by location
        of fasta file (tab separated), each on a new line. No header row.
        
        `--unitigs` is a required list of the unitig sequences to call. The unitigs need
        to be in the first column (tab separated). A header row is assumed, so
        output from [pyseer](https://github.com/mgalardini/pyseer) etc can be directly used.
        
        `calls_pyseer.txt` will contain unitig calls in seer/pyseer k-mer format.
        
        By default FM-indexes are saved in the same location as the assembly files so that they can
        be quickly loaded by subsequent runs. To turn this off use `--no-save-idx`.
        
        ### Option reference
        ```
        usage: unitig-caller [-h] (--build | --query | --simple) [--refs REFS]
                             [--reads READS] [--graph-prefix GRAPH_PREFIX]
                             [--unitigs UNITIGS] [--output OUTPUT] [--no_colour]
                             [--clean] [--ratiok RATIOK] [--inexact]
                             [--kmer_size KMER_SIZE] [--minimizer_size MINIMIZER_SIZE]
                             [--no-save-idx] [--threads THREADS] [--bifrost BIFROST]
                             [--version]
        
        Call unitigs in a population dataset
        
        optional arguments:
          -h, --help            show this help message and exit
        
        Mode of operation:
          --build               Build coloured/uncoloured de Bruijn graph using
                                Bifrost
          --query               Query unitig presence/absence across input genomes
          --simple              Use FM-index to make calls
        
        Unitig-caller input/output:
          --refs REFS           Ref file to use to --build bifrost graph (or with
                                --simple)
          --reads READS         Read file to use to --build bifrost graph
          --graph-prefix GRAPH_PREFIX
                                Prefix of bifrost graph to --query
          --unitigs UNITIGS     fasta file of unitigs to query (--query or --simple)
          --output OUTPUT       Prefix for output [default = 'unitig_caller']
        
        Build Input/output:
          --no_colour           Specify for uncoloured de Bruijn Graph [default =
                                False]
          --clean               Clean DBG (clip tips and delete isolated contigs
                                shorter than k k-mers in length) [default = False]
        
        Query Input/output:
          --ratiok RATIOK       ratio of k-mers from queries that must occur in the
                                graph to be considered as belonging to colour [default
                                = 1.0]
          --inexact             Graph is searched with exact and inexact k-mers (1
                                substitution or indel) from queries [default = False]
        
        Bifrost options:
          --kmer_size KMER_SIZE
                                K-mer size for graph building/querying [default = 31]
          --minimizer_size MINIMIZER_SIZE
                                Minimizer size to be used for k-mer hashing [default =
                                23]
        
        Simple mode options:
          --no-save-idx         Do not save FM-indexes for reuse
        
        Other:
          --threads THREADS     Number of threads to use [default = 1]
          --bifrost BIFROST     Location of bifrost executable [default = Bifrost]
          --version             show program's version number and exit
        ```
        
        ## Citation
        
        If you use this, please cite the Bifrost paper:
        
        Holley G., Melsted, P. Bifrost – Highly parallel construction and indexing of colored and compacted de Bruijn graphs.
        bioRxiv 695338 (2019). doi: https://doi.org/10.1101/695338
        
Keywords: gwas bacteria k-mer unitig
Platform: UNKNOWN
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Description-Content-Type: text/markdown
