DENTIST's Logo

Close assembly gaps using long-reads at high accuracy.

Keywords: bioinformatics, close-assembly-gaps, cluster, daligner, damapper, docker, dub, gap-filling, genome-assembly, long-reads, pacbio, singularity, snakemake

View the Project on GitHub a-ludi/dentist

DENTIST

standard-readme compliant GitHub DUB Singularity Image Version Conda package Version DOI:10.1093/gigascience/giab100

Long sequencing reads allow increasing contiguity and completeness of fragmented, short-read based genome assemblies by closing assembly gaps, ideally at high accuracy. DENTIST is a sensitive, highly-accurate and automated pipeline method to close gaps in (short read) assemblies with long reads.

API documentation: current, v4.0.0, v3.0.0, v2.0.0

First time here? Head over to the example and make sure it works.

Install

Make sure Mamba (a frontend for Conda) is installed on your system. You can then use DENTIST like so:

# run the whole workflow on a cluster using Conda
snakemake --configfile=snakemake.yml --use-conda -jall
snakemake --configfile=snakemake.yml --use-conda --profile=slurm

The last command is explained in more detail below in the usage section.

Note: If you do not have mamba installed, you may need to pass --conda-frontend=conda to Snakemake.

Use Conda to Manually Install Binaries

Make sure Mamba (a frontend for Conda) is installed on your system. Install DENTIST and all dependencies like so:

mamba create -n dentist -c a_ludi -c bioconda dentist-core
mamba activate dentist
mamba install -c conda-forge -c bioconda snakemake

# execute the workflow
snakemake --configfile=snakemake.yml --cores=all

More details on executing DENTIST can be found in the usage section.

Use Pre-Built Binaries

Download the latest pre-built binaries from the releases section and extract the contents. The pre-built binaries are stored in a subfolder called bin. Here are the instructions for v4.0.0:

# download & extract pre-built binaries
wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz

# make binaries available to your shell
cd dentist.v4.0.0.x86_64
PATH="$PWD/bin:$PATH"

# check installation with
dentist -d
# Expected output:
# 
#daligner (part of `DALIGNER`; see https://github.com/thegenemyers/DALIGNER) [OK]
#damapper (part of `DAMAPPER`; see https://github.com/thegenemyers/DAMAPPER) [OK]
#DAScover (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DASqv (part of `DASCRUBBER`; see https://github.com/thegenemyers/DASCRUBBER) [OK]
#DBdump (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBdust (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBrm (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBshow (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#DBsplit (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DAM (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#fasta2DB (part of `DAZZ_DB`; see https://github.com/thegenemyers/DAZZ_DB) [OK]
#computeintrinsicqv (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]
#daccord (part of `daccord`; see https://gitlab.com/german.tischler/daccord) [OK]

The tarball additionally contains the Snakemake workflow, example config files and this README. In short, everything you to run DENTIST.

Use a Singularity Container (discouraged)

Remark: the Singularity container may not work properly depending on your system. (see issue #30)

Make sure Singularity is installed on your system. You can then use the container like so:

# launch an interactive shell
singularity shell docker://aludi/dentist:stable

# execute a single command inside the container
singularity exec docker://aludi/dentist:stable dentist --version

# run the whole workflow on a cluster using Singularity
snakemake --configfile=snakemake.yml --use-singularity --profile=slurm

The last command is explained in more detail below in the usage section.

Build from Source

  1. Install the D package manager DUB.
  2. Install JQ 1.6.
  3. Build DENTIST using either
     dub install dentist
    

    or

     git clone --recurse-submodules https://github.com/a-ludi/dentist.git
     cd dentist
     dub build
    

Runtime Dependencies

The following software packages are required to run dentist:

Please see their own documentation for installation instructions. Note, the available packages on Bioconda are outdated and should not be used at the moment but they are available using conda install -c a_ludi <dependency>.

Please use the exact versions specified in the Conda recipe in case you experience troubles.

Usage

Before you start producing wonderful scientific results, you should skip over to the example section and try to run the small example. This will make sure your setup is working as expected.

Quick execution with Snakemake

TL;DR

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz
cd dentist.v4.0.0.x86_64

# edit dentist.yml and snakemake.yml

# execute with CONDA:
snakemake --configfile=snakemake.yml --use-conda

# execute with SINGULARITY:
snakemake --configfile=snakemake.yml --use-singularity

# execute with pre-built binaries:
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml

Install Snakemake version >=5.32.1 and prepare your working directory:

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist.v4.0.0.x86_64.tar.gz
tar -xzf dentist.v4.0.0.x86_64.tar.gz

cp -r -t . \
    dentist.v4.0.0.x86_64/snakemake/dentist.yml \
    dentist.v4.0.0.x86_64/snakemake/Snakefile \
    dentist.v4.0.0.x86_64/snakemake/snakemake.yml \
    dentist.v4.0.0.x86_64/snakemake/envs \
    dentist.v4.0.0.x86_64/snakemake/scripts

Next edit snakemake.yml and dentist.yml to fit your needs and optionally test your configuration with

# see above for variants with pre-built binaries or Singularity
snakemake --configfile=snakemake.yml --use-conda --cores=1 -f -- validate_dentist_config

If no errors occurred the whole workflow can be executed using

# see above for variants with pre-built binaries or Singularity
snakemake --configfile=snakemake.yml --use-conda --cores=all

For small genomes of a few 100 Mbp this should run on a regular workstation. One may use Snakemake’s --cores to run independent jobs in parallel. Larger data sets may require a cluster in which case you can use Snakemake’s cloud or cluster facilities.

Executing on a Cluster

Please follow the setup steps from above except for the actual execution.

To make execution on a cluster easy DENTIST comes with examples files to make Snakemake use SLURM via DRMAA, sbatch or srun found under snakemake. If your cluster does not use SLURM please modify the profiles to suit your needs or read the documentation of Snakemake. Another good starting point is the Snakemake-Profiles project.

After you have selected an appropriate cluster profile, make it available to Snakemake, e.g.:

# choose appropriate file from `snakemake/profile-slurm.*.yml`
mkdir -p ~/.config/snakemake/slurm
cp ./snakemake/profile-slurm.submit-async.yml ~/.config/snakemake/slurm/config.yaml

Adjust the profile according to your cluster, e.g. you may need to specify accounting information. Values defined in cluster.yml can be used in the profile as demonstrated in the examples. This file is also the place to modify resource allocations and job names.

Now, you can execute the workflow like this:

snakemake --configfile=snakemake.yml --profile=slurm --use-conda

Snakemake will now start submitting jobs to your cluster until all the work is done. If something fails, you can execute the same command again to continue from the latest state of the workflow.

Manual execution

Please inspect the Snakemake workflow to get all the details. It might be useful to execute Snakemake with the -p switch which causes Snakemake to print the shell commands. If you plan to write your own workflow management for DENTIST please feel free to contact the maintainer!

Example

Make sure you have Snakemake 5.32.1 or later installed.

You can also use the convenient Conda package to execute the rules. Just make sure you have Mamba installed.

First of all download the test data and workflow and switch to the dentist-example directory.

wget https://github.com/a-ludi/dentist/releases/download/v4.0.0/dentist-example.tar.gz
tar -xzf dentist-example.tar.gz
cd dentist-example

Local Execution

Execute the entire workflow on your local machine using all cores:

# run the workflow
PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --cores=all

# validate the files
md5sum -c checksum.md5

Execution takes approx. 7 minutes and a maximum of 1.7GB memory on my little laptop with an Intel® Core™ i5-5200U CPU @ 2.20GHz.

Execution with Conda

Make sure Mamba (a frontend for Conda) is installed on your system. Execute the workflow without explicit installation by adding --use-conda to the call to Snakemake:

# run the workflow
snakemake --configfile=snakemake.yml --use-conda --cores=all

# validate the files
md5sum -c checksum.md5

Note: If you do not have mamba installed, you may need to pass --conda-frontend=conda to Snakemake.

Execution in Singularity Container (discouraged)

Remark: the Singularity container may not work properly depending on your system. (see issue #30)

Execute the workflow inside a convenient Singularity image by adding --use-singularity to the call to Snakemake:

# run the workflow
snakemake --configfile=snakemake.yml --use-singularity --cores=all

# validate the files
md5sum -c checksum.md5

Cluster Execution

Please follow the instructions “Executing on a Cluster” above.

Configuration

DENTIST comprises a complex pipeline of with many options for tweaking. This section points out some important parameters and their effect on the result or performance.

The default parameters are rather conservative, i.e. they focus on correctness of the result while not sacrificing too much sensitivity.

We also provide a greedy sample configuration (snakemake/dentist.greedy.yml) which focuses on sensitivity but may introduce more errors. Warning: Use with care! Always validate the closed gaps (e.g. manual inspection).

In any case, the workflow creates an intermediate assembly workdir/{output_assembly}-preliminary.fasta that contains all closed gaps, i.e. before validation. It is accompanied by an AGP and BED file. You may inspect these file for maximum sensitivity.

How to Choose DENTIST Parameters

While the list of all commandline parameters is a good reference, it does not provide an overview of the important parameters. Therefore, we provide this shorter list of important and influential parameters. Please also consider adjusting the performance parameter in the workflow configuration (snakemake/snakemake.yml).

Choosing the Read Type

In the examples PacBio long reads are assumed but DENTIST can be run using any kind of long reads. Currently, this is either PacBio or Oxford Nanopore reads. For using none-PacBio reads, the reads_type in snakemake.yml must be set to anything other than PACBIO_SMRT. The recommendation is to use OXFORD_NANOPORE for Oxford Nanopore. These names are borrowed from the NCBI. Further details on the rationale can found in this issue.

Cluster/Cloud Execution

Cluster job schedulers can become unresponsive or even crash if too many jobs with short running time are submitted to the cluster. It is therefore advisable to adjust the workflow accordingly. We tried to provide a default configuration that works in most cases as is but the application scenarios can be very diverse and manual adjustments may become necessary. Here is a small guide which config parameters influence the number of jobs and how much resources they consume.

Troubleshooting

Regular ProtectedOutputException

Snakemake has a built-in facility to protect files from accidental overwrites. This is meant to avoid overwriting precious results that took many CPU hours to produce. If executing a rule would overwrite a protected file, Snakemake raises a ProtectedOutputException, e.g.:

ProtectedOutputException in line 1236 of /tmp/dentist-example/Snakefile:
Write-protected output files for rule collect:
workdir/pile-ups.db
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 136, in run_jobs
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 441, in run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 230, in _run
  File "/usr/lib/python3.9/site-packages/snakemake/executors/__init__.py", line 155, in _run

Here workdir/pile-ups.db is the protected file that caused the error. If you are sure of what you are doing, you can simply raise the protection by chmod -R +w ./workdir and execute Snakemake again. Now, it will overwrite any files.

No internet connection on compute nodes

If you have no internet connection on your compute nodes or even the cluster head node and want to use Singularity for execution, you will need to download the container image manually and put it to a location accessible by all jobs. Assume /path/to/dir is such a location on your cluster. Then download the container image using

# IF internet connection on head node
singularity pull --dir /path/to/dir docker://aludi/dentist:stable

# ELSE (on local machine)
singularity pull docker://aludi/dentist:stable
# copy dentist_stable.sif to cluster
scp dentist_stable.sif cluster:/path/to/dir/dentist_stable.sif

When the image is in place you will need to adjust your configuration in snakemake.yml:

dentist_container: "/path/to/dir/dentist_stable.sif"

Now, you are ready for execution.

Note, if you want to use Conda without internet connection, you can just use the pre-compiled binaries instead because they are just what Conda will install. Be sure to adjust your PATH accordingly, e.g.:

PATH="$PWD/bin:$PATH" snakemake --configfile=snakemake.yml --profile=slurm

Illegally formatted line from DBshow -n

This error message may appear in DENTIST’s log files. It is a known bug that will be fixed in a future release. In the meantime avoid FASTA headers that contain a literal " :: ".

Citation

Arne Ludwig, Martin Pippel, Gene Myers, Michael Hiller. DENTIST — using long reads for closing assembly gaps at high accuracy. GigaScience, Volume 11, 2022, giab100. https://doi.org/10.1093/gigascience/giab100

Maintainer

DENTIST is being developed by Arne Ludwig <ludwig@mpi-cbg.de> at the Planck Institute of Molecular Cell Biology and Genetics, Dresden, Germany.

Contributing

Contributions are warmly welcome. Just create an issue or pull request on GitHub. If you submit a pull request please make sure that:

It is recommended to install the Git hooks included in the repository to avoid premature pull requests. You can enable all shipped hooks with this command:

git config --local core.hooksPath .githooks/

If you do not want to enable just a subset use ln -s .githooks/{hook} .git/hooks. If you want to audit code changes before they get executed on your machine you can you cp .githooks/{hook} .git/hooks instead.

License

This project is licensed under MIT License (see LICENSE).