Contact: Ruibang Luo Email: [email protected]

Manual of SOAPdenovo2

About MEGAHIT

MEGAHIT works with single-cell sequencing data and metagenomcis data. Compare to SOAPdenovo, it generates longer contigs and consumes less memory.

To scaffold the contigs generated by MEGAHIT, please use SOAPdenovo-fusion. It is a preparation module that takes contigs as input and generates files that could be used consecutively by SOAPdenovo's map and scaff module.

Reference:

MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph

Manuscript

Github

For MAC users

Please use brew to SOAPdenovo. SOAPdenovo's package in Homebrew-science is managed by Shaun Jackman.

Introduction

SOAPdenovo is a novel short-read assembly method that can build a de novo draft assembly for the human-sized genomes. The program is specially designed to assemble Illumina GA short reads. It creates new opportunities for building reference sequences and carrying out accurate analyses of unexplored genomes in a cost effective way.

System Requirement

SOAPdenovo aims for large plant and animal genomes, although it also works well on bacteria and fungi genomes. It runs on 64-bit Linux system with a minimum of 5G physical memory. For big genomes like human, about 150 GB memory would be required.

Installation

You can download the pre-compiled binary according to your platform, unpack using "tar -zxf ${destination folder} download.tgz" and execute directly.
Or download the source code, unpack to ${destination folder} with the method above, and compile by using GNU make with command "make" at ${destination folder}/SOAPdenovo-V2.04.

How to use it

1. Configuration file

For big genome projects with deep sequencing, the data is usually organized as multiple read sequence files generated from multiple libraries. The configuration file tells the assembler where to find these files and the relevant information. "example.config" is an example of such a file.

The configuration file has a section for global information, and then multiple library sections. Right now only "max_rd_len" is included in the global information section. Any read longer than max_rd_len will be cut to this length.

The library information and the information of sequencing data generated from the library should be organized in the corresponding library section. Each library section starts with tag [LIB] and includes the following items:

The assembler accepts read file in three kinds of formats: FASTA, FASTQ and BAM. Mate-pair relationship could be indicated in two ways: two sequence files with reads in the same order belonging to a pair, or two adjacent reads in a single file (FASTA only) belonging to a pair. If a read in bam file fails platform/vendor quality checks(the flag field 0x0200 is set), itself and it's paired read would be ignored.

In the configuration file single end files are indicated by "f=/path/filename" or "q=/pah/filename" for fasta or fastq formats separately. Paired reads in two fasta sequence files are indicated by "f1=" and "f2=". While paired reads in two fastq sequences files are indicated by "q1=" and "q2=". Paired reads in a single fasta sequence file is indicated by "p=" item. Reads in bam sequence files is indicated by "b=".

All the above items in each library section are optional. The assembler assigns default values for most of them. If you are not sure how to set a parameter, you can remove it from your configuration file.

2. Get started

Once the configuration file is available, a typical way to run the assembler is:

3.Options

3.1 Options for all (pregraph-contig-map-scaff)

3.2 Options for sparse_pregraph

4. Output files

4.1 These files are output as assembly results:

4.2 There are some other files that provide useful information for advanced users, which are listed in Appendix B.

5. FAQ

5.1 How to set K-mer size?

The program accepts odd numbers between 13 and 31. Larger K-mers would have higher rate of uniqueness in the genome and would make the graph simpler, but it requires deep sequencing depth and longer read length to guarantee the overlap at any genomic location.

The sparse pregraph module usually needs 2-10bp smaller kmer length to achieve the same performance as the original pregraph module.

5.2 How to set genome size(-z) for sparse pregraph module?

The -z parameter for sparse pregraph should be set a litter larger than the real genome size, it is used to allocate memory.

5.3 How to set library rank?

SOAPdenovo will use the pair-end libraries with insert size from smaller to larger to construct scaffolds. Libraries with the same rank would be used at the same time. For example, in a dataset of a human genome, we set five ranks for five libraries with insert size 200-bp, 500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs in each rank provide adequate physical coverage of the genome.