assembly-scan

Generate basic stats for an assembly.

MIT License

Stars
8

assembly-scan reads an assembly in FASTA format and outputs summary statistics in TSV or JSON format

assembly-scan

I wanted a quick method to output simple summary statistics of an input assembly in TSV or JSON format. There are alternatives including assemblathon-stats.pl and assembly-stats, but they didn't output what I wanted. Thus assembly-scan was born.

Installation

Bioconda

assembly-scan is available on Bioconda.

conda create -n assembly-scan -c conda-forge -c bioconda assembly-scan

From Source

While I will always recommend using the Bioconda installation, the only dependency assembly-scan has is Python >=3.7. So, if you have that already you can use the script directly.

[email protected]:rpetit3/assembly-scan.git
cd assembly-scan
python3 bin/assembly-scan YOUR_ASSEMBLY.fasta

From there you can decide to add it to your PATH or not. But, again, I recommend just going the Bioconda route.

Usage

assembly-scan requires an assembly, gzip compressed or uncompressed, as input.

Usage

usage: assembly-scan [-h] [--json] [--transpose] [--prefix PREFIX] [--version] ASSEMBLY

Generate statistics for a given assembly.

positional arguments:
  ASSEMBLY         FASTA file to read (gzip or uncompressed)

options:
  -h, --help       show this help message and exit
  --json           Print output in a JSON format
  --transpose      Print output in a transposed tab-delimited format
  --prefix PREFIX  ID to use for output (Default: basename of assembly)
  --version        show program's version number and exit

Example Usage

Many FASTA files are available in the test directory. These include an uncompressed complete phiX174 genome and a compressed Staphylococcus aureus assembly. This script reads the input and outputs summary statistics in tab-delimited format to STDOUT.

Uncompressed

By default assembly-scan outputs the results in tab-delimited format. But for example purposes the --transpose option has been used. It is just easier to look at in the README.

assembly-scan test/phiX174.fna --transpose
test/phiX174.fna        sample  phiX174.fna
test/phiX174.fna        total_contig    1
test/phiX174.fna        total_contig_length     5386
test/phiX174.fna        max_contig_length       5386
test/phiX174.fna        mean_contig_length      5386
test/phiX174.fna        median_contig_length    5386
test/phiX174.fna        min_contig_length       5386
test/phiX174.fna        n50_contig_length       5386
test/phiX174.fna        l50_contig_count        1
test/phiX174.fna        num_contig_non_acgtn    0
test/phiX174.fna        contig_percent_a        23.97
test/phiX174.fna        contig_percent_c        21.48
test/phiX174.fna        contig_percent_g        23.28
test/phiX174.fna        contig_percent_t        31.27
test/phiX174.fna        contig_percent_n        0.00
test/phiX174.fna        contig_non_acgtn        0.00
test/phiX174.fna        contigs_greater_1m      0
test/phiX174.fna        contigs_greater_100k    0
test/phiX174.fna        contigs_greater_10k     0
test/phiX174.fna        contigs_greater_1k      1
test/phiX174.fna        percent_contigs_greater_1m      0.00
test/phiX174.fna        percent_contigs_greater_100k    0.00
test/phiX174.fna        percent_contigs_greater_10k     0.00
test/phiX174.fna        percent_contigs_greater_1k      100.00

gzip Compressed

assembly-scan includes a simple check (.gz extension) for gzip compressed assemblies. This example also demonstrates the --json option output.

assembly-scan test/saureus.fasta.gz --json
{
    "sample": "saureus.fasta.gz",
    "total_contig": 139,
    "total_contig_length": 2761520,
    "max_contig_length": 269921,
    "mean_contig_length": 19867,
    "median_contig_length": 163,
    "min_contig_length": 56,
    "n50_contig_length": 86756,
    "l50_contig_count": 9,
    "num_contig_non_acgtn": 0,
    "contig_percent_a": "33.74",
    "contig_percent_c": "16.50",
    "contig_percent_g": "16.21",
    "contig_percent_t": "33.54",
    "contig_percent_n": "0.00",
    "contig_non_acgtn": "0.00",
    "contigs_greater_1m": 0,
    "contigs_greater_100k": 7,
    "contigs_greater_10k": 37,
    "contigs_greater_1k": 49,
    "percent_contigs_greater_1m": "0.00",
    "percent_contigs_greater_100k": "5.04",
    "percent_contigs_greater_10k": "26.62",
    "percent_contigs_greater_1k": "35.25"
}

Output Columns

Column Description
sample Either assembly file basename, or value of --prefix
total_contig Total number of contigs in the assembly
total_contig_length Sum of all contig lengths
max_contig_length Length of the longest contig
mean_contig_length Average length of all contigs
median_contig_length Median value of all contigs
min_contig_length Length of the smallest contig
n50_contig_length N50 length of the contigs
l50_contig_count L50 number of contigs make up half the total
num_contig_non_acgtn Number of contigs with non-A,T,G,C, or N characters
contig_percent_a Percent of A nucleotides in contigs
contig_percent_c Percent of C nucleotides in contigs
contig_percent_g Percent of G nucleotides in contigs
contig_percent_t Percent of T nucleotides in contigs
contig_percent_n Percent of N nucleotides in contigs
contig_non_acgtn Percent of non-A,T,G,C, or N nucleotides in contigs
contigs_greater_1m Number of contigs greater than 1,000,000 bp
contigs_greater_100k Number of contigs greater than 100,000 bp
contigs_greater_10k Number of contigs greater than 10,000 bp
contigs_greater_1k Number of contigs greater than 1,000 bp
percent_contigs_greater_1m Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_100k Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_10k Percent of contigs greater than 1,000,000 bp
percent_contigs_greater_1k Percent of contigs greater than 1,000,000 bp

Naming

Originally this was named assembly-stats, but after a quick Google search (which I didn't do, again, I really should do better!) I found another assembly-stats from Sanger Pathogens. So I decided to rename it to assembly-scan, similar to my fastq-scan tool, since this process is similar to the Scan ability found in some video games/movies/tv etc... In otherwords, it 'scans' an assembly and provides the user with otherwise hidden information about the assembly.