vg

tools for working with genome variation graphs

OTHER License

Stars
1.1K
Committers
62

Bot releases are visible (Hide)

vg - giant leaps

Published by ekg almost 9 years ago

Notable improvements in this release:

  1. MSGA -- a kind of alignment-driven overlap-based assembler that can align large genomes; vg's answer to cactus.
  2. call -- variant calling on the graph
  3. index -- index huge graphs!
  4. map -- map against huge graphs!

Indexing improvements

Assuming you've started with a big graph (and that node ids 1000000000 and 1000000001 are outside of your node space), let's start by reducing the complexity of the graph so as to bound memory requirements during GCSA2 construction. We break the graph where we cross more than (approximately) 3 bifurcating edges in 16bp, and when fragments are less than 100bp in total sequence we remove them. This effectively masks out high-complexity regions of the graph. Despite the complexity reduction, the resulting system is sufficient for seeding global mapping:

vg mod -pl 16 -t 16 -e 3 $chr.vg \
    | vg mod -S -l 32 - >$chr.prune.vg
vg kmers -gB -k 16 -F -H 1000000000 -T 1000000001 -t 16 \
    $chr.prune.vg >$chr.graph

Now, we feed the resulting graphs, which binary formats for de Bruijn-like graphs, into GCSA2's build_gcsa tool:

build_gcsa -d 1 -o wg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.graph'")

The resulting GCSA index is wg.gcsa.

Now, we build an xg index:

vg index -x wg.xg $((seq 1 22; echo X; echo Y) | parallel -j 1 "echo -n ' {}.vg'")

We can then drive our mapper using these indexes, and it only requires 20G of RAM!

zcat NA12878.x10.13502_7.fq.gz | head -4000000 | vg map -x wg.xg -g wg.k16e3l100d1.gcsa -if - -t 32 -k 27 -GFX 1.5 -T 50 -l 12 -S 5 >aln.gam

I'll generate a more-detailed writeup for the wiki, but hopefully this satisfies the very-curious.

MSGA

You can build graphs of sets of sequences by feeding them in in fasta format using vg msga.

vg msga -f ref.fa >ref.vg

Fun!

vg - ✩✭✮✫✯✨

Published by ekg about 9 years ago

Since v1.0.0:

🖬 111606B of patches.

  • multi mapping (thanks to @adamnovak )
  • gcsa2 integration (nod to @jltsiren for helping me debug my many buggy attempts)

vg is now capable of generating GCSA and xg indexes of graphs! By default it will index the forward and reverse complement of any variation graph you give it, and allow you to construct queries of up to . At present, you can query the indexes using vg find, but stay tuned because mapping using the new indexes is top priority for v1.2.0.

You can try it out with this executable and the tarball (or your own copy of the repo):

➜  test git:(master) ✗ time vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg -p
 loading variants for z         [==============================================]100.0%
 parsing variants               [==============================================]100.0%
 enforcing node size limit      [==============================================]100.0%
 planning construction          [==============================================]100.0%
 constructing graph             [==============================================]100.0%
 merging remaining graphs       [==============================================]100.0%
 joining graphs                 [==============================================]100.0%
 topologically sorting          [==============================================]100.0%
 compacting ids                 [==============================================]100.0%
 saving graph                   [==============================================]100.0%
vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 -p > z.vg  5.99s user 0.32s system 123% cpu 5.114 total

And then we'll index it with xg and gcsa2, using an initial kmer size of 16 and 3 doubling steps, which will yield a GCSA index that lets us find the positions of sequences of up to 16 * 2^3 = 128 in the graph. Meanwhile the xg index provides near-optimal query performance over elements of the graph (nodes, edges, paths, and path-relative positions) using minimal space.

➜  test git:(master) ✗ time vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg         
GCSA::prefixDoubling(): Initial path length 16
  mergePaths(): 3370686 -> 3265575 paths
  mergePaths(): 2988126 unique, 277449 unsorted, 0 nondeterministic paths
GCSA::prefixDoubling(): Step 1 (path length 16 -> 32)
  joinPaths(): 3265575 -> 3756204 paths (8280486 ranks)
  mergePaths(): 3756204 -> 3741775 paths
  mergePaths(): 3607232 unique, 113082 unsorted, 21461 nondeterministic paths
GCSA::prefixDoubling(): Step 2 (path length 32 -> 64)
  joinPaths(): 3741775 -> 6328908 paths (21039927 ranks)
  mergePaths(): 6328908 -> 6314604 paths
  mergePaths(): 6080484 unique, 61311 unsorted, 172809 nondeterministic paths
GCSA::prefixDoubling(): Step 3 (path length 64 -> 128)
  joinPaths(): 6314604 -> 8253191 paths (36664975 ranks)
  mergePaths(): 8253191 -> 8252375 paths
  mergePaths(): 7786268 unique, 39280 unsorted, 426827 nondeterministic paths
GCSA::mergeByLabel(): 8252375 -> 8017312 paths
GCSA::build(): 10511607 edges
GCSA::sample(): 2558469 samples at 2376049 positions
Index verification complete.
vg index -x z.vg.xg -d z.vg.gcsa -g -k 16 -X 3 -V z.vg  234.44s user 1.80s system 310% cpu 1:15.97 total

We can find things in it using a few features of vg find. Let's simulate a perfect read using vg sim and then look it up in the index:

➜  test git:(master) ✗ vg sim -f -n 1 -l 64 z.vg    
TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
➜  test git:(master) ✗ vg find -g z.vg.gcsa -S TCACTTAACAATGTCACTTCCAGTGGGTGTCTCCCACCTGAGGAGGGTGATACATGTACAAGGC
58533:0

Also, we can also pick up pieces of the graph from the xg index:

➜  test git:(master) ✗ vg find -x z.vg.xg -p z:200-300 | vg mod -o - | vg view - | head
H       HVN:Z:1.0
S       25      CCACATTTCCATTCACCCATCCTC
P       25      z       +       24M
L       25      +       26      +       0M
S       26      C
P       26      z       +       1M
L       26      +       28      +       0M
S       28      CATCCATCAACCCTCCAATCC
P       28      z       +       21M
L       28      +       29      +       0M

Indexing takes about the same amount of time as using the old strategy:

➜  test git:(master) ✗ time vg index -s -k 27 -e 7 z.vg
vg index -s -k 27 -e 7 z.vg  140.62s user 1.22s system 222% cpu 1:03.76 total

But uses less disk and RAM.

1.2M    z.vg            # the graph itself
32M     z.vg.index  # old index
2.6M    z.vg.xg       # xg index
18M     z.vg.gcsa   # 128mer gcsa

And the comparison isn't apples-to-apples. We've made something altogether much more powerful. We can query the graph for not just 27-mers on the forward strand but all 128-mers, both on the forward and reverse strands!

Both the .gcsa and .xg indexes use approximately the same amount of memory on disk and in ram, so we'd need about 21M to load them into ram. This is about 17 times the size of the input graph. Extrapolating, we should expect to use about 50G of for the whole genome.

Now, if you're itching to try to build such indexes for a human whole-genome variation graph, you will need to mask out some small regions of apparent high complexity (likely misalignments or highly divergent haplotypes):

(NB: I haven't tried this. But, when I do, this is how I will 😉)

# for each chromosome, reduce the graph complexity by masking
# complex regions yielding a lot of short subgraphs
# and retaining the original id space
for chr in $(seq 1 22; echo X; echo Y);
do
    vg mod -pl 16 -e 6 $chr.vg | vg mod -S -l 50 - >>in.vg
done
# now use these graphs as input to the indexer
# using only 2 doubling steps, and indexing only the forward strand
time vg index -x wg.xg -d wg.gcsa -g -k 16 -X 2 -V -F in.vg

Reboot!

image

vg - first binary distribution

Published by ekg about 9 years ago

Let's call this version 1, if for no particular reason other than that I have a static linux 64-bit binary to distribute. Linux users should be able to download and run the attached executable after a quick chmod +x ./vg.